Abstract
Interspecies RNA-Seq datasets are increasingly common, and have the potential to answer new questions about the evolution of gene expression. Single-species differential expression analysis is now a well-studied problem that benefits from sound statistical methods. Extensive reviews on biological or synthetic datasets have provided the community with a clear picture on the relative performances of the available methods in various settings. However, synthetic dataset simulation tools are still missing in the interspecies gene expression context. In this work, we develop and implement a new simulation framework. This tool builds on both the RNA-Seq and the phylogenetic comparative methods literatures to generate realistic count datasets, while taking into account the phylogenetic relationships between the samples. We illustrate the usefulness of this new framework through a targeted simulation study, that reproduces the features of a recently published dataset, containing gene expression data in adult eye tissue across blind and sighted freshwater crayfish species. Using our simulated datasets, we perform a fair comparison of several approaches used for differential expression analysis. This benchmark reveals some of the strengths and weaknesses of both the classical and phylogenetic approaches for interspecies differential expression analysis, and allows for a reanalysis of the crayfish dataset. The tool has been integrated in the R package compcodeR, freely available on Bioconductor.
Keywords: RNA-Seq, differential gene expression, phylogenetic comparative methods, orthologous genes, comparative transcriptomics, crayfish
Introduction
The study and analysis of gene expression differences across species is a long-standing problem (King and Wilson 1975). The development of microarray technologies led to the gathering of the first large scale across species gene expression datasets, that allowed for the formulation and study of various hypotheses regarding the link between gene expression and evolution (Enard 2002; Khaitovich et al. 2004; Gilad et al. 2006; Whitehead and Crawford 2006). RNA-Sequencing technologies have changed the way to measure gene expression (Wang et al. 2009), making comparisons across several species easier, even for species with no reference genome available (Perry et al. 2012; Romero et al. 2012). Interspecies gene expression data are increasingly common, with well-curated resources, such as the Bgee database (Bastian et al. 2021), that make it available to the community.
Since changes in expression may underlie complex phenotypes, across species gene expression datasets can be used to test a wide range of evolutionary scenarios (Romero et al. 2012; Dunn et al. 2013). Tested hypotheses include, for instance, expression divergence (Gu 2004); the strength of expression conservation (Gu et al. 2019); the coevolution of gene expression (Cope et al. 2020); test of the orthology conjecture (Rogozin et al. 2014; Dunn et al. 2018); the detection of “phylogenetic signal” (Musser and Wagner 2015); equality of within-species variance (Catalán et al. 2019); constant stabilizing selection, loss through drift, parallel, or divergent selection (Stern and Crandall 2018a, 2018b); or the detection of duplication-specific effects in expression evolution (Fukushima and Pollock 2020).
In this work, we focus only on the detection of change in gene expression levels across species, in a specific lineage or between different groups of species. This problem can be formalized as an interspecies differential expression analysis, and has been studied in various groups of organisms (Cáceres et al. 2003; Zheng-Bradley et al. 2010; Blake et al. 2018; Stern and Crandall 2018b; Chen et al. 2019; Alam et al. 2020; Blake et al. 2020). For instance, difference in gene expression levels was found between mammalian lineages and birds (Brawand et al. 2011), across nonmodel primates species (Perry et al. 2012), between Drosophila species (Torres-Oliva et al. 2016) or Heliconius butterflies (Catalán et al. 2019). Note that the biological interpretation of changes in the level of expression of a gene across species is not easy (Romero et al. 2012). Shifts in gene expression across species could be molecular signatures of ecological adaptation, associated with a directional selection scenario, or a relaxation of evolutionary constraints.
From a bioinformatic point of view, the comparison of RNA-Seq samples between multiple species requires, first, the detection of orthologous relationships between genes (Tatusov 1997; Tekaia 2016), second, the consideration of differences in genome mappability (Zhu et al. 2014) and, third, the adaptation of alignment and quantification pipelines (LoVerso and Cui 2015; Chung et al. 2021). Multi-species alignments techniques have also been developed (Bradley et al. 2009; Brawand et al. 2011). In this work, we name orthologous genes (OG), or simply genes, the set of genes having orthologous relationship across species. Once the orthologous gene expression matrix has been created, the level of expression can be transformed into a discrete variable to detect the presence vs. absence of gene expression (Bastian et al. 2021). Other approaches perform separate differential expression for each species (Dunn et al. 2013; Kristiansson et al. 2013) or focus on pairwise comparisons only (Zhou et al. 2019; Chung et al. 2021). However, direct comparisons of expression between species can be complicated by batch effects (Gilad and Mizrahi-Man 2015), or potential confounding factors (Roux et al. 2015; Cope et al. 2020), and comparative gene expression studies should be carefully designed (Romero et al. 2012; Dunn et al. 2013; Chung et al. 2021).
In the present study, we assume that the alignment has already been performed, and we focus our attention on genes having a one-to-one relationship across several species (more than two species). We consider the level of expression of genes as a quantitative trait evolving across several species, and we detect genes with a shift in the level of expression across species as performed in for example, Brawand et al. (2011), Perry et al. (2012), Torres-Oliva et al. (2016), and Stern and Crandall (2018a). The specificities of interspecies RNA-Seq data are multifold. RNA-Seq data are counts, usually measured on a low number of samples. In addition, several technical biases affect the measured level of expression of a given gene in a given sample, either gene-specific (such as heterogeneity of gene length and GC content across genes and samples), or sample-specific (such as heterogeneity in library size across samples). Finally, since the level of expression of a gene is measured across several species, the phylogenetic relationships between species induce some correlations in the data. While, ideally, all these specificities should be taken into account in the statistical analysis, to our knowledge, there exist no model that includes all these constraints in its hypotheses. The user has the choice between methods specifically designed for analyzing gene expression data such as limma, DESeq2, or edgeR (Smyth 2004; Smyth et al. 2005; Anders and Huber 2010; Robinson and Oshlack 2010; Love et al. 2014); or phylogenetic comparative methods (PCMs) such as phylolm (Ho and Ané 2014a), that implement a phylogenetic regression or ANOVA (Martins and Hansen 1997; Rohlfs and Nielsen 2015), designed for analyzing quantitative traits across species. Although PCMs have already been described for gene expression analysis (Gu 2004; Gu and Su 2007; Bedford and Hartl 2009; Rohlfs et al. 2014; Gu et al. 2019), and applied in particular for differential expression detection (Brawand et al. 2011; Stern and Crandall 2018a; Catalán et al. 2019; Chen et al. 2019), these methods do not explicitly account for count data, which can lead to biased results.
To benchmark these methods, a common strategy is to simulate RNA-Seq count data. There are several well-established tools to simulate RNA-Seq count data in the classical, intraspecies case (Dillies et al. 2013; Soneson and Delorenzi 2013; Soneson 2014; Frazee et al. 2015), which allowed for the benchmark of many differential expression analysis models (Anders and Huber 2010; Robinson and Oshlack 2010; Law et al. 2014). Although some methodological questions remain open (Van den Berge et al. 2019), these extensive simulation studies helped setting good practices in terms of model choice or normalization methods in various intraspecies RNA-Seq settings. To our knowledge, there exists no extension of these frameworks to the interspecies setting. Simulation of gene expression across species has been performed using linear models and Gaussian variables (Rohlfs et al. 2014; Rohlfs and Nielsen 2015; Gu et al. 2019), but without taking into account the specificity of RNA-Seq count data and without focusing on the detection of shifts across species. In this work, we propose a framework to simulate RNA-Seq data across species. We use this framework to compare different strategies to detect genes with a expression level shift across multiple species, and draw recommendations for interspecies gene expression comparison.
New Approach
Simulation Framework
We designed and implemented a new simulation framework, that we used to benchmark and calibrate differential analysis methods on synthetic datasets that exhibit features specific to interspecies RNA-Seq dataset. Our approach is based on a phylogenetic Poisson log-normal (pPLN) model, that relies on two layers. First, latent variables are simulated using a Brownian motion (BM) or an Ornstein-Uhlenbeck (OU) stochastic process on the phylogenetic tree, using well-studied tools from the PCM literature. This layer can account for correlations induced by the phylogenetic relationships, including intraspecies-independent variations. Second, the values of these latent variables are used to define the parameter of a Poisson distribution, from which the final simulated count values are drawn. Built on top of tools tailored for RNA-Seq, the framework can simulate realistic count data that reproduce some of the important characteristics of a given empirical dataset, with genes of possibly varying lengths between samples. The simulation can include genes that are differentially expressed in different a priori specified groups of species. (See Section “A framework to simulate interspecies RNA-seq datasets” in the Materials and Methods for more details on the simulation framework.)
Impact of Tree Group Design
Contrary to classical differential analysis, where all the samples are independent, we showed that the specific group design on the tree had a strong influence on the signal present in the data. The phylogeny indeed acts as a confounding factor, as species within a clade tend to look alike while differing from species in other clades, just because of the tree structure. The differential effect of a group that spans over a single clade was hence more difficult to distinguish from a simple evolutionary random drift than the differential effect of a group with species present in various clades (see, e.g., the “block” vs. “alt” designs in fig. 1). To help practitioners having to design new interspecies RNA-Seq studies, we propose a new “differential analysis phylogenetic asymptotic effective sample size” () score. This score takes into account the tree and the group design at the tips of the tree only, so that it can be computed prior to any data collection. Using simulations, we showed that this score was a good predictor of the performance of differential analysis methods in a specific dataset. (See Section “Differential Analysis Phylogenetic Asymptotic Effective Sample Size” in the Materials and Methods for a complete derivation of this effective sample size.)
Reanalysis of the Crayfish Dataset
We applied our simulation framework with parameters empirically drawn from a recent study on the evolution of gene expression underlying vision loss in cave animals (Stern and Crandall 2018a). The synthetic datasets had characteristics similar to the original data, while allowing us to control the actual simulation parameters, and hence assess the performance of various differential expression analysis tools. We tested some of the most popular methods, along with several normalization and transformation strategies, both issued from the RNA-Seq or the PCM literature. (See supplementary section A, Supplementary Material online for a survey of all the methods used.) Enlightened by this focused benchmark, we proposed a reanalysis of the original dataset, resulting in a list of possibly differentially expressed genes that is much shorter than previously published one, and that we believe is more robust.
Results
Simulation Studies
Stern and Crandall (2018a) collected an interspecies RNA-Seq dataset to study the molecular mechanisms involved in vision loss in the North American family Cambaridae of crayfish species. We exploited our new simulation framework to generate realistic synthetic datasets following a base scenario, that was set to mimic the features of the crayfish dataset, using the estimated crayfish tree with the observed vision status design (“sight” design, see fig. 1), and matching the empirical counts and gene lengths moments. From this base scenario, we varied several parameters in order to study the impact of evolutionary dependence on the simulated data. In all simulated datasets, we could control exactly which genes were generated as differentially expressed, and which genes had a constant expression across clades (see section “A framework to simulate interspecies RNA-seq datasets” in the Materials and Methods). This allowed us to compare the list of truly differentially expressed genes with the list of candidate genes found by the various statistical methods tested. We tested the performance of the following differential expression analysis methods: DESeq2 (Love et al. 2014); limma (Ritchie et al. 2015), or limma cor (Smyth et al. 2005); and phylolm (Ho and Ané 2014a) with BM or OU processes. See Materials and Methods Section “Simulations and Empirical Studies” and supplementary section A, Supplementary Material online for a detailed presentation of the parameters and methods used. We did not aim at a comprehensive comparison of all the methods available, and chose these tools as representing the three main approaches used in an interspecies context, and available in R. Using the framework implemented in compcodeR, other methods could however easily be added to the comparison.
PLN and NB Simulation Frameworks Produce Similar Datasets
To check that our new pPLN framework produced datasets with properties similar to the well known NB framework as implemented in compcodeR (Soneson 2014), we replaced the crayfish tree with a star-tree, that mimics the NB situation where all species and replicates are independent. When parametrized to produce the same moments, the pPLN framework on a star tree produced datasets that were similar in difficulty to the classical NB framework (fig. 2, first two columns). While limma controlled the FDR to the nominal rate, DESeq2 failed to control the FDR. However, both methods had a better TPR under the pPLN model (see supplementary fig. S1, Supplementary Material online), and DESeq2 had the best MCC score under the pPLN on a star tree. As showed by the countsimQC (Soneson and Robinson 2018) analysis, the datasets simulated with the pPLN and the NB frameworks had similar features, and were comparable to the original empirical dataset (see supplementary section D, Supplementary Material online).
Phylogenetic Data Requires Correlation Modeling
For data simulated according to the base scenario, that is, when the real tree was used to generate the data instead of the star-tree, both DESeq2 and limma methods, that do not take any correlation into account, exhibited very high rates of false discoveries (more than three quarters, fig. 2, last column). In this setting, methods that explicitly model correlations between samples (limma cor and phylolm) performed best (fig. 3, dark purple line). limma cor exhibited the best behavior with the highest MCC, and a TPR reaching about . Its FDR was still above the nominal rate (median around ).
Tree Group Design Matters
The group design on the tree is known to strongly impact the properties of the data, in particular through its “phylogenetic effective sample size” (Ané 2008; Bartoszek 2016). To study its effect in a gene expression context, we replaced the “sight” design with a “block” and “alt” design (see fig. 1), that were chosen to model two extreme situations. In the “block” design, all the species with a given group are nested within a single clade, so that the differential expression signal is redundant with the phylogenetic signal. At the other end of the spectrum, the “alt” design was chosen so that sister species are in different groups, in order to maximize the contrast between organisms that share a long common history. We expect the “alt” design to produce datasets with a stronger signal.
The “alt” designs produced datasets with the clearest signal (fig. 3, light orange line). In this case, limma cor was able to correctly control for the FDR. Although phylolm methods had slightly higher FDR, they achieved a better TPR, reaching almost , leading to a better overall MCC score. At the opposite of the spectrum, the “block” design produced datasets with a very weak signal, with differentially expressed genes counts M-A values strongly overlapping the nondifferentially expressed genes distribution, which was very diffuse (fig. 4). All methods applied to the “block” design had FDR higher or equal to about (fig. 3, black line). The BM phylolm tool had the least bad MCC score (about ), although with the worst TPR (around ). The relative difficulties of each design was correctly captured by the normalized (, so that in the independent case). While the “block” design had a lower than the independent case (), the “alt” design had a higher one (), and the “sight” design lay in the middle ().
OU Makes the Signal Weaker and is Hard to Correct For
The simulation process impacts the tree-induced correlation between species (Blomberg et al. 2003; Harmon 2019). To study the impacts of this modeling choice, we replaced the BM process with an OU, with a phylogenetic half-life (Hansen 1997) fixed equal to of the tree height.
When simulating the counts using an OU model of trait evolution for the latent trait instead of a BM, the signal became weaker, and all methods achieved lower MCC scores (fig. 5). The limma cor methods performed the best in this case, even when compared with a phylolm method that explicitly takes the OU model into account. Further results of data simulated under the OU for different group designs are presented in supplementary figure S3, Supplementary Material online.
Phylogenetic Methods are Robust to Intraspecies Variations
We mitigated the effect of the BM model on the tree by varying the level of the independent individual variation representing , from to . When reducing the intra-species variance to (inducing a correlation of between sample values of the same species), the limma cor method lost its advantage compared with the phylolm methods, whose performances were less affected by the level of intraspecies noise (fig. 6).
Normalization is Slightly Better on Phylogenetic Data
Normalization and transformation of RNA-Seq count data is known to strongly impact the analysis (Musser and Wagner 2015). We studied the effect of these choices by testing combinations of length normalization methods [ (Wagner et al. 2012), (Mortazavi et al. 2008), or a simple , i.e., no length normalization], and transformation function [ (Law et al. 2014) or square root (Musser and Wagner 2015)]. (See Materials and Methods Section “Simulations and Empirical Studies” and supplementary section A, Supplementary Material online for a detailed presentation of these normalization techniques.)
Taking gene lengths into account, using either TPM or RPKM, significantly improved the power of the methods, in particular in terms of TPR (fig. 7). Although TPM normalization led to a slightly better MCC median, its performances were largely similar to the RPKM normalization. On this base scenario, the transformation led to a consistent gain of about in TPR compared with the square root (increasing from around to , fig. 7).
No Small Counts in De Novo Assembled Data
Including a mean-variance trend correction in the limma cor method did not change its performance on the base scenario, producing very similar MCC values (the median MCC on all runs differ by less than ). This is consistent with the fact that the original dataset uses de novo assembled data, that naturally exclude any small counts, and hence the need for a mean-variance trend correction (see Discussion).
Reanalysis of the Crayfish Dataset
While Stern and Crandall (2018a) found a list of differentially expressed genes (see supplementary table S2, Supplementary Material online in Stern and Crandall 2018a), the limma cor method found evidence for only , with only one gene that was not in the previous list. Among those, one gene was clearly associated with vision (coding for the Rhodopsin protein). The phylolm OU method found evidence for differentially expressed genes, including the same genes common to OUwie and limma cor. Raising the threshold from to (respectively, ), the limma cor method selected another (resp., ) genes, including (resp., ), matching with the list from Stern and Crandall (2018a). Using the RPKM instead of TPM gave a list of proteins, including two genes that were not in the previous lists.
Discussion
Simulation Study
Our targeted simulation study illustrates some of the specificities of interspecies RNA-Seq differential expression analysis. First, it is essential to take the correlation between replicates within a given species into account. Failure to do so leads to very high rates of false discoveries (fig. 3), that make the analysis unreliable and hard to exploit. Indeed, the limma method with added correlation seems to outperform other tools, including PCMs, in many settings. These results tend to indicate that, even if the full tree is not included in the analysis, incorporating these simple correlations between replicates might be sufficient to efficiently analyze interspecies datasets, at least for some simulation designs.
The group design on the tree was indeed found to be extremely important (fig. 3). A balanced design, where the groups are evenly spread over all clades, has a stronger signal (fig. 4), and allows the analysis to be abstracted from the phylogeny to some extent, as classical tools for differential expression analysis work best in this configuration. On the other hand, when the groups are clustered in the phylogeny, the signal is weaker as it becomes more difficult to distinguish the real group effect from the simple drift that tends to isolate clades from one another. This is in particular the case of designs where one clade or species is tested against out-groups, that is sometimes encountered in the literature (Brawand et al. 2011; Rohlfs and Nielsen 2015). In this configuration, PCMs, although imperfect, are essential.
Finally, this study confirms the importance of length normalization for interspecies differential gene expression analysis to achieve acceptable power detection levels (fig. 7). Although we did not find any significant difference in performance between RPKM and TPM normalizations, the transformation seemed to have a slight advantage over the square root in this simulation setting. This advantage could however be simply an effect of the simulation framework, which is based on a pPLN distribution.
Simulation Design
In this work, we proposed a method to simulate RNA-Seq gene expression across multiple species. Similar to intraspecies simulation tools (Dillies et al. 2013; Soneson and Delorenzi 2013; Soneson 2014), our simulation method can use empirical datasets to set the value of parameters such that the simulated datasets are as close as possible to the real ones, with matching empirical marginal expectation and variance. When applied to independent species, our pPLN model produces datasets with features comparable to the classical NB model (fig. 2 and supplementary section D, Supplementary Material online). In our specific simulation studies, we use the dataset from Stern and Crandall (2018a). This dataset was obtained using de novo assembled data. In addition, we focused on genes with one-to-one orthologuous relationships across species. As a consequence, this dataset had a low number of zeros and small counts, and a large variance across samples (empirical dispersion ranged from to , see fig. 4). The simulated datasets had similar characteristics, which could explain the low performance of DESeq2 in terms of FDR, even when the data were simulated without correlation (fig. 2), and the fact that the trend procedure did not add any power to the limma method. In addition, as DESeq2 explicitly assumes a NB distribution of the counts, it suffers from the deviation from this model, as opposed to limma, which controlled the FDR to the nominal rate in both the NB and pPLN models (fig. 2). Interspecies RNA-Seq gene expression datasets are very diverse, with specificities depending on the underlying biological question being studied. This work provides a first step toward realistic simulation of such datasets.
Simulation Tool
Compared with classical intraspecies simulation tools (Dillies et al. 2013; Soneson and Delorenzi 2013; Soneson 2014), our simulation framework incorporates the species tree and the gene length, which may vary across species. It makes it possible to model the evolution of gene expression on the tree using two different processes (BM or OU), and it allows for additional independent variation, that can model, for example, interspecies variation or measurement error. This complex model leads to new effects, that can be difficult to predict. In particular, we showed that the distribution of the groups on the tree had strong effects on the ability of all methods to detect a group expression shift. We proposed a normalized criterion () to assess the difficulty of the group design for the differential gene expression analysis problem. Although it does not take into account the number of replicates or the specific evolution model, we showed that it could well represent the difficulty of an experimental design. The strength of this criterion is that it only depends on the timed species tree and the tips group allocation, and can be computed before any statistical inference or even data collection. It can hence be used as a practical guide on the expected power of the experimental design. In particular, if the normalized is lower than one, results from methods that do not take the phylogeny into account should be interpreted with particular caution.
In this work, we focused our attention on the detection of shifts of expression between groups spanning across species. However, interspecies datasets are also used to address many other questions, such as equality of within-species variance, expression divergence, or detection of neutral vs. directed evolution regimes. Several tools from the PCM literature have been used to this end, that rely on various models of trait evolution with appropriate parameter constraints. Since our simulation tool is modular, those various processes could be implemented, in order to produce realistic RNA-Seq datasets with the desired structure. More generally, our tool could be extended to take into account any correlation structure between samples, not necessarily deriving from a phylogenetic model. Such an extended framework could help researchers to test the statistical properties of these complex inference models.
Inference Tools
In this study, we focused on a few inference tools, that come either from the RNA-Seq or the PCM literature, limiting ourselves to methods implemented in R and that can do differential analysis. Although a more comprehensive simulation design would be needed to draw stronger conclusions, our results show that simulations under the OU model could lead to more difficult datasets for some group designs, and that even methods that include the OU model in their framework fail to completely correct for this effect. This could be linked with the fact that the estimation of the selection strength in an OU model is a notoriously difficult question, especially on an ultrametric tree (Ho and Ané 2014b; Cooper et al. 2016). Having to estimate this parameter for thousands of genes is bound to generate some instability, and to deteriorate the performance of those tools. Gu et al. (2019) recently proposed an empirical Bayes approach to deal with this parameter in an RNA-Seq setting. One possible direction could be to adapt this method to a differential analysis problem. More generally, our simulation study seems to show that none of the methods presented here had really satisfactory results, but that taking gene lengths and sample correlation into account was essential. This study illustrates the need for new statistical tools for interspecies differential analysis, that would combine the strengths of both the classical RNA-Seq literature, that can deal with the specificities of this noisy data, and the PCM literature, that takes into account the phylogeny, an information that can be crucial to correctly interpret interspecies data.
Reanalysis of the Crayfish Dataset
In the setting that was most similar to the empirical Crayfish dataset, we found that the limma cor method worked best, with a similar TPR but smaller FDR compared with phylogenetic methods. When we applied this method to the Crayfish dataset, we found a list of only differentially expressed genes between sighted vs. blind species across all clades. This list was robust to the choice of the detection threshold, as raising it from to only added candidates. Allowing more false discoveries with a threshold of output a total of genes including only approximately a third matching the original list of genes found by Stern and Crandall (2018a), comforting our suspicion that the data only support a limited number of differentially expressed genes. However, such a small list, containing only one gene directly associated to vision, might not be enough to explain the complex mechanisms of vision loss in cave animals. There are at least two factors that could explain this result. First, our model was designed to find genes that are differentially expressed in all clades, that is, it assumed that the mechanisms underlying vision loss were the same for all groups of organisms, which is a very strong assumption. A different design could be to assume that each clade that went through vision loss have their own differentially expressed genes. In the linear model of limma, this simply amounts to adding one group factor per clade of interest, and to test for the coefficient associated with each group. Such an analysis gave us a different list for blind species of each genus Procambarus, Cambarus, and Orconectes, with only one gene that was common to all three lists (see supplementary section B, Supplementary Material online). The fact that only a few genes overlap between each group could indicate that different sets of genes are associated with vision loss in each clade, that is, that evolution has taken different genomic routes to vision loss in cave crayfish. This finding could be consistent with Stern and Crandall (2018a), which concluded that convergent vision loss among blind species was driven by increased gene expression variance (i.e., loss of selective constraint) rather than directional selection on a common set of genes. A second limitation to this study is that we only tested for differentially expressed genes, that is, difference in mean gene expression between groups. This is only one of the many possible ways evolution can impact gene expression (see, e.g., table 1 in Stern and Crandall 2018a). Other tools, designed to test for other patterns of evolution as mentioned above, might be able to detect other genes. Finally, when using RPKM instead of TPM normalization, we found a list of genes that only partially matched with the previous ones. Although TPM and RPKM seemed to performed similarly well on the simulated data (see fig. 7), this result shows that normalization can have a strong impact when analyzing biological datasets, and the robustness of these methods should be carefully assessed, when possible. Unfortunately, still little is known regarding the evolutionary genetics of vision loss in crayfishes, which makes the biological validation of the results difficult.
Table 1.
Orthogroup | Adj. value | Uniprot Top Hit | Protein Name |
---|---|---|---|
OG0002505 | XYLA ARATH | Xylose isomerase | |
OG0001105 | PIPA DROME | 1-phosphatidylinositol 4,5-bisphosphate phosphodiesterase | |
OG0000233 | RTBS DROME | Probable RNA-directed DNA polymerase from transposon BS | |
OG0002370 | ARRH LOCMI | Arrestin homolog | |
OG0006977 | CSK2B RAT | Casein kinase II subunit beta | |
OG0001281 | OPSD PROCL | Rhodopsin |
Materials and Methods
A Framework to Simulate Interspecies RNA-Seq Datasets
Building on existing RNA-Seq methods (Robles et al. 2012; Soneson and Delorenzi 2013; Soneson 2014), we developed a new interspecies simulation framework that can generate realistic count datasets, and takes into account, first, the gene expression correlations induced by the phylogeny and, second, the different lengths a given gene can have in different species.
Realistic Simulations using the Negative Binomial Distribution
We briefly recall here the simulation framework detailed in (Soneson and Delorenzi 2013), and implemented in compcodeR (Soneson 2014).
Negative Binomial Distribution
Let be the random variable representing the count for gene () in sample (), with true expression level and sampling depth . Following Robinson and Oshlack (2010), we model each count independently by a negative binomial (NB) distribution with expectation and dispersion , such that with
(1) |
Differential Expression
To model differential expression, we assume that the samples are partitioned into two groups and . For each gene , the dispersion parameter is the same for all samples, while the expression level can only take two values: if is in and if is in . Given , we take as
with the minimal differential effect size, and random variables independent identically distributed according to an exponential distribution with parameter . The values of the parameters are set to match the empirical counts expectation and dispersion of a real datasets.
Realistic Simulations using the Poisson Log-Normal Distribution
The Poisson log-normal (PLN) distribution has been advocated as an alternative to the NB distribution for the analysis of RNA-Seq data. Being more flexible, it is particularly well suited in the presence of correlations (Gallopin et al. 2013; Zhang et al. 2015), which proves essential for interspecies datasets, as demonstrated in the next section. We show here how the parameters of a PLN model can be chosen to match first- and second-order moments of the NB model described above, making it possible to simulate realistic datasets under this more flexible framework.
The PLN Distribution
Under the PLN model, for each gene and sample , we assume that the observed count random variable follows a Poisson distribution, with log parameter a Gaussian latent variable , such that
(2) |
This model is similar in spirit to the NB distribution, that can be seen as Gamma-Poisson mixture (see, e.g., Holmes and Huber 2019, Chap. 4). Note that in both models, the coefficient of variation of the mixing distribution is constant across samples for a given gene (Chen et al. 2014).
Matching Moments
Using standard moments expressions for the NB (Holmes and Huber 2019) and PLN (Aitchison and Ho 1989) distributions, it is straightforward to show that a PLN distribution with parameters and yields the same first- and second-order moments as a NB distribution with expectation and dispersion if and only if
(3) |
These equations allow us to readily use the framework developed in the previous section also in the case of a PLN simulation.
Phylogenetic Comparative Methods
Phylogenetic relationships are known to induce correlations between observed quantitative traits on several species (Felsenstein 1985). The field of PCMs specializes in the comparative study of such phylogenetically related traits, and has been flowering over the last decades [see, e.g., Harmon (2019) for a recent review]. Conditionally on a phylogenetic tree that links a set of species, PCMs model the evolution of a quantitative trait as a stochastic process along the branches of the tree (see fig. 8). This generative model induces a multivariate Gaussian structure of the observed vector of traits across species, with a correlation structure that depends on the tree and on the chosen process. The values of the trait are only observed at the tips of the tree. The values at the root or at the internal nodes are unobserved and are modeled using latent variables.
Brownian Motion on a Tree
The most commonly used process is the BM (Felsenstein 1985). Under this model, for a given continuous trait Z′ measured at the tips of the tree, the covariance between traits and is simply proportional to the time of shared evolution between species and , that is, the time between the root of the tree and the most recent common ancestor of and : , where is the variance of the BM process. The expectation of each trait is equal to , the ancestral value of the process at the root.
Ornstein-Uhlenbeck on a Tree
To model stabilizing selection, the OU process is often used (Hansen and Martins 1996; Hansen 1997). Compared with the BM, it has an equilibrium value , that represents the “optimal value” of the trait in a given environment. The trait is attracted to this optimum with a speed that is controlled by the selection strength , or better the phylogenetic half-life (Hansen 1997). This process induces a different correlation structure than the BM, with stronger selection strength inducing weaker interspecies correlations (Hansen 1997; Ho and Ané 2013). Specifically, conditionally on a fixed root, , with the stationary variance of the process, and the time between the root and node (Ho and Ané 2013).
Within-Species Variation
The traditional PCM framework assumes that only one measurement is available for each species, and that there is no measurement error, that is, that all the observed variation can be explained by the evolution process on the tree. However, ignoring measurement error can lead to severe biases (Silvestro et al. 2015; Cooper et al. 2016). In addition, in an interspecies RNA-Seq differential analysis, it is usual to have access to replicated measurements, that is, to measurements for several individuals of the same species. There is a vast literature on the subject of within-species variation (Grafen 1989, 1992; Lynch 1991; Housworth et al. 2004; Ives et al. 2007; Hadfield and Nakagawa 2010; Goolsby et al. 2017). One simple way to look at the problem in a univariate setting is to assume that all the individuals from a same species are placed on the tree as tips linked to a same species node with a branch of length zero (Felsenstein 2008) and to add a uniform Gaussian individual variance to all the tip samples traits (see figs. 8 and 1). In such a framework, the total variance of a sample trait attached to a latent tip with trait is given by , where is determined by the chosen stochastic process to model the latent trait (BM or OU). Similarly, the covariance between two-sample traits and attached, respectively, to latent tip traits and is given by .
The Phylogenetic Poisson Log-Normal Distribution
In an interspecies framework, various samples come from various species, which implies a specific correlation between measures, that can be taken into account in a multivariate PLN model.
Continuous Trait Evolution Model
The models of trait evolution used in PCMs are generative, and can be used to simulate continuous traits at the tips of a tree (with possible replicates) such that their correlation structure is consistent with their phylogeny (see fig. 8). Using a simple uniform Gaussian individual variance to model within-species variation, the trait variance for the vector of continuous traits at the tips of the tree generated by such a process can be expressed as
where is the phylogenetic variance between species and of samples and (see fig. 8), with a structure given by the evolution process (BM or OU, see expressions above), and the added intraspecies variation. Note that the variance parameters do depend on the gene (for instance, in the BM case), which allows us to tune the marginal moments to realistic values for count data, as detailed below.
The Phylogenetic Poisson Log-Normal Distribution
The models described above are well suited for quantitative traits, but need to be adapted for count measures, such as the one produced by a RNA-Seq analysis. To handle such counts, we propose to add a Poisson layer to the trait evolution models described above, defining a “phylogenetic” Poisson log-normal (pPLN) distribution. More specifically, for a given gene , we simulate a vector of latent traits as the result of such a process running on the tree, and then, conditionally on this vector, draw the observed counts from a Poisson distribution with parameter
(4) |
In other words, the vector of counts for each gene is drawn from a multivariate PLN distribution, with parameters and obtained from the evolutionary models described above, being the structured variance matrix of both phylogenetic and independent effects, and a vector or expectations values at the tips, that can be set independently from the process.
Matching Moments for Realistic Simulations
Assuming that the diagonal coefficients of are all equal to a single value , equation (3) can be used to ensure that the pPLN model above yields the same marginal expectation and variance as a NB model with expectation and dispersion . At a macro-evolutionary scale, most of the dated phylogenetic trees encountered are ultrametric, that is, are such that all the tips are at the same distance from the root. In that case, all the phylogenetic models described above verify this variance homogeneity assumption. For instance, for the simple BM model with an extra layer of independent variation, we have . Note that although the NB and pPLN models are set to have the same expectations and variance, they differ significantly in their covariances: while in the standard NB model, all the samples are independent from one another, in the proposed pPLN framework, the measurements are correlated, with a structure reflecting both the tree and the selected evolutionary process.
Taking Differential Gene Lengths into Account
Length Normalization of Counts
Let denote the length of the gene for sample . Following Robinson and Oshlack (2010), we take this length into account by changing equation (1) to
(5) |
Note that the same overall sequencing depth is attributed to each sample, but that, because of the weighted average, it is preferentially allocated to longer genes.
Lengths Simulation
The lengths are simulated according to the pPLN model described above, with expectations and dispersions empirically estimated from the dataset at hand.
Differential Analysis Phylogenetic Asymptotic Effective Sample Size
To quantify the intrinsic difficulty of a design compared with another, we propose a new “differential analysis phylogenetic asymptotic effective sample size” (). Given a phylogenetic tree , we first remove all replicates, so that there are no zero-length branches. Then, given a design vector , we postulate a simple BM model for an hypothetical continuous trait at the tips: with . From standard linear model theory, the variance of the maximum likelihood estimator of the coefficient is given by Ané (2008): with the matrix of predictors. We hence define: . In the case where all the species are independent (star-tree ), we fall back on a standard differential expression analysis, and we get, assuming that there are species and that the groups are balanced: which is the standard effective sample size for a balanced two-sample t-test with uniform variance. This gives us a base-line for a “standard” difficulty, and we use in the following the normalized : . A value lower than indicates a design that is deemed more difficult than a standard independent design (larger asymptotic variance of the estimator), while a value greater that indicates a problem where the phylogeny actually helps in finding the significant differences. Note that this score can be computed a priori, and, as shown below, can be used to assess the quality of the experimental design.
Simulations and Empirical Studies
Gene Expression Underlying Vision Loss in Cave Animals
The real dataset used to set simulation parameters and for the real data analysis case study was extracted from (Stern and Crandall 2018a). In this study, the authors selected eight blind and six sighted crayfish species, for which a time-calibrated maximum likelihood phylogeny is known (Stern et al. 2017). 3,560 orthologous gene expressions were estimated using the method RNA-Seq by Expectation Maximization (RSEM) (Li and Dewey 2011), with one to three replicates per species (see fig. 1).
Base Simulation Parameters
We used the real dataset to set the parameters of our simulations. We took the estimated crayfish tree rescaled to unit height (), with the observed vision status design (“sight” design, see fig. 1), and matching the empirical counts and gene lengths expectation and dispersion. The expression level and the dispersion were estimated from the dataset for each gene , while for each sample the simulation sequencing depth was independently drawn from a uniform distribution with bounds and the observed empirical minimal and maximal values of the library size across all samples. We used a BM model of trait evolution, with an independent layer of individual variation representing of the total tip variance for each gene : , with . We chose a base effect size of , with differentially expressed genes out of the 3,560 simulated ones. From this base scenario, we varied several parameters in order to study their impacts on the simulated data. Each scenario was replicated times.
Inference Methods Parameters
We used the following statistical inference methods: DESeq2 (Love et al. 2014) assumes a NB distribution on independent counts; limma (Ritchie et al. 2015) applies an Empirical Bayes moderation (without a mean-variance trend correction, unless otherwise specified) on independent normalized counts, possibly assuming that all the samples in a same species are correlated [limma cor (Smyth et al. 2005)]; and phylolm (Ho and Ané 2014a) uses a phylogenetic regression framework based on a BM or OU process, with measurement error. We refer to supplementary section A, Supplementary Material online for a detailed presentation of these methods. For phylolm, the differential analysis relied on a t statistic computed for each gene independently, conditionally on the estimated maximum likelihood parameters ( and for the OU). The raw P values computed by all methods were adjusted using the BH method (Benjamini and Hochberg 1995), using the R function p.adjust. Inferred gene expression differences across groups were marked as significant if their associated adjusted P value was below the threshold of .
Length Normalization and Transformation
In DESeq2 (Love et al. 2014), we used the default RLE method (Anders and Huber 2010) to compute the sample-specific normalization factor . We followed the recommendations of the section “Sample-/gene-dependent normalization factors” from the DESeq2 vignette to compute the coefficients from the coefficients and gene lengths detailed in supplementary section A, Supplementary Material online. For methods requiring a preprocessing normalization of the count data (limma and phylolm), we used the TMM method (Robinson and Oshlack 2010) implemented in the calcNormFactor function in edgeR, and a length normalization with a transformation.
Scores Used to Assess the Performance of the Inference Methods
To assess the performance of the inference methods, based on the list of true (simulated) differentially expressed genes, we computed the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). We used the Matthews correlation coefficient () as advised in Chicco and Jurman (2020). We also computed the true positive rate () and the false discovery rate (). In addition, we compared the features of the simulated datasets with the empirical one using the countsimQC R package (Soneson and Robinson 2018).
Reanalysis of the Crayfish Dataset
Stern and Crandall (2018a) used the OUwie package (Beaulieu et al. 2012) on each gene to compare an OU model with a single optimal value to a model with two optimal values, one for the sighted species and one for the blind. This method is similar to the phylogenetic ANOVA, but differs on two important aspects. First, it takes as entry the within-species empirical means and variance instead of the individual values. Second, it uses a likelihood ratio test assuming a chi-square distribution with one degree of freedom, instead of the conditional -test used in phylolm. In a mixed model setting, such likelihood ratio test have been shown to be anticonservative (see, e.g., Section 2.4 in Pinheiro and Bates 2006), and can hence lead to many false discoveries. We applied the limma cor method on the TPM values, that performed best on the realistic simulations above, to the same dataset, and compared the list of differentially expressed genes to the one found in Stern and Crandall (2018a). We also applied the phylolm OU method, that is the most similar to the OUwie method, for comparison.
Supplementary Material
Acknowledgments
P.B. and M.G are grateful to Sylvain Merlot for initiating this project, to Christine Drevet, Marie-Laure Martin and Guillem Rigaill for useful discussions, and to Claire Ducos, Marie Michel, and Sarah Jelassi for their work during their master internship. This work was partly funded by the I2BC and the MI CNRS through the MODELCOG (M.G.) and X-TrEM projects (Sylvain Merlot). We are grateful to the INRAE MIGALE bioinformatics facility (MIGALE, INRAE, 2020. Migale bioinformatics Facility, doi: 10.15454/1.5572390655343293E12) for providing computing and storage resources.
Contributor Information
Paul Bastide, IMAG, Université de Montpellier, CNRS, Montpellier, France.
Charlotte Soneson, Friedrich Miescher Institute for Biomedical Research, 4058 Basel, Switzerland; SIB Swiss Institute of Bioinformatics, 4058 Basel, Switzerland.
David B Stern, Department of Integrative Biology, University of Wisconsin-Madison, 430 Lincoln Drive, Madison, WI 53706, USA.
Olivier Lespinet, Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CEA, CNRS, 91198 Gif-sur-Yvette, France.
Mélina Gallopin, Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CEA, CNRS, 91198 Gif-sur-Yvette, France.
Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.
Data Availability
The simulation tool is integrated into the compcodeR package, that is freely available on the Bioconductor platform, and documented through a specific vignette (https://doi.org/10.18129/B9.bioc.compcodeR). The data and code used for the simulation study are available on the following GitHub repository: https://github.com/i2bc/InterspeciesDE, and was deposited on Zenodo: www.doi.org/10.5281/zenodo.7311523.
References
- Aitchison J, Ho CH. 1989. The multivariate Poisson-log normal distribution. Biometrika 76(4):643–653. [Google Scholar]
- Alam T, Agrawal S, Severin J, Young RS, Andersson R, Arner E, Hasegawa A, Lizio M, Ramilowski JA, Abugessaisa I, et al. 2020. Comparative transcriptomics of primary cells in vertebrates. Genome Res. 30(7):951–961. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anders S, Huber W. 2010. Differential expression analysis for sequence count data. Genome Biol. 11(10):R106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ané C. 2008. Analysis of comparative data with hierarchical autocorrelation. Ann Appl Stat. 2(3):1078–1102. [Google Scholar]
- Bartoszek K. 2016. Phylogenetic effective sample size. J Theor Biol. 407:371–386. [DOI] [PubMed] [Google Scholar]
- Bastian FB, Roux J, Niknejad A, Comte A, Fonseca Costa S, de Farias TM, Moretti S, Parmentier G, de Laval VR, Rosikiewicz M, et al. 2021. The Bgee suite: integrated curated expression atlas and comparative transcriptomics in animals. Nucleic Acids Res. 49(D1):D831–D847. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beaulieu JM, Jhwueng D-C, Boettiger C, O’Meara BC. 2012. Modeling stabilizing selection: expanding the Ornstein-Uhlenbeck model of adaptive evolution. Evolution 66(8):2369–2383. [DOI] [PubMed] [Google Scholar]
- Bedford T, Hartl DL. 2009. Optimization of gene expression by natural selection. Proc Natl Acad Sci. 106(4):1133–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B (Methodol). 57(1):289–300. [Google Scholar]
- Blake LE, Roux J, Hernando-Herraez I, Banovich NE, Perez RG, Hsiao CJ, Eres I, Cuevas C, Marques-Bonet T, Gilad Y. 2020. A comparison of gene expression and DNA methylation patterns across tissues and species. Genome Res. 30(2):250–262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blake LE, Thomas SM, Blischak JD, Hsiao CJ, Chavarria C, Myrthil M, Gilad Y, Pavlovic BJ. 2018. A comparative study of endoderm differentiation in humans and chimpanzees. Genome Biol. 19(1):162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blomberg SP, Garland T, Ives AR. 2003. Testing for phylogenetic signal in comparative data: behavioral traits are more labile. Evolution 57(4):717–745. [DOI] [PubMed] [Google Scholar]
- Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L. 2009. Fast statistical alignment. PLoS Comput Biol. 5(5):e1000392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brawand D, Soumillon M, Necsulea A, Julien P, Csárdi G, Harrigan P, Weier M, Liechti A, Aximu-Petri A, Kircher M, et al. 2011. The evolution of gene expression levels in mammalian organs. Nature 478(7369):343–348. [DOI] [PubMed] [Google Scholar]
- Cáceres M, Lachuer J, Zapala MA, Redmond JC, Kudo L, Geschwind DH, Lockhart DJ, Preuss TM, Barlow C. 2003. Elevated gene expression levels distinguish human from non-human primate brains. Proc Natl Acad Sci USA 100(22):13030–13035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Catalán A, Briscoe AD, Höhna S. 2019. Drift and directional selection are the evolutionary forces driving gene expression divergence in eye and brain tissue of heliconius butterflies. Genetics 213(2):581–594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen Y, Lun ATL, Smyth GK. 2014. Differential expression analysis of complex RNA-seq experiments using edgeR. In: Datta S, Nettleton D, editors. Statistical analysis of next generation sequencing data. Cham: Springer International Publishing. p. 51–74. [Google Scholar]
- Chen J, Swofford R, Johnson J, Cummings BB, Rogel N, Lindblad-Toh K, Haerty W, Di Palma F, Regev A. 2019. A quantitative framework for characterizing the evolutionary history of mammalian gene expression. Genome Res. 29(1):53–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chicco D, Jurman G. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21(1):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chung M, Bruno VM, Rasko DA, Cuomo CA, Muñoz JF, Livny J, Shetty AC, Mahurkar A, Dunning Hotopp JC. 2021. Best practices on the differential expression analysis of multi-species RNA-seq. Genome Biol. 22(1):121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cooper N, Thomas GH, Venditti C, Meade A, Freckleton RP. 2016. A cautionary note on the use of Ornstein-Uhlenbeck models in macroevolutionary studies. Biol J Linn Soc. 118(1):64–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cope AL, O’Meara BC, Gilchrist MA. 2020. Gene expression of functionally-related genes coevolves across fungal species: detecting coevolution of gene expression using phylogenetic comparative methods. BMC Genom. 21(1):370. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, et al. 2013. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 14(6):671–683. [DOI] [PubMed] [Google Scholar]
- Dunn CW, Luo X, Wu Z. 2013. Phylogenetic analysis of gene expression. Integr Comp Biol. 53(5):847–856. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dunn CW, Zapata F, Munro C, Siebert S, Hejnol A. 2018. Pairwise comparisons across species are problematic when analyzing functional genomic data. Proc Natl Acad Sci USA. 115(3):E409–E417. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Enard W. 2002. Intra- and interspecific variation in primate gene expression patterns. Science 296(5566):340–343. [DOI] [PubMed] [Google Scholar]
- Felsenstein J. 1985. Phylogenies and the comparative method. Am Nat. 125(1):1–15. [Google Scholar]
- Felsenstein J. 2008. Comparative methods with sampling error and within-species variation: contrasts revisited and revised. Am Nat. 171(6):713–725. [DOI] [PubMed] [Google Scholar]
- Frazee AC, Jaffe AE, Langmead B, Leek JT. 2015. Polyester: simulating RNA-seq datasets with differential transcript expression. Bioinformatics 31(17):2778–2784. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fukushima K, Pollock DD. 2020. Amalgamated cross-species transcriptomes reveal organ-specific propensity in gene expression evolution. Nat Commun. 11(1):4459. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gallopin M, Rau A, Jaffrézic F. 2013. A hierarchical Poisson log-normal model for network inference from RNA sequencing data. PLoS ONE 8(10):e77503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilad Y, Mizrahi-Man O. 2015. A reanalysis of mouse ENCODE comparative gene expression data. F1000Research 4:121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilad Y, Oshlack A, Smyth GK, Speed TP, White KP. 2006. Expression profiling in primates reveals a rapid evolution of human transcription factors. Nature 440(7081):242–245. [DOI] [PubMed] [Google Scholar]
- Goolsby EW, Bruggeman J, Ané C. 2017. Rphylopars: fast multivariate phylogenetic comparative methods for missing data and within-species variation. Methods Ecol Evol. 8(1):22–27. [Google Scholar]
- Grafen A. 1989. The phylogenetic regression. Phil Trans R Soc Lond B. 326(1233):119–157. [DOI] [PubMed] [Google Scholar]
- Grafen A. 1992. The uniqueness of the phylogenetic regression. J Theor Biol. 156(4):405–423. [Google Scholar]
- Gu X. 2004. Statistical framework for phylogenomic analysis of gene family expression profiles. Genetics 167(1):531–542. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gu X, Ruan H, Yang J. 2019. Estimating the strength of expression conservation from high throughput RNA-seq data. Bioinformatics 35(23):5030–5038. [DOI] [PubMed] [Google Scholar]
- Gu X, Su Z. 2007. Tissue-driven hypothesis of genomic evolution and sequence-expression correlations. Proc Natl Acad Sci USA. 104(8):2779–2784. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hadfield JD, Nakagawa S. 2010. General quantitative genetic methods for comparative biology: phylogenies, taxonomies and multi-trait models for continuous and categorical characters. J Evol Biol. 23(3):494–508. [DOI] [PubMed] [Google Scholar]
- Hansen TF. 1997. Stabilizing selection and the comparative analysis of adaptation. Evolution 51(5):1341. [DOI] [PubMed] [Google Scholar]
- Hansen TF, Martins EP. 1996. Translating between microevolutionary process and macroevolutionary patterns: the correlation structure of interspecific data. Evolution 50(4):1404. [DOI] [PubMed] [Google Scholar]
- Harmon LJ. 2019. Phylogenetic comparative methods: learning from trees. Center for Open Science, version 1. edition. Available from: https://lukejharmon.github.io/pcm/. [Google Scholar]
- Ho LST, Ané C. 2013. Asymptotic theory with hierarchical autocorrelation: Ornstein-Uhlenbeck tree models. Ann Stat. 41(2):957–981. [Google Scholar]
- Ho LST, Ané C. 2014a. A linear-time algorithm for gaussian and non-Gaussian trait evolution models. Syst Biol. 63(3):397–408. [DOI] [PubMed] [Google Scholar]
- Ho LST, Ané C. 2014b. Intrinsic inference difficulties for trait evolution with Ornstein-Uhlenbeck models. Methods Ecol Evol. 5(11):1133–1146. [Google Scholar]
- Holmes S, Huber W. 2019. Modern statistics for modern biology. Cambridge (UK): Cambridge University Press. [Google Scholar]
- Housworth EA, Martins EP, Lynch M. 2004. The phylogenetic mixed model. Am Nat. 163(1):84–96. [DOI] [PubMed] [Google Scholar]
- Ives AR, Midford PE, Garland T, Oakley T. 2007. Within-species variation and measurement error in phylogenetic comparative methods. Syst Biol. 56(2):252–270. [DOI] [PubMed] [Google Scholar]
- Khaitovich P, Weiss G, Lachmann M, Hellmann I, Enard W, Muetzel B, Wirkner U, Ansorge W, Pääbo S. 2004. A neutral model of transcriptome evolution. PLoS Biol. 2(5):e132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- King M, Wilson A. 1975. Evolution at two levels in humans and chimpanzees. Science 188(4184):107–116. [DOI] [PubMed] [Google Scholar]
- Kristiansson E, Österlund T, Gunnarsson L, Arne G, Larsson DGJ, Nerman O. 2013. A novel method for cross-species gene expression analysis. BMC Bioinform. 14(1):70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Law CW, Chen Y, Shi W, Smyth GK. 2014. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15(2):R29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li B, Dewey CN. 2011. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12(1):323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Love MI, Huber W, Anders S. 2014. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15(12):550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- LoVerso PR, Cui F. 2015. A computational pipeline for cross-species analysis of RNA-seq data using r and bioconductor. Bioinform Biol Insights. 9:BBI.S30884. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lynch M. 1991. Methods for the analysis of comparative data in evolutionary biology. Evolution 45(5):1065–1080. [DOI] [PubMed] [Google Scholar]
- Martins EP, Hansen TF. 1997. Phylogenies and the comparative method: a general approach to incorporating phylogenetic information into the analysis of interspecific data. Am Nat. 149(4):646–667. [Google Scholar]
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 5(7):621–628. [DOI] [PubMed] [Google Scholar]
- Musser JM, Wagner GP. 2015. Character trees from transcriptome data: origin and individuation of morphological characters and the so-called “species signal”. J Exp Zool B: Mol Dev Evol. 324(7):588–604. [DOI] [PubMed] [Google Scholar]
- Perry GH, Melsted P, Marioni JC, Wang Y, Bainer R, Pickrell JK, Michelini K, Zehr S, Yoder AD, Stephens M, et al. 2012. Comparative RNA sequencing reveals substantial genetic variation in endangered primates. Genome Res. 22(4):602–610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pinheiro J, Bates D. 2006. Mixed-effects models in S and S-PLUS. Springer New York, NY: Springer Science & Business Media. Available from: https://link.springer.com/book/10.1007/b98882. [Google Scholar]
- Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. 2015. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43(7):e47–e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robinson MD, Oshlack A. 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11(3):R25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, Taylor JM. 2012. Efficient experimental design and analysis strategies for the detection of differential expression using RNA-sequencing. BMC Genom. 13(1):484. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rogozin IB, Managadze D, Shabalina SA, Koonin EV. 2014. Gene family level comparative analysis of gene expression in mammals validates the ortholog conjecture. Genom Biol Evol. 6(4):754–762. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rohlfs RV, Harrigan P, Nielsen R. 2014. Modeling gene expression evolution with an extended Ornstein–Uhlenbeck process accounting for within-species variation. Mol Biol Evol. 31(1):201–211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rohlfs RV, Nielsen R. 2015. Phylogenetic ANOVA: the expression variance and evolution model for quantitative trait evolution. Syst Biol. 64(5):695–708. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Romero IG, Ruvinsky I, Gilad Y. 2012. Comparative studies of gene expression and the evolution of gene regulation. Nat Rev Genet. 13(7):505–516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roux J, Rosikiewicz M, Robinson-Rechavi M. 2015. What to compare and how: comparative transcriptomics for Evo-Devo. J Exp Zool B: Mol Dev Evol. 324(4):372–382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Silvestro D, Kostikova A, Litsios G, Pearman PB, Salamin N. 2015. Measurement errors should always be incorporated in phylogenetic comparative analysis. Methods Ecol Evol. 6(3):340–346. [Google Scholar]
- Smyth GK. 2004. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 3(1):1–25. [DOI] [PubMed] [Google Scholar]
- Smyth GK, Michaud J, Scott HS. 2005. Use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics 21(9):2067–2075. [DOI] [PubMed] [Google Scholar]
- Soneson C. 2014. compcodeR–an R package for benchmarking differential expression methods for RNA-seq data. Bioinformatics 30(17):2517–2518. [DOI] [PubMed] [Google Scholar]
- Soneson C, Delorenzi M. 2013. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinform. 14(1):91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Soneson C, Robinson MD. 2018. Towards unified quality verification of synthetic count data with countsimQC. Bioinformatics 34(4):691–692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stern DB, Breinholt J, Pedraza-Lara C, López-Mejìa M, Owen CL, Bracken-Grissom H, Fetzner JW, Crandall KA. 2017. Phylogenetic evidence from freshwater crayfishes that cave adaptation is not an evolutionary dead-end. Evolution 71(10):2522–2532. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stern DB, Crandall KA. 2018a. The evolution of gene expression underlying vision loss in cave animals. Mol Biol Evol. 35(8):2005–2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stern DB, Crandall KA. 2018b. Phototransduction gene expression and evolution in cave and surface crayfishes. Integr Comp Biol. 58(3):398–410. [DOI] [PubMed] [Google Scholar]
- Tatusov RL. 1997. A genomic perspective on protein families. Science 278(5338):631–637. [DOI] [PubMed] [Google Scholar]
- Tekaia F. 2016. Inferring orthologs: open questions and perspectives. Genom Insights. 9:GEI.S37925. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Torres-Oliva M, Almudi I, McGregor AP, Posnien N. 2016. A robust (re-)annotation approach to generate unbiased mapping references for RNA-seq-based analyses of differential expression across closely related species. BMC Genom. 17(1):392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van den Berge K, Hembach KM, Soneson C, Tiberi S, Clement L, Love MI, Patro R, Robinson MD. 2019. RNA sequencing data: Hitchhiker’s guide to expression analysis. Annu Rev Biomed Data Sci. 2(1):139–173. [Google Scholar]
- Wagner GP, Kin K, Lynch VJ. 2012. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 131(4):281–285. [DOI] [PubMed] [Google Scholar]
- Wang Z, Gerstein M, Snyder M. 2009. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 10(1):57–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Whitehead A, Crawford DL. 2006. Variation within and among species in gene expression: raw material for evolution. Mol Ecol. 15(5):1197–1211. [DOI] [PubMed] [Google Scholar]
- Zhang H, Xu J, Jiang N, Hu X, Luo Z. 2015. PLNseq: a multivariate poisson lognormal distribution for high-throughput matched RNA-sequencing read count data. Stat Med. 34(9):1577–1589. [DOI] [PubMed] [Google Scholar]
- Zheng-Bradley X, Rung J, Parkinson H, Brazma A. 2010. Large scale comparison of global gene expression patterns in human and mouse. Genome Biol. 11(12):R124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou Y, Zhu J, Tong T, Wang J, Lin B, Zhang J. 2019. A statistical normalization method and differential expression analysis for RNA-seq data between different species. BMC Bioinform. 20(1):163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu Y, Li M, Sousa AM, Šestan N. 2014. XSAnno: a framework for building ortholog models in cross-species transcriptome comparisons. BMC Genom. 15(1):343. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The simulation tool is integrated into the compcodeR package, that is freely available on the Bioconductor platform, and documented through a specific vignette (https://doi.org/10.18129/B9.bioc.compcodeR). The data and code used for the simulation study are available on the following GitHub repository: https://github.com/i2bc/InterspeciesDE, and was deposited on Zenodo: www.doi.org/10.5281/zenodo.7311523.