TROM: A Testing-Based Method for Finding Transcriptomic Similarity of Biological Samples

Wei Vivian Li; Yiling Chen; Jingyi Jessica Li

doi:10.1007/s12561-016-9163-y

. Author manuscript; available in PMC: 2018 Jun 1.

Published in final edited form as: Stat Biosci. 2016 Aug 29;9(1):105–136. doi: 10.1007/s12561-016-9163-y

TROM: A Testing-Based Method for Finding Transcriptomic Similarity of Biological Samples

Wei Vivian Li ¹, Yiling Chen ¹, Jingyi Jessica Li ^1,^2,^✉

PMCID: PMC5542419 NIHMSID: NIHMS813908 PMID: 28781712

Abstract

Comparative transcriptomics has gained increasing popularity in genomic research thanks to the development of high-throughput technologies including microarray and next-generation RNA sequencing that have generated numerous transcriptomic data. An important question is to understand the conservation and divergence of biological processes in different species. We propose a testing-based method TROM (Transcriptome Overlap Measure) for comparing transcriptomes within or between different species, and provide a different perspective, in contrast to traditional correlation analyses, about capturing transcriptomic similarity. Specifically, the TROM method focuses on identifying associated genes that capture molecular characteristics of biological samples, and subsequently comparing the biological samples by testing the overlap of their associated genes. We use simulation and real data studies to demonstrate that TROM is more powerful in identifying similar transcriptomes and more robust to stochastic gene expression noise than Pearson and Spearman correlations. We apply TROM to compare the developmental stages of six Drosophila species, C. elegans, S. purpuratus, D. rerio and mouse liver, and find interesting correspondence patterns that imply conserved gene expression programs in the development of these species. The TROM method is available as an R package on CRAN (https://cran.r-project.org/package=TROM) with manuals and source codes available at http://www.stat.ucla.edu/~jingyi.li/software-and-data/trom.html.

Keywords: Transcriptomic similarity measure, Multi-species developmental stages, Robustness to platform differences, Comparative transcriptomics, Microarray vs. RNA-seq, Pearson correlation coefficient, Spearman correlation coefficient, overlap test

1 Introduction

Comparative genomics is an important field that addresses evolutionary questions and studies developmental processes across distant species [17]. Studying transcriptomes is essential for understanding functions of genomic regions and interpreting regulatory relationships of multiple genomic elements [25]. Comparing transcriptomes of the same species can reveal molecular mechanisms behind the occurrence and progression of important biological processes, such as organism development and stem cell differentiation [12,19]. Comparing transcriptomes of different species can help understand the conservation and differentiation of these molecular mechanisms in evolution [14]. High-throughput technologies have generated large amounts of publicly available transcriptomic data, creating an unprecedented opportunity for comparing multi-species transcriptomes under various biological conditions.

Finding the transcriptomic similarity and disparity of biological samples is a key step to understand the underlying molecular mechanisms common or unique to them. It is desirable to have a transcriptomic similarity measure that can lead to a clear correspondence pattern of biological samples from the same or different species. Correlation analysis is a classical approach for comparing transcriptomes based on gene expression data. Commonly used measures are Pearson and Spearman correlation coefficients, both of which have played important roles in biological discoveries [1,16,20]. However, in most scenarios neither of them can produce a clear correspondence pattern among biological samples. The main reason is the existence of many housekeeping genes, which would inflate correlation coefficients. Moreover, correlation measures rely heavily on the accuracy of gene expression data and are susceptible to the low signal-to-noise ratios of lowly expressed genes. Therefore, it is often difficult to use correlation analysis to find a clear correspondence pattern of transcriptomes.

Here we introduce a new testing-based measure—transcriptome overlap measure (TROM)—to find correspondence of transcriptomes in the same or different species. The measure is based on testing the overlap of “associated genes,” which represent transcriptomic characteristics of biological samples. For the purpose of discovering sparse sample relationships, we define a sample correspondence map as the binarized mapping pattern resulted from a sample similarity matrix: a none-zero value means that two samples are mapped to each other, while a zero value means that two samples are unmapped. We show that compared to Pearson and Spearman correlations, TROM has better power to detect transcriptome correspondence in simulations and leads to clearer correspondence maps of developmental stages within and between multiple species in real data studies. TROM also provides a systematic approach for selecting associated genes of every biological sample. We show that these associated genes can well capture transcriptomic characteristics and help construct developmental trees in multiple species. In addition, we demonstrate that TROM is robust to data normalization and high-throughput platform difference.

In Sect. 2, we describe the TROM method including the identification of associated genes, the calculation of TROM scores, and the selection of a threshold parameter. In Sect. 3, we present real data applications of TROM to large-scale transcriptomic data sets, power analysis of TROM versus Pearson and Spearman correlations, demonstration of the robustness of TROM to data normalization and platform difference, and bioinformatic analyses of the TROM results.

2 Method

2.1 Associated Genes and TROM Scores

Our method focuses on selecting associated genes to perform a gene set overlap test [14], which will lead to TROM scores that can be used to compare biological samples. We define associated genes of a sample using the following criterion: the genes that have z-scores (normalized expression levels across samples) ≥ z in the sample, where z is a threshold that can be selected in a systematic approach (please see Sect. 2.2) or set by users. Based on this definition, associated genes of a sample are those with higher expression in the sample compared to a few other samples. In other words, associated genes are highly expressed in the sample of interest but not always highly expressed in all samples, and they are a superset of sample specific genes. Hence, associated genes capture gene expression characteristics of a sample, and these characteristics are either specific to the sample or shared by a few other samples but not all samples. Associated genes provide a basis for comparing biological samples. We compare two biological samples by statistically testing the dependence of their associated genes: to compare two samples of the same species, we calculate the significance of the number of their overlapping associated genes (resulting in a within-species TROM score); to compare two samples of different species, we calculate the significance of the number of orthologous gene pairs in their associated genes (resulting in a between-species TROM score).

We consider the two sample-associated gene sets as two samples drawn from the gene population. In the within-species scenario, we denote the number of biological samples of a given species as m, and use X_i and X_j (i, j = 1, 2, …, m) to denote the associated genes of samples i and j to be compared. The gene population consists of all genes of the given species, and the size of the gene population is denoted as N. Then to test for the null hypothesis that X_i and X_j are two independent samples drawn from the gene population versus the alternative hypothesis that X_i and X_j are dependent samples, the p-value for within-species comparison between samples i and j is calculated as

p -value = \sum_{k = ∣ X_{i} \cap X_{j} ∣}^{min (∣ X_{i} ∣, ∣ X_{j} ∣)} \frac{(\begin{matrix} N \\ k \end{matrix}) (\begin{matrix} N - k \\ ∣ X_{i} ∣ - k \end{matrix}) (\begin{matrix} N - ∣ X_{i} ∣ \\ ∣ X_{j} ∣ - k \end{matrix})}{(\begin{matrix} N \\ ∣ X_{i} ∣ \end{matrix}) (\begin{matrix} N \\ ∣ X_{j} ∣ \end{matrix})} .

(1)

In the between-species scenario, we denote the numbers of biological samples from species 1 and 2 as m₁ and m₂. The gene population consists of all orthologous gene pairs between the two species, and the number of pairs is denoted as N. The ortholog pairs can be represented as a two-column table with N rows. We use X_i (i = 1, 2, …, m₁) and Y_j (j = 1, 2, …, m₂) to denote the orthologous gene pairs (i.e., rows in the table) that overlap with the associated genes of sample i in species 1 and sample j in species 2, respectively. In other words, X_i (or Y_j) represents the orthologous gene pairs that contain the associated genes in sample i of species 1 (or sample j of species 2). Then to test for the null hypothesis that X_i and Y_j are two independent samples drawn from the population of orthologous gene pairs versus the alternative hypothesis that X_i and Y_j are dependent samples, the p-value for between-species comparison of the two samples is calculated as

p -value = \sum_{k = ∣ X_{i} \cap Y_{j} ∣}^{min (∣ X_{i} ∣, ∣ Y_{j} ∣)} \frac{(\begin{matrix} N \\ k \end{matrix}) (\begin{matrix} N - k \\ ∣ X_{i} ∣ - k \end{matrix}) (\begin{matrix} N - ∣ X_{i} ∣ \\ ∣ Y_{j} ∣ - k \end{matrix})}{(\begin{matrix} N \\ ∣ X_{i} ∣ \end{matrix}) (\begin{matrix} N \\ ∣ Y_{j} ∣ \end{matrix})} .

(2)

Then we define the within-species or between-species TROM score as

TROM score = - {log}_{10} (Bonferroni-corrected p -value),

(3)

which describes transcriptome similarity of two biological samples. A larger TROM score represents greater similarity.

2.2 Selection of z-Score Threshold

The selection of the z-score threshold z will directly influence the sensitivity and specificity of sample-associated genes and thus affect the resulting TROM scores. If z is too small, a large number of associated genes will be selected for every sample and more associated genes will be shared by different samples, and thus it becomes difficult to distinguish different biological samples. If z is too large, only a small number of associated genes will be identified for each sample and potentially informative genes could be filtered out, and thus no similarity of biological samples will be captured by TROM. Although the selection of z is ultimately subject to users’ preference for the resulting correspondence maps (a larger z for a sparser map or a smaller z for a denser map), we propose an objective approach to choose an appropriate threshold when no prior knowledge is available. Our approach aims at balancing two goals: (1) the threshold should help minimize noisy correspondence of biological samples and thus leads to a sparse correspondence map; (2) the threshold should help preserve strong correspondence of samples and thus leads to a stable correspondence map.

We use the mean of TROM scores of all pairwise comparisons of biological samples in the correspondence map as the objective function, which is defined as

u (z) = {log}_{10} (\frac{\sum_{i = 1}^{m} \sum_{j = 1, j \neq i}^{m} a_{i j} (z)}{m^{2} - m} + 1)

(4)

where m is the number of biological samples, A(z) = (a_ij (z))_m_×_m is the TROM score matrix based on threshold z. We select the desirable threshold z^* by the following approach. Considering our goal (2), we would like u(z) to be stable for z values near z^*. Since similar u(z) values would lead to a peak in the density of u(z), denoted as f(u), we consider the z values corresponding to the peak, that is, {z : u(z) = mode(u)}, where mode(u) = arg max_u f(u) (i.e., the u value that maximizes the density of f(u) for u = u(z) with z ∈ [−2, 3]). Also considering our goal (1), we would like to select z^* as the largest z value that leads to the stable region of u(z). Hence, we find z^* as

z^{*} = sup {z : u (z) = mode (u)},

(5)

where u = u(z) for z ∈ [−2, 3]. If users desire a sparser correspondence map, we suggest an alternative approach to finding the z-score threshold as z^* = sup{z : u(z) = mode(u) + sd(u)}, where sd(u) stands for the standard deviation of the u(z) values. According to Lemma 1 and also our empirical observation, [−2, 3] is a large enough region to capture the peak with low computational intensity, as the u(z) values are close to 0 outside of this region.

As shown in Lemma 1, an important feature of u(z) is that it approaches 0 when the absolute value of z is large. This is because the entire gene population will be selected as associated genes when the threshold z is small enough while no genes will be selected when z is large enough. In both extreme cases, the resulting TROM score is 0 for any pair of samples. Because of this feature and the non-negativity of u(z), u(z) must have a maximum at a certain value of z. The observed unimodal shape is a typical feature of u(z) for the various species we have investigated.

Lemma 1

u(z) → 0 as |z| → ∞.

Proof

Because of the criterion of selecting associated genes: z-scores ≥ z, for within-species comparison between samples i and j, whose sets of associated genes are denoted as X_i and X_j, we have

as z → −∞, |X_i | → N, |X_j| → N, and |X_i ∩ X_j| → N, where N is the number of all genes of the species;
as z → ∞, |X_i | → 0, |X_j| → 0, and |X_i ∩ X_j | → 0.

Given the p-value formula (Eq. (1)) of the within-species overlap test in TROM, we have

as |X_i| → N, |X_j| → N, and |X_i ∩ X_j| → N, p-value → 1;
as |X_i | → 0, |X_j| → 0, and |X_i ∩ X_j| → 0, p-value → 1.

For between-species comparison between samples i from species 1 and sample j from species 2, whose associated genes correspond to ortholog pairs denoted as X_i and Y_j, and between X_i and Y_j there are | X_i ∩ Y_j| ortholog pairs, we have

as z → −∞, |X_i | → N, |Y_j| → N, and |X_i ∩ Y_j| → N, where N is the total number of ortholog pairs between the two species;
as z → ∞, |X_i| → 0, |Y_j| → 0, and |X_i ∩ Y_j| → 0.

Given the p-value formula (Eq. (2)) of the between-species overlap test in TROM, we have

as |X_i| → N, |Y_j| → N, and |X_i ∩ Y_j| → N, p-value → 1;
as |X_i | → 0, |Y_j| → 0, and |X_i ∩ Y_j| → 0, p-value → 1.

So for both within-species and between-species comparisons, we have TROM score a_ij (z) → 0 as |z| → ∞ given Eq. (3).

Hence, given the definition of u(z) in Eq. (4), we have u(z) → 0 as |z| → ∞.

Using this proposed approach, we can easily select a z-score threshold for a specified species given its gene expression data. We demonstrate how this approach can select an appropriate threshold for comparing D. melanogaster developmental stages by applying it to the RNA-seq data of m = 30 stages. We consider candidate thresholds in the range of zε[−2, 3] and calculate TROM matrices for all the candidate values in this range with a step size of 0.1. The corresponding u(z) is plotted in Fig. 1a.

From the density of u(z) (see Fig. 1b), we determine that the mode of u(z) is 1.42. By finding the maximum z value such that u(z) = 1.42, our approach selects z^* = 0.5. Figure 1c show how different z-score thresholds influence the patterns of correspondence maps. When the threshold is too low (e.g., −0.4), many stage pairs are mapped to each other, providing vague information on the relationships of different stages. On the other hand, when the threshold is too high (e.g., 2.0), so much information is filtered out that most stages are only mapped to themselves, and important correspondence such as the similarity between fly early embryos and female adults is missing [14]. Unlike the two extremes, our selected threshold 0.5 reveals important correspondence patterns and meanwhile yields a clean correspondence map.

3 Results

3.1 Application of TROM to Finding Correspondence of Developmental Stages of Multiple Species

We first demonstrate the use and the performance of TROM in comparative transcriptomics. We apply TROM to find correspondence patterns of developmental stages of six Drosophila (fly) species, C. elegans (worm), S. purpuratus (sea urchin), D. rerio (zebrafish) and mouse liver tissues. The goal is to find similarity of developmental stages within and between species in terms of gene expression dynamics. We use multiple datasets including RNA-seq data of 30 D. melanogaster developmental stages with expression estimates of 15,095 genes, RNA-seq data of 35 C. elegans stages with 31,622 genes [9,14], RNA-seq data of 10 sea urchin stages with 21,090 genes [22], microarray data of six fly species: D. melanogaster, D. simulans, D. ananassae, D. persimilis, D. pseudoobscura and D. virilis with 9 to 13 embryonic stages and 3663 genes [1], microarray data of mouse liver development with 14 stages and 45, 101 genes [15] and microarray data of D. rerio with 61 stages and 18,259 genes [6]. To implement TROM on these gene expression datasets, we select z-score thresholds based on the alternative approach described in Sect. 2.2, and the selected thresholds for various species are summarized in Appendix Table 2 and used throughout this paper unless otherwise specified. A detailed description of these datasets is given in Appendix Table 3.

Table 2.

Selected z-score thresholds for different species

Species	Threshold of z-scores
D. melanogaster (RNA-seq)	1.8
C. elegans	2.0
D. melanogaster (microarray)	0.9
D. ananassae	0.9
D. simulans	0.9
D. persimilis	0.9
D. pseudoobscura	0.8
D. virilis	1.1
Mouse liver	1.4
Sea urchin	1.1
D. rerio	1.0

Open in a new tab

Table 3.

Description of sample labels of different species

Species	Sample labels and corresponding explanation
D. melanogaster (RNA-seq)	Embryo0–2h, Embryo2–4h, Embryo4–6h, Embryo6–8h, Embryo8–10h, Embryo10–12h, Embryo12–14h, Embryo14–16h, Embryo16–18h, Embryo18–20h, Embryo20–22h, Embryo22–24h, L1 (L1 stage larvae), L2 (L2 stage larvae), L3+12h (L3 stage larvae, 12 hr post-molt), L3PS1–2 (L3 stage larvae, dark blue gut, puff stage 1–2), L3PS3–6 (L3 stage larvae, light blue gut, puff stage 3–6), L3PS7–9 (L3 stage larvae, clear gut puff stage 7–9), Prepupae (White prepupae), Prepupae+12h (Pupae, 12 hours after white prepupae), Prepupae+24h (Pupae, 24 hours after white prepupae), Prepupae+2d (Pupae, 2 days after white prepupae), Prepupae+3d (Pupae, 3 days after white prepupae), Prepupae+4d (Pupae, 4 days after white prepupae), Male+1d (Adult male, one day after eclosion), Male+5d (Adult male, 5 days after eclosion), Female+1d (Adult female, one day after eclosion), Female+5d (Adult female, 5 days after eclosion), Female+30d (Adult female, 30 days after eclosion)
C. elegans	EE_50-0 (embryo 0 mins), EE_50-30 (embryo 30 mins), EE_50–60, EE_50–90, EE_50–120, EE_50–150, EE_50–180, EE_50–210, EE_50–240, EE_50–300, EE_50–330, EE_50–360, EE_50–390, EE_50–420, EE_50–450, EE_50–480, EE_50–510, EE_50–540, EE_50–570, EE_50–600, EE_50–630, EE_50–660, EE_50–690, EE_50–720, L1 (larva L1), LIN35 (larva L1 lin35), L2 (larva L2), L3 (larva L3), L4 (larva L4), L4MALE (larva L4 male), YA (young adult), AdultSPE9 (adult spe9), DauerEntryDAF2, DauerDAF2, DauerExitDAF2
Drosophila (microarray)	E0–2h (Embryo 0–2h), E2–4h (Embryo 2–4h), E4–6h (Embryo 4–6h), E6–8h (Embryo 6–8h), E8–10h (Embryo 8–10h), E10–12h (Embryo 12–14h), E14–16h (Embryo 14–16h), E16–18h (Embryo 16–18h), E18–20h (Embryo 18–20h), E20–22h (Embryo 20–22h), E22–24h (Embryo 22–24h), E22–24h (Embryo 24–26h)
Mouse liver	E11.5 (embryonic day 11.5), E12.5 (embryonic day 12.5), E13.5 (embryonic day 13.5), E14.5 (embryonic day 14.5), E15.5 (embryonic day 15.5), E16.5 (embryonic day 16.5), E17.5 (embryonic day 17.5), E18.5 (embryonic day 18.5), Day0 (the day of birth), Day3, Day7, Day14, Day21, and NL (normal adult liver)
Sea urchin	00 hpf (0 hours post-fertilization), 10 hpf, 18 hpf, 24 hpf, 30 hpf, 40 hpf, 48 hpf, 56 hpf, 64 hpf, 72 hpf
D. rerio	0min (egg 0min), 15min (zygote 15min), 45min (cleavage 45min), 1h15min (cleavage 1h15min), 1h45min (cleavage 1h45min), 2h15min (blastula 2h15min), 2h45min (blastula 2h45min), 3h20min (blastula 3h20min), 4h (blastula 4h), 4h40min (blastula 4h40min), 5h20min (gastrula 5h20min), 6h (gastrula 6h), 7h (gastrula 7h), 8h (gastrula 8h), 9h (gastrula 9h), 10h (gastrula 10h), 10h20min (segmentation 10h20min), 11h (segmentation 11h), 11h40min (segmentation 11h40min), 12h (segmentation 12h), 13h (segmentation 13h), 14h (segmentation 14h), 15h (segmentation 15h), 16h (segmentation 16h), 17h (segmentation 17h), 18h (segmentation 18h), 19h (segmentation 19h), 20h (segmentation 20h), 21h (segmentation 21h), 22h (segmentation 22h), 23h (segmentation 23h), 1d1h (pharyngula 1d1h), 1d3h (pharyngula 1d3h), 1d6h (pharyngula 1d6h), 1d10h (pharyngula 1d10h), 1d14h (pharyngula 1d14h), 1d18h (pharyngula 1d18h), 2d (hatching 2d), 2d12h (hatching 2d12h), 3d (hatching 3d), 4d (larva 4d), 6d (larva 6d), 8d (larva 8d), 10d (larva 10d), 14d (larva 14d), 18d (larva 18d), 24d (larva 24d), 30d (larva 30d), 40d (larva 40d), 45d (juvenile 45d), 55d (juvenile 55d), 65d (juvenile 65d), 80d (juvenile 80d), 90d (adult 90d female), 3m15d (adult 3m15d female), 4m (adult 4m female), 7m (adult 7m female), 9m (adult 9m female), 1y2m (adult 1y2m female), 1y6m (adult 1y6m female), 1y9m (adult 1y9m), 55d (adult 55d male), 65d (adult 65d male), 80d (adult 80d male), 90d (adult 90d male), 3m15d (adult 3m15d male), 4m (adult 4m male), 7m (adult 7m male), 9m (adult 9m male), 1y2m (adult 1y2m male), 1y6m (adult 1y6m male), 1y9m (adult 1y9m male)

Open in a new tab

In the comparison of developmental stages within each species, the TROM method finds block diagonal correspondence patterns as expected. That is, in every species, adjacent developmental stages close to each other in the time order have high TROM scores. We illustrate the correspondence maps of developmental stages of mouse liver (Fig. 2a), sea urchin (Fig. 2b) and the six Drosophila species (Appendix Fig. 7). These results provide strong support to the efficacy and validity of TROM in finding transcriptomic similarity of biological samples, in addition to our previous results on the correspondence of D. melanogaster and C. elegans stages based on RNA-seq data [14], to which we applied the preliminary idea of TROM.

Fig. 2 — Within-species and between-species correspondence maps of TROM scores. For better illustration, TROM scores are saturated at 6: all the scores larger than 6 are set to 6. a Pairwise within-species TROM scores calculated for the 14 stages of mouse liver; b pairwise within-species TROM scores calculated for the 10 stages of sea urchin; c pairwise within-species TROM scores calculated for the 10 stages of *D. melanogaster*. The *column stages* are from the microarray data, and the *row stages* are from the RNA-seq data; d pairwise between-species TROM scores of *D. melanogaster* vs. mouse liver. The *columns* represent the 14 mouse stages, and the *rows* represent the 30 fly stages

Fig. 7 — Correspondence maps of within-species and between-species TROM scores (calculated based on the z-score thresholds listed in the table). TROM scores are saturated at 6. The names of the species are marked as *row* or *column* labels of the corresponding heatmaps. For the *Drosophila* species, the stages labels 1–13 refer to Embryo 0–2, 2–4, 4–6, 6–8, 8–10, 10–12, 12–14, 14–16, 16–18, 18–20, 20–22, 22–24 and 24–26 h, respectively

We also apply TROM to compare the developmental stages of two different species. We use ortholog information downloaded from Ensembl [4] in the comparison. Since fly, worm, and mouse are vastly distant from each other in evolution, any correspondence between their developmental stages revealed by TROM will be interesting and may imply conserved developmental programs. Between D. melanogaster life cycle and mouse liver development (Fig. 2d), TROM finds unknown correspondence between fly early embryos and mouse embryo liver tissues, and between fly female adults and mouse embryo liver tissues. A main reason for the latter correspondence is the transcriptomic similarity of fly early embryos and female adults due to the expression of maternal effect genes [14]. Additionally, there is some irregular correspondence between fly larvae and liver tissues of born mice. We can see a clear separation of the liver tissues of mouse embryos and born mice, and their corresponding fly stages also exhibit a separation of embryos and female adults from other stages. These results indicate that even for vastly different species such as fly and mouse, there is good conservation in their embryonic development. Similarly between the six Drosophila species’ embryonic development and mouse liver development, we also see good correspondence of fly early embryos and mouse embryo liver tissues, and correspondence between fly late embryos and mouse adult liver tissues (Appendix Fig. 7). Moreover, mouse embryo liver tissues are observed to correspond well with worm embryos, and this is consistent with the observed correspondence between fly embryos and worm embryos (Appendix Fig. 7). These consistent correspondence patterns together validate the efficacy of the TROM approach.

Between the six Drosophila species, since they are known to have similar developmental programs [1], comparisons of their developmental stages resemble within-species comparisons, and block diagonal correspondence patterns are expected. Our results confirm this: diagonal patterns are observed between the developmental stages of every two fly species (Appendix Fig. 7). These results again demonstrate the validity of TROM.

3.2 Comparison of TROM and Pearson/Spearman Correlation Measures

We next describe the scenarios where TROM serves as a better similarity measure than Pearson/Spearman correlation measures in differentiating the stage pairs, which exhibit high dependence in highly expressed genes, from other stage pairs. A key difference between our TROM method and the Pearson/Spearman correlation analysis is that TROM divides genes into two sets (associated genes and non-associated genes) for every sample based on gene expression dynamics across all samples. After the division, calculation of TROM scores does not rely on actual gene expression measurements. Henceforth, TROM defines sample similarity based on the overlap of their associated genes. In contrast to TROM, Pearson and Spearman correlations are calculated based on actual expression measurements of the same set of genes in two samples. Hence, they are more sensitive to expression fluctuations of lowly expressed genes due to measurement errors, and their values can be driven high by the genes (e.g., housekeeping genes) that have approximately constant expression across samples and carry little information on sample characteristics. For our goal of constructing a sparse sample correspondence map based on gene expression, Pearson and Spearman correlation measures are often unsatisfactory, as they give rise to noisy correspondence maps (Appendix Figs. 8, 9).

Fig. 8 — Outline of the package TROM. The three *left* (and *right*) heatmaps illustrate the within-species (and between-species) comparison results by using TROM (with z-score threshold 1.5 for both fly and worm), Pearson correlation and Spearman correlation. Greater similarities are shown in *darker colors*. The results show that compared to the popular Pearson and Spearman correlations, TROM can find clearer correspondence patterns. TROMtakes gene expression matrices and orthologous genes of the species of interest as input. The functions select.associated.genes and select.associated.orthologs select the associated genes of different biological samples among all the genes or only among the genes with orthologs in the other species to be compared with. They also provide graphical summaries of the numbers of selected associated genes and orthologs. The functions ws.trom and ws.trom.orthologs perform the within-species transcriptome comparison, find the overlapping associated genes between every two samples and calculate within-species TROM scores. The function bs.trom performs the between-species transcriptome comparison, find the overlapping associated orthologs between every two samples from different species and calculate the between-species TROM scores. The function heatmap.3 visualizes the TROM scores in a heatmap, with various add-on options for customization. The functions find.top.GO.terms and find.top.GO.slim.terms perform gene set enrichment analysis and find top enriched Gene Ontology (GO) terms and GO slim terms in the associated genes. Instead of using the selected associated genes, users may input customized gene lists representing characteristics of different biological samples into the above functions. Please see the package manual and vignette of TROM for details

Fig. 9 — Correlation measures calculated based on the union of associated genes. Pearson correlation (a) and Spearman correlation (b) for every pair of *D. melanogaster* stages calculated based on the union of associated genes of all stages. Pearson correlation (c) and Spearman correlation (d) for every pair of *D. melanogaster* and *C. elegans* stages calculated based on the union of associated ortholog pairs of all stages. These heatmaps show that correlation measures calculated based on associated genes only still cannot lead to clear correspondence patterns

To demonstrate the power of TROM in detecting the correspondence of biological samples that share transcriptomic characteristics embedded in highly expressed genes, we conduct a simulation study to compare TROM with Pearson and Spearman correlation measures. Specifically, we consider their values as classification scores to differentiate the sample pairs with strong dependence in highly expressed genes from the rest sample pairs. We evaluate their performance in terms of classification accuracy.

Suppose a species of interest has a total number of N genes and m samples. For the observed data, let X_j = (X₁_j, …, X_Nj)^T denote the expression vector of the N genes in sample j. For the underlying (hidden) sample similarity, we use a state matrix E_m_×_m to denote the pairwise relationships between the m samples. That is, if sample i and j have high dependence in their associated genes, E_ij = 1; otherwise E_ij = 0. We consider how to predict E_ij for every pair 1 ≤ i ≠ j ≤ m from gene expression matrix as a classification problem. We would like to compare the three measures in this setting and evaluate their performance as classification scores using precision-recall curves, receiver operating characteristic (ROC) curves, and Neyman–Pearson ROC curves [21].

In this simulation, we define the state matrix E_m_×_m based on a correlation matrix of associated genes. Specifically, in the example of comparing developmental stages, we assume a Toeplitz-type correlation matrix Σ where Σ_ij = ρ^|ⁱ⁻ ^j^| (i, j = 1, 2, …, m; ρ ∈ [0, 1]), which is reasonable as it assigns a higher correlation to more adjacent stage pairs. To reduce arbitrariness in defining E based on Σ, we vary a threshold c ∈ (0, 1) and define E as

E_{i j} = {\begin{cases} 1 & if \sum_{i j} > c \\ 0 & if \sum_{i j} \leq c \end{cases},

(6)

and we would like to track how the classification accuracy of the three measures changes as the parameter c changes.

We use the following generative model to simulate gene expression matrices. We let I_N _×_m be an indicator matrix, with I_ij = 1 if gene i is an associated gene of sample j and I_{i j} = 0 otherwise. Given the correlation matrix Σ, we assume that the ith row I_i ∈ {0, 1}^m is a binary vector randomly sampled from a multivariate Bernoulli distribution with expectation q × 1_m_×1 (q ∈ (0, 1) inferred from real data) and correlation matrix Σ. Given the associated-gene indicator matrix I_N_×_m, we generate a gene expression matrix in a data-driven approach, because gene expression values in real data contain noises and cannot be easily described by any common probability distributions. We first scale a real gene expression matrix Y_N_×_m by dividing each of its rows by the row maximal values, denoted by Y^scale. Then for each gene i = 1, 2, …, N, we locate its closest counterpart in real data by searching for gene i′ in Y^scale such that the ith row $Y_{i^{'}}^{scale}$ and I_i has the minimal Euclidean distance. Given Y_i′, the i′th row of Y, we define sets A_i_′ = {Y_i_′_j : gene i′ is an associated gene in sample j, j = 1, …, m} and $A_{i^{'}}^{c} = {Y_{i^{'} j} : gene i^{'} is not an associated gene in sample j, j = 1, \dots, m}$ to collect the expression values of gene i′ when it is identified as associated or not associated with real-data samples, based on a pre-determined z-score threshold z^*. Finally, we create a gene expression matrix X_N_×_m as follows: for gene i in sample j, if I_ij = 1, we randomly sample the value of X_ij from A_i_′ ; if I_{i j} = 0, we randomly sample the value of X_{i j} from $A_{i^{'}}^{c}$ .

Using this generative model, we simulate K = 200 gene expression matrices of the same species. We denote the matrices as X⁽^k⁾, k = 1, …, K. Then we calculate the similarity score matrices based on the three similarity measures. For TROM, to determine the associated genes and non-associated genes of each sample, we calculate the z-score threshold based on X⁽^k⁾ using the method introduced in Sect. 2.2. The resulting TROM score matrix is denoted as T⁽^k⁾. The Pearson and Spearman correlation matrices are denoted as P⁽^k⁾ and S⁽^k⁾, respectively. Please note that T⁽^k⁾, P⁽^k⁾ and S⁽^k⁾ are all m × m matrices, with the same dimensions as E.

To perform classification based on the score matrices of the three measures, we apply multiple cutoffs to the matrices and calculate the resulting precision and recall rates. For example, if we use c_T as the cutoff for TROM scores, for k = 1, 2, …, K we have predicted class labels

{\hat{E}}_{i j}^{(k)} = {\begin{cases} 1 & if T_{i j}^{(k)} > c_{T} \\ 0 & if T_{i j}^{(k)} \leq c_{T} \end{cases} .

The precision and recall rates of TROM in the kth run are then calculated as

\begin{array}{l} {precision}^{(k)} & = \frac{\underset{i \neq j}{\sum \sum} {\hat{E}}_{i j}^{(k)} E_{i j}}{\underset{i \neq j}{\sum \sum} {\hat{E}}_{i j}^{(k)}}, \\ {recall}^{(k)} & = \frac{\underset{i \neq j}{\sum \sum} {\hat{E}}_{i j}^{(k)} E_{i j}}{\underset{i \neq j}{\sum \sum} E_{i j}} . \end{array}

Similarly, we can calculate the precision and recall rates of Pearson/Spearman correlation by applying varying cutoffs on P⁽^k⁾ and S⁽^k⁾ respectively.

We carry out this simulation study in the context of D. melonagaster (fly) and C. elegans (worm). For fly, we have N = 10,000, m = 30, q = 0.15, z^* = 0.5; for worm, we have N = 10,000, m = 35, q = 0.2, z^* = 0.6. In both cases, we set ρ = 0.5 and let c take four different values: 0.3, 0.2, 0.1, 0.05. The real data used to generate the simulated gene expression matrices are processed from modENCODE RNA-seq data of 30 fly developmental stages and 35 worm developmental stages [9,14]. The precision-recall curves of the three measures are illustrated in Figs. 3 and 4. In both cases, we see that TROM produces clearer sparse patterns of sample similarity (Figs. 3a vs. b, c and 4a vs. b, c), and for predicting stage-pair labels defined by different threshold c values, TROM always has the largest area under the precision-recall curves (Figs. 3e, f, 4e, f, in terms of both the mean area and the 95 % confidence intervals from the K = 200 simulation runs). We also calculate Receiver Operating Characteristic (ROC) and the Neyman–Pearson Receiver Operating Characteristic (NP-ROC [21]) curves of the three measures in each case (see Appendix Fig. 10), and TROM still has the best classification accuracy.

Fig. 3 — Comparison of TROM and Pearson/Spearman correlation on simulated *D. melanogaster* (fly) data. **a–c** The correspondence maps produced by TROM (a), Pearson correlation (b) and Spearman correlation (c) on a randomly selected gene expression matrix (among the K = 200 matrices). d The correlation matrix Σ that defines the dependence of associated genes between samples. e The true sample relationships (1: high dependence in associated genes; 0: otherwise) defined as in Eq. (6) for varying c. In **a–e**, the columns and rows correspond to the 30 developmental stages of fly. f The mean precision-recall curves on the 200 gene expression matrices, given the true labels in e. The 95 % confidence intervals of each measure’s area under the curve (AUC) are marked next to the *curves*

Fig. 4 — Comparison of TROM and Pearson/Spearman correlation on simulated *C. elegans* (worm) data. **a–c** The correspondence maps produced by TROM (a), Pearson correlation (b) and Spearman correlation (c) on a randomly selected gene expression matrix (among the 200 matrices). d The correlation matrix Σ that defines the dependence of associated genes between samples. e The true sample relationships (1: high dependence in associated genes; 0: otherwise) defined as in Eq. (6) for varying c. In **a–e**, the *columns* and *rows* correspond to the 35 developmental stages of worm. f The mean precision-recall curves on the 200 gene expression matrices, given the true labels in e. The 95 % confidence intervals of each measure’s area under the curve (AUC) are marked next to the *curves*

Fig. 10 — Comparison of TROM and Pearson/Spearman correlation on simulated data, with a for fly and b for worm. In both *panels*, the *first row* gives the true sample relationships (1: high dependence in associated genes; 0: otherwise) defined as in Eq. 6 for varying c. The *second* row gives the mean receiver operating characteristic (ROC) curves on the 200 simulated gene expression matrices, given the true labels in the *first row*. The *third row* gives the mean Neyman–Pearson receiver operating characteristic (NP-ROC) curves, accordingly. The 95 % confidence intervals of the area under the curve (AUC) are marked next to the *curves*

In this classification setting, TROM scores, Pearson correlations, and Spearman correlations are essentially three ways of transforming a gene expression matrix into features of sample pairs. The above simulation results suggest that TROM scores serve as better features for this task, that is, to capture the sparse similarity relationships of samples. The main reason is that TROM scores are based on gene expression levels of all samples, while Pearson and Spearman correlations only capture the similarity of gene expression profiles for every pair of samples.

In addition, we directly compare TROM with Pearson and Spearman correlation coefficients on the two real datasets of fly and worm used in the simulation. In our previous work [14], we applied the preliminary idea of TROM to compare the developmental stages within each species and between the two species, and found interesting correspondence patterns: a block diagonal pattern for within-species comparison and two parallel patterns between fly and worm developmental stages. When using Pearson and Spearman correlations on the same data to compare these stages, however, we find that neither correlation measure leads to clear correspondence patterns in the between-species comparison (Appendix Fig. 8). in the within-species comparison, Spearman correlation finds a vague diagonal pattern, while Pearson correlation leads to an unreasonable checkerboard pattern. We also calculate Pearson and Spearman correlation matrices based on the union of all the stage-associated genes found by TROM. However, correlation methods still cannot provide clear correspondence maps like TROM does (Appendix Fig. 9).

3.3 Robustness of TROM to Data Normalization

Since quantile normalization has been suggested as an essential step in many analysis pipelines for high-throughput data such as microarray and RNA-seq data [3,11], we conduct a simulation study to demonstrate the influence of quantile normalization on TROM scores. We simulate 200 gene expression matrices and compute their TROM scores with or without quantile normalization as a preceding step. Then we test if the distribution of TROM scores changes with the use of quantile normalization.

We use the same procedure as what described in Sect. 3.2 to generate 200 gene expression matrices based on the modENCODE RNA-seq data of 35 worm developmental stages. By applying the TROM method to these gene expression matrices before or after quantile normalization, we obtain two sets of TROM matrices T⁽⁰^k⁾ and T⁽¹^k⁾, k = 1, 2, …, 200. For each pair of samples, say samples i and j, we have two sets of TROM scores $T_{i j}^{(0 k)}$ and $T_{i j}^{(1 k)}$ . We then use the Wilcoxon signed-rank test and separately the paired Student’s t test to check whether the TROM scores change significantly before and after quantile normalization. We consider the change as significant if the Bonferroni-corrected p-value is smaller than 0.05. The results are shown in Fig. 5.

Fig. 5 — Robustness of TROM to quantile normalization on simulated *C. elegans* (worm) data. a The correspondence maps based on TROM scores of a randomly selected gene expression matrix (among the 200 simulated matrices), before (*left*) and after (*right*) quantile normalization. b The results of the Wilcoxon signed-rank test (*left*) and the paired Student’s t test (*right*). Every blank cell means that the Bonferroni-corrected p-value is insignificant for the corresponding pair of stages, i.e., the TROM scores do not change significantly after quantile normalization

The results of both tests suggest that TROM is robust to unnormalized data, and the correspondence patterns resulted from TROM scores do not change significantly after quantile normalization. Even in the two rare cases where the p-values are significant (Fig. 5b), the corresponding samples are consistently mapped before and after normalization. We also try to replace the gene expression data with their normalized version in Sect. 3.2, and the confidence intervals of TROM’s area under the curve (AUC) remain the same. This result implies that the classification power of TROM is also robust to data normalization.

3.4 Robustness of TROM to Different Platforms: Comparison of D. melanogaster Developmental Stages Based on Microarray and RNA-seq Data

Although many studies have claimed that RNA-seq is the technique of choice that provides more accurate estimation of absolute gene expression levels compared with microarray [8,26], several genome-wide analyses have also suggested that microarray can measure the expression of above-median expressed genes reasonably well, and on those genes the two platforms have good concordance [24]. Since microarray has been widely used to study transcriptomes of multiple species under various conditions in the past decade, it is desirable to have a good comparative transcriptomic method that is robust to the platform difference of microarray and RNA-seq data.

Here we demonstrate the robustness of TROM by applying it to comparing the microarray and RNA-seq data of the developmental stages of D. melanogaster. If TROM is robust, it should identify strong correspondence between similar developmental stages in the microarray and RNA-seq data. For a pair of developmental stages, one with microarray data and the other with RNA-seq data, TROM identifies a set of associated genes for each of them based on all the stages with microarray and RNA-seq data, respectively. Then TROM performs the overlap test and produces a correspondence map. The results show that TROM can find almost perfect correspondence of the same D. melanogaster embryonic stages between microarray or RNA-seq (Fig. 2c). There are five other Drosophila species that have similar developmental patterns as D. melanogaster, as we have already shown in the within-species and between-species comparison in Sect. 3.1. We also compare their microarray data of embryonic stages with the RNA-seq data of D. melanogaster as a further check. In the result (Appendix Fig. 11), we observe strong block diagonal patterns. Although RNA-seq data contain larvae, prepupae, and adult stages that do not have corresponding microarray data, the off-diagonal patterns, which we observe (1) between late embryos in microarray and prepupae in RNA-seq and (2) between early embryos in microarray and female adults in RNA-seq, are consistent with our previous within-species correspondence map based on RNA-seq data only [14] and previous studies [1]. These results show that TROM can find almost the same correspondence of Drosophila developmental stages regardless of the platform being microarray or RNA-seq.

Fig. 11 — Correspondence maps of developmental stages. TROM scores are calculated using the RNA-seq data of *D. melanogaster* and the microarray data of the other five *Drosophila* species

3.5 Gene Ontology (GO) Enrichment Analysis

To understand the biological functions behind the correspondence we have observed between developmental stages, we perform enrichment analysis [2] of biological process (BP) gene ontology (GO) terms in stage-associated genes, as a way to determine common biological functions and processes in corresponding stages. First, we examine the GO term enrichment in the associated genes of every D. melanogaster embryomnic stage, using RNA-seq data (with z-score threshold 1.5) and microarray data (with z-score threshold 0.5) respectively. The enrichment scores are defined as − log₁₀(Bonferroni-corrected p-value) where p-values are calculated based on the hypergeometric test, and the results are illustrated in Appendix Figs. 12 and 13. For every fly embryonic stage, the top 20 enriched GO terms in the associated genes identified by RNA-seq data contain biological functions highly relevant to these stages, and many of these terms have been discovered as enriched in relevant embryonic samples by previous studies [14,18]. A proportion of these top enrichment GO terms with support in the literature are listed in Table 1. The enriched GO terms identified from both RNA-seq and microarray data support the correspondence patterns observed in TROM correspondence maps: common enriched GO terms are often shared by adjacent stages whose pairwise TROM scores are high. The top enriched GO terms found by both microarray and RNA-seq are informative for further functional studies on the associated genes of every stage, so as to better understand embryonic development of D. melanogaster.

Fig. 12 — Top 20 enriched biological process GO terms of *D.melanogaster*. The enrichment scores in the heatmap are calculated based on stage-associated genes identified from the RNA-seq data (with z-score threshold 1.5) and saturated at 6. For each stage, the common enriched GO terms identified from both microarray (Fig. 13) and RNA-seq datasets are marked in *red color* (Color figure online)

Fig. 13 — Top 20 enriched biological process GO terms of *D.melanogaster*. The enrichment scores in the heatmap are calculated based on the stage-associated genes identified from the microarray data (with z-score threshold 0.5) and saturated at 6. For each stage, the common enriched GO terms identified from both microarray and RNA-seq (Fig. 12) datasets are marked in *red color* (Color figure online)

Table 1.

Selected enriched GO terms in each stage of D. melanogaster

Stage name	Top enriched GO terms
Embryo 0–2 h	Oogenesis, DNA replication, germ cell development, neurogenesis
Embryo 2–4 h	Neurogenesis, mRNA splicing via spliceosome, zygotic determination of anterior/posterior axis
Embryo 4–6 h	mRNA splicing via spliceosome, specification of segmental identity, cell fate specification
Embryo 6–8 h	Cell fate specification, sensory organ development, open tracheal system development
Embryo 8–10 h	Myoblast fusion, multicellular organism reproduction, puparial adhesion
Embryo 10–12 h	Myoblast fusion, translation, mitotic spindle elongation, septate junction assembly
Embryo 12–14 h	Axon guidance, septate junction assembly, branch fusion open tracheal system
Embryo 14–16 h	Circadian rhythm, response to light stimulus, crystal cell differentiation
Embryo 16–18 h	Chitin-based cuticle development, body morphogenesis, chitin metabolic process
Embryo 18–20 h	Body morphogenesis, chitin metabolic process, proteolysis

Open in a new tab

We also examine the GO term enrichment in the associated genes (identified with z-score threshold 1.5) of every developmental stage of mouse liver. The resulting enrichment scores are illustrated in Appendix Fig. 14. The top 10 enriched GO terms in our selected stage-associated genes of every stage confirm previous findings on liver development and regeneration. In E11.5–12.5, two of the early stages, top enriched GO terms are mostly cell cycle-related terms like “translation,” “mRNA processing,” “cell cycle,” and “cell division” [15]. Previous research has shown that mouse liver takes over the function of hematopoiesis at E10.5–12.5 [10,15], and we found that the GO terms including “heme biosynthetic process” and “porphyrin-containing compound biosynthetic process” are top enriched in subsequent stages. For stages E17.5–Day7, the GO terms “innate immune response” and “immune system process” are top enriched, in accordance with the theory that liver is an organ with innate immune features [7]. Finally, as the function of mouse liver switches from hematopoiesis to metabolism and this capacity dominates in the adult liver [10,15], we observe that GO terms related to various metabolic processes become enriched in stages E17.5–NL (normal adult liver tissue). These findings again illustrate the capacity of the associated genes in capturing transcriptomic characteristics of biological samples.

Fig. 14 — Top 10 enriched biological process GO terms of mouse liver. The enrichment scores in the heatmap are calculated based on the stage-associated genes identified from the microarray data (with z-score threshold 1.5). For each stage, the highly relevant GO terms that have been confirmed in previous studies are marked in *red color* (Color figure online)

3.6 Construction of Developmental Trees Using Stage-Associated Genes

We further demonstrate that the selected stage-associated genes contain abundant information to group and distinct developmental stages. Tree construction has been a popular approach for studying the relationships of different developmental stages in organism development [1] as well as cell lineages in cell differentiation [23]. Here we attempt to construct developmental trees of diverse species (see Fig. 6 and Appendix Fig. 15) based on the identified associated genes of each developmental stage, reasoning that the associated genes capture stage characteristics and thus can lead to reasonable developmental trees. In tree construction, both Simpson and Jacard similarity coefficients can be used to measure the distance between the associated genes of different samples. However, Simpson coefficient will produce a result of 1 when the associated genes of one sample is a subset of the associated genes of the other sample, and it thus fails to distinguish two samples in this case. In contrast, Jacard coefficient is able to separate two biological samples in this case, because it considers two samples as identical if and only if they have exactly the same associated genes. As a consequence, we carry out the tree construction by hierarchical clustering, using average linkage and Jaccard coefficient, where the distance between two stages i and j is calculated as

Fig. 6 — Developmental trees constructed based on stage-associated genes (identified with z-score thresholds 1.4 and 1.1 for mouse liver and sea urchin respectively). a Developmental tree of mouse liver. b Developmental tree of sea urchin

Fig. 15 — Developmental trees constructed using stage-associated genes (identified with the z-score thresholds in the table). a–f are for *Drosophila* species and G is for *C. elegans*

J_{i j} = \frac{∣ X_{i} \cap X_{j} ∣}{∣ X_{i} ∣ + ∣ X_{j} ∣ - ∣ X_{i} \cap X_{j} ∣},

(7)

where |X_i| and |X_j| are the sizes of two sets of stage-associated genes and |X_i ∩ X_j| is the number of genes in their intersection.

The developmental tree (see Fig. 6a) constructed for mouse liver development shows an interesting pattern: the first major branch of the tree successfully divides the 14 stages into embryonic stages and postnatal stages with one exception that the last embryonic stage E18.5 is clustered with the postnatal stages. Moreover, neighboring stages are clustered with each other in small branches. These observations are in accordance with the correspondence pattern illustrated by TROM scores (see Fig. 2a): mappings exist between neighboring stages but not between E11.5–E17.5 and E18.5–NL. Previous hierarchical clustering results on genes whose expression levels are changed by more than 1.5-fold to average [15] supported our constructed tree and the similarity between E18.5 and postnatal stages. The GO enrichment analysis provides functional explanation on the observed clustering of E18.5 and Day 7, which both have enriched GO term including “innate immune response,” “immune system process,” and “multicellular organismal development.”

The developmental tree (see Fig. 6b) constructed for sea urchin embryonic development also matches existent understanding of temporal interrelations of developmental stages. First, the major branch of the differentiation tree divides the stages into two sub-groups: one is 00, 10, 18, 24 and 30 hpf and the other is 40, 48, 56, 64 and 72 hpf. Previous studies show that oral/aboral (O/A) axis specification, endomesoderm development, and autonomous specification are the major developmental processes before 40 hpf, while set-aside cells and rudiment formation and embryonic morphogenesis take over the major processes after 40 hpf [5]. This functional explanation supports our constructed tree. Second, neighboring stages are grouped into small branches, and the overall tree is in accordance with sea urchin’s embryonic development periods as cleavage, blastula, gastrula, and prism-pluteus [5].

We also observe reasonable and meaningful developmental trees constructed for the six Drosophila species and C. elegans (Appendix Fig. 15). We note that the tree construction is robust to the z-score threshold choices.

4 Discussion

In this work, we demonstrate that our proposed measure TROM is more efficient in finding transcriptomic similarity and correspondence patterns of biological samples within and between species compared with Pearson and Spearman correlations. Both simulation and real data analysis verify the superior power of TROM in detecting biologically meaningful relationships between different samples. The comparison results suggest that in the TROM method the selection of associated genes is a critical step before the overlap test. The selection step ensures that the transcriptomic characteristics of each sample are well captured and represented. Moreover, the strength of TROM also lies in the overlap test that does not directly rely on absolute gene expression values and is thus relatively robust to noisy data. On the other hand, Pearson and Spearman correlations fail to detect clear correspondence patterns even based on the associated genes.

We observe that it is possible to improve the correspondence map found by Spearman correlation by thresholding its correlation values, i.e., setting all the values below the threshold to the minimum value of all pairwise comparisons. We test this procedure on the RNA-seq datasets of D. melanogaster and C. elegans and the results are summarized in Appendix Fig. 16. As expected, thresholding on the Spearman correlation can give rise to relatively clearer correspondence patterns. However, this procedure is very sensitive to the threshold and often miss biologically meaningful mappings: the similarity of early embryos and female adults in fly is only captured once and the similarity of embryo and adults in worm is totally missing at all thresholds [14].

Fig. 16 — Spearman correlations of the developmental stages of *D. melanogaster* (fly) and *C. elegans* (worm). a The first *panel* shows the original Spearman correlations of fly stages, while the rest *panels* show the Spearman correlations of fly stages under different thresholds. b The first *panel* shows the original Spearman correlations of worm stages, while the rest *panels* show Spearman correlations of worm stages under different thresholds. c TROM scores of fly. d TROM scores of worm. All the values under the selected threshold are set to the minimum value of each correlation matrix

We would also like to point out that although TROM is not a parameter-free method, the resulting similarity patterns are largely robust to the selection of the z-score threshold. In addition, the TROM method provides users with the flexibility to tune the threshold according to the level of relationships they look for between biological samples.

The sample-associated genes identified based on the threshold carry important transcriptomic characteristics of the corresponding samples and are not simply the complement of housekeeping genes. The identification of sample-associated genes filters out not only housekeeping genes, but also those genes that exhibit little variation across samples. In addition, it is worth noting that the concept of associated genes is not equivalent to specific genes, since associated genes also contain genes that capture transcriptomic similarity among closely related samples, and these genes can be shared by several but not all samples.

To the best of our knowledge, Le et al. [13] is the only previous attempt other than correlation-based methods to compare biological samples across species. This method compares expression experiments from different species through a newly defined distance metric between the ranking of orthologous genes in the two species. However, their method relies on a large training dataset of known similar samples to learn the parameters for distance functions, and is thus not practical for finding novel patterns of biological samples from rarely studied species such as D. rerio. Another advantage of TROM compared with this method is that TROM can identify informative associated genes that enable various downstream analyses.

5 Conclusion

TROM, a testing-based method, is introduced for finding correspondence patterns among transcriptomes of the same or different species. We demonstrate the greater power of TROM compared to correlation measures in finding transcriptomic similarity in terms of highly expressed genes. We apply TROM to find correspondence maps of developmental stages within and between multiple species, and we show that the associated genes TROM identifies for developmental stages can be used to construct developmental trees in these species. We also show that TROM is robust to data normalization and platform difference of microarray and RNA-seq. In addition, we design a systematic approach for selecting a key threshold parameter in TROM. We implement the TROM method in an R package, which provides functions with flexibility for illustration and customization and can be easily integrated into existing comparative genomic pipelines.

Appendix

See Tables 2, 3 and Figs. 7, 8, 9, 10, 11, 12, 13, 14, 15, and 16.

References

1.Arbeitman MN, Furlong EE, Imam F, Johnson E, Null BH, Baker BS, Krasnow MA, Scott MP, Davis RW, White KP. Gene expression during the life cycle of Drosophila melanogaster. Science. 2002;297(5590):2270–2275. doi: 10.1126/science.1072152. [DOI] [PubMed] [Google Scholar]
2.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Bolstad BM, Irizarry RA, Åstrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19(2):185–193. doi: 10.1093/bioinformatics/19.2.185. [DOI] [PubMed] [Google Scholar]
4.Cunningham F, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S, et al. Ensembl 2015. Nucl Acids Res. 2015;43(D1):D662–D669. doi: 10.1093/nar/gku1010. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Davidson EH, Cameron RA, Ransick A. Specification of cell fate in the sea urchin embryo: summary and some proposed mechanisms. Development. 1998;125(17):3269–3290. doi: 10.1242/dev.125.17.3269. [DOI] [PubMed] [Google Scholar]
6.Domazet-Lošo T, Tautz D. A phylogenetically based transcriptome age index mirrors ontogenetic divergence patterns. Nature. 2010;468(7325):815–818. doi: 10.1038/nature09632. [DOI] [PubMed] [Google Scholar]
7.Dong Z, Wei H, Sun R, Tian Z. The roles of innate immune cells in liver injury and regeneration. Cell Mol Immunol. 2007;4(4):241–252. [PubMed] [Google Scholar]
8.Fu X, Fu N, Guo S, Yan Z, Xu Y, Hu H, Menzel C, Chen W, Li Y, Zeng R, et al. Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genom. 2009;10(1):161. doi: 10.1186/1471-2164-10-161. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Gerstein MB, Rozowsky J, Yan KK, Wang D, Cheng C, Brown JB, Davis CA, Hillier L, Sisu C, Li JJ, et al. Comparative analysis of the transcriptome across distant species. Nature. 2014;512(7515):445–448. doi: 10.1038/nature13424. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Hata S, Namae M, Nishina H. Liver development and regeneration: from laboratory study to clinical therapy. Develop Growth Differ. 2007;49(2):163–170. doi: 10.1111/j.1440-169X.2007.00910.x. [DOI] [PubMed] [Google Scholar]
11.Hicks SC, Irizarry RA. When to use quantile normalization? 2014 doi: 10.1101/012203. bioRxiv. [DOI] [Google Scholar]
12.Labbé RM, Irimia M, Currie KW, Lin A, Zhu SJ, Brown DD, Ross EJ, Voisin V, Bader GD, Blencowe BJ, et al. A comparative transcriptomic analysis reveals conserved features of stem cell pluripotency in planarians and mammals. Stem Cells. 2012;30(8):1734–1745. doi: 10.1002/stem.1144. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Le HS, Oltvai ZN, Bar-Joseph Z. Cross-species queries of large gene expression databases. Bioinformatics. 2010;26(19):2416–2423. doi: 10.1093/bioinformatics/btq451. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Li JJ, Huang H, Bickel PJ, Brenner SE. Comparison of D. melanogaster and C. elegans developmental stages, tissues, and cells by modencode RNA-Seq data. Genome Res. 2014;24(7):1086–1101. doi: 10.1101/gr.170100.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Li T, Huang J, Jiang Y, Zeng Y, He F, Zhang MQ, Han Z, Zhang X. Multi-stage analysis of gene expression and transcription regulation in c57/b6 mouse liver development. Genomics. 2009;93(3):235–242. doi: 10.1016/j.ygeno.2008.10.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Necsulea A, Soumillon M, Warnefors M, Liechti A, Daish T, Zeller U, Baker JC, Grützner F, Kaessmann H. The evolution of lncRNA repertoires and expression patterns in tetrapods. Nature. 2014;505(7485):635–640. doi: 10.1038/nature12943. [DOI] [PubMed] [Google Scholar]
17.Pantalacci S, Sémon M. Transcriptomics of developing embryos and organs: a raising tool for evo–devo. J Exp Zool Part B Mol Dev Evol. 2015;324(4):363–371. doi: 10.1002/jez.b.22595. [DOI] [PubMed] [Google Scholar]
18.Puniyani K, Faloutsos C, Xing EP. Spex2: automated concise extraction of spatial gene expression patterns from fly embryo ISH images. Bioinformatics. 2010;26(12):i47–i56. doi: 10.1093/bioinformatics/btq172. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Shen Y, Yue F, McCleary DF, Ye Z, Edsall L, Kuan S, Wagner U, Dixon J, Lee L, Lobanenkov VV, et al. A map of the cis-regulatory sequences in the mouse genome. Nature. 2012;488(7409):116–120. doi: 10.1038/nature11243. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Spencer WC, Zeller G, Watson JD, Henz SR, Watkins KL, McWhirter RD, Petersen S, Sreedharan VT, Widmer C, Jo J, et al. A spatial and temporal map of C. elegans gene expression. Genome Res. 2011;21(2):325–341. doi: 10.1101/gr.114595.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Tong X, Feng Y, Li JJ. Neyman-Pearson (NP) classification algorithms and NP receiver operating characteristic (NP-ROC) curves. arXiv preprint. 2016 doi: 10.1126/sciadv.aao1659. arXiv:1608.03109. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Tu Q, Cameron RA, Davidson EH. Quantitative developmental transcriptomes of the sea urchin Strongylocentrotus purpuratus. Dev Biol. 2014;385(2):160–167. doi: 10.1016/j.ydbio.2013.11.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Virmani AK, Tsou JA, Siegmund KD, Shen LY, Long TI, Laird PW, Gazdar AF, Laird-Offringa IA. Hierarchical clustering of lung cancer cell lines using DNA methylation markers. Cancer Epidemiol Biomark Prevent. 2002;11(3):291–297. [PubMed] [Google Scholar]
24.Wang C, Gong B, Bushel PR, Thierry-Mieg J, Thierry-Mieg D, Xu J, Fang H, Hong H, Shen J, Su Z, et al. The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat Biotechnol. 2014;32(9):926–932. doi: 10.1038/nbt.3001. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63. doi: 10.1038/nrg2484. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Zhao S, Fung-Leung WP, Bittner A, Ngo K, Liu X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PloS One. 2014;9(1) doi: 10.1371/journal.pone.0078644. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Arbeitman MN, Furlong EE, Imam F, Johnson E, Null BH, Baker BS, Krasnow MA, Scott MP, Davis RW, White KP. Gene expression during the life cycle of Drosophila melanogaster. Science. 2002;297(5590):2270–2275. doi: 10.1126/science.1072152. [DOI] [PubMed] [Google Scholar]

[R2] 2.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Bolstad BM, Irizarry RA, Åstrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19(2):185–193. doi: 10.1093/bioinformatics/19.2.185. [DOI] [PubMed] [Google Scholar]

[R4] 4.Cunningham F, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S, et al. Ensembl 2015. Nucl Acids Res. 2015;43(D1):D662–D669. doi: 10.1093/nar/gku1010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Davidson EH, Cameron RA, Ransick A. Specification of cell fate in the sea urchin embryo: summary and some proposed mechanisms. Development. 1998;125(17):3269–3290. doi: 10.1242/dev.125.17.3269. [DOI] [PubMed] [Google Scholar]

[R6] 6.Domazet-Lošo T, Tautz D. A phylogenetically based transcriptome age index mirrors ontogenetic divergence patterns. Nature. 2010;468(7325):815–818. doi: 10.1038/nature09632. [DOI] [PubMed] [Google Scholar]

[R7] 7.Dong Z, Wei H, Sun R, Tian Z. The roles of innate immune cells in liver injury and regeneration. Cell Mol Immunol. 2007;4(4):241–252. [PubMed] [Google Scholar]

[R8] 8.Fu X, Fu N, Guo S, Yan Z, Xu Y, Hu H, Menzel C, Chen W, Li Y, Zeng R, et al. Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genom. 2009;10(1):161. doi: 10.1186/1471-2164-10-161. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Gerstein MB, Rozowsky J, Yan KK, Wang D, Cheng C, Brown JB, Davis CA, Hillier L, Sisu C, Li JJ, et al. Comparative analysis of the transcriptome across distant species. Nature. 2014;512(7515):445–448. doi: 10.1038/nature13424. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Hata S, Namae M, Nishina H. Liver development and regeneration: from laboratory study to clinical therapy. Develop Growth Differ. 2007;49(2):163–170. doi: 10.1111/j.1440-169X.2007.00910.x. [DOI] [PubMed] [Google Scholar]

[R11] 11.Hicks SC, Irizarry RA. When to use quantile normalization? 2014 doi: 10.1101/012203. bioRxiv. [DOI] [Google Scholar]

[R12] 12.Labbé RM, Irimia M, Currie KW, Lin A, Zhu SJ, Brown DD, Ross EJ, Voisin V, Bader GD, Blencowe BJ, et al. A comparative transcriptomic analysis reveals conserved features of stem cell pluripotency in planarians and mammals. Stem Cells. 2012;30(8):1734–1745. doi: 10.1002/stem.1144. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Le HS, Oltvai ZN, Bar-Joseph Z. Cross-species queries of large gene expression databases. Bioinformatics. 2010;26(19):2416–2423. doi: 10.1093/bioinformatics/btq451. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Li JJ, Huang H, Bickel PJ, Brenner SE. Comparison of D. melanogaster and C. elegans developmental stages, tissues, and cells by modencode RNA-Seq data. Genome Res. 2014;24(7):1086–1101. doi: 10.1101/gr.170100.113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Li T, Huang J, Jiang Y, Zeng Y, He F, Zhang MQ, Han Z, Zhang X. Multi-stage analysis of gene expression and transcription regulation in c57/b6 mouse liver development. Genomics. 2009;93(3):235–242. doi: 10.1016/j.ygeno.2008.10.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Necsulea A, Soumillon M, Warnefors M, Liechti A, Daish T, Zeller U, Baker JC, Grützner F, Kaessmann H. The evolution of lncRNA repertoires and expression patterns in tetrapods. Nature. 2014;505(7485):635–640. doi: 10.1038/nature12943. [DOI] [PubMed] [Google Scholar]

[R17] 17.Pantalacci S, Sémon M. Transcriptomics of developing embryos and organs: a raising tool for evo–devo. J Exp Zool Part B Mol Dev Evol. 2015;324(4):363–371. doi: 10.1002/jez.b.22595. [DOI] [PubMed] [Google Scholar]

[R18] 18.Puniyani K, Faloutsos C, Xing EP. Spex2: automated concise extraction of spatial gene expression patterns from fly embryo ISH images. Bioinformatics. 2010;26(12):i47–i56. doi: 10.1093/bioinformatics/btq172. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Shen Y, Yue F, McCleary DF, Ye Z, Edsall L, Kuan S, Wagner U, Dixon J, Lee L, Lobanenkov VV, et al. A map of the cis-regulatory sequences in the mouse genome. Nature. 2012;488(7409):116–120. doi: 10.1038/nature11243. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Spencer WC, Zeller G, Watson JD, Henz SR, Watkins KL, McWhirter RD, Petersen S, Sreedharan VT, Widmer C, Jo J, et al. A spatial and temporal map of C. elegans gene expression. Genome Res. 2011;21(2):325–341. doi: 10.1101/gr.114595.110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Tong X, Feng Y, Li JJ. Neyman-Pearson (NP) classification algorithms and NP receiver operating characteristic (NP-ROC) curves. arXiv preprint. 2016 doi: 10.1126/sciadv.aao1659. arXiv:1608.03109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Tu Q, Cameron RA, Davidson EH. Quantitative developmental transcriptomes of the sea urchin Strongylocentrotus purpuratus. Dev Biol. 2014;385(2):160–167. doi: 10.1016/j.ydbio.2013.11.019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Virmani AK, Tsou JA, Siegmund KD, Shen LY, Long TI, Laird PW, Gazdar AF, Laird-Offringa IA. Hierarchical clustering of lung cancer cell lines using DNA methylation markers. Cancer Epidemiol Biomark Prevent. 2002;11(3):291–297. [PubMed] [Google Scholar]

[R24] 24.Wang C, Gong B, Bushel PR, Thierry-Mieg J, Thierry-Mieg D, Xu J, Fang H, Hong H, Shen J, Su Z, et al. The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat Biotechnol. 2014;32(9):926–932. doi: 10.1038/nbt.3001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63. doi: 10.1038/nrg2484. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Zhao S, Fung-Leung WP, Bittner A, Ngo K, Liu X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PloS One. 2014;9(1) doi: 10.1371/journal.pone.0078644. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

TROM: A Testing-Based Method for Finding Transcriptomic Similarity of Biological Samples

Wei Vivian Li

Yiling Chen

Jingyi Jessica Li

Abstract

1 Introduction

2 Method

2.1 Associated Genes and TROM Scores

2.2 Selection of z-Score Threshold

Lemma 1

Proof

Fig. 1.

3 Results

3.1 Application of TROM to Finding Correspondence of Developmental Stages of Multiple Species

Table 2.

Table 3.

Fig. 2.

Fig. 7.

3.2 Comparison of TROM and Pearson/Spearman Correlation Measures

Fig. 8.

Fig. 9.

Fig. 3.

Fig. 4.

Fig. 10.

3.3 Robustness of TROM to Data Normalization

Fig. 5.

3.4 Robustness of TROM to Different Platforms: Comparison of D. melanogaster Developmental Stages Based on Microarray and RNA-seq Data

Fig. 11.

3.5 Gene Ontology (GO) Enrichment Analysis

Fig. 12.

Fig. 13.

Table 1.

Fig. 14.

3.6 Construction of Developmental Trees Using Stage-Associated Genes

Fig. 6.

Fig. 15.

4 Discussion

Fig. 16.

5 Conclusion

Appendix

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases