Characterizing Cancer Subtypes Using Dual Analysis in Caleydo StratomeX

Cagatay Turkay; Alexander Lex; Marc Streit; Hanspeter Pfister; Helwig Hauser

doi:10.1109/MCG.2014.1

. Author manuscript; available in PMC: 2014 Oct 14.

Published in final edited form as: IEEE Comput Graph Appl. 2014 Mar-Apr;34(2):38–47. doi: 10.1109/MCG.2014.1

Characterizing Cancer Subtypes Using Dual Analysis in Caleydo StratomeX

Cagatay Turkay ¹, Alexander Lex ², Marc Streit ³, Hanspeter Pfister ⁴, Helwig Hauser ⁵

PMCID: PMC4196636 EMSID: EMS60636 PMID: 24808198

Abstract

In this approach, dual-analysis views depict distributions of genes or data samples within Caleydo. Significant-difference plots show the elements of a cancer subtype that differ significantly from other subtypes. Analysts can characterize subtypes, investigate how samples relate to their subtype and other groups, and create well-defined subtypes based on statistical properties.

Although cancers are colloquially referred to by the tissue from which they originate (for example, lung cancer), significant differences can exist between cancers from the same tissue. The differences are often characterized by various biomolecular properties. These different forms of cancer are called subtypes. Large-scale research projects such as the Cancer Genome Atlas (TCGA; http://cancergenome.nih.gov) elicit comprehensive genomic and clinical datasets with the goals of characterizing the molecular alterations responsible for cancer and identifying and characterizing cancer subtypes.

Owing to next-generation sequencing and microarray technology, these projects can employ large, heterogeneous datasets. However, deriving insight from these complex datasets remains a challenge. Current analysis relies largely on custom scripts to find interesting genes or clusters of patients (stratifications) in these datasets. To remedy this, we developed Caleydo StratomeX, an interactive visualization method to analyze and discover relationships in these datasets.¹ Researchers can use StratomeX to evaluate overlaps and relationships of stratifications.

However, StratomeX doesn’t inherently enable analysts to identify the characteristic genes of candidate subtypes, nor does it communicate how patients relate to a given subtype. The former capability is important because the characteristic genes could also be causally involved in a subtype and thus might be a target for a therapeutic or diagnostic approach. With the latter capability, researchers can investigate how samples relate to a subtype to estimate the quality of candidate subtypes and build a deeper characterization of a subtype.

To address these limitations, we integrated two techniques into StratomeX:

■
dual analysis,² a general high-dimensional data analysis methodology, and
■
significant-difference plots, a novel visual representation of the differences between data subsets.

With this approach, domain scientists can discover genes that are distinctive for specific subtypes. They can also observe the properties of a cluster’s samples and compare how they behave in different datasets and clusters. These capabilities can provide a deeper understanding of stratifications. Moreover, scientists can employ the dual-analysis methodology to interactively generate stratifications.

Biological Background and Analysis Tasks

Subtype analysis is based on a variety of biomolecular datasets that capture different aspects of the process of life, ranging from the information stored in the genome to the functional products that trigger biochemical reactions in cells. Projects such as TCGA capture information on gene activity, factors influencing gene expression, and the genome’s structure and sequence. An example of gene activity data is messenger RNA (mRNA) data, which measures mRNA’s abundance in the cell. mRNA is translated into proteins, which are the functional products. In addition, microRNA (miRNA) and DNA methylation influence gene expression and thus are important factors in many processes and diseases.

All these factors play a role in the development of certain cancers, so a comprehensive analysis solution must take into account all these datasets, in addition to metadata such as clinical patient data. In this article, we demonstrate our approach by investigating mRNA, mRNA-seq (which relates to the same biological process as mRNA but uses a different acquisition technique), miRNA, and methylation data. However, a comprehensive analysis would also incorporate other datasets—for instance, related to structural variations occurring on various scales in the genome.

In previous research, we elicited subtype analysis tasks that dealt with finding and evaluating stratifications based on multiple datasets.¹ We recently revisited those requirements in collaboration with domain scientists and found the need to supplement them with the following three tasks to further characterize stratifications.

Find Distinctive Elements

Identifying distinctive elements of clusters in a stratification provides a deeper understanding of why a particular cluster exists and how it relates to other clusters in the analysis. Distinctive elements are also good candidates to investigate as diagnostic markers or might even be causally involved in the disease.

Compare Samples

Investigating samples’ characteristics over several datasets and in comparison to other stratifications helps build a more complete picture of the samples’ properties. Analysts can observe how strongly a cluster’s members are related and explore whether they show similar properties in a dataset different from the one used for clustering.

Create Clusters

Analysts should be able to create clusters in an exploratory manner and interactively compare the intermediate results to metadata such as clinical data. Moreover, this manual clustering should enable analysts to merge observations of different datasets. The resulting clusters will be well defined in terms of statistical properties and richer in terms of the information sources included during construction.

Methodological Building Blocks

To enable the aforementioned tasks, our solution employs StratomeX and dual analysis.

StratomeX

Caleydo (www.caleydo.org) is an open-source visualization framework for biomolecular data analysis. It provides rich functionality for loading and handling multiple heterogeneous datasets as well as stratifications defined on the data. A core strength is its ability to slice datasets into meaningful subsets and flexibly combine multiple small visualizations of these subsets, using views such as histograms or heat maps, to create a fully integrated composite visualization.³ Other examples of visual methods that improve analysis of genomics data are the Hierarchical Cluster Explorer⁴ and Mayday.⁵

StratomeX, a Calyedo project, is a comparative-visualization technique that uses slicing. It lets analysts investigate the relationships between multiple stratifications, represented as columns. Each column consists of blocks, each corresponding to a group of patients. Ribbons of varying width visualize the overlap between neighboring stratifications, resulting in an overall appearance similar to Parallel Sets⁶ or Sankey Diagrams.⁷ Wide ribbons indicate a strong overlap between two groups; thin or absent ribbons correspond to only a few or no shared patients. Each block contains a visualization of the data for that group’s patients. Analysts can switch between different types of visualizations. For numerical data, clustered heat maps are the default because they effectively communicate global trends and patterns.

Dual Analysis

In this approach, the visual analysis occurs in parallel on both the data items and the dimensions. We achieve this duality by using statistics computed over both the dataset rows and columns.

For example, consider an mRNA gene expression dataset given as a 2D data table with n rows and p columns, where each row corresponds to a single sample (patient) and each column to a single gene. The matrix cells contain the expression values.

After normalizing the data appropriately, we calculate the central tendency (the mean, μ, or median) and the spread (the standard deviation, σ, or interquartile range, IQR), using each of the n samples and p genes separately. We calculate the robust counterparts of statistical moments to increase the statistics’ resistance to outlier values. Because experts are often accustomed to using nonrobust versions of the statistics (for example, μ or σ), our system incorporates such measures. This helps users quickly familiarize themselves with the information in the views, and at any point in an analysis, they can modify the set of statistics they’re using.

Figure 1 illustrates how we construct dual-analysis views. Visualizations of samples have a yellow background, with each point representing a sample; visualizations of genes have a light-green background, with each point depicting a gene. The computed statistics determine a point’s location in a scatterplot. We can elaborate the analysis by using statistics other than the first two statistical moments. For the analyses in this article, we also computed skewness, which indicates a distribution’s asymmetry (and the asymmetry’s direction), and kurtosis, which characterizes its peakedness.

To construct the view depicting samples (the one with the yellow background), we computed the statistics for each sample (the mean, μ, and standard deviation, σ) using a row of the data. To construct the view of the genes (the one with the light-green background), we computed the statistics using a column of the data.

Characterizing Subtypes

To facilitate subtype characterization, we incorporate dual-analysis scatterplots and significant-difference plots as blocks. We also use these visualizations as separate linked views to enhance interactive visual exploration and achieve tasks such as manual cluster creation.

Dual-Analysis Views

Figure 2 shows the embedded dual-analysis views. The sample scatterplots (yellow) display only those samples that are members of the represented cluster. The gene scatterplots (green) display the statistics for all the genes computed, using only the members of the represented cluster.

The first column shows a four-cluster stratification for a microRNA (miRNA) dataset. The scatterplots show the median versus the interquartile range (IQR) for the cluster samples. The second column shows a three-cluster stratification for a messenger RNA (mRNA) dataset, again showing samples. The third column uses the same three-cluster stratification for the same dataset but shows genes instead of samples. The sample scatterplots (yellow) depict the statistical characteristics of each cluster’s members; the gene scatterplots (green) depict statistics computed for the genes using only the samples from the cluster represented by the block. The selection of samples is highlighted in the first two columns and in the ribbons. The selection of the genes enables investigation of the distribution of expression values for the genes for different clusters in a stratification.

We enhance interactive exploration by enabling a selection mechanism that’s linked with all the views in StratomeX. Users can select both samples (see the second cluster in the second column of Figure 2) and genes (see the second cluster in the third column in Figure 2) at the same time. In Figure 2, the ribbons in StratomeX highlight the selection of the samples.

Significant-Difference Plots

We previously used plots to effectively display the changes in statistical computations in response to a user’s selection.⁸ Here, we extend that approach with the determination and communication of the visualized differences’ significance.

Figure 3 illustrates how we construct significant-difference plots (we call them just difference plots from now on). The user first selects (brushes) a subset of samples. In response, the system automatically calculates μ and σ for each gene using only the set of selected samples B (μ^B and σ^B) and the rest of the samples R (μ^R and σ^R) separately. We then compute the differences between the values with

Δ_{μ} = μ^{B} - μ^{R}, Δ_{σ} = σ^{B} - σ^{R},

(1)

where Δ_μ and Δ_σ are data vectors of size p, the number of genes. The difference plot then visualizes these values for all p. When no difference exists between a gene’s expression values for B and R, we place that gene at the origin (0, 0). While building this view, we compare the selected set of samples to the rest instead of comparing against all the samples. This approach avoids overlap between the two compared sets.

The plot visualizes the differences between the selected samples (B) and unselected samples (R) for the genes. Δ_μ and Δ_σ are data vectors of size p, the number of genes. Genes that differ significantly are red; all others are blue.

The difference plot on the right in Figure 3 displays the distribution of the differences in the statistic computations in response to the (sample) selection in the scatterplot (see the left of Figure 3). In this example, most genes have lower μ values for the selected items; that is, they’re on the left of the y-axis.

Communicating significance

An important consideration for analyzing differences between two subsets is statistical significance—whether the difference is not likely due to chance. As in many other domains, researchers who analyze genomic data use statistical hypothesis tests to test for significance.⁹ So, we enhance difference plots with integrated statistical hypothesis testing.

As the hypothesis-testing procedure, we use the two-sample Welch’s t-test.¹⁰ This test doesn’t assume that the two subsets have equal variance, which makes it suitable for our application. We perform the test on B and R and test against the (null) hypothesis that they have equal central tendencies. We compute t and the degrees of freedom, d.f.:

t = \frac{{\bar{μ}}_{B} - {\bar{μ}}_{R}}{\sqrt{\frac{s_{B}^{2}}{N_{B}} + \frac{s_{R}^{2}}{N_{R}}}},

d . f . = \frac{{(s_{B}^{2} ∕ N_{B} + s_{R}^{2} ∕ N_{R})}^{2}}{{(s_{B}^{2} ∕ N_{B})}^{2} ∕ (N_{B} - 1) + {(s_{R}^{2} ∕ N_{R})}^{2} ∕ (N_{R} - 1)},

where ${\bar{μ}}_{i}$ is the sample mean, $s_{i}^{2}$ is the sample variance, and N_i is the sample size of B and R.

We then use these values together with the t-distribution and test the null hypothesis with a significance level of 0.05, employing a two-tail strategy. We perform this test for all p. For each gene, we store whether it shows a significant difference between B and R. We communicate this information by modifying the color of each gene in the difference plot. Genes with significant differences are red; the others are blue (see the right side of Figure 3). This enhancement lets analysts get immediate feedback on the differences’ significance. On the basis of this initial assessment, analysts can employ more-advanced routines to confirm the significance of the changes between the two subsets.

Difference plots as blocks

While constructing these blocks, we again compute Δ_μ and Δ_σ for each gene, using Equation 1. Here, however, B corresponds to the samples that are members of the represented cluster, and R corresponds to the rest of the samples. We also compute the differences’ significance and color the visualization accordingly. The resulting blocks communicate which genes are more distinctive for each cluster. Moreover, the selection mechanism lets analysts compare these distinctive genes across different clusters. We show an example of this feature later.

Case Studies

We demonstrated our approach’s effectiveness by analyzing a comprehensive invasive breast carcinoma (BRCA) dataset collected by the TCGA consortium. We used the mRNA expression data, miRNA sequencing data, and methylation data from more than 800 breast cancer patients. First, we loaded the BRCA data, which is available pre-packaged for Caleydo. For comparison and evaluation, we used a recently published reference study that provided a stratification of samples.¹¹

The case studies aimed to demonstrate how our approach lets analysts execute the three tasks we described earlier.