Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 19.
Published in final edited form as: Methods Mol Biol. 2023;2629:73–93. doi: 10.1007/978-1-0716-2986-4_5

Statistical Methods for Integrative Clustering of Multi-omics Data

Prabhakar Chalise 1, Deukwoo Kwon 2, Brooke L Fridley 3, Qianxing Mo 3
PMCID: PMC10950392  NIHMSID: NIHMS1970698  PMID: 36929074

Abstract

Cancers are heterogeneous diseases caused by accumulated mutations or abnormal alterations at multi-levels of biological processes including genomics, epigenomics, transcriptomics, and proteomics. There is a great clinical interest in identifying cancer molecular subtypes for disease prognosis and personalized medicine. Integrative clustering is a powerful unsupervised learning method that has been increasingly used to identify cancer molecular subtypes using multi-omics data including somatic mutations, DNA copy numbers, DNA methylation, and gene expression. Integrative clustering methods are generally classified into model-based or nonparametric approaches. In this chapter, we will give an overview of the frequently used model-based methods, including iCluster, iClusterPlus, and iClusterBayes, and the nonparametric method, integrative nonnegative matrix factorization (intNMF). We will use the integrative analyses of uveal melanoma and lower-grade glioma to illustrate these representative methods. Finally, we will discuss the strengths and limitations of these representative methods and give suggestions for performing integrative analyses of cancer multi-omics data in practice.

Keywords: Integrative clustering, iCluster, iClusterPlus, iClusterBayes, NMF, intNMF, TCGA, Uveal melanoma, Lower-grade gliomas

1. Introduction

Identification of molecular subtypes of cancers using multi-omics data such as gene expression, DNA methylation, DNA copy number, and protein expression has played a very important role in understanding the molecular basis of disease prognosis and treatment response. Similar-looking cells at the morphological level can have very different molecular profiles. A few examples of molecular heterogeneity within a cell include point mutations, deletions or insertions in tumor suppressor genes, differential promoter methylation, chromosomal translocations, and so on. Studies have shown that such differing molecular heterogeneity can result in multiple cancer subtypes [13]. People having the same cancer diagnosed with similar clinicopathological, morphological, and functional characteristics can exhibit very different clinical outcomes with respect to the severity of the disease, response trajectory to drugs, etc. Therefore, it is clinically important to identify the subtypes of cancers at the molecular level to understand their etiology and to develop personalized medicine. One such method to identify cancer subtypes is to perform clustering analysis, which subgroups the samples based on the similarity pattern of their molecular profiles. Within the subgroups, it is thought that the tumors will be more homogeneous and thus may have similar clinical response to a given therapy regimen. There have been many successful examples of the application of clustering method to identify cancer subtypes [46].

Clustering can be carried out using a single type of data at a time or using multiple types of omics (multi-omics) datasets comprehensively together [7]. Multiple layers of biological data have been increasingly generated due to the advent of high throughput technologies such as microarrays and next-generation sequencing technologies. For example, the multi-institution collaborative project, The Cancer Genome Atlas (TCGA), has generated multiple layers of omics data including genome, transcriptome, epigenome, and proteome information for a large number of subjects for many cancers [6, 813]. The different types of datasets are assayed at different layers of biological processes. The biological signal (a systematic pattern of molecular profiles) may or may not be present in all those datasets. For example, a similarity (or dissimilarity) pattern may be observed with gene expression data for a few subjects, while such signals may be clearer with methylation data for others. Also, there might be weak but consistent signals present across several datasets. Since multiple layers of data are assayed on the same subjects, they are interrelated. Clustering analysis using a single data cannot capture such signals. Integrative clustering is a powerful approach to identify latent subtype structures inherent in the datasets accounting for both between and within data correlations. The goal of the integrative clustering analysis is to identify the subgroups of samples into a distinct class (clusters), considering the biological phenomena at several levels including gene expression, DNA methylation, DNA copy number variation (CNV), protein expression, etc. [4, 14, 15].

Different molecular data have their own characteristics in terms of their measurement methods, scales of measurement, and distribution. For example, gene mutation is usually summarized as the presence/absence of somatic mutations, methylation is measured by the ratio of methylated and unmethylated signal intensities, and gene expression is measured by the amount of production of mRNAs using microarray or sequencing technologies. The data differ widely by their quantitative scales and statistical distributions. Such disparity of the datasets poses challenge for integrative analyses. The simplest approach to integration is by concatenating multiple datasets after appropriate normalization (i.e., rescaling of the data). Then well-known methods designed for single omics data clustering are applied to the combined data matrix. Such approaches are simple and computationally efficient but tend to dilute the small signal-to-noise ratio in multiple datasets [16]. Another approach is to carry out clustering analysis using one data at a time and manually integrate the results. Such approach is time-consuming, suffers from subjective bias, and is non-reproducible. Third, clustering analyses are carried out for each data separately, the clustering assignments are combined, and another clustering method is used on the combined clustering assignment data (cluster of clusters) [17]. However, defining the similarity metric or distance is not easy with such methods.

A few parametric (model-based) [15, 1822] and nonparametric methods [14, 16, 23, 24] have been developed for integrative clustering analysis. In this chapter, we will focus on introducing the frequently used model-based methods including iCluster, iClusterPlus and iClusterBayes [15, 2022] and the nonparametric method integrative nonnegative matrix factorization (intNMF) [14]. To illustrate these methods, we will present the analyses of the uveal melanoma and lower-grade glioma multi-omics data from TCGA. Finally, we will discuss the strengths and limitations of the model-based methods and the nonparametric methods.

2. Model-Based Methods of Integrative Clustering

2.1. The iCluster Method

The integrative clustering (iCluster) method is a model-based approach for unsupervised joint clustering analysis of multi-omics data [15]. The major goals of the iCluster are to obtain joint sample clusters and simultaneously identify the multi-omics features contributing to the joint sample clustering (Fig. 1). The iCluster method is designed to decompose multiple continuous data matrices into the product of omics-specific weight matrices and a shared factor matrix, which can be written as

Xt=WtZ+Et,

where Xt is the tth data matrix with dimension of pt×n(pt rows of features and n columns of samples), Wt is the weight (coefficient) matrix of dimension pt×k, Z is a matrix of dimension k×n that connects all the data matrices, and Et is a random error matrix of dimension pt×n (Fig. 1). The iCluster model can be viewed as a Gaussian latent variable model in which the ith column vector zi=zi1,zi2,,zik' of matrix Z is associated with sample i(i=1,2,,n), which is used to capture the correlative structure of multi-omics data. To obtain a likelihood-based solution of the joint matrix decomposition, the components of Z and Et are assumed to be normally distributed. A penalized expectation-maximization algorithm is used to obtain optimized Wt and Z in which a lasso-like penalty term is added on Wt to generate a sparse Wt matrix. As a result, informative features that contribute to the sample clustering will have nonzero coefficients (rows) in Wt matrix, and non-informative features will have zero coefficients. Integrative sample cluster assignments are obtained by performing k-means clustering on the n column vectors of Z. Therefore, the iCluster method can perform sample clustering and distinguish informative features from non-informatic features, which are important for integrative clustering analysis of multi-omics data. Different data types have different data structures that call for applying different penalty functions on the coefficient matrix Wt. For example, DNA copy number data tend to be spatially correlated along contiguous chromosomal regions, which makes the fused lasso penalty an appealing method for inducing sparsity on Wt. A set of genes in a certain pathway or functional group may be co-regulated. The elastic net penalty shrinks coefficients of correlated features towards each other, resulting in a group effect by removing or selecting highly correlated omics features together. In order to accommodate different omics data structures, Shen et al. further extended the iCluster method by applying lasso, elastic net, and fused lasso penalty functions on Wt to facilitate multi-omics feature selection [22].

Fig. 1.

Fig. 1

Integrative clustering (iCluster) framework. (a) n cancer samples undergone somatic mutation, DNA copy number, promoter methylation, and mRNA gene expression analyses. Matrices Xt(t=1,2,3,4) represent the multi-omics datasets with n common samples (columns) and pt(t=1,2,3,4) features (rows), respectively. (b) iCluster models are based on matrix factorization and can be considered a latent variable model in which Z is a k×n latent variable matrix and Wt is a pt×k coefficient matrix for t=1,2,3,4. When k=3,n samples can be divided into 4 clusters in the 3-dimensional latent variable space. (c) Besides performing sample clustering, iCluster models identify the informative (driver) features that drive the sample clustering. The simplified correlated patterns between DNA copy number and gene expression are shown in clusters 1 and 4, and the simplified correlated patterns between methylation and gene expression are shown in clusters 2 and 3

2.2. The iClusterPlus Method

The iCluster method is limited to modeling continuous multi-omics data, which are usually assumed to be normally distributed. In practice, multi-omics data are made up of many data types. For example, in a typical RNA-seq experiment, the expression level of a gene is usually represented by sequence count (the number of reads falling in the gene); for DNA copy number data, a chromosomal region can be classified as gain, loss, or normal; for somatic mutation data, the mutation status of a gene can be classified as mutation or normal. The iClusterPlus method can integrate four different data types including continuous, count, binary, and multi-categorical data types, which is a significant enhancement of the iCluster method [21]. Suppose n samples are analyzed by m high-throughput techniques generating m omics datasets and the tth dataset has pt features. Let xijt denote an omics variable for the jth j=1,2,,pt omics feature of the ith (i=1,2,,n) sample in the tth (t=1,2,,m) dataset. It is assumed that sample i is associated with latent variable zi=zi1,zi2,,zik', a column vector consisting of k unobserved latent variables that can be used for sample clustering. In the iClusterPlus framework, when xijt is a continuous variable, xijt and zi are related through a standard linear regression:

xijt=αjt+βjtzi+εijt,εijtN0,σjt2,

where αjt is the intercept, βjt=β1jt,,βkjt are the slope coefficients, and εijt is the random error that is assumed to be independent and normally distributed with mean 0 and variance σjt2. When xijt is a count variable, the relationship between xijt and zi is modeled by a Poisson regression:

logλxijtzi=αjt+βjtzi,

where λxijtzi is the conditional mean of the count given zi. When xijt is a binary variable with a realized value of 0 or 1, the following logistic regression is used.

logPxijt=1zi1Pxijt=1zi=αjt+βjtzi,

where Pxijt=1zi is the probability of xijt=1 given zi. Finally, when xijt is a multi-categorical variable, the following multinomial logistic regression is used.

Pxijt=czi=expαjct+βjctzil=1Cexpαjlt+βjltzi,c=1,2,,C,

where Pxijt=czi denotes the probability of xijt being the c category. To identify the omics features that make an important contribution to sample clustering, the lasso penalty is applied for the estimation of the model parameters. As a result, the coefficients βjt are shrunk to 0 for the non-informative features, while the coefficients for the informative features contributing to sample clustering have nonzero components. Sample clusters are determined by the values of the latent variables zi, a length-k vector. K-means clustering is used to separate the n samples into k+1 clusters. In order to achieve an optimal solution for the iClusterPlus model, a small number of k (e.g., k=1,2,3,4,5) are tested, and a grid search of the lasso shrinkage parameters is performed. The optimal k is selected based on the plot of the deviance ratio or the Bayesian information criterion (BIC) versus k, where the deviance ratio or BIC becomes stable (see the example below).

2.3. The iClusterBayes Method

The iClusterBayes model is a fully Bayesian latent variable model built on the iClusterPlus framework [20]. This method applies the Bayesian variable selection algorithm to identify informatic features that contribute to sample clustering. Briefly, the iClusterBayes model for the continuous, count, and binary variables can be described as follows:

xijt=βjtΓjtzi+εijt,εijtN0,σjt2,
logλxijtzi=βjtΓjtzi,
logPxijt=1zi1Pxijt=1zi=βjtΓjtzi,

where βjt=β0jt,β1jt,,βkjt are the coefficients for the jth feature in the tth dataset, zi=1,zi1,zi2,,zik, and Γjt=diag1,γjt,,γjt is a diagonal matrix with k+1 diagonal elements. The constant 1 in zi and Γjt is designed to let the models have intercepts β0jt (equivalent to αjt in the iClusterPlus model). In the iClusterBayes model, an indicator variable γjt with value 1 or 0 is used for Bayesian variable selection. When γjt=1, it indicates that the jth feature in the tth dataset contributes to the sample clustering. When γjt=0, it indicates that the contribution of the corresponding feature to sample clustering is negligible. To perform Bayesian analysis, the model parameters are assumed to have the following prior distributions:

βjtN0,t,σjt2InverseGammav2,vσ22,γjtBernoulliqt.

In words, βjt follows multivariate normal distribution with mean vector 0 and covariance Σt, σjt2 follows inverse gamma distribution with shape parameter ν/2 and scale parameter νσ2/2, and γjt follows Bernoulli distribution with probability of qt being a driver for sample clustering. With these assumptions, the model parameters and zi can be drawn from their posterior distributions. The drivers of the sample clustering are these features with high posterior probability of γjt=1. Like the iClusterPlus method, the sample cluster assignments are obtained by performing K-means clustering on the latent variable zi. In addition, to achieve an optimal k, a small number of k are tested, and the optimal k is selected at the transition point on the plots of the deviance ratio and/or the Bayesian information criterion (BIC) versus k, where the deviance ratio or BIC become relatively stable (see the example below). To analyze the same multi-omics dataset, the iClusterBayes needs much less computing time than the iClusterPlus since it does not need to perform a grid search of the lasso parameters. Another advantage of the iClusterBayes method is that it calculates the posterior probabilities of being drivers for the omics features, while the iClusterPlus method does not calculate statistical significance for omics features.

The iCluster methods have been used by TCGA and other research groups to characterize a variety of cancers, including breast cancer, lung squamous cell carcinoma (LUSC), lung adenocarcinoma (LUAD), uterine corpus endometrial carcinoma (UCEC), stomach adenocarcinoma (STAD), skin cutaneous melanoma (SKCM), prostate adenocarcinoma (PRAD), oesophageal carcinoma (ESCA), liver hepatocellular carcinoma (LIHC), bladder urothelial carcinoma (BLCA), mesothelioma (MESO), sarcoma (SARC), and uveal melanoma (UM) [813, 2531]. Recently, Hoadley et al. performed an unsupervised pan-cancer analysis of the 33 cancer types using the iCluster method and found that about one-third of the iClusters were almost homogeneous for a single cancer type, and the remaining two-thirds had various degrees of heterogeneity [32]. These findings demonstrated the dominant role of cell-of-origin in the classification of some cancer types, while, for the other cancer types, the cell-of-origin had less effect on cancer classification.

3. Nonparametric Methods of Integrative Clustering

In contrast to the model-based methods, the nonparametric methods do not rely on any distribution assumptions of the data. A few nonparametric methods have been proposed in recent years including the integrative nonnegative matrix factorization method (intNMF) [14], similarity network fusion (SNF) [16], perturbation clustering (PINS) [24], and neighborhood-based multi-omics clustering (NEMO) [23], etc. The nonparametric methods utilize the similarity among the subjects as a criterion to identify the subtypes. Different methods utilize different types of similarity metrics, such as distance, correlation, nearest neighborhood, or consensus. The integrative clustering intNMF utilizes state-of-the-art consensus clustering in integrating multiple types of molecular data. We briefly introduce a few nonparametric methods for integrative clustering and present more details on intNMF clustering method.

3.1. Similarity Network Fusion (SNF) [16]

For each omics data, SNF creates sample similarity matrices based on pairwise distance measures, termed as patient networks for each data. The samples represent the nodes, and similarity measures represent the edges between the nodes. Then, the networks are integrated to create a single common network matrix using a nonlinear combination method. SNF method utilizes the kernel method to create and fuse the network structures. Then spectral clustering is used on the final network structure to identify the disease subtypes.

3.2. Perturbation Clustering (PINS) [24]

PINS uses the perturbation technic, i.e., adding Gaussian noise to the data repeatedly and applying a classical clustering algorithm to the resulting data. For each type of the omics data, a connectivity matrix is constructed, which will be merged into a single combined similarity matrix. Then, classical clustering methods, such as k-means, hierarchical clustering, partitioning around medoid, or dynamic tree cut, are used to identify the subtypes and determine the cluster membership assignment to the subjects.

3.3. Neighborhood-Based Multi-omics Clustering (NEMO) [23]

NEMO constructs similarity networks for the samples as in SNF for each data and then modifies the similarity to relative similarity. The relative similarity networks are then averaged to create a common relative similarity network structure. Then, as in SNF, spectral clustering is used on the common similarity matrix to carry out the clustering of the subjects.

In this chapter, we describe the intNMF method and its application in a real-life data on lower-grade gliomas. The method is based on powerful nonnegative matrix factorization technic [33] and consensus clustering-based approaches [34]. Nonnegative matrix factorization method has been shown to be highly effective in pattern recognition studies [33]. Consensus clustering technic is considered to be the state-of-the-art approach for clustering algorithms [24, 34]. The integrative clustering intNMF utilizes non-negative matrix factorization and consensus clustering in integrating multiple types of molecular data. The method has been found to be highly effective in identifying disease subtypes [14, 35].

4. Integrative NMF Clustering (intNMF)

Let us begin with a brief introduction of nonnegative matrix factorization (NMF) framework. By definition, NMF imposes a non-negativity constraint on its estimation process. In other words, the linear combination of the estimated matrices has only the additive effect if any effect is present; otherwise, it is zero, i.e., there is no negative effect. NMF follows an intuitive approach in which the parts can be combined to form a whole object without having any cancellation effects during the process. NMF approach was first proposed by Paatero and Tapper [36] in 1994. In 1999, Lee and Seung [33] proposed an algorithm for matrix factorization and demonstrated the successful application of NMF in pattern recognition problem in human image analysis. Prior to NMF, the standard method in image analyses was singular value decomposition (SVD) or principal component-based method. Lee and Seung showed that although SVD-based methods are powerful and have unique stable decomposition, in pattern recognition problems, they often result in noisy patterns induced by the cancellation effects of negative signs. They concluded that NMF is more efficient than SVD for pattern recognition research.

Brunet et al. [37] incorporated the algorithm proposed by Lee and Seung and the consensus clustering technic [34] to determine subtypes of cancer. Suppose Xp×np×n is a matrix with p features and n subjects containing all the nonnegative entries. Then NMF factorizes the matrix Xp×n into two nonnegative matrices Wp×k and Zk×n approximately, Xp×nWp×kZk×n, where k is user-specified latent factors, e.g., the number of groups or classes. The resulting matrices Z and W are called the matrix of basis vectors and the matrix of coefficient vectors, respectively. Using the algorithm, optimum basis matrix Z is identified and used to classify the subjects based on their rank in the columns.

In order to estimate the optimum Z an objective function is defined, such as the Frobenius norm Q=minWXWZ2. Although divergence-based functions can also be used instead of the Frobenius norm, the results are very similar [14, 33]. The objective function is convex in Z when W is given, or convex in W when Z is given. But, since both Z and W are unknown, the objective function is non-convex in general. As a result, there is no global minimum of the optimization problem [36, 38]. In pattern recognition or clustering analysis applications, achieving the “best” local minimum out of several random initializations of Z and W in the algorithm is enough and has been successfully implemented and interpreted [37]. The clustering structure is determined based on the relative ranks of the subjects within the columns of Z matrix rather than the numerical values. Therefore, as long as the relative rank (i.e., cluster pattern) remains stable in a local optimum, we will be able to extract the latent structure successfully.

intNMF extends the NMF framework for more than one type of data (e.g., mRNA, DNA methylation, protein expression) collected on the same set of subjects. Let Xt,t=1,2,,m be matrices representing m data types assayed on n samples with pt, features. Integrative clustering is carried out by estimating the common latent structure (basis matrix) Z and data-specific coefficient matrices Wt such that

XtWtZ,t=1,2,m,

where all entries of Z and Wt are nonnegative. Figure 2 shows the overall framework of the intNMF method. The objective function is then defined as the weighted Frobenius norm

Q=minW,Zt=1mθtXtWtZ2

where θi>0 is the user-specified weight if such weights are available. The default weight in the current implementation of intNMF is 1, i.e., equal weight for all data. However, users can provide their own estimated weights as appropriate. One example of such weight estimation for each data can be based on the mean sums of squares given by θt=MaxmeanXt2,t=1,,mmeanXt2,t=1,,m. Numerical optimization methods are used to find the local optimum by minimizing the non-convex objective function. intNMF utilizes nonnegative-constrained alternating least squares (NNALS) algorithm [39] to solve the optimization problem. The strategy is to initialize only one Z matrix regardless of how many data are being integrated. In the first part of each iteration, non-negative least squares method is used to estimate Wt matrices. Then, in the second part of the same iteration, the Z matrix is estimated by feeding those Wt matrices into the least squares method. In this way, the non-convex problem is converted into a convex problem piecewise for each of the alternating iterative steps. There are several advantages to this algorithm over the multiplicative update rule as proposed earlier [14, 40]. After optimum Z matrix is identified, the cluster memberships for the subjects are determined based on the relative rank of the values in each column. The convergence criterion is based on the stability of the clustering assignment using the idea of consensus clustering [14, 34].

Fig. 2.

Fig. 2

The framework of integrative nonnegative matrix factorization (intNMF) method

In any clustering method, there are two major steps involved: (i) identification of optimum number of clusters k and (ii) assignment of the subjects to those clusters, i.e., cluster memberships. The meaningful cluster k needs to be small enough to reduce the noise but large enough to retain important patterns present in data. intNMF utilizes a resampling-based cross-validation technic to estimate the optimum number of clusters by calculating the Cluster Prediction Index (CPI). After the cluster numbers are identified, cluster memberships for the subjects are assigned by running the intNMF algorithm and by specifying the optimum number of clusters. The details of the algorithm and the clustering assignments can be found in Chalise and Fridley [14]. The method is implemented in the R package intNMF and is freely available to download.

5. Preparation of Multi-omics Data for Integrative Clustering Analysis

TCGA has generated integrative multi-omics data including genomic, epigenomic, transcriptomic, and proteomic data for 36 cancer types, which have become a great resource for the discovery of integrative cancer subtypes. The level 3 multi-omics data are available at http://firebrowse.org/ and can be used for integrative clustering analysis. Usually, the multi-omics data are processed to form multiple data matrices with columns corresponding to the samples and rows corresponding to omics features for each dataset. For example, somatic mutation data can be summarized using a binary matrix with values 1 or 0, which indicates a gene that contains mutations or no mutation, respectively. A gene is classified as “mutated” if it contains in-frame deletion/insertion, frame shift deletion/insertion, missense/nonsense/nonstop mutation, RNA splice site, and/or translation start site mutation. Genes with a low mutation rate (e.g., ≤2%) in the samples usually do not significantly contribute to sample clustering, and it is challenging to fit the iCluster models due to the sparsity of the data. Therefore, we suggest only including genes with mutation rates >2% in iCluster analysis. For the copy number data, the log2 ratio segment means can be used for iCluster analysis. Since neighboring genomic regions tend to have similar values, they can be merged into condensed regions using the methods described by Mo et al. [21]. Gene expression and methylation data usually contain tens of thousands of genes. Typically, the cluster driver genes are those with relatively large variances. In practice, we found that using the most variable genes (e.g., top 25%) was sufficient for iCluster analysis [30, 31]. To make the gene expression and methylation data fit the model better, for the iCluster method, we suggest performing log2 transformation of the gene expression values (e.g., normalized mRNA-seq counts) and logit transformation of the methylation beta values. Similar data preparation steps can be followed for the intNMF method as well. But since intNMF does not utilize any statistical distribution assumption, no such transformation is required. The specific requirement of intNMF is that each of the datasets must have only nonnegative entries. While a few types of data are naturally nonnegative, e.g., methylation, mRNA out of RNAseq, many others contain both negative and positive entries, e.g., microarray datasets. We recommend calculating the absolute value of the smallest negative number present in the data and adding that to the whole data matrix, thus shifting the data in the positive direction. In doing so, the variance of the data will remain unchanged and will not have any effect on clustering performance. Also, it is important to keep the scales of the multiple datasets similar in order to avoid the results being biased towards the data with a wider range. The data can be rescaled such that they fall within a certain range. One simple approach to do that is by diving the data by the maximum value present in the data.

6. Integrative Clustering Analyses of Uveal Melanoma

Uveal melanoma (UM) is a rare cancer that originates from melanocytes within the uveal tract of the eye [41]. UM is a highly aggressive disease that preferentially metastasizes to the liver within a few years after diagnosis, regardless of the successful local treatment of the disease. The three-year survival rate was less than 5% for patients with metastatic UM [4244]. UM characterized by loss of a copy of chromosome 3 (monosomy 3), gain of chromosome 8q, and a high-frequency mutation of BRCA-associated protein 1 (BAP1) is associated with a high risk of metastasis [4547]. UM characterized by the normal copy number of chromosome 3 (disomy 3) and gain of chromosome 6p is associated with a low risk of metastasis [47, 48]. TCGA generated multi-omics data for 80 UM samples and reported DNA copy number, methylation, and gene expression subtypes [49]. However, TCGA did not report integrative subtypes based on integrative clustering analysis of the multi-omics data. Therefore, we use the analysis of UM multi-omics data as an example to illustrate the iCluster methods.

We performed iCluster analysis of TCGA uveal melanoma (UM) multi-omics data including somatic mutation, DNA copy number, methylation, and gene expression [30]. The multi-omics data were processed as described in the previous section and analyzed using the iClusterPlus method. In order to identify an optimal number of clusters, the cluster number parameter k=1,2,3,4,5 were tested, respectively, and 701 lasso parameters were searched for each k. Figure 3a shows the BIC value and deviance ratio at each k for the model such that the deviance ratio of the model is the maximum among the 701 possible models. It can be seen that the BIC values are the second smallest and the smallest when k=3 and k=4, respectively, suggesting that four and five clusters are two optimal solutions (Fig. 3a). By examining the two options, we found that the major difference between the four- and five-cluster solutions was that one cluster of the four-cluster solution was further divided into two smaller clusters, generating five clusters in total [30]. To avoid overfitting, the four-cluster solution was considered optimal. Figure 3b shows the distribution of the UM samples in the three-dimensional latent variable space in which the samples form two major clusters and each of the two major clusters is divided into two smaller clusters. This observation is consistent with the multi-omics patterns that drive the formation of the four iClusters (Fig. 3c). It is well known that loss of one copy of chr3 (monosomy 3) is the most common aberration in UM [46, 49]. Strikingly, the samples with chr3 loss form a cluster, and the samples with chr3 normal (disomy) form another cluster (Fig. 3c, copy number). Following the tradition, the two major clusters are named M3 and D3, respectively. Primarily driven by the chromosome 6q status, the M3 is further divided into M3.1 and M3.2 clusters, and the D3 is further divided into D3.1 and D3.2 clusters, where M3.2 and D3.1 are characterized by 6q loss (Fig. 3c, copy number). Notably, M3 is correlated with 8q gain, D3 is correlated with 6p gain, and M3 and 6p gain appears to be mutually exclusive (Fig. 3c, copy number). The top genes driving the sample clustering form two gene clusters (Me1, Me2) in the methylation data and two gene clusters (Ex1, Ex2) in the gene expression data, respectively (Fig. 3c, methylation, mRNA). The methylation and gene expression patterns are negatively correlated. Overall, the methylation patterns of Me1 and the expression patterns of Ex2 are positively correlated with the chr3 status. In contrast, the methylation patterns of Me2 and the expression patterns of Ex1 are negatively correlated with the chr3 status. In addition, genes including GNA11, BAP1, EIF1AX, SF3B1, and GNAQ are identified as the driver genes with different mutation rates between the M3 and D3 iClusters (Fig. 3c, somatic mutation). The M3 iCluster is associated with a worse overall survival, compared to the D3 iCluster (Fig. 3d). However, the survival functions are not significantly different between M3.1 and M3.2 and between D3.1 and D3.2 (Fig. 3e). These observations suggest that it may be sufficient to classify UM into two major subtypes for personalized management in practice.

Fig. 3.

Fig. 3

UM integrative clusters (adapted from Mo et al. [30]). (a) Bayesian information criterion (BIC) and the maximum deviance ratio at the cluster parameter k=1,2,3,4,5. Based on the plot, k=3 (4 clusters) and k=4 (5 clusters) are two optimal solutions. (b) the UM samples distributed in the latent variable space of the iClusterPlus model. (c) heatmaps of the multi-omics features driving the UM sample clustering. Copy number: copy number loss, normal, and gain are indicated by blue, white, and red, respectively. Methylation: hypomethylation and hypermethylation are indicated by blue and red, respectively; mRNA: high and low gene expression are indicated by blue and red, respectively. Somatic mutation: mutated genes are indicated by black bars. (d) Patient overall survival of the two major M3 and D3 iClusters. (e) Patient overall survival of the four iClusters

7. Integrative Clustering Analyses of Lower-Grade Glioma (LGG)

We illustrate the application of intNMF using data from TCGA and Ceccarelli et al. studies on lower-grade glioma (LGG) [5, 50]. TCGA studies consisted of LGG subjects only, while Ceccarelli et al. combined those subjects with glioblastoma subjects for their analysis. The data used in this example consists of only the LGG subjects, and we refer to the TCGA for the data we used hereafter.

The lower-grade gliomas (grades II and III) have extremely variable clinical behavior and disease progression which cannot be adequately predicted from the histological class diagnosis alone. Although at histological examinations the LGGs look similar, a few of them stay indolent while others progress rapidly and develop glioblastoma [5]. This shows that the disease is dissimilar at the molecular level, indicating the presence of subtypes characterized by molecular activities. The LGGs data was downloaded from the Genomics Data Commons (GDC) website, https://portal.gdc.cancer.gov/, which consists of mRNA (20,330 genes), DNA methylation (25,978 probes), and DNA copy number (24,776 genes) assayed on 511 subjects. Many genomic features are similar across the subjects (noise features), and those features do not contribute to the clustering analysis. Therefore, the most varying features as measured by standard deviations from each type of data were selected for the clustering analysis. Such selection of the features also helps in optimizing computational efficiency by reducing the dimension of the data. After selecting the top three percentile features ranked by standard deviation, the new dimensions of the datasets become 584 mRNAs, 553 methylation features, and 493 CNVs for 511 subjects. In addition, a few different percentile cutoffs of the top varying features were also used in the intNMF method, and we observed that the results were stable. For simplicity, we picked the results from the top three percentile features for the illustrative example here. The intNMF is applied to the datasets, and the resulting subtypes are cross compared with the subtypes identified by TCGA studies. The three subtypes identified by the TCGA study have been characterized by IDH mutation and 1p/19q co-deletion status; IDHmut-codel, IDHmut-non-codel, and IDHwt [5]. In our implementation of intNMF, we specified the search range of numbers of clusters k from 2 to 7. In order to identify the optimum number of clusters, we carried out cross-validation to estimate the cluster prediction index (CPI). We found that the CPI value was highest for k=2 and only slightly lower for k=3 (Fig. 4a). Beyond k=3, the CPI values were much smaller. Therefore, dividing the LGG samples into three clusters was the optimum solution (Fig. 4d). The three sample clusters can be visualized on the tSNE (t-distributed stochastic neighbor embedding) plot (Fig. 4b). The three subtypes identified by intNMF highly agree with the subtypes identified by TCGA studies [5]. The subtypes identified by TCGA were based on a cluster-of-clusters analysis approach. The subtypes were characterized by IDH mutation and 1p/19q co-deletion status. The intNMF-C1 was enriched with IDH wild type and non-codeletion, intNMF-C2 was made up of IDH mutation and highly enriched with 1p/19q codeletion, and intNMF-C3 was entirely made up of IDH mutation with non-codeletion. Notably, the intNMF-C1 was widely separated from the rest two clusters and had a worse overall survival than the other two clusters (Fig. 4e). The survival differences among the subtypes were assessed by Kaplan-Meier analysis method followed by a log-rank test (p-value < 0.001). The subtypes identified by intNMF were biologically relevant and had differing survival probabilities.

Fig. 4.

Fig. 4

Integrative clustering analysis of lower-grade glioma using three datasets: DNA methylation, mRNA expression, and copy number variation. (a) Plot of Cluster Prediction Index (CPI) against search range of a number of clusters (k) showing the optimum number of clusters. (b) Visualization of the subgroups using tSNE. (c) Sample block structure of the consensus matrix. Stability of this matrix is the stopping criteria of the algorithm. (d) Heatmaps of mRNA, methylation, and copy number showing the identified cluster groups on the top. The bottom panel shows the biologically important IDH mutation status and 1p/19q codeletion status. (e) Survival probability differences using Kaplan Meier method followed by log-rank test

8. Conclusion

A fundamental approach in many omics data analyses is to find an appropriate lower-dimensional representation of the data. One such approach is identifying the latent structure of the datasets so that subgroups of the subjects (and/or features) can be discovered. Clustering analysis is a strong method to identify such latent structures. Traditionally, clustering analyses are carried out by considering one omics data at a time. Although often independently analyzed, the molecular datasets are highly correlated since they are assayed at several layers of biological processes on the same set of patient samples. The true strong latent structure may not be present in all data types. Also, there might be weak but consistent signals across the multiple data types. Integrative analysis can leverage such strength across the omics data types.

There are a few model-based and nonparametric clustering methods that have been developed. In this chapter, we focus on one representative model-based (iCluster) and one nonparametric clustering method (intNMF) with illustrative examples. Model-based methods assume omics data follows certain distributions. For example, the iCluster methods assume microarray gene expression data following normal distribution and RNA-seq gene expression data (sequence counts) following Poisson distribution. In most cases, the raw omics data may not exactly follow the assumed distributions. However, we found that the iCluster methods performed reasonably well when the data were approximately distributed according to the assumed distributions, which usually can be achieved via data transformation. For example, methylation beta values (range: 0–1) do not follow normal distribution. However, after performing logit transformation, the transformed data can be modeled using linear regression in the iCluster models. In addition, after log transformation of sequence count data, the transformed data can also be modeled using linear regression. In contrast, nonparametric methods such as intNMF do not make distribution assumptions about the data. It only requires that the data be positive. Both the methods have pros and cons. For example, the iClusterPlus and iClusterBayes methods are computationally intensive methods, which need more computational resources than the intNMF method. However, the iCluster methods can perform feature selection to identify the driver features for sample clustering, while there is no corresponding function in the intNMF method. In addition, the iClusters methods can model different types of data (e.g., binary, categorical), while the intNMF method is limited to continuous data. Regardless, if there exists a decent cluster pattern in the omics datasets, both methods are able to identify the biologically relevant subtypes, and they mostly agree with each other. The iCluster methods are implemented in the iClusterPlus package (https://www.bioconductor.org/packages/release/bioc/html/iClusterPlus.html), and the intNMF method is implemented in the intNMF package (https://cran.r-project.org/web/packages/IntNMF/index.html), both of which are publicly available for downloading.

Acknowledgments

We thank the Cancer Genome Atlas project for the use of data from uveal melanoma and lower-grade gliomas. Mo and Fridley are supported in part by the National Cancer Institute Center Core Grant P30 CA076292.

References

  • 1.Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Lønning PE, Børresen-Dale A-L (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci 98(19):10869–10874. 10.1073/pnas.191367098 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Sotiriou C, Neo S-Y, McShane LM, Korn EL, Long PM, Jazaeri A, Martiat P, Fox SB, Harris AL, Liu ET (2003) Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proc Natl Acad Sci 100(18):10393–10398. 10.1073/pnas.1732912100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O’Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, Speed TP, Gray JW, Meyerson M, Getz G, Perou CM, Hayes DN, Cancer Genome Atlas Research N (2010) Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17(1):98–110. 10.1016/j.ccr.2009.12.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.The Cancer Genome Atlas Network (2012) Comprehensive molecular portraits of human breast tumours. Nature 490(7418):61–70. http://www.nature.com/nature/journal/v490/n7418/abs/nature11412.html#supplementary-information [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.The Cancer Genome Atlas Network (2015) Comprehensive, integrative genomic analysis of diffuse lower-grade gliomas. N Engl J Med 372(26):2481–2498. 10.1056/NEJMoa1402121 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Cancer Genome Atlas Research N (2017) Integrated genomic and molecular characterization of cervical cancer. Nature 543(7645):378–384. 10.1038/nature21386 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Chalise P, Koestler DC, Bimali M, Yu Q, Fridley BL (2014) Integrative clustering methods for high-dimensional molecular data. Transl Cancer Res 3(3):202–216. 10.3978/j.issn.2218-676X.2014.06.03 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Cancer Genome Atlas Research N (2012) Comprehensive genomic characterization of squamous cell lung cancers. Nature 489(7417):519–525. 10.1038/nature11404 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Cancer Genome Atlas Research N (2013) Integrated genomic characterization of endometrial carcinoma. Nature 497(7447):67–73. 10.1038/nature12113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Cancer Genome Atlas Research N (2014) Comprehensive molecular profiling of lung adenocarcinoma. Nature 511(7511):543–550. 10.1038/nature13385 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Cancer Genome Atlas Research N (2014) Comprehensive molecular characterization of gastric adenocarcinoma. Nature 513(7517):202–209. 10.1038/nature13480 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Cancer Genome Atlas Research N (2015) The molecular taxonomy of primary prostate Cancer. Cell 163(4):1011–1025. 10.1016/j.cell.2015.10.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Cancer Genome Atlas Research N (2017) Integrated genomic characterization of oesophageal carcinoma. Nature 541(7636):169–175. 10.1038/nature20805 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Chalise P, Fridley BL (2017) Integrative clustering of multi-level ‘omic data based on non-negative matrix factorization algorithm. PLoS One 12(5):e0176278. 10.1371/journal.pone.0176278 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Shen R, Olshen AB, Ladanyi M (2009) Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics (Oxford, England) 25(22):2906–2912. 10.1093/bioinformatics/btp543 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, Haibe-Kains B, Goldenberg A (2014) Similarity network fusion for aggregating data types on a genomic scale. Nat Methods 11:333. 10.1038/nmeth.2810. https://www.nature.com/articles/nmeth.2810#supplementary-information [DOI] [PubMed] [Google Scholar]
  • 17.Cancer Genome Atlas N (2012) Comprehensive molecular portraits of human breast tumours. Nature 490(7418):61–70. 10.1038/nature11412 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lock EF, Dunson DB (2013) Bayesian consensus clustering. Bioinformatics 29(20):2610–2616. 10.1093/bioinformatics/btt425 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kirk P, Griffin JE, Savage RS, Ghahramani Z, Wild DL (2012) Bayesian correlated clustering to integrate multiple datasets. Bioinformatics 28(24):3290–3297. 10.1093/bioinformatics/bts595 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Mo Q, Shen R, Guo C, Vannucci M, Chan KS, Hilsenbeck SG (2018) A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. Biostatistics 19(1):71–86. 10.1093/biostatistics/kxx017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Mo Q, Wang S, Seshan VE, Olshen AB, Schultz N, Sander C, Powers RS, Ladanyi M, Shen R (2013) Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc Natl Acad Sci U S A 110(11):4245–4250. 10.1073/pnas.1208949110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Shen R, Wang S, Mo Q (2013) Sparse integrative clustering of multiple omics data sets. Ann Appl Stat 7(1):269–294. 10.1214/12-AOAS578 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Rappoport N, Shamir R (2019) NEMO: cancer subtyping by integration of partial multi-omic data. Bioinformatics 35(18):3348–3356. 10.1093/bioinformatics/btz058 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Nguyen T, Tagett R, Diaz D, Draghici S (2017) A novel approach for data integration and disease subtyping. Genome Res 27:2025–2039. 10.1101/gr.215129.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, Speed D, Lynch AG, Samarajiwa S, Yuan Y, Graf S, Ha G, Haffari G, Bashashati A, Russell R, McKinney S, Group M, Langerod A, Green A, Provenzano E, Wishart G, Pinder S, Watson P, Markowetz F, Murphy L, Ellis I, Purushotham A, Borresen-Dale AL, Brenton JD, Tavare S, Caldas C, Aparicio S (2012) The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486(7403):346–352. 10.1038/nature10983 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.The Cancer Genome Atlas Network (2015) Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature 517(7536):576–582. 10.1038/nature14129. http://www.nature.com/nature/journal/v517/n7536/abs/nature14129.html#supplementary-information [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Cancer Genome Atlas Research Network. Electronic address edsc, Cancer Genome Atlas Research N (2017) Comprehensive and integrated genomic characterization of adult soft tissue sarcomas. Cell 171(4):950–965. e928. 10.1016/j.cell.2017.10.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Network CGAR (2017) Comprehensive and integrative genomic characterization of hepatocellular carcinoma. Cell 169(7):1327–1341. e1323. 10.1016/j.cell.2017.05.046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Hmeljak J, Sanchez-Vega F, Hoadley KA, Shih J, Stewart C, Heiman D, Tarpey P, Danilova L, Drill E, Gibb EA, Bowlby R, Kanchi R, Osmanbeyoglu HU, Sekido Y, Takeshita J, Newton Y, Graim K, Gupta M, Gay CM, Diao L, Gibbs DL, Thorsson V, Iype L, Kantheti H, Severson DT, Ravegnini G, Desmeules P, Jungbluth AA, Travis WD, Dacic S, Chirieac LR, Galateau-Salle F, Fujimoto J, Husain AN, Silveira HC, Rusch VW, Rintoul RC, Pass H, Kindler H, Zauderer MG, Kwiatkowski DJ, Bueno R, Tsao AS, Creaney J, Lichtenberg T, Leraas K, Bowen J, Network TR, Felau I, Zenklusen JC, Akbani R, Cherniack AD, Byers LA, Noble MS, Fletcher JA, Robertson AG, Shen R, Aburatani H, Robinson BW, Campbell P, Ladanyi M (2018) Integrative molecular characterization of malignant pleural mesothelioma. Cancer Discov 8(12):1548–1565. 10.1158/2159-8290.CD-18-0804 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Mo Q, Wan L, Schell MJ, Jim H, Tworoger SS, Peng G (2021) Integrative analysis identifies multi-omics signatures that drive molecular classification of uveal melanoma. Cancers (Basel) 13(24). 10.3390/cancers13246168 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Mo Q, Li R, Adeegbe DO, Peng G, Chan KS (2020) Integrative multi-omics analysis of muscle-invasive bladder cancer identifies prognostic biomarkers for frontline chemotherapy and immunotherapy. Commun Biol 3(1):784. 10.1038/s42003-020-01491-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E, Shen R, Taylor AM, Cherniack AD, Thorsson V, Akbani R, Bowlby R, Wong CK, Wiznerowicz M, Sanchez-Vega F, Robertson AG, Schneider BG, Lawrence MS, Noushmehr H, Malta TM, Cancer Genome Atlas N, Stuart JM, Benz CC, Laird PW (2018) Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell 173(2):291–304 e296. 10.1016/j.cell.2018.03.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791 [DOI] [PubMed] [Google Scholar]
  • 34.Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1–2):91–118. 10.1023/A:1023949509487 [DOI] [Google Scholar]
  • 35.Cantini L, Zakeri P, Hernandez C, Naldi A, Thieffry D, Remy E, Baudot A (2021) Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer. Nat Commun 12(1):124. 10.1038/s41467-020-20430-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Paatero P, Tapper U (1994) Positive matrix factorization - a nonnegative factor model with optimal utilization of error-estimates of data values. Environmetrics 5(2):111–126. 10.1002/env.3170050203 [DOI] [Google Scholar]
  • 37.Brunet JP, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci U S A 101(12):4164–4169. 10.1073/pnas.0308531101 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Berry MW, Browne M, Langville AN, Pauca VP, Plemmons RJ (2007) Algorithms and applications for approximate nonnegative matrix factorization. Comput Stat Data Anal 52(1):155–173. 10.1016/j.csda.2006.11.006 [DOI] [Google Scholar]
  • 39.Lawson C, Hanson R (1995) Solving least squares problems. SIAM [Google Scholar]
  • 40.Zhang S, Liu CC, Li W, Shen H, Laird PW, Zhou XJ (2012) Discovery of multidimensional modules by integrative analysis of cancer genomic data. Nucleic Acids Res 40(19):9379–9391. 10.1093/nar/gks725 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Kaliki S, Shields CL (2017) Uveal melanoma: relatively rare but deadly cancer. Eye (Lond) 31(2):241–257. 10.1038/eye.2016.275 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Amaro A, Gangemi R, Piaggio F, Angelini G, Barisione G, Ferrini S, Pfeffer U (2017) The biology of uveal melanoma. Cancer Metastasis Rev 36(1):109–140. 10.1007/s10555-017-9663-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Collaborative Ocular Melanoma Study G (2001) Assessment of metastatic disease status at death in 435 patients with large choroidal melanoma in the Collaborative Ocular Melanoma Study (COMS): COMS report no. 15. Arch Ophthalmol 119(5):670–676. 10.1001/archopht.119.5.670 [DOI] [PubMed] [Google Scholar]
  • 44.Lane AM, Kim IK, Gragoudas ES (2018) Survival rates in patients after treatment for metastasis from uveal melanoma. JAMA Ophthalmol 136(9):981–986. 10.1001/jamaophthalmol.2018.2466 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Gupta MP, Lane AM, DeAngelis MM, Mayne K, Crabtree M, Gragoudas ES, Kim IK (2015) Clinical characteristics of uveal melanoma in patients with germline BAP1 mutations. JAMA Ophthalmol 133(8):881–887. 10.1001/jamaophthalmol.2015.1119 [DOI] [PubMed] [Google Scholar]
  • 46.Harbour JW, Onken MD, Roberson ED, Duan S, Cao L, Worley LA, Council ML, Matatall KA, Helms C, Bowcock AM (2010) Frequent mutation of BAP1 in metastasizing uveal melanomas. Science 330(6009):1410–1413. 10.1126/science.1194472 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.White VA, Chambers JD, Courtright PD, Chang WY, Horsman DE (1998) Correlation of cytogenetic abnormalities with the outcome of patients with uveal melanoma. Cancer 83(2):354–359 [PubMed] [Google Scholar]
  • 48.Damato B, Dopierala J, Klaasen A, van Dijk M, Sibbring J, Coupland SE (2009) Multiplex ligation-dependent probe amplification of uveal melanoma: correlation with metastatic death. Invest Ophthalmol Vis Sci 50(7):3048–3055. 10.1167/iovs.08-3165 [DOI] [PubMed] [Google Scholar]
  • 49.Robertson AG, Shih J, Yau C, Gibb EA, Oba J, Mungall KL, Hess JM, Uzunangelov V, Walter V, Danilova L, Lichtenberg TM, Kucherlapati M, Kimes PK, Tang M, Penson A, Babur O, Akbani R, Bristow CA, Hoadley KA, Iype L, Chang MT, Network TR, Cherniack AD, Benz C, Mills GB, Verhaak RGW, Griewank KG, Felau I, Zenklusen JC, Gershenwald JE, Schoenfield L, Lazar AJ, Abdel-Rahman MH, Roman-Roman S, Stern MH, Cebulla CM, Williams MD, Jager MJ, Coupland SE, Esmaeli B, Kandoth C, Woodman SE (2018) Integrative analysis identifies four molecular and clinical subsets in uveal melanoma. Cancer Cell 33(1):151. 10.1016/j.ccell.2017.12.013 [DOI] [PubMed] [Google Scholar]
  • 50.Ceccarelli M, Barthel Floris P, Malta Tathiane M, Sabedot Thais S, Salama Sofie R, Murray Bradley A, Morozova O, Newton Y, Radenbaugh A, Pagnotta Stefano M, Anjum S, Wang J, Manyam G, Zoppoli P, Ling S, Rao Arjun A, Grifford M, Cherniack Andrew D, Zhang H, Poisson L, Carlotti Carlos G, da Tirapelli Daniela Pretti C, Rao A, Mikkelsen T, Lau Ching C, Yung WKA, Rabadan R, Huse J, Brat Daniel J, Lehman Norman L, Barnholtz-Sloan Jill S, Zheng S, Hess K, Rao G, Meyerson M, Beroukhim R, Cooper L, Akbani R, Wrensch M, Haussler D, Aldape Kenneth D, Laird Peter W, Gutmann David H, Anjum S, Arachchi H, Auman JT, Balasundaram M, Balu S, Barnett G, Baylin S, Bell S, Benz C, Bir N, Black Keith L, Bodenheimer T, Boice L, Bootwalla Moiz S, Bowen J, Bristow Christopher A, Butterfield Yaron SN, Chen Q-R, Chin L, Cho J, Chuah E, Chudamani S, Coetzee Simon G, Cohen Mark L, Colman H, Couce M, D’Angelo F, Davidsen T, Davis A, Demchok John A, Devine K, Ding L, Duell R, Elder JB, Eschbacher Jennifer M, Fehrenbach A, Ferguson M, Frazer S, Fuller G, Fulop J, Gabriel Stacey B, Garofano L, Gastier-Foster Julie M, Gehlenborg N, Gerken M, Getz G, Giannini C, Gibson William J, Hadjipanayis A, Hayes DN, Heiman David I, Hermes B, Hilty J, Hoadley Katherine A, Hoyle Alan P, Huang M, Jefferys Stuart R, Jones Corbin D, Jones Steven JM, Ju Z, Kastl A, Kendler A, Kim J, Kucherlapati R, Lai Phillip H, Lawrence Michael S, Lee S, Leraas Kristen M, Lichtenberg Tara M, Lin P, Liu Y, Liu J, Ljubimova Julia Y, Lu Y, Ma Y, Maglinte Dennis T, Mahadeshwar Harshad S, Marra Marco A, McGraw M, McPherson C, Meng S, Mieczkowski Piotr A, Miller CR, Mills Gordon B, Moore Richard A, Mose Lisle E, Mungall Andrew J, Naresh R, Naska T, Neder L, Noble Michael S, Noss A, O’Neill Brian P, Ostrom Quinn T, Palmer C, Pantazi A, Parfenov M, Park Peter J, Parker Joel S, Perou Charles M, Pierson Christopher R, Pihl T, Protopopov A, Radenbaugh A, Ramirez Nilsa C, Rathmell WK, Ren X, Roach J, Robertson AG, Saksena G, Schein Jacqueline E, Schumacher Steven E, Seidman J, Senecal K, Seth S, Shen H, Shi Y, Shih J, Shimmel K, Sicotte H, Sifri S, Silva T, Simons Janae V, Singh R, Skelly T, Sloan Andrew E, Sofia Heidi J, Soloway Matthew G, Song X, Sougnez C, Souza C, Staugaitis Susan M, Sun H, Sun C, Tan D, Tang J, Tang Y, Thorne L, Trevisan Felipe A, Triche T, Van Den Berg David J, Veluvolu U, Voet D, Wan Y, Wang Z, Warnick R, Weinstein John N, Weisenberger Daniel J, Wilkerson Matthew D, Williams F, Wise L, Wolinsky Y, Wu J, Xu Andrew W, Yang L, Yang L, Zack Travis I, Zenklusen Jean C, Zhang J, Zhang W, Zhang J, Zmuda E, Noushmehr H, Iavarone A, Verhaak RGW (2016) Molecular profiling reveals biologically discrete subsets and pathways of progression in diffuse glioma. Cell 164(3):550–563. 10.1016/j.cell.2015.12.028 [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES