Abstract
Summary
Multilayer omics profiling has become a major venue for understanding complex diseases. We develop NCutYX, an R package for clustering analysis of multilayer omics data. The package and methods jointly analyze multiple layers of omics measurements and effectively accommodate their regulations. They systematically conduct a series of analysis based on the normalized cut technique, including the clusterings of subjects and omics measurements and biclustering. The package can be valuable for its timely context, novel methods and comprehensiveness.
Availability and implementation
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
In recent biomedical studies, multilayer profiling has been extensively conducted, collecting data on multiple types of omics measurements, such as copy number variations (CNVs), microRNAs, DNA methylations, mRNA gene expressions (GEs) and protein expressions (Pucher et al., 2019). In the analysis of omics data, clustering, of both subjects and variables, has been playing an essential role. Compared to single-layer data, clustering analysis of multilayer data is more challenging with the higher dimensionality and regulations between layers. Recent clustering analysis packages/methods for multilayer data include iCluster (Shen et al., 2009), LRAcluster (Wu et al., 2015), CancerSubtypes (Xu et al., 2017), VSClust (SchwäMmle and Jensen, 2018) and others. They usually conduct a single analysis, in particular subtype identification. Packages/methods that can systematically conduct a variety of analysis are still lacking. We develop the NCutYX package which can systematically conduct multiple types of clustering analysis, particularly including the clusterings of omics measurements and subjects and biclustering. Specifically, it contains five main methods, all of which are recently developed and novel. They are all based on the normalized cut (NCut) technique, and the unity in methodological design, construction of objective function and optimization is much desired. Overall, NCutYX can provide a timely and practical tool for the clustering analysis of multilayer omics data.
2 Materials and methods
For completeness, we include ncut to realize the ordinary NCut for single-layer data. It clusters the columns of an input data matrix into disjoint groups by maximizing the within-cluster similarity (WCS) and minimizing the across-cluster similarity (ACS) simultaneously. Additionally, NCutYX implements three methods, ancut (Teran Hidalgo et al., 2017), awncut (Li et al., 2018) and muncut (Teran Hidalgo and Ma, 2018), for multilayer omics data with different goals. Use GE and CNV as an example. (i) The goal of ancut (assisted NCut) is to more effectively cluster GEs with the assistance of CNVs. It first decomposes GEs into two components, one regulated by CNVs and a ‘residual’ component and then maximizes the WCS and minimizes the ACS for both components of GEs. (ii) The goal of awncut (assisted weighted NCut) is to cluster subjects based on GE patterns with the assistance of CNVs. It computes the WCS and ACS using GEs and CNVs separately, where weighted similarity measures are introduced to eliminate the effects of GEs and CNVs irrelevant for clustering. A new penalty is developed to promote coherence between the GE and CNV clusterings. Two tuning parameters τ and λ are involved, where τ adjusts the relative importance of CNVs in case that some CNVs are not informative and λ balances the NCut measures and regulations. awncut can simultaneously cluster subjects and generate weights for GEs and CNVs (which can indicate their relative importance). (iii) For muncut (multilayer NCut), consider, as an example, data with CNV, GE and protein measurements. Its goal is to identify distinct omics clusters, where each cluster contains multiple layers of omics measurements. It accommodates interconnections within layers and also across layers. A tuning parameter γ is introduced to allow for different ‘degrees of emphasis’ on the within and across measures.
Moreover, a new method, mlbncut (multilayer biclustering NCut), is developed in NCutYX, which conducts the biclustering of subjects and multilayer omics measurements. First, consider the clustering of subjects. For each omics cluster, the WCS and ACS are computed with the similarity defined only using omics measurements in this omics cluster. Second, consider the clustering of omics measurements. For each subject group, it constructs the similarity between any pair of omics measurements from different layers only using subjects in this subject group, and further computes the corresponding WCS and ACS to accommodate interconnections across layers. mlbncut maximizes the WCS and minimizes the ACS for all the subject groups and omics clusters.
For optimizing ancut, awncut and muncut, the simulated annealing (SA) technique is adopted, which involves two parameters, the maximum number of iterations B and temperature coefficient L. For ncut and mlbncut, the cross entropy technique is adopted, which involves three parameters, including B, the number of selected samples N and percentile for the sampling q0. Literature suggests that the value of B is not important, as long as it is large enough. L and N are suggested to be set as 1000 and 500, and q0 is suggested to be not large, for example 0.1.
For all methods, the number of clusters is determined using the GAP statistic with the function clusGap. For awncut, NCutYX provides the function awncut.selection to select τ and λ with the DBI. For muncut, γ can be selected using cross validation. Note that researchers can also determine the tuning parameters using their own methods. NCutYX is computationally affordable. For example, consider a dataset with 200 subjects and 300 measurements for each layer. Set B = 500 and with fixed tuning parameters, ancut, awncut, muncut and mlbncut take about 0.47 min, 2.51 min, 2.98 min and 14.21 min, using a computer with 2.5 GHz CPU and 8 GB memory. We examine data with various sizes in the Supplementary Materials and observe approximately quadratic time complexity. Parallel computing can be potentially applied to tuning parameter selection and some other tasks to reduce computer time.
In the Supplementary Materials, we conduct extensive simulation on mlbncut. Four alternatives are also considered, including the sparse biclustering (SBC), convex biclustering (COBRA), naive spectral clustering (NSC) and naive K-means (NKM). We propose a measure M4 to evaluate the error between the true and estimated clusters. It takes values between 0 and 1, with a smaller value indicating better clustering accuracy. Overall, mlbncut outperforms the four alternatives, with the improved M4 values having ranges 0.002–0.298 (SBC), 0.066–0.325 (COBRA), 0.005–0.134 (NSC) and 0.023–0.134 (NKM). For mlbncut, comparisons with random clustering are further conducted, and the P-values from Wilcoxon tests are all <0.001. We refer to the Supplementary Materials for details on the methods, computation, parameter selection and numerical comparison.
3 Application examples
TCGA prostate adenocarcinoma data (Abeshouse et al., 2015), with both clinical and multilayer omics measurements, is analyzed. Detailed information is provided in the Supplementary Materials. Analysis is conducted on 346 subjects with 316 CNV, 670 GE and 136 protein measurements. Take awncut as an example. First, we use awncut.selection and clusGap to select tunings including the number of cluster T, λ and τ. Detailed commands and values of tuning parameter selection statistics as functions of parameters are provided in the Supplementary Figure S2. Then, with the optimal , conduct
where Y and X are the GE and CNV data matrices, two parameters in SA algorithm are set as default and clust is a list which includes the estimated clustering results Cs and variable weights ws. Cs is a K-column (0, 1) matrix where each row corresponds to 1 subject and the column with value 1 indicates the assigned cluster. ws is a vector with the first 670 elements corresponding to GEs and the remaining to CNVs, where a larger weight indicates higher importance. Two subject groups are generated with sizes 138 and 208. For visualization, the weighted similarity matrix for two groups is presented in Supplementary Figure S3 by conducting
where subjects in the same groups have larger similarity. To better comprehend the differences between groups, we examine clinical T stage and Gleason score in Supplementary Figure S4. The differences provide some support to the clustering method and package, where P-values computed from chi-squared tests are 0.009 and 0.002. We also present the estimated weights (ws) for GEs and CNVs in Supplementary Figure S4, from which we can identify the most relevant ones. Analysis on protein and its regulator GE (and other ‘combinations’) can be conducted in a similar manner. Detailed analysis is provided in the Supplementary Materials.
4 Discussion
For multilayer omics data, which is gaining popularity fast, NCutYX can conduct a variety of clusterings. This unique multilayer perspective and comprehensiveness may make it more attractive over alternatives. One limitation is that NCutYX does not allow overlapping clusters. Computer time can be potentially reduced by writing certain ‘repetitive’ computations in for example C++. These are left to future research.
Funding
This work was supported by National Institutes of Health [CA216017, CA204120]; Bureau of Statistics of China [2018LD02] and Shanghai Pujiang Program [19PJ1403600].
Conflict of Interest: none declared.
Supplementary Material
References
- Abeshouse A. et al. (2015) The molecular taxonomy of primary prostate cancer. Cell, 163, 1011–1025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Y. et al. (2018) Assisted gene expression-based clustering with AWNCut. Stat. Med., 37, 4386–4403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pucher B.M. et al. (2019) Comparison and evaluation of integrative methods for the analysis of multilevel omics data: a study based on simulated and experimental cancer data. Brief. Bioinform., 20, 671–681. [DOI] [PubMed] [Google Scholar]
- Shen R. et al. (2009) Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics, 25, 2906–2912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Teran Hidalgo S.J. et al. (2017) Assisted clustering of gene expression data using ANCut. BMC Genom., 18, 623.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Teran Hidalgo S.J., Ma S. (2018) Clustering multilayer omics data using MuNCut. BMC Genom., 19, 198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- SchwäMmle V., Jensen O.N. (2018) Vsclust: feature-based variance-sensitive clustering of omics data. Bioinformatics, 34, 2965–2972. [DOI] [PubMed] [Google Scholar]
- Wu D. et al. (2015) Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification. BMC Genom., 16, 1022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu T. et al. (2017) Cancersubtypes: an R/bioconductor package for molecular cancer subtype identification, validation, and visualization. Bioinformatics, 33, 3131–3133. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
