Abstract
Background
Advancements in genomic sequencing continually improve personalized medicine, and recent breakthroughs generate multimodal data on a cellular level. We introduce MOSCATO, a technique for selecting features across multimodal single-cell datasets that relate to clinical outcomes. We summarize the single-cell data using tensors and perform regularized tensor regression to return clinically-associated variable sets for each ‘omic’ type.
Results
Robustness was assessed over simulations based on available single-cell simulation methods, and applicability was assessed through an example using CITE-seq data to detect genes associated with leukemia. We find that MOSCATO performs favorably in selecting network features while also shown to be applicable to real multimodal single-cell data.
Conclusions
MOSCATO is a useful analytical technique for supervised feature selection in multimodal single-cell data. The flexibility of our approach enables future extensions on distributional assumptions and covariate adjustments.
Supplementary Information
The online version contains supplementary material available at (10.1186/s12864-022-08759-3).
Keywords: Tensor regression, Single-cell sequencing, Multi-omics, Multimodal, Network analysis
Background
Classic bulk genetic sequencing involves averaging signature levels across all cells. Different sequencers may sequence different types of molecules such as ribonucleic acid (RNA), proteins, DNA methyl groups, etc. Disease progression, therapy success, and other clinical outcomes often vary among individuals suffering from complex diseases [3, 7, 17, 35], and the heterogeneity in their outcomes may be better understood through the intricacies of a patient’s molecular signatures [1, 5, 24, 26]. This has led to an explosive demand for multi-omics which involves integrating multiple types of molecular information in order to have a more Systems Biology approach. For example, in breast cancer patients with resistance to lapatinib therapy, Komurov et al. were able to suggest additional therapy targets by identifying combinations of RNA and proteins responsible for glucose deprivation that was associated with the resilience [18].
Methods for identifying graphs and gene regulatory networks within a single molecular type has been well studied [14, 20], however, different methods should be considered when integrating multiple types of molecular information in order to accommodate the between and within molecular relationships [6]. Each molecular type often contains thousands of features, and integrating them creates a higher dimensional problem with more sophisticated relationships both within and between molecular types. For example, the Decomposition of Network Summary Matrix via Instability (DNSMI) method decomposes a matrix of network strengths by fitting a series of models for the expected relationships across molecular types and with the disease outcome [38]. Supervised sparse canonical correlation analysis (SCCA) attempts to optimize the correlation matrix between molecular types through lasso constrained linear combinations of the features and also eliminates features weakly correlated with the outcome [36].
In bulk sequencing experiments, rare cells or smaller cell-types will be diluted due to the averaging across all cells within the sample. This motivated single-cell sequencing techniques where molecular information could then be sequenced on a cell-by-cell basis. While initial protocols were limited to RNA [23, 29], newer technology may now sequence multiple types of molecular information within each cell, denoted as multi-modal (or multi-omic) single-cell sequencing. For example, CITE-seq simultaneously sequences both cell surface proteins and RNA on each cell of a sample [32]. Although still a growing technology, applications have already been considered using this novel sequencing approach. For example, Kendal et al. utilized CITE-seq technology to compare tendons in healthy individuals to those with tendinopathy [15]. Figure 1 displays an example of single-cell data from each patient.
Many bulk sequencing problems utilize matrix decomposition techniques for various analytical goals where the dimensions correspond to different feature sets (e.g., RNA or proteins). However, single-cell sequencing essentially creates an added dimension for cells that did not previously exist in bulk sequencing. Tensors provide a general framework for organizing high dimensional data, and a matrix is a special case of a two dimensional tensor. Therefore, leveraging tensors for single-cell sequencing is an interesting approach to accommodate the added cellular dimension. For example, ScLRTC and scLRTD utilize tensor decompositions to impute dropouts or missing data found in single-cell sequencing data [25, 28]. scTenifoldNet also utilizes tensor decomposition to identify unsupervised gene regulatory networks in single-cell RNA-seq data [27].
This manuscript proposes a novel method, Multi-Omic Single-Cell Analysis using TensOr regression (MOSCATO), for identifying the superstructure of a semi-directed graph, or network, within multimodal single-cell data that relates to a disease or phenotypic outcome. Current single-cell methods are often either limited to one subject/experiment, limited to just RNA, unsupervised with no clinical outcome associations, or intended for cell clustering and not feature selection. MOSCATO uses regularized tensor regression to address each of those common limitations found in existing methods. “Preliminaries” section describes preliminary tensor concepts, “Results” section presents the MOSCATO results from a series of simulations and a real data application, “Discussion” and “Conclusions” sections discuss future work and limitations, and “Methods” section describes the mathematical details behind MOSCATO. MOSCATO is applicable when two ‘omic’ types of single-cell data are present (e.g., RNA and proteins) with a univariate outcome of interest.
Preliminaries
MOSCATO utilizes regularized tensor regression, and this section describes existing and relevant tensor concepts. “Tensor definitions” section defines tensors and basic tensor operations, and “Tensor regression” section uses the operations and definitions from “Tensor definitions” section to describe tensor regression and regularization techniques.
Tensor definitions
High dimensional data may be organized into a tensor, and a matrix may be thought of as a 2-dimensional tensor. Utilizing familiar tensor notation as provided by Kolda and Bader [16], we let denote a D-dimensional tensor where dimension d contains pd variables for d=1,...,D. For example, D=1 denotes a vector and D=2 denotes a matrix. Many mathematical operations for tensors build on mathematical operations used in matrices. For example, Definition 1 describes outer products between D vectors to create a D-dimensional tensor, where ∘ denotes the Khatri-Rao product.
Definition 1
Let b1,b2,...,bD, denote vectors where . Then the outer product of those vectors, b1∘b2∘⋯∘bD, creates a D-dimensional tensor of size p1×p2×⋯×pD, and each (i1,...,iD)th element equals .
It may also be convenient to reorganize a tensor into a lower dimensional space by vectorizing or mode-d matricizing the tensor. Definitions 2 and 3 describe these reorganization techniques.
Definition 2
Let denote a D-dimensional tensor. Then may be reorganized into a column vector through the vec operator , where the element of corresponds to the (i1,...,iD)th value in .
Definition 3
Let denote a D-dimensional tensor. Then may be reorganized into a matrix through the mode-d matricization operator , where the (id,j)th element within equals the (i1,...,iD)th value within and .
Similarly as done in matrix operations, it may be of interest to multiply two tensors with comparable dimensions via inner products, as described in Definition 4.
Definition 4
Suppose two tensors and . The inner product may be obtained by
1 |
Furthermore, it may be of interest to multiply a matrix along the dth dimension of a tensor through d-mode products as described in Definition 5.
Definition 5
Suppose a tensor and a matrix . The d-mode product between and U may be expressed as where the (i1,...,id−1,j,id+1,...,iD)th value equals .
The rank of a matrix denotes the maximum number of linearly independent rows/columns in the matrix. Building on those concepts, the rank of a tensor may be thought of as the maximum number of vectors that can be multiplied and added to replicate the tensor, as shown in Definition 6.
Definition 6
Assume a D-dimensional tensor . has rank R if no smaller R exists such that
2 |
where Definition 7 defines and for some set of Zd matrices for d=1,...,D.
Definition 7
Let for d=1,...,D. Furthermore, let denote the rth column of Zd. Then
3 |
The true rank of a tensor may often be difficult to determine due to the high dimensionality, motivating decomposition techniques that estimate vectors for a given rank that approximate the tensor, as shown in Definition 8.
Definition 8
Assume a D-dimensional tensor . A rank-R CP decomposition aims to use R vector sets (one vector per dimension) to approximate by
4 |
where .
Kolda and Bader present additional details on decomposition and other tensor operations [16].
Tensor regression
Building on notation covered in “Tensor definitions” section, this section will briefly describe tensor regression that was originally presented by Zhou et al. [39]. Tensor regression builds on Generalized Linear Model (GLM) concepts, where we have an outcome y for each subject that follows some exponential family with link function g(·) and mean μ. Classic GLM uses univariate independent variables to predict the outcome, but tensor regression extends those concepts by additionally allowing a predictor tensor. This is accomplished by multiplying the dimensions of the tensor through coefficient vectors that convert the dimensions to a univariate value that may then predict the outcome.
Figure 2 shows a simple example with rank-1 tensor regression and D=2 dimensions for the predictor tensor. Each dimension in the predictor tensor corresponds to a different feature set, and each subject will contain its own D=2 predictor tensor to be used to predict their univariate outcome y.
Imaging modalities such as magnetic resonance imaging (MRI) present an excellent application for tensor regression models. For example, a particular brain MRI slice may be of interest for an outcome, say, disease status. This image slice can be expressed in an array structure (i.e., a 2-dimensional tensor) where the value at a given pixel denotes the MRI signal associated with that location. Referring to the model shown in Fig. 2, the MRI image slice would correspond to the predictor tensor, , and the estimated coefficients would indicate regions of interest associated with the outcome.
Tensor regression may also involve higher rank problems with the more formal representation
5 |
where U contains the univariate independent variables (e.g., age or sex). A rank-R tensor regression estimates R coefficient vectors for each dimension in the predictor tensor, but for simplicity, in this manuscript we will assume rank-1. The Block Relaxation Algorithm is used to estimate the coefficient vectors with additional details described by Zhou et al. [39]. Zhou et al. [39] also claim that regularization in tensor regression may be accomplished by simply imposing constraints when fitting the models on each dimension.
If one naively vectorized the predictor tensor and fit a classic GLM model, it would require estimating coefficients for the tensor. This approach would not only ignore the inherent structure of the data by treating each element in the tensor as independent with no distinction between the dimensions, it would also attempt to estimate many more coefficients compared to coefficients in tensor regression. Consequently, this naive approach may be unrealistic in high dimensional problems given a typically much smaller sample size. This reduction in parameters highlights the benefits of tensor regression. However, it is subject to limitations such as uniqueness and identifiability. For example, suppose a rank-1 model with . Then for any scalar τ, we could derive an equally optimal model where and . Additionally, the Block Relaxation Algorithm may converge to a local maxima as opposed to the global maxima when attempting to maximize the log likelihood. Zhou et al. [39] describe measures that may be used to check whether these issues are present in the estimated tensor model.
Results
MOSCATO was applied to both simulated data and real data. Results from the simulations are presented in “Simulation results” section and results from the real data application are presented in “Real data application results” section. To our knowledge there are no appropriate methods that easily match the goals of MOSCATO. Nevertheless, to create a sensible competitor we employed a reasonable, but ad hoc, alternative method. MOSCATO was benchmarked against a competing feature selection technique using area under the receiver operating curve (AUC). In short, AUC methods select features that best predict the outcome according to estimated receiver operating curves (ROCs). AUC selections were based on either Bonferroni adjusted p-values or whether the AUC was less than 0.3 or greater than 0.7.
Simulation results
Two ‘omic’ types denoted as and , along with an outcome, were simulated for each subject, and simulations were performed under 9 different settings accounting for differing number of cells per subject and level of technical noise. For additional details on the simulations, refer to “Simulation details” section.
Figure 3 displays the tuned maximum network size for and , denoted as maxG and maxX. In a perfect execution of MOSCATO, maxG would be tuned to 10 and maxX would be tuned to 15 due to known network sizes in the simulated data. All simulations tuned maxG and maxX to values greater than the true number of network features, regardless of the simulation setting. For maxX, smaller values were tuned as technical noise decreased and number of cells increased, but this trend did not persist when tuning maxG.
Figure 4 displays the sensitivity and specificity across the 9 simulation settings for data types and for the 3 different feature selection methods (MOSCATO, AUC using p-values, and AUC using cutoffs). Sensitivity measures the probability that network features are properly included in the selections, and specificity measures the probability that non-network features are properly excluded from the selections. As shown in Fig. 4, the sensitivity and specificity under MOSCATO generally improve as the number of cells increases per subject and as technical noise decreases. Conversely, the specificity declines for AUC selections using p-values as the number of cells increases and the technical noise improves. AUC based on p-values not only produced counterintuitive results where the performance actually degraded as the technical noise reduced, it also selected too many features such that the results were not remotely sparse. This explains that while the sensitivity remained high for all simulations, this is simply due to the fact that nearly all features were selected using that criteria. AUC selections based on cutoffs resulted in opposite issues where it did not select nearly any features and produced poor sensitivity with nearly perfect specificity.
MOSCATO reproduced the superstructure of the graph (i.e., network) reasonably well with generally high sensitivity and also limited false positives present. This is especially true when comparing against approaches using the AUC. However, when it is expected that high levels of technical noise are present with limited cells per subject, caution should be used when considering MOSCATO.
Real data application results
Leukemia is a broad disease that encompasses all cancers that occur in blood cells. The 5-year survival rate is about 65% according to data from 2011 to 2017, and about 459,000 people were living with leukemia in 2018 in the United States [12]. Leukemia may be classified based on progression speed where chronic denotes slow progression and acute denotes aggressive progression. In addition to cancer progression, leukemia may be subtyped by the type of cells where the cancer forms. For example, lymphocytic leukemia describes cancer developing from white blood cells and myelogenous leukemia describes cancer developing in blood forming cells within the bone marrow. Although rare, one may have both lymphocytic and myelogenous leukemia which is denoted as mixed phenotype leukemia.
To assess MOSCATO in practice, we applied it to real single-cell data with multiple data types. Limited data is currently available due to the infancy of multimodal single-cell sequencing, so data across multiple studies that all used CITE-seq protocols [32] on bone marrow / peripheral blood cells were used. CITE-seq produces cellular level RNA information and cell surface protein abundance via antibody derived tags (ADT) simultaneously, and our outcome of interest will be leukemia versus healthy patients. Our goal will be to apply MOSCATO to this data in order to obtain a subset of RNA and ADT features associated with leukemia. After combining the data across studies, we have 14 healthy patients and 7 patients with leukemia. Of the 7 leukemia subjects, 1 had chronic lymphocytic leukemia while the other 6 had mixed-phenotype acute leukemia. The studies used and further details are described in “Real data application details” section.
The data was pre-processed and integrated using Seurat version 4.0.3 [10], and Fig. 5 displays the Uniform Manifold Approximation and Projection (UMAP) [2] plots from the cell clusters established from the integrated data across all 21 subjects. The Seurat workflow normalizes the scRNA-seq data, identifies the highly variable genes, scales the data, performs reciprocal principal component analysis using the highly variable genes, finds the nearest neighbors for each cell, and clusters the cells. After integration, 8 of the cell clusters (clusters 0, 1, 2, 3, 4, 5, 6 and 14 from Fig. 5), 17991 RNA features, and 5 ADTs (CD3, CD4, CD14, CD19, and CD56) were measured across all subjects.
In this application after pre-processing, each subject contained 8 scRNA-seq matrices (one for each cell cluster) where each matrix contained 17991 columns and rows corresponding to the cells within that cluster. In addition to the scRNA-seq datasets, each subject contained another 8 matrices (one for each cell cluster) for their single-cell ADT abundance where each matrix contained 5 columns and rows corresponding to the cells within that cluster (same cells as in the scRNA-seq datasets). Finally, each subject had a binary disease status (healthy or leukemia). The feature selection techniques were applied to the 8 pairs of matrices separately (i.e., each cell cluster treated independently), resulting in 8 different MOSCATO and AUC selection results. The first step of MOSCATO estimates the correlation between each subject’s scRNA-seq and single-cell ADT features, resulting 21 (i.e., one for each subject) correlation matrices, each of size 17991×5.
Analysis was performed on each cell cluster separately, and the selections were later reviewed for biological relevancy. The rest of this section will focus on the results obtained from cell cluster 0, but the complete results from MOSCATO, AUC selections based on p-values, and AUC selections based on the AUC cutoffs for each of the 8 cell clusters are provided in Additional file 2. In summary, the number of features selected by MOSCATO and AUC cutoffs were similarly sized, but AUC selections based on p-values resulted in nonsparse feature sets. DAVID [11, 31] was used to analyze and organize the gene ontology information from the RNA gene selections. DAVID clusters genes based on common annotations and functional information, and DAVID only allows clustering on gene sets with less than 3000 genes. Since the AUC selections based on p-values resulted in RNA selections well over this 3000 restriction for most cell clusters, we only focused on gene clusters from the MOSCATO and AUC cutoff selections.
MOSCATO selected 96 RNA features within cell cluster 0, and DAVID identified 2 gene clusters. The strongest gene cluster (based on highest enrichment score) used 7 of these 96 genes. This was the most notable gene cluster which included the genes CD3D, CD3E, and CD3G which are part of the KEGG pathway for Human T-cell Leukemia Virus type 1 (HTLV-I) infection (KEGG pathway hsa05166), and HTLV-I infections are a known risk factor for developing adult T-cell leukemia/lymphoma [13]. Additionally, the genes CD2, CD3D, CD3E, CD3G, and CD4 within this gene cluster belong to the KEGG pathway for Hematopoietic cell lineage (hsa04640) which assists in producing blood cells. Given that the disease of interest in this application is based on leukemia (i.e., cancer in tissues which produce blood), it is reassuring that this gene cluster contains genes associated with blood production. Also, genes CD2, CD3D, CD3E, CD3G, KLRB1, CD247, and CD4 within the gene cluster are associated with the gene ontology for the cell surface receptor signaling pathway (GO:0007166) which makes sense given that MOSCATO summarized the single-cell data between RNA and cell surface proteins (i.e., ADTs). Of the 5 cell surface proteins considered, MOSCATO selected the ADT’s CD3 and CD4 for this cell cluster. Figure 6 displays the RNA features within the strongest functional DAVID cluster, along with the ADT selections. No features selected by MOSCATO under cell cluster 0 were selected by AUC cutoffs, and although 83 of the 96 MOSCATO RNA selections were also selected by AUC based on p-values, the AUC selections based on p-values selected nearly half of all RNA features considered. AUC selections based on cutoffs selected 11 RNA features and 0 ADTs for cell cluster 0, and DAVID was not able to discern any gene clusters based on the genes selected.
In conclusion, the selections made by MOSCATO under cell cluster 0 resulted in a concise gene cluster discovered by previously known annotations and functionalities. These functionalities not only related to the disease of interest (i.e., leukemia), it also related to cell surface functionalities. This highlights that MOSCATO not only considers supervised information (e.g., disease versus no disease), it also considers the relationships across data types (e.g., RNA and cell surface proteins). Performing feature selection using solely AUC not only neglects the between data type relationships, it also was not able to return a concise set of genes that related to leukemia. Furthermore, it is arbitrary to select pre-specified AUC cutoffs for selections, and the p-value selections did not produce sparse solutions. Also, AUC selections were based directly on normalized expression/sequenced values, but MOSCATO performs selections based on the similarities between data types. This possibly helps reduce batch effects found in individual subjects by standardizing each value between -1 and 1.
Discussion
MOSCATO was performed on both simulated and real data. The simulations produced fairly accurate results with a sensitivity and specificity close to 1 for many of the simulations, although MOSCATO did not perform as well in situations with high technical noise or low cell counts per subject. Since MOSCATO calculates its predictor tensor to be the estimated correlation of the datasets per individual, it is unsurprising that cell count would contribute to more accurate correlation estimates. Although not used in this manuscript, covariate adjustments could easily be made in MOSCATO by simply adding them to the tensor regression.
The real data application involved 21 subjects, and given the highly variable nature of single-cell sequencing data, the features selected should be interpreted with some caution. Furthermore, since the data was collected across different studies, only 5 proteins were used that merged across all 21 subjects. As single cell technologies become more available and cost feasible, future experiments should be considered with more consistent proteins across experiments and larger sets of subjects.
MOSCATO currently assumes all cells come from the same cell type, but it might be more interesting to accommodate situations in which multiple cell types are present. We are currently working on higher dimensional applications of MOSCATO in a future manuscript. A reasonable solution could incorporate another step which estimates a similarity matrix for each cell type and includes another dimension to the predictor tensor for cell type. Additionally, MOSCATO was only tested on experiments with two data types, but extensions should be considered in situations where more than two data types are present. This could be accomplished by extending the predictor tensor to accommodate a dimension per data type extracted, although higher dimensional summary measures would need to be considered.
MOSCATO was only tested using Pearson’s correlation as the summary measure to construct the predictor tensor , but other summaries should be considered. For example, mutual information or inverse covariance matrices might be interesting avenues to explore for future consideration. Graphical lasso [8] is a popular technique to obtain both the nodes and edges of a graph by applying lasso regressions to estimate the inverse of the covariance matrix, and this estimated inverse of the covariance matrix could be explored as the predictor tensor input for MOSCATO.
This study only used rank-1 tensors, but higher ranks could be considered that may unveil other patterns and networks available in the data. The proper rank could be obtained using typical model selection criteria such as cross validation, although interpreting the results may not be as straight forward.
Zhou et al. discuss hypothesis testing via asymptotic normality results [39], and these hypothesis testing schemes could be explored to assess network strength. For example, one could perform a global test whether the model coefficients equal zero for the network selections.
Although MOSCATO returns the superstructure for a graph, it does not provide information on directionality and does not currently consider directional consistency. For example, suppose two genes are positively correlated with each other, but they contain conflicting correlative directions with the outcome. This inconsistency in directionality makes interpreting the results more difficult, and may be mitigated by additionally tuning based on optimizing balance. This concept has been considered in bulk level analyses with a single ‘omic’ type [34], and it could be considered for future work for multimodal data.
Conclusions
MOSCATO is a statistical technique for analyzing multimodal single-cell data where the study goals are to identify which features within the ‘omic’ datasets relate to each other and a clinical outcome. MOSCATO was found to perform favorably through a series of simulations and a real data application. Multimodal single-cell data continues to grow in popularity, and feature selection techniques such as MOSCATO may be critical to fully leverage the potential in using the data for highlighting biological markers in complex diseases.
Methods
Multimodal network analysis using tensor regression
In classic bulk sequencing, the data contains one record per subject. Supposing n subjects with two data types, bulk sequencing studies would contain two data sets (i.e., a dataset for each data type), and . In single-cell sequencing, there are multiple records per subject where each row corresponds to a cell within the subject. Consequently, for a given subject i with two data types, their single-cell data would contain two datasets and , where mi denotes the number of cells for subject i. Since the number of cells typically differs across subjects (i.e., where i≠i′), we organize each subject’s data into separate datasets (i.e., and ) as opposed to organizing the input data directly into a 3-dimensional tensor with a dimension for cells. It should also be noted that each subject’s data often consists of thousands of cells, and concatenating the single-cell data in long format may be computationally inefficient. Furthermore, we assume each subject i contains a univariate outcome yi for i=1,...,n, and we may express the outcomes in a vector as y=[y1,y2,...,yn]T. For simplicity, we may denote the two data types as and without the i subscript, although as described previously, each subject’s single-cell data will contain separate matrices for the data types as opposed to expressing data in long format as found in bulk sequencing.
MOSCATO aims to identify a subset of features within and that relate to each other and the outcome. In graphical modelling terms, MOSCATO identifies the superstructure of a semi-directed graph with undirected nodes involving features within and with some path directed to the outcome y. MOSCATO accomplishes this by imposing elastic net constraints on a tensor regression model [40].
Similarly as in the MRI image example from “Tensor regression” section, multimodal single-cell data contains multi-dimensional data per subject (i.e., features within and features within ) with a univariate outcome. This motivates the use of tensor regression for multimodal single-cell data. Additionally, tensor regression not only efficiently accommodates multi-dimensional input data with a univariate outcome, it also handles regularization techniques and allows for additional covariate adjustments (e.g., age, sex, race, etc.). However, tensor regression requires equivalent dimensions for each subject’s input tensor, and single-cell sequencing experiments nearly always return differing number of cells. Therefore, to standardize the dimensions across each subject, the first step of MOSCATO involves collapsing the cellular dimension all together by estimating a correlation matrix between their data type matrices,
6 |
where letting xij denote the jth feature in and gik denote the kth feature in for the ith subject. Although many summary matrices could be considered, such as the inverse of the covariance matrix or mutual information, Pearson’s correlation provides a simple interpretation while also standardizing the values within between -1 and 1.
In broad strokes, MOSCATO applies a rank-1 tensor regression on to determine the elements in each ‘omic’ type that are associated with each other and with the outcome. Specifically, using the tensors for i=1,...,n to estimate the coefficients, a tensor regression model similar to (5) and depicted in Fig. 2 will be fit with elastic net constraints. The elastic net constraint works to balance by a weighted average between an L1-norm and L2-norm, where the L1-norm truncates small coefficients to zero and the L2-norm better handles highly correlated features. In summary, the elastic net constraint typically denoted as in the classical GLM setting will now involve
7 |
where denotes the coefficient vector for denotes the coefficient vector for denotes the coefficient for the jth feature in , and denotes the coefficient for the kth feature in . The hyperparameters αX∈[0,1] and αG∈[0,1] denote the weights to put on the L1-norm constraints for and , respectively. The hyperparameter λ in the classical GLM setting denotes the overall weight to put on the constraint, and it may be any positive number from 0 to infinity. Since tuning λ to the proper range may be difficult due to the nontrivial parameter space, we use maxX and maxG instead to denote the maximum number of non-zero values within βX and βG, respectively. This reparameterization of the constraints drastically simplifies the hyperparameter space and subsequent tuning, as described in “Tuning on stability” section.
For some fixed αX,αG,maxX, and maxG, the tensor regression model will be fit to obtain and . Due to the L1-norm truncating small values to zero from the elastic net constraint, only a subset of values within and will be nonzero. Thus, final network features within data type will be the set , and final network features within data type will be the set . Algorithm 1 summarizes the steps to MOSCATO.
Tuning on stability
MOSCATO assumes fixed values for αX,αG,maxX, and maxG which will be tuned using an extension of the Stability Approach to Regularization Selection (StARS) method [22]. Tuning on accuracy, such as by cross validation or Bayesian information criterion, tends to result in overly dense solutions in high dimensional problems with results that are not reproducible [22]. The most extreme scenarios for stability are perfectly stable results from selecting no features (i.e., completely sparse) or selecting all features (i.e., no sparseness). Building on that logic, StARS initializes the parameters to the sparsest solution and gradually relaxes the sparsity until some instability threshold ϕ is met. Instability is estimated based on subsamples from the data by performing the feature selection under each subsample and summarizing the consistency in results across different subsamples. Although ϕ may initially be thought of as an arbitrary cutoff between 0 and 1, it may be easily interpreted as the amount of allowable instability. In essence, a smaller ϕ would imply a more sparse but stable result. The motivation behind allowing some instability as opposed to fixing ϕ to 0 is to allow some noise to be selected in order to ensure that no true signal is missed in the final feature selection. In statistical terms, this means that StARS prioritizes reducing type II errors.
Although the StARS method was developed for tuning a single sparsity parameter, the four hyperparameters αX,αG,maxX, and maxG will be tuned using similar logic. Focusing on tuning one dimension at a time, we initialize to a sparse solution with some small maxX. Fixing maxX, we estimate the instability for a range of αX values between 0 and 1. Select the αX value resulting in the lowest instability, and if that instability is less than ϕ, increase maxX and repeat the process. This continues until the ϕ instability is hit to select maxX and αX. Using the highest maxX and corresponding optimal αX with instability less than ϕ, a similar process is then repeated for tuning maxG and αG. In this case with two dimensions, one for and another for , we first tune αX and maxX for some fixed αG and maxG, and then use and when tuning αG and maxG. The initial fixed αG and maxG may be kept large, suppose αG=0.5 and maxG=floor(p/2) such that an overly sparse does not impact the stability on for the first dimension of tuning. This is summarized in Algorithm 2.
Simulation details
Theoretical details of simulations
Splatter [37] is a popular technique to simulate single-cell RNA-seq (scRNA-seq) data, and it has been shown to mimic distributions from real scRNA-seq data. The general Splatter schematic initiates by simulating a gene mean and then adjusts the gene mean to account for variation in outliers, library size, and dispersion. It then simulates the “observed” scRNA-seq values through a Poisson distribution using the adjusted gene mean, and the values are then randomly truncated to zero to replicate dropouts. Although Splatter realistically portrays scRNA-seq distributions, a few extensions were required in order to simulate multimodal single-cell data with supervised gene networks.
To accomplish this, we leverage latent structures for multimodal supervised networks detailed in Zhang et al. [38]. Figure 7 demonstrates the expected causal relationships within the data containing a supervised multimodal network. In summary, we expect there to be a subset of features within and that relate to the outcome but not with each other; a subset of features within and that relate to each other but not with the outcome; a subset of features within and that are independent of each other and the outcome; and finally the target subgroups of the analysis, subset of network features within that relate to the network features within that ultimately relates to the outcome. These expected relationships among these latent components may be represented through a covariance matrix where the off diagonals will be non-zero where relationships exist and zero where independence is expected. More details are provided in the Additional file 1.
To extend Splatter for supervised multimodal networks, we will first simulate the latent values (S, H, X, G, G′,X′, and y) for each subject using a multivariate normal distribution with zero mean and the latent variables’ covariance matrix. Since the latent values were simulated from a multivariate normal, they will each be normally distributed with mean 0. The Splatter simulation assumes the initial gene mean comes from a gamma distribution, so we take the square of the latent value divided by its standard deviation. By doing so, the transformed latent values then become gamma distributed with shape equal to 1/2 and scale equal to 2 times its variance. These transformed latent values will be used as the initial gene means for each subset of features, and the latent value for y will be used as the mean to simulate an observed outcome from a normal distribution. Initial gene means for the noise subset of features are randomly simulated independently. Additional theoretical details may be found in the Additional file 1.
Dispersion is adjusted on a subject level, library size is adjusted on a cellular level within a subject, outliers are adjusted on a feature level within a subject, and dropouts are accounted for on a cellular/feature level within a subject. The Splatter simulation is performed using the transformed gene means (combining all latent components within and ), and then the features are later separated by data type for each subject.
Simulation specifications
MOSCATO was applied to a series of simulations using the techniques described in the previous section. Simulations were performed under 9 different settings accounting for the average number of cells per subject (250, 500, or 1000) and amount of technical noise (low, moderate, or high). Simulations were replicated 50 times under each simulation setting and each simulation had 100 subjects. contained 1440 total features where only 10 belonged to the network, and contained 1555 total features where only 15 belonged to the network. Table 1 describes the total number of features contained within each of the latent variables described in Fig. 7.
Table 1.
Latent Variable | # Features |
---|---|
H | 15 |
G | 10 |
G′ | 15 |
Noise in | 1400 |
S | 20 |
X | 15 |
X′ | 20 |
Noise in | 1500 |
The table summarizes the number of features within each latent variable used in the simulations. These latent variables are displayed in Fig. 7. H describes the subset of features within that relate to the outcome but not any features within , G describes the subset of features within belonging to the network, and G′ describes the subset of features within related to some features within but not the outcome. S, X, and X′ describe similar subsets of features within
For tuning the hyperparameters, we set gridα={0.2,0.5,0.7},ϕ=0.02,R=50, and c=50. and were initially set to 5, although this number may be increased in order to reduce runtimes as long as the instability remains below ϕ for its initialization. Ideally, MOSCATO would tune maxG to 10 and maxX to 15 in order to select the proper network size according to Table 1, but this will be unlikely due to the mechanics behind the StARS tuning method described in “Tuning on stability” section which prioritizes reducing type II error over type I error.
In addition to applying MOSCATO, we also applied competing methods using the AUC. Seurat provides a popular single-cell sequencing workflow, and following similar methods used by the authors of Seurat [10], this AUC approach was done using the presto version 1.0.0 R package [19]. Selections using AUC were performed using two different criteria. One criterion was based on whether the Bonferroni adjusted p-value was less than the nominal significance level (set to 0.05) under the null hypothesis that the AUC equals 0.5. Additionally, selection criteria using cutoff values where features with an AUC either less than 0.3 or greater than 0.7 were selected. Since AUC requires categorical outcomes, we use the median of the outcome to binarize it (i.e., if the outcome is less than the median then recode the outcome as ‘0’, otherwise if the outcome is greater than the median then recode as ‘1’).
Real data application details
Data was combined across multiple studies for healthy and leukemia subjects. The studies used to obtain the data are summarized in Table 2. Only non-perturbed, baseline cells were considered.
Table 2.
Seurat version 4.0.3 [10] was used to normalize the data, cluster cells, and integrate the cell types across subjects.
We applied MOSCATO to each of these cell clusters separately, each with gridα={0.01,0.05,0.1}. Since Seurat clusters cells by maximizing correlation between features, the multicollinearity across features would consequently be high and require more weight on the L2-norm (i.e., lower α within the elastic net constraint).
Due to the modest sample size, we tuned the hyperparameters using a subsampling size of 20 (out of 21 total subjects) to estimate stability based on a “leave one out” scheme. Although StARS suggests setting the instability threshold ϕ to 0.05 for most applications [22], in this application with a small sample size (i.e., only 21 subsamples) and large number of RNA features (i.e., 17991 variables), the estimated instability under sparsest solutions will be much smaller than 0.05. For example, suppose all 21 subsamples select completely disjoint feature sets, but due to the high number of variables in consideration, many variables are consistently excluded from any selections across all of the subsampled results. Since StARS considers both consistency in selections and consistency in exclusions, the estimated instability will be quite small due to the consistency in exclusions despite if the small number of features selected may be completely disjoint across all subsamples. Therefore, ϕ was set to 0.001.
To compare selections with another method, the MOSCATO results were compared to selections based on the AUC. The AUC approach was done using the presto version 1.0.0 R package [19]. Similarly as was done for MOSCATO, the AUC feature selections were performed on each cell cluster separately. Feature selections were made under two different selection criteria for AUC: if the Bonferroni adjusted p-value was less than 0.05 under the null hypothesis that the AUC equals 0.5 or whether the AUC was less than 0.3 or greater than 0.7. Since the p-value would likely be small in situations with many cells (i.e., large sample sizes producing sensitive p-values for miniscule AUC deviations from 0.5), both a p-value approach and an approach based on the AUC values were considered.
The real data application may be reproduced by following the steps provided at https://github.com/lorinmil/MOSCATOLeukemiaExample.
Supplementary Information
Acknowledgments
The authors would like to thank Dr. David Tritchler for providing valuable feedback for the work presented in this manuscript.
Abbreviations
- ADT
Antibody Derived Tag
- AUC
Area Under the receiver operating Curve
- DNSMI
Decomposition of Network Summary Matrix via Instability
- GLM
Generalized Linear Model
- HTLV-I
Human T-cell Leukemia Virus type 1
- MOSCATO
Multi-Omic Single-Cell Analysis using TensOr regression
- MRI
Magnetic Resonance Imaging
- RNA
Ribonucleic Acid
- SCCA
Sparse Canonical Correlation Analysis
- scRNA-seq
single-cell RNA-seq
- StARS
Stability Approach to Regularization Selection
- UMAP
Uniform Manifold Approximation and Projection
Authors’ contributions
LT-M performed the simulations, real data application, and was a major contributor in drafting the manuscript. JM was also a major contributor in drafting the manuscript and reviewing methods. All authors read and approved the final manuscript.
Funding
LT-M was funded under the Empire Clinical Research Investigator Program (ECRIP) grant (PI: Dr. Ekaterina Noyes). The funding body played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Availability of data and materials
All code in this manuscript was done using R version 4.1.0 [30]. Code to perform MOSCATO and replicate the simulations may be found publicly on GitHub at https://github.com/lorinmil/MOSCATO.
The real data application utilized publicly available data across three studies (assession numbers ERP124005, GSE152469, and GSE139369). The data for ERP124005 was downloaded through the Human Cell Atlas at https://data.humancellatlas.org/explore/projects/efea6426-510a-4b60-9a19-277e52bfa815/project-matrices, the data from GSE152469 was downloaded through the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4616298, and the data from GSE139369 was downloaded through NCBI’s Gene Expression Omnibus at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE139369. Steps and code to replicate the real data application may be found at https://github.com/lorinmil/MOSCATOLeukemiaExample.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Lorin M. Towle-Miller, Email: lorinmil@buffalo.edu
Jeffrey C. Miecznikowski, Email: jcm38@buffalo.edu
References
- 1.Balmain A, Gray J, Ponder B. The Genetics and Genomics of Cancer. Nat Genet. 2003;33(3):238–44. doi: 10.1038/ng1107. [DOI] [PubMed] [Google Scholar]
- 2.Becht E, McInnes L, Healy J, Dutertre C-A, Kwok IW, Ng LG, Ginhoux F, Newell EW. Dimensionality Reduction for Visualizing Single-Cell Data Using UMAP. Nat Biotechnol. 2019;37(1):38–44. doi: 10.1038/nbt.4314. [DOI] [PubMed] [Google Scholar]
- 3.Benatar D, Bondmass M, Ghitelman J, Avitall B. Outcomes of Chronic Heart Failure. Arch Intern Med. 2003;163(3):347–52. doi: 10.1001/archinte.163.3.347. [DOI] [PubMed] [Google Scholar]
- 4.Cadot S, Valle C, Tosolini M, Pont F, Largeaud L, Laurent C, Fournie JJ, Ysebaert L, Quillet-Mary A. Longitudinal CITE-Seq Profiling of Chronic Lymphocytic Leukemia During ibrutinib Treatment: Evolution of Leukemic and Immune Cells at Relapse. Biomark Res. 2020;8(1):1–13. doi: 10.1186/s40364-020-00253-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Cookson W, Liang L, Abecasis G, Moffatt M, Lathrop M. Mapping Complex Disease Traits with Global Gene Epression. Nat Rev Genet. 2009;10(3):184–94. doi: 10.1038/nrg2537. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Creixell P, Reimand J, Haider S, Wu G, Shibata T, Vazquez M, Mustonen V, Gonzalez-Perez A, Pearson J, Sander C, et al. Pathway and Network Analysis of Cancer Genomes. Nat Methods. 2015;12(7):615. doi: 10.1038/nmeth.3440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Elissen AM, Steuten LM, Lemmens LC, Drewes HW, Lemmens KM, Meeuwissen JA, Baan CA, Vrijhoef HJ. Meta-Analysis of the Effectiveness of Chronic Care Management for Diabetes: Investigating Heterogeneity in Outcomes. J Eval Clin Pract. 2013;19(5):753–62. doi: 10.1111/j.1365-2753.2012.01817.x. [DOI] [PubMed] [Google Scholar]
- 8.Friedman J, Hastie T, Tibshirani R. Sparse Inverse Covariance Estimation with the Graphical Lasso. Biostatistics. 2008;9(3):432–41. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Granja JM, Klemm S, McGinnis LM, Kathiria AS, Mezger A, Corces MR, Parks B, Gars E, Liedtke M, Zheng GX, et al. Single-Cell Multiomic Analysis Identifies Regulatory Programs in Mixed-Phenotype Acute Leukemia. Nat Biotechnol. 2019;37(12):1458–65. doi: 10.1038/s41587-019-0332-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hao Y, Hao S, Andersen-Nissen E, Mauck III WM, Zheng S, Butler A, Lee MJ, Wilk AJ, Darby C, Zagar M, Hoffman P, Stoeckius M, Papalexi E, Mimitou EP, Jain J, Srivastava A, Stuart T, Fleming LB, Yeung B, Rogers AJ, McElrath JM, Blish CA, Gottardo R, Smibert P, Satija R. Integrated Analysis of Multimodal Single-Cell Data. Cell. 2021. 10.1016/j.cell.2021.04.048. [DOI] [PMC free article] [PubMed]
- 11.Huang DW, Sherman BT, Lempicki RA. Bioinformatics Enrichment Tools: Paths Toward the Comprehensive Functional Analysis of Large Gene Lists. Nucleic Acids Res. 2009;37(1):1–13. doi: 10.1093/nar/gkn923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.SRPS in NCI’s Division of Cancer Control, (DCCPS) PS. Cancer Stat Facts: Leukemia. 2021. https://seer.cancer.gov/statfacts/html/leuks.html. Accessed 25 Aug 2021.
- 13.Ishitsuka K, Tamura K. Human T-cell Leukaemia Virus Type I and Adult T-cell Leukaemia-lymphoma. Lancet Oncol. 2014;15(11):517–26. doi: 10.1016/S1470-2045(14)70202-5. [DOI] [PubMed] [Google Scholar]
- 14.Karlebach G, Shamir R. Modelling and Analysis of Gene Regulatory Networks. Nat Rev Mol Cell Biol. 2008;9(10):770–80. doi: 10.1038/nrm2503. [DOI] [PubMed] [Google Scholar]
- 15.Kendal AR, Layton T, Al-Mossawi H, Appleton L, Dakin S, Brown R, Loizou C, Rogers M, Sharp R, Carr A. Multi-Omic Single Cell Analysis Resolves Novel Stromal Cell Populations in Healthy and Diseased Human Tendon. Sci Rep. 2020;10(1):1–14. doi: 10.1038/s41598-020-70786-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Kolda TG, Bader BW. Tensor Decompositions and Applications. SIAM Rev. 2009;51(3):455–500. [Google Scholar]
- 17.Komarova NL, Thalhauser CJ. High Degree of Heterogeneity in Alzheimer’s Disease Progression Patterns. PLoS Comput Biol. 2011;7(11):1002251. doi: 10.1371/journal.pcbi.1002251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Komurov K, Tseng J-T, Muller M, Seviour EG, Moss TJ, Yang L, Nagrath D, Ram PT. The Glucose-Deprivation Network Counteracts Lapatinib-Induced Toxicity in Resistant ErbB2-Positive Breast Cancer Cells. Mol Syst Biol. 2012;8(1):596. doi: 10.1038/msb.2012.25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Korsunsky I, Nathan A, Millard N, Raychaudhuri S. Presto Scales Wilcoxon and auROC Analyses to Millions of Observations. BioRxiv. 2019:653253.
- 20.Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9:559. doi: 10.1186/1471-2105-9-559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Lawlor N, Nehar-Belaid D, Grassmann JD, Stoeckius M, Smibert P, Stitzel ML, Pascual V, Banchereau J, Williams A, Ucar D. Single Cell Analysis of Blood Mononuclear Cells Stimulated Through Either LPS or Anti-CD3 and Anti-CD28. Front Immunol. 2021;12:691. doi: 10.3389/fimmu.2021.636720. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Liu H, Roeder K, Wasserman L. Stability approach To Regularization Selection (StARS) for High Dimensional Graphical Models. Adv Neural Inf Process Syst. 2010;24(2):1432. [PMC free article] [PubMed] [Google Scholar]
- 23.Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, et al. Highly Parallel Genome-Wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell. 2015;161(5):1202–14. doi: 10.1016/j.cell.2015.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.McCarthy MI. Genomics, Type 2 Diabetes, and Obesity. N Engl J Med. 2010;363(24):2339–50. doi: 10.1056/NEJMra0906948. [DOI] [PubMed] [Google Scholar]
- 25.Ni Z, Zheng X, Zheng X, Zou X. scLRTD: A Novel Low Rank Tensor Decomposition Method for Imputing Missing Values in Single-Cell Multi-Omics Sequencing Data. In: IEEE/ACM Transactions on Computational Biology and Bioinformatics: 2020. [DOI] [PubMed]
- 26.O’Donnell CJ, Nabel EG. Genomics of Cardiovascular Disease. N Engl J Med. 2011;365(22):2098–109. doi: 10.1056/NEJMra1105239. [DOI] [PubMed] [Google Scholar]
- 27.Osorio D, Zhong Y, Li G, Huang JZ, Cai JJ. scTenifoldNet: A Machine Learning Workflow for Constructing and Comparing Transcriptome-Wide Gene Regulatory Networks from Single-Cell Data. Patterns. 2020;1(9):100139. doi: 10.1016/j.patter.2020.100139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Pan X, Li Z, Qin S, Yu M, Hu H. ScLRTC: Imputation for Single-Cell RNA-seq Data via Low-Rank Tensor Completion. BMC Genomics. 2021;22(1):1–19. doi: 10.1186/s12864-021-08101-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Picelli S, Faridani OR, Björklund ÅK, Winberg G, Sagasser S, Sandberg R. Full-length RNA-seq from Single Cells Using Smart-seq2. Nat Protoc. 2014;9(1):171–81. doi: 10.1038/nprot.2014.006. [DOI] [PubMed] [Google Scholar]
- 30.R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2021. R Foundation for Statistical Computing. https://www.R-project.org/.
- 31.Sherman BT, Lempicki RA, et al. Systematic and Integrative Analysis of Large Gene Lists Using DAVID Bioinformatics Resources. Nat Protoc. 2009;4(1):44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
- 32.Stoeckius M, Hafemeister C, Stephenson W, Houck-Loomis B, Chattopadhyay PK, Swerdlow H, Satija R, Smibert P. Simultaneous Epitope and Transcriptome Measurement in Single Cells. Nat Methods. 2017;14(9):865–68. doi: 10.1038/nmeth.4380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Towle-Miller LM, Miecznikowski JC, Zhang F, Tritchler DL. Sumo-fil: Supervised multi-omic filtering prior to performing network analysis. PLoS ONE. 2021;16(8):0255579. doi: 10.1371/journal.pone.0255579. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Tritchler D, Towle-Miller LM, Miecznikowski JC. Balanced Functional Module Detection in Genomic Data. bioRxiv. 2020. [DOI] [PMC free article] [PubMed]
- 35.Turner NC, Reis-Filho JS. Genetic Heterogeneity and Cancer Drug Resistance. Lancet Oncol. 2012;13(4):178–85. doi: 10.1016/S1470-2045(11)70335-7. [DOI] [PubMed] [Google Scholar]
- 36.Witten DM, Tibshirani RJ. Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data. Stat Appl Genet Mol Biol. 2009;8(1):Article28. doi: 10.2202/1544-6115.1470. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Zappia L, Phipson B, Oshlack A. Splatter: Simulation of Single-Cell RNA Sequencing Data. Genome Biol. 2017;18(1):174. doi: 10.1186/s13059-017-1305-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Zhang F, Miecznikowski JC, Tritchler DL. Identification of Supervised and Sparse Functional Genomic Pathways. Stat Appl Genet Mol Biol. 2020;19(1):20180026. doi: 10.1515/sagmb-2018-0026. [DOI] [PubMed] [Google Scholar]
- 39.Zhou H, Li L, Zhu H. Tensor Regression with Applications in Neuroimaging Data Analysis. J Am Stat Assoc. 2013;108(502):540–52. doi: 10.1080/01621459.2013.776499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. J R Stat Soc Ser B Stat Methodol. 2005;67(2):301–20. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All code in this manuscript was done using R version 4.1.0 [30]. Code to perform MOSCATO and replicate the simulations may be found publicly on GitHub at https://github.com/lorinmil/MOSCATO.
The real data application utilized publicly available data across three studies (assession numbers ERP124005, GSE152469, and GSE139369). The data for ERP124005 was downloaded through the Human Cell Atlas at https://data.humancellatlas.org/explore/projects/efea6426-510a-4b60-9a19-277e52bfa815/project-matrices, the data from GSE152469 was downloaded through the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4616298, and the data from GSE139369 was downloaded through NCBI’s Gene Expression Omnibus at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE139369. Steps and code to replicate the real data application may be found at https://github.com/lorinmil/MOSCATOLeukemiaExample.