Capturing discrete latent structures: choose LDs over PCs

Theresa A Alexander; Rafael A Irizarry; Héctor Corrada Bravo

doi:10.1093/biostatistics/kxab030

. 2021 Sep 1;24(1):1–16. doi: 10.1093/biostatistics/kxab030

Capturing discrete latent structures: choose LDs over PCs

Theresa A Alexander ¹, Rafael A Irizarry ², Héctor Corrada Bravo ^3,^✉

PMCID: PMC9748550 PMID: 34467372

Summary

High-dimensional biological data collection across heterogeneous groups of samples has become increasingly common, creating high demand for dimensionality reduction techniques that capture underlying structure of the data. Discovering low-dimensional embeddings that describe the separation of any underlying discrete latent structure in data is an important motivation for applying these techniques since these latent classes can represent important sources of unwanted variability, such as batch effects, or interesting sources of signal such as unknown cell types. The features that define this discrete latent structure are often hard to identify in high-dimensional data. Principal component analysis (PCA) is one of the most widely used methods as an unsupervised step for dimensionality reduction. This reduction technique finds linear transformations of the data which explain total variance. When the goal is detecting discrete structure, PCA is applied with the assumption that classes will be separated in directions of maximum variance. However, PCA will fail to accurately find discrete latent structure if this assumption does not hold. Visualization techniques, such as t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP), attempt to mitigate these problems with PCA by creating a low-dimensional space where similar objects are modeled by nearby points in the low-dimensional embedding and dissimilar objects are modeled by distant points with high probability. However, since t-SNE and UMAP are computationally expensive, often a PCA reduction is done before applying them which makes it sensitive to PCAs downfalls. Also, tSNE is limited to only two or three dimensions as a visualization tool, which may not be adequate for retaining discriminatory information. The linear transformations of PCA are preferable to non-linear transformations provided by methods like t-SNE and UMAP for interpretable feature weights. Here, we propose iterative discriminant analysis (iDA), a dimensionality reduction technique designed to mitigate these limitations. iDA produces an embedding that carries discriminatory information which optimally separates latent clusters using linear transformations that permit post hoc analysis to determine features that define these latent structures.

Keywords: Bioinformatics, Multivariate data, Statistical modeling

1. Introduction

The number of studies that rely on high dimensional biological data collection continues to grow. This trend has been intensified recently by the advent of single cell RNA-seq (scRNA-seq), which simultaneously measures genome-wide expression profiles for tens of thousands of cells. In these studies, dimensionality reduction techniques are commonly used to capture underlying discrete structure of the data in a small number of dimensions. This structure can be imposed by sources of variation that could be strategic (experimental), biological but unknown and potentially interesting, or unwanted and unintentionally introduced by laboratory techniques. In the case of scRNA-seq, investigators are often interested in discovering previously unknown cell types based on their gene expression profile. Whether heterogeneity across classes arises from unwanted sources of variation, variation of response to an experimental intervention, or different cell types in scRNA-seq experiments, it is vital to be able to identify a low dimensional embedding which describes the separation of classes. This embedding can be used as an interpretable reduction to characterize and investigate features which drive class separation or to correct for unwanted sources of variation in downstream analysis. Varying experimental conditions commonly introduce batch effects that induce unwanted variability (Leek and others, 2010). For example, the Geuvadis Project expression data (Lappalainen and others, 2013) shows clear evidence of batch effect, where the laboratory in which the samples were processed contributes to a large part of variation in the data (Figure S1 of the Supplementary material available at Biostatistics online). If unaccounted for, this batch effect may cause spurious associations in downstream analysis. This is particularly problematic when the outcome of interest is confounded with the variable responsible for the batch effect (Leek and others, 2010).

Similarly, in genetic or transcriptome-wide association studies (GWAS/TWAS), where each loci or transcript is tested for association with an outcome, standard statistical tests assume observations are independent. When there is underlying structure in the data, genetic population structure for example, the independence assumption does not hold (Brown and others, 2018). To correct for this latent structure, state-of-the-art methods include population structure as a covariate in the model. In GWAS, self-reported ancestry is not reliable (Mersha and Abebe, 2015), and the latent structure attributed to ancestry must be estimated. Similarly, in TWAS, sources of technical variability are impossible to predict and methods that estimate these unknown factors, such surrogate variable analysis (SVA) (Leek and Storey, 2007) have been proposed.

A common motivation of studies using scRNA-seq is to detect novel cell-types and identify marker genes that define these cell types. The current standard approach is to define cell-types using unsupervised clustering followed by statistical tests used to identify genes with greater expression in one cell type compared to the others. While this may give researchers valuable information for a particular gene of interest, the method fails to provide a broad summary of a set of genes which define cell type separation. A more useful approach would be to find an embedding space which defines the latent structure of the separation of cell types. The 10X genomics 3k cell PBMC data set, we see a separation based on cell type but with boundaries between them sometimes ill-defined (Figure S2 of the Supplementary material available at Biostatistics online). Defining the latent structure which captures the features which best separate cell types in low dimensional space from each other would be preferred.

PCA is commonly used as a dimensionality reduction method in these applications since it finds principal components (PCs) that explain variation in the data. For sources of variation such as batch effects or genetic ancestry the variation in the data set is largely attributed to factors not associated with the outcome variable and PCA is used to find an embedding to estimate and account for these latent factors. In other data types like scRNA-seq, PCA is used to reduce the computational cost of downstream analyses like clustering. The application of PCA to detect latent discrete structure assumes that latent classes will in fact account for most of the dataset variance. However, since PCA is unsupervised, it does not necessarily separate classes (Lever and others, 2017). When the clusters formed by latent classes fall in directions of maximum variance, PCA will indeed find an embedding where clusters separate, but if these clusters do not fall in directions of maximum variance, embeddings computed by PCA will fail to separate these clusters.

Since PCA seeks to find transformations that explain most variance in the data, it inherently favors separating clusters with large variance. A well-separated cluster with low variance will likely not be separated by the top embedding directions since it does not contribute much to the total variance of the data set. For data sets with an innate discrete latent structure containing clusters with unbalanced variances an alternative method for recognizing latent variables is needed.

Visualization techniques, such as t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP), use a different approach to retain local structure in the original high dimensionality data in the low dimensional embedding. These methods often yield visualizations that capture latent discrete structure in data well. These methods are computationally expensive and in practice are usually used on data that is already of smaller dimension by using PCA as a first step, thus inheriting the issues raised about PCA above. More importantly, embeddings obtained by t-SNE and UMAP are nonlinear, making it hard to determine features in the original data that may be driving some of the signal separating inferred latent discrete structure in the data.

In this paper, we propose iterative Discriminant Analysis (iDA), an unsupervised dimensionality reduction technique to address the problems of PCA and nonlinear methods like t-SNE and UMAP. iDA uses linear transformations to produce an embedding space that contains discriminatory information that optimally separates latent clusters. This method leverages linear discriminant analysis (LDA) in order to find transformations which both separate classes from each other while also minimizing variance within each class. This method improves on PCA as it defines the latent space while maintaining unobserved clustered architecture.

2. Methods

2.1. iDA Algorithm

Denote observed Inline graphic data matrix as . Linear discriminant analysis (LDA), like PCA, is a dimensionality reduction technique since it finds a linear projection of the data features to best fit the data in a lower dimension producing an embedding . A major difference between these two linear projection techniques is LDA is a supervised method, while PCA is unsupervised. Unsupervised PCA finds projections which maximize the total variance in the data, which may or may not separate classes in these projections (Figure 1). If the class assignments are known a priori, LDA can be used to find projections Inline graphic which best separate the classes. In Figure 1, the direction which maximizes variation is uninteresting since it does not separate the latent clustering. However, the LDA projection finds a distinct boundary between the classes. Such as in this example, a direction with low variation, but potentially valuable, class defining information, is not guaranteed to be preserved in the first few principal components. When determining the number of PC’s to use, it is common practice to evaluate the eigenvalues for the top PC’s in an elbow plot to determine the best cutoff for which eigenvectors to include in the reduced embedding. Since the eigenvectors indicate the amount of variation attributed to its respective eigenvector, the top PC’s which capture the majority of the variation in the data are commonly used, meaning the lowly variable but well defined clusters will not be preserved in the PCA embedding.

Fig. 1. — PCA versus LDA projection for data where clustered data. The PCA projection fails to separate the classes because they do not fall on the projection which maximizes variation for the data. The supervised LDA projection maximizes the ratio of between to within cluster variation. Graphic inspired by Malakar (2017).

PCA maximizes total variance of the embedded data by maximizing the objective function

(2.1)

where Inline graphic is the empirical covariance of the embedded data

(2.2)

(2.3)

and Inline graphic is the th row of embedding corresponding to the low-dimensional representation of the th data observation ().

LDA uses observed assignments Inline graphic to one of classes to find projections which separate classes by maximizing the objective function:

(2.4)

where Inline graphic is the between class empirical covariance and is the within class empirical covariance matrix:

(2.5)

While PCA maximizes global variation of the data, LDA maximizes the variation specific to the clustered architecture since class labels are known. Intuitively, by separating the data into clusters and maximizing the ratio of between to within cluster variation, the resulting projection Inline graphic will be one which maximizes the variance between clusters while minimizing the variance within each cluster. LDA assumes that all classes have equal covariance, but the extension to quadratic discriminant analysis (QDA) allows for unequal covariance matrices when calculating the within class scatter matrices, therefore allowing for more flexible, nonlinear decision boundaries between classes.

LDA is a supervised classification technique which requires class labels in order to maximize this ratio of between class variation to within class variation. The challenge of bridging the gap between PCA and LDA is defining these class labels which capture latent discrete structure of data. The core idea of iDA is to use an unsupervised clustering method to define classes, and use those in LDA to find a low dimensional embedding. This unsupervised method aims to find a latent clustered structure in the data while being unbiased to factors such as batch, population, or cell type. This agnostic approach avoids using recorded class labels, but the iDA algorithm does allow for an initial set of class labels to be given if desired by the user. Iteration in this case would start with the embedding step instead of clustering step using the given labels, but that will just find an embedding that separates based on the provided labels. An approach that incorporates labels these as soft-assignments would be helpful, but we leave it for future work.

The iDA algorithm determines the number of clusters to use in each iteration by maximizing the clustering’s modularity as used in the Walktrap clustering step. If, however, the user would like to find an embedding with a pre-determined number of clusters, this can be achieved by setting the user-defined Inline graphic parameter as input to the iDA function.

iDA results in an embedding space of dimension Inline graphic - 1, where is the number of clusters identified by iDA which maximizes the community structure modularity, and each dimension determines a decision boundary which best separates an individual class from the rest. To find reliable clusters which define the discrete latent structure, we use the Walktrap method for community detection in a nearest neighbor graph which computes the similarities between neighbors based on random walks in the graph (Pons and Latapy, 2005). To compute the shared nearest neighbors graph, first, a k-nearest neighbors graph is built with the default number of neighbors k = 10. Then, the pairwise Jaccard Similarity is computed (the number of common neighbors each vertex has divided by the number of vertices that are neighbors of at least one of the two pairwise vertices). The shared nearest neighbors graph is then pruned such that pairwise vertices that have similarities less than 1/15 are set to zero, and converted to an undirected graph, which is then the input to the Walktrap clustering algorithm. The Walktrap clustering returns a hierarchical clustering. The point at which to cut the tree is determined by computing the modularity at each level of the tree and cutting it where the modularity is maximized.

For our experiments below, we check for convergence as explained in Section 2.2.1. Since iDA remains agnostic to the study design, we don’t need to worry about recording errors for phenotype or batch data, or self-identification discrepancies. In our implementation we initialize the embedding by selecting Inline graphic features with highest marginal variance.

2.2. Methods for performance assessment

To assess the performance of iDA, we used two data sets. The first is from the Geuvadis bulk RNA-sequencing Project, which has expression data from 462 individuals from European and Yoruba ancestry individuals and was sequenced in 7 different laboratories (Brown and others, 2018). This data set exhibits a latent structure from an overwhelming amount of variation attributed to batch effects and some variation attributed to population structure. The second data set is the 10X Genomics peripheral blood mononuclear cell (PBMC) single cell RNA-sequencing (scRNA-seq) data set which has expression from roughly 3000 unlabeled immune cells (Zheng and others, 2017). We chose these data sets because of the known structure embedded in each (genetic population structure and batch in the Geuvadis and distinct cell types in PBMC data set). Note, however, these labels are not used as input to the iDA algorithm.

2.2.1. Convergence assessment

To assess the convergence of iDA, we need to inspect the stability of the cluster assignments resulting from the Walktrap community detection at each iteration. Since the cluster assignments dictate the scatter matrices calculated for the discriminant analysis, once the cluster assignments are stable, then the discriminators will not change. To evaluate when the cluster assignments have stopped changing, we calculate the Adjusted Rand Index as concordance for the cluster assignments between each iteration. The iterations are stopped when the Adjusted Rand Index is larger than 0.98 between the cluster assignment vectors.

2.2.2. Cumulative F-statistic

To compare the capacity of iDA and PCA to accurately find a subspace which describes the underlying clustering in the data, we adapt the F statistic as well as Inline graphic to be able to assess models across dimensions. First, we perform linear regressions which model each dimension of PCA and iDA as a function of the observed variables that describe the known underlying group structure (for Geuvadis data set, this is the interaction between population and laboratory, for 10X Genomics PBMC scRNA-seq data, this will be cell type). To determine the appropriate number of PC’s to use in this comparison for each data set, we evaluated the elbow plots to determine the eigenvectors accounting for the majority of the variation in each data set (Figure S3 of the Supplementary material available at Biostatistics online). To ensure an appropriate amount of variation was captured in the PC embedding for comparison to iDA, we use the first 10 PC’s for the Geuvadis and PBMC data sets.

(2.6)

Then, to assess how well these known classes are separated across dimensions, we use an adaptation of the F-statistic. We define this cumulative F-statistic as:

(2.7)

where we compute the total sum of squares (TSS) and residual sum of squares (RSS) per regression model, and sum over each dimension to compute a cumulative F-statistic from 1 to k. This statistic measures the variance of the group means/variance of the within group variances over dimensions. This indicates how well the dimensional reductions for PCA and iDA describes group separation for known groups. Because both the Inline graphic and change as dimensions are added, the cumulative F-statistic is not monotonic.

2.2.3. Cumulative

We also compute a cumulative Inline graphic for dimensions 1 through to measure the total variance captured in the subspaces from each reduction technique over dimensions.

(2.8)

This cumulative Inline graphic over dimensions represents the proportion of variance over dimensions that is explained by the underlying grouping.

2.2.4. Clustering validation

To validate that the iDA method identifies better latent clustering in the original data than the PCA dimensionality reduction and is not creating spurious clustering, we will assess the compactness and separation of clusters in the original space using the Dunn Index (DI). The DI is defined as the ratio between the smallest distance between observations not in the same cluster to the largest intracluster distance,

(2.9)

where Inline graphic is the intercluster distance and is the intracluster distance.

To ensure robustness, the DI is calculated using a bootstrapping approach over multiple sampled observations samples drawn with replacement and the DI is calculated over 1000 bootstrap samples.

3. Results

3.1. iDA preserves cluster structure better than PCA

The cumulative F-statistic is a measure to evaluate how well the low-dimensional embedding captures the separation of clusters in the data. The iDA algorithm was applied to the 10X genomics scRNA-seq PBMC data set and compared to the PCA reduction of the same dimension. For comparison of the embedding from iDA and PCA, we used the cell identities assigned from the Seurat scRNA-seq pipeline as the cell types.

Figure 2(A) provides evidence that the decision boundaries from iDA offer an embedding which better separates the putative cell types compared to PCA, as determined by the cumulative F-statistic. As expected, the first dimension of PCA, which captures the most variation in the data, separates one cluster very well (likely the most variable cluster), but as dimensions are added, the projections do not capture the cluster separation. In the iDA embedding, we see each dimension equally captures the separation of one cluster, so the cumulative F-statistic does not degrade as we add dimensions. iDA also finds LD’s which capture a higher proportion of variance explained over all dimensions than the PCs do for evaluating variance between cell types (Figure 2(B)). Although PCA’s objective function is to maximize variance, we see that it fails to find variation attributed to some cell types, likely ones with low variation. There is evidence of this in the acute drop off in both the cumulative F-statistic and Inline graphic values.

Fig. 2. — Cumulative F-statistic and Cumulative for the PBMC and Geuvadis data sets. (A) As more dimensions are added for each of the reduction methods, the clustering from as determined by the Seurat putative cell type assignments for each of the samples is better separated by the iDA embedding than PCA. (B) The iDA embedding captures more of the total variance attributed to the latent clustering than PCA. (C) As more dimensions are added for each of the reduction methods, the clustering from the main effects and interaction between laboratory and population for each of the samples is better separated by the iDA embedding than PCA. (D) The iDA embedding captures more of the total variance attributed to the latent clustering than PCA.

Inline graphic — Cumulative F-statistic and Cumulative for the PBMC and Geuvadis data sets. (A) As more dimensions are added for each of the reduction methods, the clustering from as determined by the Seurat putative cell type assignments for each of the samples is better separated by the iDA embedding than PCA. (B) The iDA embedding captures more of the total variance attributed to the latent clustering than PCA. (C) As more dimensions are added for each of the reduction methods, the clustering from the main effects and interaction between laboratory and population for each of the samples is better separated by the iDA embedding than PCA. (D) The iDA embedding captures more of the total variance attributed to the latent clustering than PCA.

We see a similar trend for the statistics computed using the Geuvadis data set. We evaluate the performance of iDA and PCA using the groups from the main effects and interaction between the population and laboratory the samples were processed in. While these variables may not account for all latent clustering in this data, we know they attribute to a large portion of the clustering, and an embedding should capture these clusters (and potentially others). Here, the first dimension of PCA does a very poor job of accounting for the latent clustering attributed to population and laboratory; more evidence that maximizing the variation does not guarantee to best separate clusters (Figure 2(C)). iDA again consistently finds better separation for this data, as well as better overall variation attributed to these groups (Figure 2(D)).

Many of the clusters produced by both iDA and PCA + Louvain (Seurat clusters) are nearly identical between each method (Table S1 of the Supplementary material available at Biostatistics online). The main difference is the boundary between Seurat Clusters 0 and 1 changes such that some Seurat Cluster 0 cells get grouped with cells in Seurat Cluster 1. Based on both the tSNE and UMAP of these projections, the boundary between these two clusters is not well defined (Figure S4 of the Supplementary material available at Biostatistics online). To determine which clustering is optimal, we compute the bootstrapped DI in the full gene expression space for class assignments resulting from iDA compared to the class assignments from other methods over 1000 iterations. The PCA and iDA embeddings were compared using multiple clustering techniques including two community detection algorithms common in single cell clustering, Walktrap clustering (the default for the iDA algorithm), and Louvain Community Detection (Blondel and others, 2008), as well as K-means clustering, which has been shown to be closely linked to the unsupervised dimension reduction of PCA (Ding and He, 2004).

The clusters found in the PBMC scRNA-seq data set using the iDA embedding with each of the community detection algorithms—Walktrap (green) and Louvain clustering (light blue)—both have Dunn Indices which are significantly increased compared to the clusters found with the PCA embedding and Louvain clustering (i.e., the clusters found with the Seurat pipeline) (yellow) (Figure S5A of the Supplementary material available at Biostatistics online). In the Geuvadis data set, the Dunn Indices of the clusters found by all clustering methods with the iDA embedding are all significantly increased compared to the clusters found by both the PCA embedding with Louvain clustering as well as the recorded batch (laboratory) and population of each sample (Figure S5B of the Supplementary material available at Biostatistics online). This indicates that the class assignments resulting from iDA define clusters in the original space which are compact and better separated than class assignments determined by either the Seurat pipeline for scRNA-seq or measured covariates in Geuvadis.

3.2. iDA directions are highly correlated with known PBMC cell type markers

Canonical markers of particular cell types have been characterized in PBMCs and can be used to validate clustering methods (Zheng and others, 2017). These markers have been experimentally elucidated such that cells with high expression of these markers are likely of the corresponding cell type based on their known function.

CD14 and LYZ are known markers for monocytes, MS4A1 is a marker for B-cells, GNLY and NKG7 are markers for Natural Killer (NK) cells, and PPBP is a marker for platelets. If the clustering algorithm captures clusters indicative of cell type, the resulting clusters should reflect this in the marker genes expression. For example, LD6 is the directional vector which separates cluster 1 (with cells in cluster 1 mapping to the lowest weights of the vector). We see that the cells in cluster 1 highly express the two markers for Monocytes, CD14 and LYZ (Figure 3(A)). Quantitatively, both markers for monocytes have much higher mean scaled expression than the cells in all other clusters (Table 1).

Fig. 3. — Putative cell type clusters found with iDA are separated both by known cell type markers and in the iDA embedding. (A) The group of cells which highly express both LYZ and CD14, known markers for monocytes, are separated by the 6th iDA dimension (LD6) (salmon colored points). (B) The group of cells which highly express GNLY and NKG7, known markers for NK cells, are separated by the 3rd iDA dimension (LD3) (light blue colored points). (C) The group of cells which highly express MS4A1, a known marker for B-cells, are separated by the 5th iDA dimension (LD5) (teal colored points). (D) PPBP is a known marker for platelets. The group of cells which highly express PPBP are separated by the 1st iDA dimension (LD1) (pink colored points). (E) FCER1A is a known marker for Dendritic cells. The group of cells which highly express FCER1A are separated by the 4th iDA dimension (LD4) (purple colored points).

Table 1.

Difference in mean expression for canonical PBMC cell type markers

Cell type	Marker scaled expression	Mean expression within cell type	Mean expression all other cells
Monocytes	LYZ	1.668	0.382
Monocytes	CD14	1.431	0.328
NK Cells	GNLY	2.740	0.150
NK Cells	NKG7	2.436	0.133
B-Cells	MS4A1	1.971	0.291
Platelets	PPBP	9.792	0.053

Open in a new tab

The LD3 vector separates the cluster of cells (cluster 6) which highly express the markers of NK cells (GNLY and NKG7) (Figure 3(B)). These markers are shown to have much higher average scaled expression than cells not in this cluster. Additionally, iDA is sensitive to separating the cluster of cells which are high in the expression of both marker genes (as opposed to only one), separating them from cells in cluster 2 which are high in only NKG7 expression but not GNLY, indicating that cluster 6 is a cluster of NK cells.

LD5 separates the cluster which highly expresses the marker for B-cells, MS4A1 (Figure 3(C)), also with an overall much higher mean scaled expression when compared to cells in all other clusters (1.971 versus -0.291). This indicates cluster 5 is likely a population of B-cells.

Lastly, a marker for platelets, PPBP, is highly expressed in the cells in cluster 9 identified by the iDA algorithm. This cluster is separated by LD1 and shows a clear separation, with the platelet cells mapping to high LD weights (Figure 3(D)).

3.3. iDA directions are highly correlated with SNPs in eQTL

The Geuvadis data set has known sub-population features attributed to ancestry and the location where the samples were sequenced. iDA identifies an LD which separates the YRI population from the EUR populations (LD3). Interestingly, this LD purely captures the separation in population and groups all samples sequenced in different laboratories together.

Brown and others (2018) show that expression reflects population structure because alleles, which may have different frequencies among different populations, can affect the expression level of nearby genes. These allele-gene pairs form expression quantitative trait loci, or eQTL, if this relationship occurs. The Geuvadis data set has 116 SNP-gene pairs which are in eQTL only in YRI individuals and not in the EUR individuals. The average difference between the YRI minor allele frequency (MAF) and the EUR populations MAF is a substantial 0.175. Since these alleles are in eQTL with a gene pair and the allele frequencies are different between populations, the effect on expression level will also differ between populations. These SNPs allele frequencies vary greatly between the YRI and EUR populations (Table S2 of the Supplementary material available at Biostatistics online).

The MAFs differ as much as 36 Inline graphic (as seen in the rs11757158-RAB5C pair), and in some cases, the minor allele is not present at all in the EUR population (rs143415501). In Figures 4(A, B, and C), clearly increase the expression rate of their respective eQTL genes, while D shows a decrease in expression of PWAR6 in YRI as compared to the EUR samples, mimicking the difference of MAF between the YRI population and the EUR populations.

Fig. 4. — SNP–gene pairs in eQTL affect the expression dependent on the sample population since the SNP MAF is variable across populations. The 3rd iDA dimension (LD3) separates the YRI population from the European ones. (A) RAB5C, which is in eQTL with SNP rs11757158 in the YRI population, has higher average expression than the European populations. (B) SRF, which is in eQTL with SNP rs11413536 in the YRI population, has higher average expression than the European populations. (C) PSKH1, which is in eQTL with SNP rs143415501 in the YRI population, has higher average expression than the European populations. (D) PWAR6, which is in eQTL with SNP rs200846953 in the YRI population, has higher average expression than the European populations.

3.4. iDA directions are highly correlated with known technical noise

Another point to consider is if technical noise is being further entrenched in the embeddings found by iDA. To ensure iDA is able to capture variation in data sets with unknown technical noise, we performed iDA on a control data set designed with all variation attributed to technical noise (Zheng and others, 2017). This data set consists of droplets which were loaded with the same ratio of 92 External RNA Controls Consortium (ERCC) synthetic RNAs into GEMs. Since there are no differences in cell size, RNA content or transcriptional activity in these GEMs, the only variation is from technical noise introduced by the differing number of UMIs per GEM. The embedding iDA produces on both the raw counts and the normalized counts confirms that the entrenchment of technical noise can arise from the normalization technique (Townes and others, 2019). For example, in the negative control ERCC data set, the iDA embedding of the raw counts clearly separates GEMs by the technical variation introduced by the mean UMI count per GEM (Figure S6B of the Supplementary material available at Biostatistics online), but the embedding of the log normalized counts does not produce this separation (Figure S6A of the Supplementary material available at Biostatistics online). Therefore, the entrenchment of technical noise in the iDA embedding is a result of the normalization step, and not the iDA algorithm itself.

To show further evidence of accurate iDA clustering and dimensionality reduction, we applied both iDA and PCA with Walktrap clustering to two single cell bench marking data sets from the 10X genomics platform with known cell type labels Tian and others (2019). Known populations of three (10X-3cl) and five cell lines (10X-5cl), respectively, were sequenced using the 10X genomics single cell RNA sequencing platform for these two data sets. The tSNE plot of the PCA reduction of log counts for each cell reveals discrete clustering within each cell line, but there is also a moderate amount of heterogeneity in each cell line (particularly in H1975 and A549, which each separate into distinct sub-clusters) (Figure S7A of the Supplementary material available at Biostatistics online). When iDA is applied to each of these datasets without any guidance on how many cell lines are in each data set, the resulting clustering is sensitive to the cell type subclustering. Because of this sensitivity to within cell type heterogeneity, when compared to the known cell type labels, the Adjusted Rand Index (ARI) for iDA clustering on the 3 cell line data set is 0.519, and 0.798 for the 5 cell line data set. Comparatively, when unsupervised clustering is applied to the PCA embedding, the ARI between these clusters and the known cell type labels is 0.457 for 10X-3cl and 0.615 for 10X-5cl.

In the case where the number of cell lines in a sample is known, this parameter can be given to Walktrap clustering. In these two data sets, because there is such great difference between the expression profiles of each cell line, the supervised clustering yields much greater results. When iDA is applied with the number of known clusters given, the resulting ARIs are 0.997, 0.986 for 10X-3cl and 10X-5cl, respectively. PCA with supervised clustering also does well with ARIs of 1.00 for 10X-3cl and 0.987 for 10X-5cl (Table S3 of the Supplementary material available at Biostatistics online).

3.5. iDA Converges

When assessing the convergence of iDA, we evaluate the concordance of cluster assignments at each iteration with the Adjusted Rand Index between the two sets of class assignments. We set a threshold of concordance larger than Inline graphic as the stopping criterion for iDA. Each data set reaches the threshold within three iterations of iDA (Figure S8 of the Supplementary material available at Biostatistics online). Since the discriminants in iDA are determined based on the cluster assignments, if the assignments reach stability between iterations, the discriminants will be stable and unchanged between iterations as well.

The speed of iDA is heavily dependent on the number of variable features used initially since each iteration decomposes the gene-by-gene matrix. Although PCA is faster, iDA runs within 6 min for data sets with less than 3,000 variable genes and 500 samples. iDA identifies 2829 and 3000 variable genes in the Geuvadis and PBMC datasets and runs in five and three minutes, respectively (Figure S9 of the Supplementary material available at Biostatistics online). In the breakdown of time allocation for the iDA algorithm, the SVD takes the most time for each of the data sets, since a decomposition is computed in each iteration.

4. Discussion

Discovering the latent structure in high-dimensional data is essential in many applications. The latent structure can be attributed to batch, which needs to be corrected for before downstream analysis to avoid spurious associations as well as reduced power in association testing. The latent structure can also define biological clusters which researchers are interested in evaluating such as in scRNA-seq, where identifying a set of features which define cell types is of main concern. In standard pre-processing using PCA, as well as methods which use PCA to identify sources of variation in batch effects cases, classes which are distinctly defined but have low variation will not be maximally separated. Although both PCA and iDA are not specifically batch correction methods, the resulting low dimensional embedding which captures unwanted variation can in turn be used as covariates in downstream analysis to produce more accurate results. For example, the iDA embedding which captures the separation of batches in the scaled Geuvadis counts clearly separates the samples from individual laboratories in which they were processed, and could be used as covariates in downstream analysis to account for this separation (Figure S6C of the Supplementary material available at Biostatistics online). This approach, namely, adding embeddings to control for unwanted structure in downstream analysis, is preferred to another common approach of removing the first few embedding directions from the original input data to produce corrected data sets for further downstream analysis. Accordingly, we have focused our analyses in this paper on the preferred approach.

One advantage of PCA embeddings is the interpretability of variability contribution each dimension adds when included in the embedding. Often, the respective eigenvalues of the top PCs are plotted to determine the “elbow” which determines the ideal number of top PCs to include in the final embedding (Figure S3 of the Supplementary material available at Biostatistics online). This offers a natural ordering of PC’s based on the amount of variation the PC accounts for in the data. Because iDA also finds embeddings with an SVD, the resulting LD’s order corresponds to easy-to-separate clusters. For example, in the PBMC data set, LD1 separates the platelet cells which are best separated from other cell types (Figure 3(D)), while the later LDs are not as well separated (Figures 3(A,C, and E)).

Several neighbor-preserving non-linear dimensionality reduction techniques exist for data visualization such as t-SNE and Uniform Manifold Approximation and Projection (UMAP). These methods are particularly popular for scRNA-seq dimensionality reduction and visualization. tSNE constructs a probability distribution for all pairs of samples in the data such that samples close in Euclidean distance in the original data space to have a higher probability of being placed near each other in the resulting lower dimensional space than those which are far apart from each other in the original data space, therefore preserving the neighborhood structure in a two or three dimensional space (Laurens van der Maaten and Geoffrey Hinton, 2008). tSNE has a few shortcomings, particularly when applied to scRNA-seq. tSNE does not scale well which is problematic with the rapidly growing number of cells being sequenced for scRNA-seq. iDA is more scalable for large numbers of samples since the SVD time-consuming step in iDA is dependent on the number of variable features and not the number of samples. Since tSNE computes the k-nearest neighbor graph as an initial step, it is infeasible to compute this in the original high dimensional space, so a preprocessing step like PCA is used to reduce the number of dimensions before applying tSNE, which means it could fall victim to the downfalls of PCA as discussed before. The newer method, UMAP, is also a neighborhood-preserving method with several adaptations to tSNE, such as a cost function which attempts to preserve global structure (McInnes and others, 2020).

While t-SNE or UMAP may suffice for data visualization, the embedding only offers visual information on dissimilarity between the samples and loses the interpretability that the iDA embedding offers. The tSNE and UMAP embeddings do not have weights of the features which define the differences between samples, which renders the embeddings useless to explain features which drive cluster separation, or define features that may be used as covariates in downstream analysis. From the SVD of Inline graphic yields the weight matrix of genes by linear discriminants. These gene weights allow for the interpretation of specific features which separate latent clusters.

With these interpretable embeddings, we can visualize the separation of the clusters in each dimension of the embedding and validate these clusters with known properties of each group. For instance, to validate the putative cell types found in the scRNA-seq data set, we compare the expression of known markers to LDs which separate the identified cell type clusters. The cell type clusters identified by the iDA algorithm show corresponding levels of expression for known markers of these cell types. The iDA embedding and clustering is also sensitive to lowly-abundant cell types. In PBMCs, lymphocytes (T cells, B cells, and NK cells) are the majority of the cell population, monocytes are the next highest abundant, and dendritic cells and platelets as the most rare (Kleiveland, 2015). In the iDA embedding of the PBMC data set, the least abundant platelet cell population is the cell cluster which is best separated from the others (in LD1), and the dendritic cell population is captured in LD4 (Figure 3). Therefore, iDA will identify rare populations that are sufficiently separated.

Expression is known to reflect population structure because of the eQTL (Brown and others, 2018) so it is essential to identify this source of variation as a covariate if this is not the desired response variable. In the Geuvadis data set, of the 116 SNPs which are in eQTL in only the YRI individuals, the average difference between the YRI MAF and the EUR populations MAF is a substantial 0.175. This means that on average, SNPs in eQTL in the YRI population but not in the EUR populations have a frequency difference in the population of 17.5% (Table S4 of the Supplementary material available at Biostatistics online). Since these SNPs are in eQTL in only the YRI population, this means they are affecting the expression of the gene which is in eQTL with them much differently than other populations. The difference in population structure affecting MAFs means that the expression of genes in eQTL will be affected by the population structure as well, as seen in 4, causing a latent clustering of expression. The genes in eQTL with population-dependent alleles are recapitulated in the separation of populations by LD as well as gene expression.

When comparing the embeddings that iDA produces compared to PCA, the cumulative F-statistic shows the iDA supervised embedding outperforms the other methods in both data sets with respect to the known cell types. The iDA unsupervised embedding outperforms its PCA counterpart in 10X-5cl, and does just as well as the PCA embedding in 10X-3cl (Figure S7B of the Supplementary material available at Biostatistics online).

In instances where the nature of the latent structure is from discrete sources of variations such as batch effect, population stratification, and discrete cell types, iDA is able to capture these clusters well in the resulting embedding and clustering. iDA leverages LDA in its algorithm to find the embedding and therefore is not as effective at finding an accurate embedding in scenarios where the latent structure is of a continuous fashion, such as in developmental biology. In such data sets where cell types are differentiating and form a continuous trajectory, the iDA embedding may find natural or unnatural breaks in trajectories and the resulting embedding will produce a more discrete structure than actually exists. Because of this, the use of iDA is most appropriate in datasets where the latent structure is thought to be of relatively discrete clusters.

The “pseudo cell” mixture data sets from the CellBench experiments, which includes 9 cell combinations of the same 3 cell lines yielding 34 unique groups with a continuous structure, show the case where cell lines are not discretely separated from each other (Figure S7C of the Supplementary material available at Biostatistics online). The cumulative F-statistic shows that even in the case of continuous, trajectory-like structure between mixes of different cell type compositions, the iDA unsupervised (no number of clusters given as input) and iDA supervised (the estimated number of latent clusters is given as input) either outperforms or matches the performance that a PCA embedding finds with respect to attributing within to between cluster separation between cell type groupings.

In standard preprocessing using PCA, classes which are distinctly defined but have low variation will not be separated on the axes of maximum variation. We have shown that iDA is sensitive to classes with unequal within-class variation and the iDA embedding better separates clusters and also identifies the class assignments which determine a more optimal clustering in the original data space, as evidenced by the DI. The iDA embedding can be evaluated to determine which features are contributing the most to the class separation, and can be used as covariates in downstream analysis to correct for this latent clustering.

Supplementary Material

kxab030_Supplementary_Data

Click here for additional data file.^{(6.3MB, zip)}

Acknowledgments

Conflict of Interest: None declared.

Contributor Information

Theresa A Alexander, Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA.

Rafael A Irizarry, Biostatistics and Computational Biology, Dana Farber Cancer Institute and Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA.

Héctor Corrada Bravo, Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA and Data Science and Statistical Computing, Genentech, Inc. South San Francisco, CA 94080, USA.

Software

R (http://www.r-project.org) code for iDA is available at https://github.com/hcbravolab/iDA; analysis code available at https://github.com/reesyxan/iDA/tree/master/scripts.

Supplementary Material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

Funding

NSF Training (DGE-1632976 to T.A.) and NIH (R01 HG005220 and R35 GM131802 to R.I.) and (R01 GM114267 to H.C.B.), in part.

References

Blondel, V. D., Guillaume, J.-L., Lambiotte, R. and Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, P10008. [Google Scholar]
Brown, B. C., Bray, N. L. and Pachter, L. (2018). Expression reflects population structure. PLoS Genetics 14, e1007841. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chris, D. and Xiaofeng, H. (2004). K-means clustering via principal component analysis. In Proceedings of the twenty-first international conference on Machine learning (ICML ’04). Association for Computing Machinery, New York, NY, USA, 29. DOI: 10.1145/1015330.1015408. [DOI] [Google Scholar]
Ding, C. and He, X. (2004). K-means clustering via principal component analysis. In: ICML ’04. [Google Scholar]
Kleiveland, C. R. (2015). Peripheral Blood Mononuclear Cells. In: Verhoeckx K,, Cotter P,, López-Expósito I, et al., editors. The Impact of Food Bioactives on Health: in vitro and ex vivo models [Internet]. Cham (CH): Springer; 2015. Chapter 15. Available from: https://www.ncbi.nlm.nih.gov/books/NBK500157/ doi: 10.1007/978-3-319-16104-4_15. [DOI] [PubMed] [Google Scholar]
Lappalainen, T., Sammeth, M., Friedländer, M., Hoen, P., Monlong, J., Rivas, M., González-Porta, M., Kurbatova, N., Griebel, T., Ferreira, P.. and others. (2013). Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-SNE. JMLR 9, 2579–2605. [Google Scholar]
Leek, J. T., Scharpf, R. B., Bravo H. C, Simcha D., Langmead B., Johnson W. E., Geman D.,Baggerly K. and Irizarry R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11, 733–739. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3, 12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lever, J., Krzywinski, M. and Altman, N. (2017). Principal component analysis. Nature Methods 14, 641–642. [Google Scholar]
Malakar, G. P. (2017). Linear discriminant analysis (LDA) vs principal component analysis (PCA). https://www.youtube.com/watch?v=M4HpyJHPYBY. [Google Scholar]
McInnes, L., Healy, J. and Melville, J. (2020). Umap: uniform manifold approximation and projection for dimension reduction. ArXiv e-prints 1802.03426. [Google Scholar]
Mersha, T. B. and Abebe, T. (2015). Self-reported race/ethnicity in the age of genomic research: its potential impact on understanding health disparities. Human Genomics 9, 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pons, P. and Latapy, M. (2005). Computing Communities in Large Networks using Random Walks. Heidelberg Berlin: Springer. [Google Scholar]
Tian, L., Dong, X., Freytag, S., Cao, K., Su, S., JalalAbadi, A., Amann-Zalcenstein, D., Weber, T., Seidi, A., Jabbari, J.. and others. (2019). Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nature Methods 16, 479–487. [DOI] [PubMed] [Google Scholar]
Townes, F. W., Hicks, S. C., Aryee, M. J. and Irizarry, R. A. (2019). Feature selection and dimension reduction for single cell RNA-seq based on a multinomial model. Genome Biology 20, 295. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zheng, G., Terry, J., Belgrader, P., Ryvkin, P., Bent, Z., Wilson, R., Ziraldo, S., Wheeler, T., McDermott, G., Zhu,. and others. (2017). Massively parallel digital transcriptional profiling of single cells. Nature Communications 8, 14049. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxab030_Supplementary_Data

Click here for additional data file.^{(6.3MB, zip)}

[B1] Blondel, V. D., Guillaume, J.-L., Lambiotte, R. and Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, P10008. [Google Scholar]

[B2] Brown, B. C., Bray, N. L. and Pachter, L. (2018). Expression reflects population structure. PLoS Genetics 14, e1007841. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] Chris, D. and Xiaofeng, H. (2004). K-means clustering via principal component analysis. In Proceedings of the twenty-first international conference on Machine learning (ICML ’04). Association for Computing Machinery, New York, NY, USA, 29. DOI: 10.1145/1015330.1015408. [DOI] [Google Scholar]

[B4] Ding, C. and He, X. (2004). K-means clustering via principal component analysis. In: ICML ’04. [Google Scholar]

[B5] Kleiveland, C. R. (2015). Peripheral Blood Mononuclear Cells. In: Verhoeckx K,, Cotter P,, López-Expósito I, et al., editors. The Impact of Food Bioactives on Health: in vitro and ex vivo models [Internet]. Cham (CH): Springer; 2015. Chapter 15. Available from: https://www.ncbi.nlm.nih.gov/books/NBK500157/ doi: 10.1007/978-3-319-16104-4_15. [DOI] [PubMed] [Google Scholar]

[B6] Lappalainen, T., Sammeth, M., Friedländer, M., Hoen, P., Monlong, J., Rivas, M., González-Porta, M., Kurbatova, N., Griebel, T., Ferreira, P.. and others. (2013). Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-SNE. JMLR 9, 2579–2605. [Google Scholar]

[B8] Leek, J. T., Scharpf, R. B., Bravo H. C, Simcha D., Langmead B., Johnson W. E., Geman D.,Baggerly K. and Irizarry R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11, 733–739. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3, 12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Lever, J., Krzywinski, M. and Altman, N. (2017). Principal component analysis. Nature Methods 14, 641–642. [Google Scholar]

[B11] Malakar, G. P. (2017). Linear discriminant analysis (LDA) vs principal component analysis (PCA). https://www.youtube.com/watch?v=M4HpyJHPYBY. [Google Scholar]

[B12] McInnes, L., Healy, J. and Melville, J. (2020). Umap: uniform manifold approximation and projection for dimension reduction. ArXiv e-prints 1802.03426. [Google Scholar]

[B13] Mersha, T. B. and Abebe, T. (2015). Self-reported race/ethnicity in the age of genomic research: its potential impact on understanding health disparities. Human Genomics 9, 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] Pons, P. and Latapy, M. (2005). Computing Communities in Large Networks using Random Walks. Heidelberg Berlin: Springer. [Google Scholar]

[B15] Tian, L., Dong, X., Freytag, S., Cao, K., Su, S., JalalAbadi, A., Amann-Zalcenstein, D., Weber, T., Seidi, A., Jabbari, J.. and others. (2019). Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nature Methods 16, 479–487. [DOI] [PubMed] [Google Scholar]

[B16] Townes, F. W., Hicks, S. C., Aryee, M. J. and Irizarry, R. A. (2019). Feature selection and dimension reduction for single cell RNA-seq based on a multinomial model. Genome Biology 20, 295. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] Zheng, G., Terry, J., Belgrader, P., Ryvkin, P., Bent, Z., Wilson, R., Ziraldo, S., Wheeler, T., McDermott, G., Zhu,. and others. (2017). Massively parallel digital transcriptional profiling of single cells. Nature Communications 8, 14049. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Capturing discrete latent structures: choose LDs over PCs

Theresa A Alexander

Rafael A Irizarry

Héctor Corrada Bravo

Summary

1. Introduction

2. Methods

2.1. iDA Algorithm

Fig. 1.

Algorithm 1:

2.2. Methods for performance assessment

2.2.1. Convergence assessment

2.2.2. Cumulative F-statistic

2.2.3. Cumulative

2.2.4. Clustering validation

3. Results

3.1. iDA preserves cluster structure better than PCA

Fig. 2.

3.2. iDA directions are highly correlated with known PBMC cell type markers

Fig. 3.

Table 1.

3.3. iDA directions are highly correlated with SNPs in eQTL

Fig. 4.

3.4. iDA directions are highly correlated with known technical noise

3.5. iDA Converges

4. Discussion

Supplementary Material

Acknowledgments

Contributor Information

Software

Supplementary Material

Funding

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases