Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2022 Sep 5;25(1):188–202. doi: 10.1093/biostatistics/kxac035

A scalable and unbiased discordance metric with H+

Nathan Dyjack 1, Daniel N Baker 2, Vladimir Braverman 3, Ben Langmead 4, Stephanie C Hicks 5,
PMCID: PMC10724244  PMID: 36063544

Summary

A standard unsupervised analysis is to cluster observations into discrete groups using a dissimilarity measure, such as Euclidean distance. If there does not exist a ground-truth label for each observation necessary for external validity metrics, then internal validity metrics, such as the tightness or separation of the clusters, are often used. However, the interpretation of these internal metrics can be problematic when using different dissimilarity measures as they have different magnitudes and ranges of values that they span. To address this problem, previous work introduced the “scale-agnostic” Inline graphic discordance metric; however, this internal metric is slow to calculate for large data. Furthermore, in the setting of unsupervised clustering with Inline graphic groups, we show that Inline graphic varies as a function of the proportion of observations assigned to each of the groups (or clusters), referred to as the group balance, which is an undesirable property. To address this problem, we propose a modification of Inline graphic, referred to as Inline graphic, and demonstrate that Inline graphic does not vary as a function of group balance using a simulation study and with public single-cell RNA-sequencing data. Finally, we provide scalable approaches to estimate Inline graphic, which are available in the Inline graphic R package.

Keywords: Clustering, Discordance, Dissimilarity, Single cell

1. Introduction

Quantifications of discordance such as Gamma (Goodman and Kruskal, 1979) and Tau (Kendall, 1938) have historically been derived to assess fitness from contingency tables. (The terms “discordance” and “disconcordance” have been used interchangeably to describe related metrics for contingency tables (Rohlf, 1974; Goodman and Kruskal, 1979), but here we use “discordance.”)

In this article, we explore the problem of unsupervised clustering (also known as observation partitioning). A typical clustering algorithm seeks to optimally group Inline graphic observations into Inline graphic groups (or clusters) using a dissimilarity matrix Inline graphic (e.g., Euclidean distance) or Inline graphic for each Inline graphic, Inline graphic observations with Inline graphic unique pairs of distances. If there does not exist a ground-truth label for each observation, internal validity metrics are often used to evaluate the performance of a set of predicted cluster labels Inline graphic for a fixed Inline graphic. Many internal fitness metrics quantify the tightness or separation of partitions with functions such as within-cluster sums of squares or mean Silhouette scores (Rousseeuw, 1987). However, when comparing multiple dissimilarity measures, the interpretation of these performance metrics can be problematic as different dissimilarity measures have different magnitudes and ranges, leading to different ranges in the tightness of the clusters.

One solution is to use discordance as an internal validity metric that depends on the ranks of the dissimilarities, rather than on the dissimilarities themselves, thereby making it a “scale-agnostic.” For example, the discordance metric Inline graphic (Williams and Clifford, 1971; Rohlf, 1974) uses the following to assess how well a given predicted cluster label Inline graphic fits a dissimilarity Inline graphic induced from the same observations (Rohlf, 1974; Desgraupes, 2018) (Note 1 of the Supplementary material available at Biostatistics online):

graphic file with name Equation1.gif (1.1)

given fixed Inline graphic, an adjacency matrix Inline graphic is defined using the predicted cluster label Inline graphic, Inline graphic for the Inline graphic, Inline graphic observations, where Inline graphic if Inline graphic or Inline graphic otherwise. We can define the set of within-cluster distances as Inline graphic and between-cluster distances as Inline graphic with the total number of distances in each set as Inline graphic and Inline graphic, respectively. As we know that each upper triangular entry of Inline graphic is binary (every distance is either between- or within-cluster), then Inline graphic. Here, we define Inline graphic as the proportion of total distances Inline graphic that are within-cluster distances, or Inline graphic.

In the following sections, we first consider properties of Inline graphic and show how Inline graphic is a function of Inline graphic (Section 2), which has an explicit relationship with what we refer to as Inline graphic (the group balance, Section 2.4.1), where Inline graphic is the proportion of observations assigned to each of Inline graphic groups (or clusters) and Inline graphic. We illustrate how this is an undesirable property for Inline graphic to vary as a function of Inline graphic, thereby also the vector Inline graphic and Inline graphic. For example, when simulating “null” data (random Gaussian data with no mean difference between Inline graphic groups), the expected mean (and the interpretation itself) of the Inline graphic discordance metric varies depending on Inline graphic (e.g., if the groups are balanced or Inline graphic, then Inline graphic, but if the groups are imbalanced, such as Inline graphic, then Inline graphic using simulated data) (Figure 1). In addition, we demonstrate that Inline graphic is slow to calculate for large data (due to the pairwise comparisons of dissimilarities in (1.1)). To ameliorate these challenges, we propose a modification to Inline graphic, referred to as Inline graphic (Section 3) and demonstrate that Inline graphic does not vary as a function of group balance using a simulation study and with public single-cell RNA-sequencing (scRNA-seq) data (Section 4). Finally, we provide scalable approaches to estimate Inline graphic, which are available in the Inline graphic R package.

Fig. 1.

Fig. 1

The Inline graphic discordance metric varies as function of Inline graphic (proportion of within-cluster distances), which is a function of the group balance. We randomly sampled Inline graphic = 1000 observations with 500 features from a mixture distribution Inline graphic with Inline graphic being the probability of an observation coming from Inline graphic and Inline graphic coming from Inline graphic with (a,b) no mean difference (Inline graphic) (or a “null” setting), (c,d) a small mean difference (Inline graphic), and (e,f) a large mean difference (Inline graphic). We simulate data with (a,c,e) balanced groups (Inline graphic = 0.5) and (b,d,f) imbalanced groups (Inline graphic = 0.9). For each simulation, the top row contains observations belonging to a group (Inline graphic and Inline graphic) along the first two principal components (PCs) and the bottom row contains histograms of the within- (Inline graphic) and between- (Inline graphic) cluster distances (Euclidean) for the balanced and imbalanced groups. Refer to Figure S1 of the Supplementary material available at Biostatistics online for an illustration of (and Section 2.4.1 for the explicit relationship between) the proportion of within-cluster distances (Inline graphic) and the group balance (Inline graphic). For each simulation, the bottom row includes Inline graphic and the two discordance metrics Inline graphic and Inline graphic. Generally, values close to zero represent more concordance, while a larger values represent more discordance.

2. The Inline graphic discordance metric

The discordance metric Inline graphic (Williams and Clifford, 1971; Rohlf, 1974) scales Inline graphic (1.1) by Inline graphic, the number of ways to compare each unique distance to every other.

graphic file with name Equation2.gif (2.2)

Generally, Inline graphic close to zero represents high concordance, while a larger Inline graphic is more discordant. In this way, Inline graphic can be used to quantify the cluster fitness for a given Inline graphic and Inline graphic (that is, a designation of each pairwise dissimilarity as within- or between-cluster), where a small Inline graphic value would be interpreted as good performance with tight, separate clusters.

2.1. Applications of Inline graphic

As noted above, if Inline graphic is fixed, smaller values of Inline graphic among many sets of labels Inline graphic indicate increased cluster fitness (or the generated labels with smaller Inline graphic have more accurately described the dissimilarity structure of the data) (Rand, 1971; Williams and Clifford, 1971). If we instead fix Inline graphic, we can also use Inline graphic to assess the fitness of multiple dissimilarity matrices Inline graphic (Rohlf, 1974).

Because Inline graphic depends on the relative rankings of pairwise distances, this transformation enables a “scale-agnostic” approach to compare dissimilarity measures through the structure they impose on the data, rather than by the exact values of the distances themselves. This allows distances to be compared on varying scales without imposing bias from the expected magnitude of distances.

2.2. Properties of Inline graphic

Consider Inline graphic (1.1) with an adjacency matrix Inline graphic and dissimilarity matrix Inline graphic, with induced within-cluster distances Inline graphic and between-cluster distances Inline graphic where Inline graphic. We can define Inline graphic as the proportion of total distances Inline graphic that are within-cluster distances. In this way, Inline graphic, and similarly, Inline graphic. Then, conditional on Inline graphic and Inline graphic, the Inline graphic is (Note 1 of the Supplementary material available at Biostatistics online):

graphic file with name Equation3.gif (2.3)

where Inline graphic is the probability that a within-cluster distance Inline graphic is greater than a between-cluster distance Inline graphic. This is the quantity we are interested in estimating, but there is a scaling factor Inline graphic that depends on both Inline graphic and Inline graphic. Next, we consider properties of first Inline graphic and then Inline graphic.

2.3. Properties of Inline graphic

If we know the expected mean and variance for Inline graphic and Inline graphic, we can estimate Inline graphic. In the simple case where Inline graphic, we can consider Inline graphic, then Inline graphic and a standardization of Inline graphic demonstrates (assuming [co]variances exist) that Inline graphic. As we might expect, there is a Inline graphic chance that Inline graphic when Inline graphic.

2.4. Properties of Inline graphic

Using (2.2) and (2.3), the expected value of Inline graphic is (Note 1 of the Supplementary material available at Biostatistics online):

graphic file with name Equation4.gif

As Inline graphic for large enough Inline graphic, then we see that Inline graphic is a function of Inline graphic. Next, we derive the relationship between Inline graphic and Inline graphic (group balance) (Section 2.4.1). Then, we provide an illustration of how Inline graphic varying as function of Inline graphic and Inline graphic is an undesirable property (Section 2.4.2).

2.4.1. Relationship between Inline graphic and Inline graphic

Herein, we derive the relationship between Inline graphic (the proportion of total distances Inline graphic that are within-cluster distances) and the group balance Inline graphic (the proportion of observations assigned to each of the Inline graphic groups). For an arbitrary label Inline graphic (a vector of length Inline graphic, Inline graphic) where Inline graphic indicates that the Inline graphic observation is assigned membership to the Inline graphic cluster group, we can define the portion of observations in group Inline graphic using Inline graphic defined as

graphic file with name Equation5.gif

By definition, we know Inline graphic. Each of the Inline graphic clusters will contribute to the quantity Inline graphic, which is a fraction of the Inline graphic unique pairs of distances. Now, for the Inline graphic cluster, this contribution (Inline graphic) is the upper triangular elements of a matrix block with size Inline graphic

graphic file with name Equation6.gif

Finally, we can express Inline graphic as a sum over each of Inline graphic contributions Inline graphic for Inline graphic for the explicit relationship between Inline graphic and the Inline graphics (and consequently Inline graphic)

graphic file with name Equation7.gif (2.4)

2.4.2. Inline graphic as a function of Inline graphic and Inline graphic is an undesirable property

Because Inline graphic is a function of Inline graphic and thereby the group balance Inline graphic (and consequently Inline graphic), the interpretation of what we expect Inline graphic to mean, for example, in a null setting without any true difference between groups, changes across data sets with different group balances Inline graphic.

For example, assume we randomly sampled Inline graphic = 1000 observations with 500 features from a mixture distribution Inline graphic with no mean difference (Inline graphic) and balanced classes (Inline graphic = 0.5 and Inline graphic = 0.5) then, we know Inline graphic and Inline graphic. This can be thought of as a “null” simulation where we expect no difference in class character or balance, yet Inline graphic will (perhaps unintuitively) equal Inline graphic. However, if there is an imbalance in class sizes (Inline graphic = 0.9 and Inline graphic = 0.82) then Inline graphic (Figure 1). An illustration of the relationship between Inline graphic and Inline graphic for this example can be seen in Figure S1A,B of the Supplementary material available at Biostatistics online, which shifts the majority of the distances to within-cluster distances simply due to the imbalance of the classes.

However, if we consider the same scenario as above, but if we change Inline graphic from Inline graphic to Inline graphic, we see that because there are a larger number of groups, this changes Inline graphic (the portion of within-cluster distances) for both the balanced (Figure S1C of the Supplementary material available at Biostatistics online) and imbalanced simulations (Figure S1D of the Supplementary material available at Biostatistics online).

3. The proposed method

3.1. An unbiased discordance metric with Inline graphic

To ameliorate this effect, we propose Inline graphic, which replaces the scaling factor Inline graphic in the denominator in Inline graphic with Inline graphic:

graphic file with name Equation8.gif (3.5)

In other words, instead of scaling Inline graphic by the total number of ways to compare every distance to every other distance, we divide by the number of ways to compare within-cluster distances to between-cluster distances. Hence, Inline graphic is not a function of Inline graphic:

graphic file with name Equation9.gif

In fact, we can empirically verify that while Inline graphic varies as a function of Inline graphic (and Inline graphic) (Figure 2a), Inline graphic does not (Figure 2b), regardless of difference in expectation between the groups.

Fig. 2.

Fig. 2

The Inline graphic discordance metric does not change as a function of class balance. We randomly sampled Inline graphic = 1000 observations with 500 features from a mixture distribution Inline graphic with Inline graphic being the probability of an observation coming from Inline graphic and Inline graphic coming from Inline graphic with a true mean difference (Inline graphic) (Inline graphic-axis). Along the Inline graphic-axis we change group (or class) balance from balanced (e.g., Inline graphic = 0.50) and to imbalanced (e.g., Inline graphic = 0.05) groups. The plots are heatmaps of true Inline graphic (left) and Inline graphic (right) discordance metrics, which shows Inline graphic does not change as a function of class balance (Inline graphic-axis), only as a function of the true effect size (Inline graphic-axis).

3.2. Generalizing properties of Inline graphic

More generally, consider the function Inline graphic. For some constant Inline graphic, we can decompose this event as a joint event Inline graphic or Inline graphic (Jardine and Sibson, 1968; Rohlf, 1974). Therefore, as Inline graphic, we can decompose Inline graphic into two quantities: Inline graphic where Inline graphic and Inline graphic. In other words, Inline graphic empirically states a Inline graphic of Inline graphic is strictly greater than Inline graphic of Inline graphic. This implies Inline graphic is not uniquely determined. For example, if Inline graphic, we could have Inline graphic or Inline graphic. It should be noted that one can construct examples where two distinct pairs of Inline graphic will have the same product, but do not imply each other.

3.3. Two algorithms to estimate Inline graphic and Inline graphic, Inline graphic

One problem with the Inline graphic (and Inline graphic) discordance metric (3.5) is that it requires the calculation of both (i) the dissimilarity matrix Inline graphic which scales Inline graphic and (ii) Inline graphic (1.1) which scales with the number of ways to compare within-cluster distances to between-cluster distances (or Inline graphic comparisons). For example, with data sets of sizes Inline graphic = 100 and 500, it takes 0.01 and 0.22 s, respectively, to calculate Inline graphic and it takes 0.08 and 59.68 s, respectively, to calculate Inline graphic (Figure 3a, Table S1 of the Supplementary material available at Biostatistics online). For data sets with more than Inline graphic = 500 observations, this quickly becomes computationally infeasible.

Fig. 3.

Fig. 3

Computation times (seconds) for exact and approximate Inline graphic calculations as a function of increasing number of observations Inline graphic. Computational time (Inline graphic-axis) as a function of observations (Inline graphic-axis) to calculate the individual components of (a) exact Inline graphic including (i) the dissimilarity matrix Inline graphic (purple) scaling Inline graphic, (ii) the adjacency matrix (orange), and (iii) the most expensive operation Inline graphic (pink) scaling Inline graphic. Note, Inline graphic is only shown for Inline graphic = 100 and 500 observations, but the trend is shaded in for the other observations; (b,c) have different y-axes than (a). The diagonal line between (a) and (b) connects the 20-s ticks of these two axes. (b) Approximate Inline graphic estimation (HPE) using the grid search procedure including (i) the dissimilarity matrix Inline graphic (blue) scaling Inline graphic and (ii) the HPE algorithm to estimate Inline graphic (orange) scaling Inline graphic; (c) approximate Inline graphic estimation using the bootstrap procedure (HPB) (purple), which scales similarly to HPE without the computational expense required for calculating Inline graphic. Note (b) and (c) have a different Inline graphic-axis scale than (a) for an zoomed in visualization of time.

To address this, we propose two algorithms to estimate Inline graphic, both referred to as an “h-plus estimator” or (HPE): (i) a brute force approach inspired by the Top-Scoring Pair (Leek, 2009; Magis and Price, 2012) algorithms, which use relative ranks to classify observations with Inline graphic comparisons and (ii) a grid search approach with Inline graphic comparisons, where Inline graphic refers to percentiles of the data (rather than the Inline graphic observations themselves). Typically, Inline graphic is chosen such that Inline graphic, leading to significant improvements in the computational speed to calculate Inline graphic. Specifically, both algorithms estimate Inline graphic (referred to as Inline graphic or HPE) assuming Inline graphic has been precalculated and provide faster ways to approximate Inline graphic (Figure 3b). Both algorithms are implemented in the hpe() function in the fasthplus R package.

Finally, in a later section (Section 3.5), we introduce a third algorithm based on bootstrap sampling to avoid calculating the full dissimilarity matrix Inline graphic, thereby leading to further improvements in computational speed to estimate Inline graphic (referred to as Inline graphic or HPB) (Figure 3c). The bootstrap algorithm is implemented in the hpb() function in the fasthplus R package.

3.3.1. Intuition behind HPE algorithms

The estimator Inline graphic (or HPE) assumes Inline graphic has been precalculated and then provides faster ways to approximate Inline graphic (the pairwise comparisons of Inline graphic and Inline graphic). Specifically, we let the two sets Inline graphic and Inline graphic represent the ordered (ascending) dissimilarities Inline graphic and Inline graphic, respectively. Then, we bin the sets Inline graphic and Inline graphic into Inline graphic percentiles where Inline graphic and Inline graphic are the percentiles for Inline graphic. Note, Inline graphic and Inline graphic. In both algorithms below, we check if Inline graphic, then Inline graphic, and similarly, if Inline graphic then Inline graphic.

Next, we provide a graphical intuition for the two HPE algorithms by performing a simulation study. First, we simulate observations from two Gaussian distributions, namely Inline graphic and Inline graphic and calculate the quantiles Inline graphic and Inline graphic for each of the sets with Inline graphic (Figure 4a), Inline graphic (Figure 4b), and Inline graphic (Figure 4c). The calculation of these quantiles seeks to approximate the true ordered inequality information for each Inline graphic and Inline graphic. That is, if Inline graphic were both given in ascending order, the white line in Figure 4 shows the percent of Inline graphic that is strictly less than each Inline graphic. The true Inline graphic is then given by the area under the white curve (the true rank orderings for each pair). Our goal is to use the following two algorithms to estimate the true Inline graphic (fraction of blue area in the grid).

Fig. 4.

Fig. 4

Graphical representation of two HPE algorithms to estimate Inline graphic. We simulate observations from two Gaussian distributions, namely Inline graphic and Inline graphic and calculate the quantiles Inline graphic and Inline graphic for each of the sets with (a) Inline graphic, (b) Inline graphic, and (c) Inline graphic. The white curve represents the percent of elements in Inline graphic that are strictly less than each element in Inline graphic. The goal is to estimate the true Inline graphic (area under the white curve) using one of two HPE algorithms. The brute force approach (HPE algorithm 1) uses Riemann integration to approximate the white curve by summing the area of the blue squares below the curve. The grid search approach (HPE algorithm 2) starts at the minimum of Inline graphic and Inline graphic and moves along the red–blue border to approximate the white curve (path followed represents the squares with the light blue borders). The HPE contour Inline graphic (or estimate of Inline graphic) Inline graphic is given by yellow-bordered squares. In other words, every pair Inline graphic such that Inline graphic, the interval guaranteed to contain Inline graphic. The intersection of this yellow contour (Inline graphic) and blue contour (grids visited by HPE algorithm 2) are the green-bordered squares, which represents the numerical estimate for Inline graphic and Inline graphic.

3.3.2. HPE algorithm 1: Inline graphic (brute force)

Algorithm 1 numerically approximates Inline graphic with Riemann integration. Specifically, using a double Inline graphic loop with Inline graphic comparisons, this brute force approach sums the area of the squares that are blue in Figure 4, resulting in an algorithm on the order of Inline graphic. The path taken by our implementation of this algorithm is given by the squares with light blue borders, and the contour corresponding to the true Inline graphic is (approximately) represented by the squares with yellow outlines (Figure 4).

Algorithm 1

Inline graphic (brute force)

  • 1. Inline graphic

  • 2. forInline graphicdo

  • 3.       forInline graphicdo

  • 4.             Inline graphic

  • 5.       end for

  • 6. end for

  • 7. Inline graphic

3.3.3 HPE algorithm 2: Inline graphic (grid search)

An alternative and faster approach (on the order of Inline graphic comparisons) is to sketch the surface (blue–red border) that defines Inline graphic. By starting at the minimum of Inline graphic and Inline graphic, Algorithm 2 moves along the blue–red border that defines Inline graphic using grid search to determine whether to increase Inline graphic or Inline graphic with each iteration.

Algorithm 2

Inline graphic (grid search)

  •  1. Inline graphic

  •  2. Inline graphic

  •  3. Inline graphic

  •  4. Inline graphic

  •  5. whileInline graphic and Inline graphicdo

  •  6.      Inline graphic

  •  7.      ifInline graphicthen

  •  8.        Inline graphic

  •  9.      else

  • 10.        Inline graphic

  • 11.        Inline graphic

  • 12.      end if

  • 13. end while

  • 14. Inline graphic

3.4. Convergence of HPE algorithms 1 and 2

Next, we provide a numerical bound for the accuracy of Inline graphic for both the brute force and grid search approaches. For each Inline graphic, Inline graphic, HPE algorithm 2 (and intrinsically algorithm 1) ascertains one of the following:

graphic file with name Equation10.gif (3.6)

In (1), we have confirmed that Inline graphic of Inline graphic are less than Inline graphic of Inline graphic and the Inline graphic addition to the numerical integral will be zero, that is, Inline graphic in HPE algorithm 2. In (3), we see that Inline graphic of Inline graphic are greater than Inline graphic of Inline graphic and Inline graphic in HPE algorithm 2. In (2), we know that Inline graphic of Inline graphic are bigger than Inline graphic of Inline graphic, but not greater than Inline graphic of Inline graphic, and Inline graphic in HPE algorithm 2. Recall that Inline graphic is estimated as the sum over each Inline graphic where Inline graphic. We denote Inline graphic as the true value of this sum for column Inline graphic, that is, for some Inline graphic, Inline graphic where Inline graphic of Inline graphic are less than or equal to Inline graphic and Inline graphic. Thus, for (2), we have the condition Inline graphic, in other words, the addition to Inline graphic from the Inline graphic column will differs from the true value (Inline graphic by at most Inline graphic. Thus, for all Inline graphic:

graphic file with name Equation11.gif (3.7)

That is, by taking Inline graphic percentiles of Inline graphic and Inline graphic, our estimate for HPE algorithm 2 will be within Inline graphic of Inline graphic. This follows when one considers HPE algorithms 1 and 2 are approximations of the paired true rank comparisons (white curve in Figure 4) using Riemann integration with increasing accuracy as a function of Inline graphic. An additional argument for the convergence of these algorithms is presented in Note 2 of the Supplementary material available at Biostatistics online.

3.4.1. Estimating Inline graphic and Inline graphic

To estimate Inline graphic and Inline graphic, we use the intersection of the yellow contour (Inline graphic) and blue contour (path visited by HPE algorithm 2), which are the green-bordered squares in Figure 4. As our approach guarantees that Inline graphic, we can identify every pair Inline graphic as potential values of Inline graphic. Our algorithm also identifies the values of Inline graphic that are true for the observed data (all areas below the white line in Figure 4) or those which have been verified as false (all area above the white line in Figure 4). Our estimate for Inline graphic is then the intersection of Inline graphic that are empirically verified in HPE algorithm 2 such that Inline graphic of Inline graphic is strictly greater than Inline graphic of Inline graphic (blue squares in Figure 4) and Inline graphic which satisfy Inline graphic (yellow squares in Figure 4).

3.5. Bootstrap algorithm to estimate Inline graphic

As noted in Section 3.3, while the computational speed of the HPE algorithms for identifying ways to approximate Inline graphic is significantly faster than calculating the full Inline graphic (Figures 3(a) and (b)), both of these algorithms assume the dissimilarity matrix Inline graphic has been precomputed and that an adjacency matrix Inline graphic must be calculated. Unfortunately, the Inline graphic computational requirements for full pairwise dissimilarity calculation to quickly becomes infeasible (Figure 3, Table S1 of the Supplementary material available at Biostatistics online).

To address the limitation of computing and storing all pairwise dissimilarities, we implemented a bootstrap approximation of Inline graphic (HPB or Inline graphic) that samples with replacement from the original Inline graphic observations Inline graphic times (bootstraps) with a per-bootstrap sample size Inline graphic. We sample proportionally according to the vector Inline graphic as described in Section 2.4.1, that is, each of the Inline graphic clusters is randomly sampled Inline graphic times (where Inline graphic) such that Inline graphic. For each of Inline graphic iterations, the Inline graphic sampled observations are used to generate dissimilarity and adjacency matrices which are then used to calculate a point estimate of Inline graphic. The mean over these Inline graphic bootstraps is Inline graphic, the bootstrap Inline graphic estimate. The bootstrap approach scales substantially better than full dissimilarity calculation (Figure 3c). In our simulations, bootstrap parameters Inline graphic, Inline graphic yield Inline graphic estimates within Inline graphic of that given by HPB with Inline graphic (Inline graphic accuracy) with economical performance improvements. For example, we saw a reduction in computation time from Inline graphic s with HPE to Inline graphic s with HPB at 3000 observations) (Figure S2 and Table S1 of the Supplementary material available at Biostatistics online).

4. Application of Inline graphic to the analysis of single-cell RNA-sequencing data

In this section, we demonstrate the use of Inline graphic as an internal validity metric in the application of scRNA-seq data with predicted cluster labels. Also, we compare Inline graphic to other widely used validity measures, including both (i) external (i.e., comparing predicted labels to ground-truth clustering known a priori) and (ii) internal (derived from the data itself) measures (Halkidi and others, 2001; Theodoridis and Koutroumbas, 2008).

4.1. Motivation

Consider a scRNA-seq data set with Inline graphic observations (or cells) each with Inline graphic features (or genes). We introduced and formulated Inline graphic an internal validity metric to assess the fitness of a single dissimilarity measure Inline graphic and label Inline graphic. Here, we introduce two scenarios where the goal is to compare the performance of either (i) two label sets Inline graphic, Inline graphic and a fixed dissimilarity Inline graphic or (ii) two dissimilarity measures Inline graphic, Inline graphic with a fixed label Inline graphic. In the first scenario, Inline graphic and Inline graphic could represent two iterations in a single clustering algorithm or they could be labels from two separate clustering algorithms. As Inline graphic (and similarly with Inline graphic), the condition Inline graphic can be rewritten as follows

graphic file with name Equation12.gif (4.8)

As Inline graphic is fixed in the following subsections, we offer interpretations of the condition in (4.8) for fixed Inline graphic with varying Inline graphic and fixed Inline graphic with varying Inline graphic.

4.2. Data

We used the Inline graphic (Tian and others, 2019) scRNA-seq data set, which provides an experimentally derived “gold standard” true cell type identity (label) for each cell (https://github.com/LuyiTian/sc_mixology/).

The UMI counts and cellular identities were obtained for Inline graphic = 902 cells comprised of three cell lines (H1975, H2228, and HCC827). The cell lines are used as the true cell type labels. Raw counts were Inline graphic-normalized with a pseudocount of 1, and per-gene variance was calculated using Inline graphic (Lun and others, 2016). For comparison of distances, five dissimilarities (Euclidean, Maximum Manhattan, Canberra, and Binary) were calculated using Inline graphic-normalized counts and the top 1000 most variant genes. For comparison of induced labels, dendrograms were induced directly from Euclidean distances using four hierarchical clustering methods (Ward’s method, single linkage method, complete linkage method, and unweighted pair group method with arithmetic mean). Cluster labels were induced by cutting each dendrogram at the true value of Inline graphic.

4.3. Fixed Inline graphic varying Inline graphic

If a user were generating an analysis pipeline, prior to deployment, it may be insightful to compare the performance of several dissimilarity measures on a previously validated label–data set pair (Baker and others, 2021). In this case, fixing Inline graphic will imply that Inline graphic, then from (4.8), we know that Inline graphic for two dissimilarity matrices Inline graphic and Inline graphic. That is, the number of within-cluster distances greater than between-cluster distances will have strictly decreased. To illustrate this capacity, we used Inline graphic to compare the fitness of five dissimilarity methods induced from the same data and using the same “gold standard” true cell identities. These values may be found in Table S2 of the Supplementary material available at Biostatistics online. Further valuation of dissimilarities in this setting is outside the scope of this work, and we refer the reader to (Baker and others, 2021) for an exploration of this topic.

4.4. Fixed Inline graphic varying Inline graphic

Similarly, Inline graphic can be fixed (e.g., Euclidean distance) with the goal to compare the fitness of one generated label set Inline graphic (i.e., iteration Inline graphic of a clustering algorithm) to a previous label Inline graphic. In this scenario, Equation (4.8) does not imply an explicit relation for Inline graphic; however, the discordance has still decreased. To demonstrate the use of Inline graphic as a cluster fitness metric, we induce labels using four hierarchical clustering methods (Ward’s method, single linkage method, complete linkage method, unweighted pair group method with arithmetic mean) (Figures 5(a–d)), and compare against well-known both external and internal validity metrics (Figure 5(e) and (f)).

Fig. 5.

Fig. 5

The Inline graphic metric is an internal validity measure for assessing the performance of induced cluster labels. Multidimensional scaling (MDS) plots with shapes representing true cell type labels from the Inline graphic scRNA-seq data set and colors representing induced (or predicted) cluster labels from four hierarchical clustering methods implemented in the hclust() function in the base R stats package including (a) Ward’s method, (b) single linkage method, (c) complete linkage method, and (d) unweighted pair group method with arithmetic mean (UPGMA). (e) Scatter plot of Inline graphic (an internal validity metric) compared to Adjusted Rand Index (ARI) (an external validity metric) demonstrating shared information between the two metrics, which Inline graphic (calculated with the HPE algorithm 1 using Inline graphic) recovers without the need of an externally labeled set of observations. (f) A performance plot with three internal validity metrics (Inline graphic-axis scaled between 0 and 1): (i) Inline graphic (for ease of comparison) calculated from labels induced using with Inline graphic (Inline graphic-axis), (ii) mean silhouette score, and (iii) within-clusters sums of square (WCSS). The “peak” of the Inline graphic metric at the correct Inline graphic indicates that Inline graphic accurately identifies the most accurate label in a comparable fashion to established internal fitness measure, namely a “peak” at the mean silhouette score and a “bend” in the WCSS curve.

First, we compare Inline graphic as an internal validity metric to an external validity metric, namely the Adjusted Rand Index (ARI), which assesses the performance of the induced cluster labels using a gold-standard set of cell type labels in the Inline graphic (Tian and others, 2019) scRNA-seq data set. Here, the induced labels with better (higher) ARI also yield better (less) Inline graphic discordance (Figure 5(e)). In this sense, Inline graphic (an internal validity measure without the dependency of a gold-standard set of labels) captures similar information as ARI (an external validity measure that depends on the use of a gold-standard set of labels).

Next, we compare Inline graphic as an internal validity measure to other internal validity measures. Specifically, we induce labels using partition around medoids (Inline graphic-medoids clustering) for values of Inline graphic. For each label and Inline graphic, the mean Silhouette score (Rousseeuw, 1987) and Inline graphic were calculated. We found that Inline graphic accurately identifies the correct Inline graphic for induced labels when compared to an internal validity metric (i.e., how well the data are explained by a single set of labels) using either the within-cluster sum of square “bend" (or “elbow") criterion or the mean Silhouette score (Figure 5(f)).

5. Discussion

Quantifying how well a generated clustering fits the observed data is an essential problem in the statistical and computational sciences. Most methods for measuring cluster fitness are explicitly valued on the dissimilarity induced from the data. While appealing in their simplicity and interpretation, these approaches are potentially more susceptible to numerical bias between observations or types of dissimilarity measures. Discordance metrics, such as Inline graphic and Inline graphic circumvent this issue by assessing label-dissimilarity fitness implicitly on the dissimilarity values. In this work, we show Inline graphic is an estimator for the probability that a within-cluster dissimilarity is strictly greater than a between-cluster dissimilarity, Inline graphic. However, we also show that Inline graphic varies as a function of the proportion of total distances that are within-cluster distances (Inline graphic) and thereby also the group balance (Inline graphic) and number of groups Inline graphic, which an undesirable property of the discordance metric.

Here, we present Inline graphic, a modification of Inline graphic that retains the scale-agnostic discordance quantification while addressing problems with Inline graphic. Explicitly, Inline graphic is an unbiased estimator for Inline graphic. This benefit is most easily seen in the manner that Inline graphic will be unaffected by the value of Inline graphic (the portion of distance pairs that are within the same cluster), a formulation that permits the user to assess fitness for an arbitrary value of Inline graphic. We discuss the theoretical properties of this estimator, provide two simple algorithms for implementation, and ascertain a strict numerical bound for their accuracy as a function of a simple user-defined parameter. We also introduce an estimator of Inline graphic based on bootstrap resampling from the original observations that does not require the full dissimilarity and adjacency matrices to be calculated.

As Inline graphic can be used to assess the fitness of multiple dissimilarities for a fixed label, or to compare multiple labels given a fixed dissimilarity, we envision that Inline graphic can be employed in both development and analysis settings. If the true observation identities (labels) are known for a data set, Inline graphic could be utilized in the development stages of analytical software and pipelines to ascertain the most advantageous dissimilarity measure for that specific problem. In the alternate setting, we envision that Inline graphic can be used to quantify performance in clustering/classification scenarios. If the true labels are unknown, Inline graphic could be used to identify the clustering algorithm which produces the tightest clusters for a fixed dissimilarity measure. As a possible future direction, one could imagine directly minimizing discordance as the objective criteria within a clustering algorithm for optimizing iterative labels.

Due to its generalizability to the number of clusters Inline graphic or the portion of within to within-cluster dissimilarity pairs Inline graphic, Inline graphic may be susceptible to degenerate cluster labels. For example, in the hierarchical clustering portion of Figure 5, Label 4 is less discordant than Label 3 in terms of both Inline graphic and ARI. Label 4 has simply merged two true clusters, and placed a single point in a third identity. While Label 4 is more accurate than Label 3, it achieves this by exploiting an opportunity to increase the proportion of same-cluster pairs, that is, maximizing Inline graphic. One could also imagine a scenario where an algorithm simply makes Inline graphic very large to minimize Inline graphic. In both scenarios, the labels generated are unlikely to be particularly informative for the user. We posit that some form of penalization for Inline graphic may help to alleviate these degenerate cases. For example, dividing Inline graphic by Inline graphic is a penalty for degeneracy in the case of putting many observations in the same label. Conversely, a division by Inline graphic is a potential penalty for the other degeneracy of making many very small clusters.

We also imagine that discordance measures can be synthesized with probabilistic dissimilarity frameworks such as locality-sensitive hashing (LSH) and coresets (Datar and others, 2004; Har-Peled and Mazumdar, 2004). For example, it could be useful if theoretical (probabilistic) guarantees of observation proximity from LSH algorithms could be extended to similar guarantees for the discordance of observations embedded in the hash space. It may also prove fruitful to explore discordance outside the scope of the clustering/classification problem, such as pseudotime (1-dimensional ordering) or “soft” (weighting membership estimation) clustering problems.

In practice, Inline graphic could provide an additional means to consider the termination of a clustering algorithm in a distance-agnostic manner. For example, the Inline graphic-means algorithm (Hartigan and Wong, 1979) and its variants seek to minimize a form of the total within-cluster dispersion (dissimilarity). These algorithms with similar objective functions are subject to changes in behavior as the distance function changes. The extent to which minimizing discordance such as Inline graphic provides benefits regarding sensitivity to noise and magnitude of the distances is intriguing and outside the scope of this work.

Supplementary Material

kxac035_Supplementarey_Data

Acknowledgments

The authors would like to thank Kasper Hansen for the pre-print template and the Joint High Performance Computing Exchange (JHPCE) for providing computing resources.

Conflict of Interest: None declared.

Contributor Information

Nathan Dyjack, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N Wolfe St, Baltimore, MD 21205, USA.

Daniel N Baker, Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, MD 21218, USA.

Vladimir Braverman, Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, MD 21218, USA.

Ben Langmead, Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, MD 21218, USA.

Stephanie C Hicks, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N Wolfe St, Baltimore, MD 21205, USA.

Code and software availability

All analyses and simulations were conducted in the Inline graphic programming language. Code for reproduction of all plots in this article is available at https://github.com/stephaniehicks/fasthpluspaper. Both HPE and HPB have been implemented in the Inline graphic package in Inline graphic available on CRAN at https://CRAN.R-project.org/package=fasthplus and for developmental versions on GitHub at https://github.com/ntdyjack/fasthplus.

Supplementary material

Supplementary material is available online at http://biostatistics.oxfordjournals.org.

Funding

The National Institutes of Health (R00HG009007 to N.D. and S.C.H.); the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation (CZF2019-002443 to N.D. and S.C.H.); the National Institutes of Health (R35GM139602 to DNB and BL); NSF CAREER (1652257), ONR Award (N00014-18-1-2364), and the Lifelong Learning Machines program from DARPA/MTO to V.B., in part.

References

  1. Baker, D. N., Dyjack, N., Braverman, V., Hicks, S. C. and Langmead, B. (2021). Fast and memory-efficient scRNA-seq k-means clustering with various distances. In: Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB ’21). New York, NY, USA:Association for Computing Machinery,Article 24, pp. 1–8. 10.1145/3459930.3469523 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Datar, M., Immorlica, N., Indyk, P. and Mirrokni, V. S. (2004). Locality-Sensitive Hashing Scheme Based on p-Stable Distributions, SCG ’04. New York, NY, USA: Association for Computing Machinery. [Google Scholar]
  3. Desgraupes, B. (2018). clusterCrit: Clustering Indices. R package version 1.2.8. https://CRAN.R-project.org/package=clusterCrit [Google Scholar]
  4. Goodman, L. A. and Kruskal, W. H. (1979). Measures of Association for Cross Classifications. New York, NY: Springer New York. [Google Scholar]
  5. Halkidi, M., Batistakis, Y. and Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems 17, 107–145. [Google Scholar]
  6. Har-Peled, S. and Mazumdar, S. (2004). On coresets for k-means and k-median clustering. In: Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing (STOC ’04). New York, NY, USA: Association for Computing Machinery, pp. 291–300. 10.1145/1007352.1007400 [DOI] [Google Scholar]
  7. Hartigan, J. A. and Wong, M. A. (1979). Algorithm as 136: a k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 100–108. [Google Scholar]
  8. Jardine, N. and Sibson, R. (1968). The construction of hierarchic and non-hierarchic classifications. The Computer Journal 11, 177–184. [Google Scholar]
  9. Kendall, M. G. (1938). A new measure of rank correlation. Biometrika 30, 81–93. [Google Scholar]
  10. Leek, J. T. (2009). The tspair package for finding top scoring pair classifiers in R. Bioinformatics 25, 1203–1204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Lun, A. T. L., McCarthy, D. J. and Marioni, J. C. (2016). A step-by-step workflow for low-level analysis of single-cell RNA-seq data with bioconductor. F1000Research 5, 2122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Magis, A. T. and Price, N. D. (2012). The top-scoring ‘N’ algorithm: a generalized relative expression classification method from small numbers of biomolecules. BMC Bioinformatics 13, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846–850. [Google Scholar]
  14. Rohlf, F. J. (1974). Methods of comparing classifications. Annual Review of Ecology and Systematics 5, 101–113. [Google Scholar]
  15. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65. [Google Scholar]
  16. Theodoridis, S. and Koutroumbas, K. (2008). Pattern Recognition, 4th edition. USA: Academic Press. [Google Scholar]
  17. Tian, L., Dong, X., Freytag, S., Lê Cao, K.-A., Su, S., Jalalabadi, A., Amann-Zalcenstein, D., Weber, T. S., Seidi, A., Jabbari, J. S.. and others. (2019). Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nature Methods 16, 479–487. [DOI] [PubMed] [Google Scholar]
  18. Williams, W. T. and Clifford, H. T. (1971). On the comparison of two classifications of the same set of elements. Taxon 20, 519–522. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxac035_Supplementarey_Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES