Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2024 Mar 11;80(1):ujae001. doi: 10.1093/biomtc/ujae001

From local to global gene co-expression estimation using single-cell RNA-seq data

Jinjin Tian 1,, Jing Lei 2, Kathryn Roeder 3
PMCID: PMC10926266  PMID: 38465983

ABSTRACT

In genomics studies, the investigation of gene relationships often brings important biological insights. Currently, the large heterogeneous datasets impose new challenges for statisticians because gene relationships are often local. They change from one sample point to another, may only exist in a subset of the sample, and can be nonlinear or even nonmonotone. Most previous dependence measures do not specifically target local dependence relationships, and the ones that do are computationally costly. In this paper, we explore a state-of-the-art network estimation technique that characterizes gene relationships at the single cell level, under the name of cell-specific gene networks. We first show that averaging the cell-specific gene relationship over a population gives a novel univariate dependence measure, the averaged Local Density Gap (aLDG), that accumulates local dependence and can detect any nonlinear, nonmonotone relationship. Together with a consistent nonparametric estimator, we establish its robustness on both the population and empirical levels. Then, we show that averaging the cell-specific gene relationship over mini-batches determined by some external structure information (eg, spatial or temporal factor) better highlights meaningful local structure change points. We explore the application of aLDG and its minibatch variant in many scenarios, including pairwise gene relationship estimation, bifurcating point detection in cell trajectory, and spatial transcriptomics structure visualization. Both simulations and real data analysis show that aLDG outperforms existing ones.

Keywords: dependence measure, gene co-expression, independence test, single-cell RNA-seq, spatial and temporal data

1. INTRODUCTION

Experimental biologists and clinicians seek a deeper understanding of biological processes and their link with disease phenotypes by characterizing cell behavior. Gene expression offers a fruitful avenue for insights into cellular traits and changes in cellular state. Advances in technology that enable the measurement of RNA levels for individual cells via single-cell RNA sequencing (scRNA-seq) significantly increase the potential to advance our understanding of the biology of disease by capturing the heterogeneity of expression at the cellular level (Haque et al., 2017). Gene differential expression analysis, which contrasts the marginal expression levels of genes between groups of cells, is the most commonly used mode of analysis to interrogate cellular heterogeneity. By contrast, the relational patterns of gene expression have received far less attention. The most intuitive relational effect is gene co-expression, a synchronization between gene expressions, which can vary dramatically among cells. Converging evidence has revealed the importance of co-expression among genes. When looking at a collection of highly heterogeneous cells, such as cells from multiple cell types, significant gene co-expression may indicate rich cell-level structure. Alternatively, when looking at a batch of highly homogeneous cells, gene co-expression could imply gene cooperation through gene co-regulation (Raj et al., 2006; Emmert-Streib et al., 2014).

The recent work by Dai et al. (2019) attempts an ambitious task: characterizing the gene co-expression at a single cell level (termed “cell-specific network” CSN). Specifically, for a pair of genes and a target cell, Dai et al. (2019) construct a 2-way Inline graphic contingency table test by binning all the cells based on whether they are in the marginal neighborhoods of the target cell and assigning the test results as a binary indicator of gene association in the target cell. Viewed over all gene pairs, the result is a cell-specific gene network. Forgoing interpretation of the detected associations, they utilize the CSN to obtain a data transformation. Specifically, they replace the transcript counts in the gene-by-cell matrix with the degree sequence of each CSN. Although this data transformation shows encouraging success in various downstream tasks, such as cell clustering, it remains unclear what the detected “cell-specific” gene association network really represents. The implementation details and interpretation of the results are presented at a heuristic level, making it difficult for others to appreciate and generalize this line of work.

In a follow-up paper, (Wang et al., 2021) take the first steps to capitalize on the CSN approach by redirecting the concept to obtain an estimator of co-expression. Specifically, they propose averaging the “cell specific” gene association indicators over cells in a class to recover a global measure of gene association (avgCSN). The resulting measure performs remarkably well in certain simulations and detailed empirical investigations of brain cell data. Compared to Pearson’s correlation, the avgCSN gene co-expression appears less noisy and provides more accurate edge estimation in simulations. It is also more powerful in a test to uncover differential gene networks between diseased and control brain cells. Finally, it provides biologically meaningful gene networks in developing cells.

To make the method more stable, (Wang et al., 2021) propose some heuristic and practical techniques to compute avgCSN, for which we would like to have more principled insights. Examples are the choice of window size in defining neighborhoods in the local contingency table test, the choice of thresholding in constructing an edge, and the range of cells to aggregate over. Also, some questions emerge: how does avgCSN relate to other gene co-expression measures and the full range of general univariate dependence measures, and why does it perform well in practice? In this paper, through theoretical analysis and extensive experimental evaluations, we address these questions, revealing that avgCSN is an empirical estimator of a new dependency measure, which enjoys various advantages over the existing measures. By studying the theoretical properties of this new measure, we are also able to provide principle guidelines on practice.

The paper is outlined as follows: we first give a detailed review of the related methods in Section 2. Then, in Section 3.1, we show that avgCSN is indeed an empirical estimate of a valid dependence measure, which we define as averaged Local Density Gap (aLDG). In Sections 3.2 and 3.3, we formally establish its statistical properties, including estimation consistency and robustness. We also investigate data-adaptive hyperparameter selection to justify and refine the heuristic choices in application in Section 3.4. We discuss a minibatch variant of aLDG in Section 4 and point out its application in highlighting change points of cellular states. Finally, we provide a systematic comparison of aLDG and its competitors via both simulation and real data examples in Section 5.

2. A BRIEF REVIEW OF DEPENDENCE AND ASSOCIATION MEASURES

For comparison, we briefly review the related work in gene co-expression measures and general univariate dependence. In the context of gene co-expression analysis, the pair of random variables Inline graphic represents the expression level of a pair of genes, and the goal is to find the relationship between them. Since the work by Eisen et al. (1998), Pearson’s correlation has been the most popular gene co-expression measure for its simple interpretation and fast computation. However, Pearson’s correlation fails to detect nonlinear relationships and is sensitive to outliers. Another class of co-expression methods is based on mutual information (MI) (Bell, 1962; Steuer et al., 2002; Daub et al., 2004). The computation of MI involves discretizing the data and tuning parameters, and the dependence measure does not have an interpretable scale. Reshef et al. (2011) proposed the maximal information coefficient (MIC) as an extension of MI, but MIC was shown to be over-sensitive in practice. More comparisons of different co-expression measures and the constructed co-expression networks can be found in (Song et al., 2012; Allen et al., 2012).

In the broader statistical literature, the problem of finding gene co-expression is closely related to that of detecting univariate dependence between 2 random variables. Specifically, for a pair of univariate random variables Inline graphic, how to measure the dependence between them has been a long-standing problem. The problem is often described as finding a function Inline graphic, which measures the discrepancy between the joint distribution Inline graphic and the product of marginal distribution Inline graphic. Numerous solutions to this problem have been provided: including the Renyi correlation (Rényi, 1959) measuring the correlation between 2 variables after suitable transformations; various regression-based techniques; Hoeffding’s D (Hoeffding, 1948), distance correlation (dCor) (Székely et al., 2007), kernel-based measure like Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2005) and rank based measure like Kendall’s Inline graphic and the refinement later, Inline graphic (Bergsma and Dassios, 2014). Most of these methods have not yet been widely adopted in genetics applications.

To evaluate the above measures under a general framework, let us remark that Rényi (1959) proposed that a measure of dependence between 2 stochastic variables X and Y, Inline graphic, should ideally have the following properties: (i) Inline graphic is defined for any Inline graphic neither of which is constant with probability 1; (ii) Inline graphic=Inline graphic; (iii) Inline graphic; (iv) Inline graphic if and only if X and Y are independent.; (v) Inline graphic if either Inline graphic or Inline graphic, where f anf g are measurable functions; (vi) If the Borel-measurable functions f and g map the real axis in a one-to-one way to itself, then Inline graphic.

Particularly, a measure satisfying (iv) is called a strong dependence measure. Apart from the above properties, there are 2 more properties that are particularly useful in single-cell data analysis. Single-cell data often contain a significant amount of noise, among which outliers account for a nonnegligible fraction. Therefore, robustness is a desirable property in a dependence measure. Specifically, keeping with previous literature (Dhar et al., 2016), by robustness, we mean that the value of the measure does not change much when a small contamination point mass, far away from the main population, is added. A formal description and corresponding evaluation metric will be described later. Another often overlooked property is locality, which is a relatively novel concept and has not been properly defined to the best of our knowledge. Nevertheless, this concept has been catching attention over the recent decade (Reshef et al., 2011; Heller et al., 2013; 2016; Wang et al., 2014), especially in work motivated by genetic data analysis. Locality targets a special kind of dependence relationship that is generally restricted to a particular neighborhood in the sample space. A natural example is the dependence that occurs in some, but not necessarily all of the components in a finite mixture. Another is dependence within a moving time window in a time series. Generally speaking, the interactions change as the hidden condition varies, or only exist under a specific hidden condition. A dependence measure that is local should be able to accumulate dependence in the local regions.

As far as we know, none of those measures has all of the properties mentioned above. The new measure we discovered from avgCSN possesses all but properties (v) and (vi).

3. OUR METHOD: ALDG

First, we elaborate on the origin of our work, the CSN proposed by Dai et al. (2019). They define the cell-specific gene relationship using the following approach: for the gene pair Inline graphic, and a target cell j, partition the n cells based on whether Inline graphic and Inline graphic, where Inline graphic and Inline graphic are predefined window sizes. This partition can be summarized as a Inline graphic contingency table (Table 1). Then the evidence against independence in this Inline graphic table can be quantified by a general contingency table test statistic. Dai et al. (2019) uses Inline graphic, and conducts a one-sided Inline graphic level test based on its asymptotic normality, that is

TABLE 1.

The Inline graphic contingency table based on distance from j-th sample.

Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic
Inline graphic
Inline graphic n
3. (1)

Assume the variables Inline graphic have joint density Inline graphic, and marginal densities, Inline graphic and Inline graphic, that have common support. Define X and Y as being locally independent at position Inline graphic as Inline graphic, then Inline graphic provides a way of assessing local independence. Specifically, as a one-sided test, Inline graphic assesses whether or not Inline graphic, at position Inline graphic marked by cell j. To assess global independence, aggregation, as proposed by Wang et al. (2021), is needed. Their empirical measure can be formally written as: Inline graphic. Some simple approximations give us a population correspondence of avgCSN. Let Inline graphic be the estimated densities given observations of Inline graphic. Under the assumption that the bandwidth Inline graphic and Inline graphic, with some simple algebra (see Supplementary 1 for detailed derivation), we have that

3. (2)

and Inline graphic is some hyperparameter related to the test level of the local contingency test (usually Inline graphic is set to 0.05 or 0.01). Because Inline graphic as n goes to infinity, we naturally think of the following population dependence measure:

3.

In the remainder of this section, we formally define a generalized version of this measure in Section 3.1, along with its properties on the population level. Then we discuss consistent and robust estimation in Section 3.3 and hyper-parameter selection in Section 3.4.

3.1. Definition and basic properties

Definition 1

(aLDG) Consider a pair of random variables Inline graphic whose joint and marginal densities both exist, and denote Inline graphic as their joint and marginal densities. The aLDG measure is then defined as

Definition 1 (3)

and Inline graphic is a tunable hyper-parameter.

From the definition, one can immediately realize the following lemma.

Lemma 1

For Inline graphic whose joint and marginal densities both exist, we have (1) Inline graphic; (2) if Inline graphic, then Inline graphic; (3) Inline graphic is nonincreasing with regard t for all Inline graphic; (4) Inline graphic; (5) Inline graphic;

As a concrete example of the aLDG measure, the left plot of Figure S6 displays aLDG, given different t for a bivariate Gaussian with different choices of correlation. We can see that (1) Inline graphic is nonincreasing with regard t as our Lemma 1 suggests; (2) Inline graphic equals zero at independence for all Inline graphic, while Inline graphic equals zero if and only if there is no dependence, as our Lemma 1 suggests; (3) Inline graphic increases with the dependency level, indicating that it is a sensible dependence measure. Also, we include a discussion of the linkage between aLDG and existing method in Supplementary 9

Note that, from Lemma 1, Inline graphic is a strong measure of dependence. While being strong is a desirable feature of a dependence measure, for aLDG type of measure, we find that it comes with the sacrifice of robustness under independence (Proposition 1). On the other hand, setting Inline graphic could result in insensitivity under weak dependence, but with a provable guarantee of robustness (Theorem 1). In summary, the hyper-parameter t serves as a trade-off between robustness and sensitivity. In Section 3.4 we will discuss the practical choice of t in more detail. For now, we treat it as a predefined non-negative constant.

3.2. Robustness analysis

In the following, we present a formal robustness analysis. An important tool to measure the robustness of a statistical measure is the influence function (IF). It measures the influence of an infinitesimal amount of contamination at a given value on the statistical measure. The gross error sensitivity (GES) summarizes IF in a single index by measuring the maximal influence an observation could have.

Definition 2 (IF and GES)

Assume that the bivariate random variable Inline graphic follows a distribution F, the IF of a statistical functional R at F is defined as

Definition 2 (IF and GES) (4)

where Inline graphic is a Dirac measure putting all its mass at Inline graphic. The GES summarizes IF in a single index by measuring the maximal influence over all possible contamination locations, which is defined as Inline graphic.

Among the related work we have mentioned, only the robustness of Inline graphic, Inline graphic, and dCor have been theoretically investigated to the best of our knowledge. Dhar et al. (2016) proved that dCor is not robust while Inline graphic and Inline graphic are. Their evaluation criteria is a bit different from ours. We investigate the limit of the ratio when the contamination mass goes to zero. They investigate the ratio limit when the contamination position goes far away, given fixed contamination mass. We argue that our analysis aligns better with the main statistical literature. We show that Inline graphic with Inline graphic is B-robust, under some reasonable regularity conditions.

Theorem 1

Consider Inline graphic, and a bivariate distribution F of variable Inline graphic whose joint and marginal densities exist as Inline graphic, Inline graphic, Inline graphic, and satisfy

Theorem 1 (5)

then we have Inline graphic.

The proof of Theorem 1 is in Supplementary 2. In the following, we show that Inline graphic is not robust under independence.

Proposition 1

For any distribution F over a pair of independent random variables Inline graphic whose joint and marginal density exists and are smooth almost everywhere, we have Inline graphic if and only if X is independent of Y.

The proof of Proposition 1 is in Supplementary 3. The right plot in Figure S6 provides some empirical evidence of the nonrobustness of Inline graphic under independence. Specifically, we plot the population value of the ratio inside limitation (4), under bivariate Gaussian with small enough contamination proportion Inline graphic, to approximately show that the IF value of Inline graphic at independence indeed goes to infinity as t goes to zero.

3.3. Consistent and robust estimation

In this section, we investigate the estimation of Inline graphic given finite samples. One natural way to estimate Inline graphic is using the following plug-in estimator: recall that Inline graphic are the estimated joint and marginal densities, then given n observations Inline graphic of Inline graphic, Inline graphic can be estimated by

3.3. (6)

In the following, we establish the nonasymptotic high probability bound of the estimation error using the above simple plug-in estimator Inline graphic. The error rate is determined by the density estimation error for variable Inline graphic, and the probability estimation error for Inline graphic.

Theorem 2

Consider Inline graphic, and a bivariate distribution F of variable Inline graphic whose joint and marginal densities exist as Inline graphic, Inline graphic, Inline graphic, and satisfy

Theorem 2

and for some Inline graphic with Inline graphic, with probability at least Inline graphic  

Theorem 2 (7)

and with some Inline graphic, Inline graphic for all Inline graphic. Then we have, with probability at least Inline graphic, Inline graphic.

Theorem 2 is flexible in the sense that one can plug-in any kind of density estimator and its error rate to obtain the error rate of the corresponding Inline graphic estimator. The proof of Theorem 2 is in Supplementary 4. Though Theorem 2 was for fixed t, we also provide a similar result that holds true uniformly over all possible t in Supplementary 5. We also provide an example of error rate calculation given a boxcar kernel density estimator in Supplementary 6.

We also include a robustness analysis of Inline graphic in Supplementary 7. Specifically, we consider an empirical contamination model that is commonly encountered in single-cell data analysis: a small proportion of the sample points are replaced by “outliers” far away from the rest samples. We show that Inline graphic with and without outliers are close as long as the outlier proportion is small. This suggests that the estimator of Inline graphic preserves its robust nature.

3.4. Selection of hyper-parameter t

In this section, we propose 2 methods for selecting t, each of which has merit. We also provide guidance on which one is preferable in different practice settings.

Uniform error method: From the results in the previous section, we learn that Inline graphic is not robust under independence. To prevent Inline graphic from approaching Inline graphic under independence, it is sufficient to make sure that the estimation error of T under independence is uniformly dominated by t with high-probability. To compute the uniform estimation error of T under independence, we first manually construct the independence case via random shuffle. Given n samples Inline graphic of Inline graphic, denote the corresponding empirical joint distribution as Inline graphic, and marginal joint distribution as Inline graphic and Inline graphic. Applying the random shuffle function Inline graphic on indices of one dimension (i.e. Y), we have Inline graphic, that is the shuffled samples now come from a different joint distribution where Inline graphic are independent.

We can then use the shuffled samples to compute the uniform estimation error of T under independence. Note that T under independence is exactly zero, therefore its uniform estimation error is just the uniform upper bound of its estimation. To stabilize the estimation of such an upper bound, we use the median of the estimated upper bound from Inline graphic different random shuffles as the final estimation. We call this the uniform error method.

Asymptotic norm method: When using Inline graphic in large-scale data analysis, choosing t using the above data-dependent choice may be undesirable because it requires additional computations. In extensive simulations, we observe that a simple alternative also performs fine in terms of maintaining consistency, power, and robustness:

3.4. (8)

This choice is motivated by the following heuristic. Recall our derivation of aLDG statistics from avgCSN around (2): as the sample size n goes to infinity, and Inline graphic, Inline graphic, the empirical estimation of Inline graphic using the boxcar kernel coincides with avgCSN. Therefore, Inline graphic in (2) could serve as a natural choice for t, but one needs to be extra careful about Inline graphic, which is the test level of local contingency test (1) in definition toward avgCSN. We specifically modify Inline graphic to decrease with n instead of a fixed value like 0.05 since we desire consistency: that is, Inline graphic under independence should go to zero as n goes to infinity. Finally, plugging in our derived asymptotic near-optimal choice of bandwidth Inline graphic, Inline graphic for boxcar kernel density estimator (see Supplementary 6 for derivation), together with the new Inline graphic in place of Inline graphic into Inline graphic (2), we get (8). We call this the asymptotic norm method.

Empirically, we find that the asymptotic norm method is often too conservative given the small sample size (which is expected since it is based on the asymptotic normality of a contingency table test statistic). In practice, we recommend people use uniform error over asymptotic norm when the sample size is not too big (eg, no bigger than 200). When the sample size is big enough (eg, bigger than 200), and the computation budget is limited, we recommend the asymptotic norm method. In the rest of the paper, we use the uniform error method when the sample size is no bigger than 200 and the asymptotic norm method when the sample size is bigger than 200. We admit that there could be other promising ways of selecting t, for example, a geometry way we provided in Supplementary 8. Here we only present the methods that we found working the best after a careful evaluation.

4. MINIBATCHED LDG: LOCAL RELATIONAL STRUCTURE

In many cases, a special structure emerges between cellular states. For example, a smooth transition where individual cells represent points along a continuum or lineage; or a spatial graph where cell states represent nodes in a graph. Cells in these cases change states by undergoing gradual transcriptional changes that are controlled by an underlying temporal or spatial factor. The majority of the work in structured genetic data analysis focuses on marginal characterization, while higher-order perspectives like gene-gene relationships are underexplored. scHOT (Ghazanfar et al., 2020) makes the first attempt toward this direction: they infer gene pairs with relational differences along a trajectory or across spatial locations. Despite the novel perspective, their approach is rather heuristic: assuming the trajectories and corresponding pseudotime (or the spatial location) are given, they compute gene coexpression at each time point (or location) using weighted univariate correlation (weights are determined by a triangular kernel). To test whether a gene pair is differentially associated along a curve or across spatial location, they use the standard deviation of the series of time-specific gene coexpressions along the curve as the summary statistics and perform a permutation test. Wang et al. (2021) explore a similar task, but they split the cells into multiple bins along the trajectory first and then compute one covariance matrix (avgCSN) for each bin using only cells from that bin. Finally, they test the differences between the covariance matrices as a whole and report the leverage genes as the differentially associated genes along the trajectory.

Formally put, assume there are p genes and n cells, and each cell is associated with a structure covariate S taking values on a set Inline graphic. Assume the data-generating mechanism:

  1. For each Inline graphic, independently generate Inline graphic from a distribution Inline graphic. These are the structure covariates for each cell.

  2. For each Inline graphic, generate Inline graphic independently from Inline graphic, where Inline graphic is a class of probability distribution on Inline graphic indexed by s.

Then both scHOT and avgCSN estimate the dependence of gene pairs under Inline graphic, which is a Inline graphic dependence matrix for the joint distribution Inline graphic. The local aggregation in scHOT or binning in avgCSN further reduces the estimation error from similar time points. The underlying assumption is that Inline graphic, and hence the corresponding dependence matrix indexed by s, varies smoothly as s changes.

The approach we are going to propose instead works on the mixture distribution Inline graphic where S is treated as randomly generated from Inline graphic. For gene pairs Inline graphic, we use Inline graphic to estimate at cell k the LDG matrix:

4. (9)

Then for each cell k, we aggregate the LDG matrix of its neighboring cells according to their structure covariate value closeness (eg pseudotime or spatial location). This local aggregation pools G to get the final estimate of time/location-specific gene coexpression, and was designed to reduce estimation error. We call these estimations the minibatched LDG. This local aggregating approach was designed to reduce estimation error. The underlying assumption is that the G matrix, which is a random matrix, moves smoothly in its sample space as the structure covariate S changes.

In Section 5.2, we provide 2 real data examples to demonstrate how minibatched LDG can be used to highlight local structural change.

5. EMPIRICAL EVALUATION

5.1. Simulation results

In this section, we consider simulations that resemble single-cell data to gain insights underlying the behavior of aLDG relative to the other methods. Specifically, we investigate scenarios where the bivariate relationship is (1) a finite mixture; (2) linear or nonlinear; and (3) monotone or nonmonotone. See Figure S8 for all the synthetic data distributions we considered. We include the detailed generation scheme in Supplementary 12. We evaluate each dependence measure from the following perspective: (1) ability to capture complex relationships; (2) ability to accumulate subtle local dependence; (3) interpretation of the strength of dependence in common sense; (4) power as an independence test; and (5) computation time. In the following, we focus on one perspective in each subsection, showing selective examples that inform our conclusions, and relegating other examples to Supplementary.

aLDG detects nonlinear, nonmonotone relationships: By construction, aLDG is expected to detect any non-negligible deviation from independence. Though many existing measures, such as HSIC, Hoeffding’s D, dCor, Inline graphic, claim to be sensitive to nonlinear, nonmonotone relationships, some approaches are known to perform poorly under certain circumstances. By contrast, aLDG outperforms most of its competitors in the following standard evaluation experiment. Figure 1 (a) illustrates three points: (1) at independence, except for dCor, HHG, and MIC, most measures produce negligible values, as desired; (2) for linear and monotone relationships, all measures produce high values as expected; and (3) for nonlinear nonmonotone relationships only aLDG, dCor, HHG and MIC produce high values consistently. In conclusion, only aLDG can effectively detect various types of dependency relationships while maintaining near-zero value at independence. dCor, HHG, and MIC are known to be sensitive to small, artificial deviations from independence, and these simulations reveal that they are indeed too sensitive as they often produce high values at independence. A big portion of scRNA-seq data are collected over time; therefore, nonlinear, nonmonotone, and specifically oscillatory relationships are expected to happen. Therefore, it is desirable to have a measure that is sensitive to dependence while remaining near zero of true independence, even under small perturbations.

FIGURE 1.

FIGURE 1

(a) Empirical dependency estimates obtained for different data distributions for a variety of relationships between a pair of variables. For the visualization of different data distributions, see Figure S8. Here, we show the corresponding dependence level given by different measures using 200 samples with noise level Inline graphic. (averaged over 50 trials). (b) The empirical power of permutation test at level 0.05, based on different dependency measures under different data distributions and sample sizes. For the visualization of different data distributions, see Figure S8. the power is estimated using 50 independent trials.

aLDG accumulates subtle local dependencies. aLDG detects the subset of the sample space that shows a pattern of dependence. In Figure 2a, we simulated the data as a bivariate negative binomial mixture consisting of 3 components with a varying proportion of highly dependent components and estimated the corresponding dependence level. We find that aLDG, together with other dependence measures designed to capture local dependence (HHG and MIC), increase with the proportion of highly correlated components, indicating that these global dependence measures can also detect subtle local dependence structure. As the finite mixture relationship and negative binomial distribution is a common choice of model for scRNA-seq data Tian et al. (2020), this suggests that measures able to accumulate dependencies across individual components could considerably benefit scRNA-seq data analysis. Similar results are obtained for Gaussian mixtures (Figure S9).

FIGURE 2.

FIGURE 2

(a) Empirical aLDG value for Negative Binomial mixtures. In each plot, we show the dependence level given by different measures for 200 samples (averaged over 50 trials). The data are generated as a 3-component Negative Binomial mixture. From left to right, there are 0, 1, 2, and 3 out of 3 components with correlation of 0.8, while the remaining components have correlation 0, that is, the dependence level increases from left to right. For the visualization of these different data distributions, see Figure S8. (b) Empirical dependence measure versus noise levels Inline graphic for different bivariate relationships. For the visualization of different data distributions, see Figure S8. The results are shown for 100 samples (averaged over 50 trials). We claim that the higher the noise level is, the lower the estimated degree of dependence should be. Compared with other measures, aLDG decreases significantly as the noise level increases, and hence correctly infers the relative degree of dependence.

aLDG interprets degree of dependencies: degree of dependencies While it is hard to define the relative dependence level in general, we argue that when one random variable is a function of the other, Inline graphic, then the pair should be regarded as having a perfect dependence (and be assigned of dependence level 1). Moreover, the dependence level should decrease as independent noise is added. That is, for Inline graphic, where Inline graphic, one should expect the dependence measure Inline graphic to satisfy Inline graphic. We checked this monotonicity property by simulating data with several bivariate relationships and varying levels of noise [Figure 2 (b)]. Specifically, we simulate the noise Inline graphic to be standard normal, and Inline graphic where Inline graphic indicates the noise level. We find that aLDG, HSIC, MIC, dCor, and HHG all show a clear decreasing pattern as the noise level increases; however, aLDG shows the most consistent monotonic drop from perfect dependence as the noise level increases.

aLDG is powerful as an independence test: dependence measures are natural candidates for tests of independence. In this context, most existing dependence measures rely on bootstrapping or permutation to determine significance; hence we adopt this practice for all the dependence measures under comparison. Figure 1b shows the empirical power under test level 0.05 for various types of data distribution and sample size, where we do 200 repetitions of permutations to estimate the null distribution. We observe the following outcomes: (1) almost all tests have controlled type-I error under independence; (2) Pearson’s Inline graphic, Spearman’s Inline graphic, and Kendall’s Inline graphic are powerless for testing nonlinear and nonmonotone relationships; (3) aLDG, HHG, and HSIC are consistently among the top three most powerful approaches for testing both linear and nonlinear, monotone, and nonmonotone relationships.

aLDG is comparatively as fast as its competitors: theoretically speaking, aLDG requires Inline graphic in time of computation (where n is the number of samples), which is comparable to reported requirements for most dependence measures that can detect complex relationships. This is empirically confirmed in a comparison of the computation time of aLDG with all its competitors. In Figure S8, we plot the time of computation versus sample size n for different dependence measures. In previous evaluations, we saw that HHG as a method motivated by capturing local dependence structure, was indeed a strong competitor to aLDG: it has high power as an independence test across almost all the data distribution we considered; however, it requires Inline graphic time of computation, and Figure S10 shows this large discrepancy from all the other methods, which normally takes Inline graphic time.

5.2. Real data applications and realistic simulations

In this section, we evaluate the performance of aLDG among the other measures using scRNA-seq data from 2 studies.

Case study: Chu dataset: this dataset (Chu et al., 2016) contains 1018 cells of human embryonic stem cell-derived lineage-specific progenitors. The 7 cell types, including H1 embryonic stem cells (H1), H9 embryonic stem cells (H9), human foreskin fibroblasts (HFF), neuronal progenitor cells (NPC), definitive endoderm cells (DEC), endothelial cells (EC), and trophoblast-like cells (TB), were identified by fluorescence-activated cell sorting (FACS) with their respective markers. On average, 9600 genes are measured per cell. In the following, we show some special gene pairs that exhibit strong, weak, or no relational patterns and the corresponding dependence values produced by different measures. We find that only aLDG gives a high value for strong relational patterns no matter how complex the pattern composition is; maintains near-zero values for known independent cases; and avoids a spurious relationship skewed by technical noise and sparsity (Figure 3). We include more examples in Figures S4 and S5, Supplementary 12.

FIGURE 3.

FIGURE 3

Example of gene pair scatter plots from the Chu dataset, which has 1018 cells from 7 cell types. Gene expression is recorded as counts per million (CPM) and Inline graphic transformed. In each plot, we show the scatter plot of Inline graphic for a pair of genes and provide the corresponding estimated dependence values using different methods to the right of the plots. (a) aLDG gives a much higher value than the others in these scenarios which appear to illustrate a strong mixture dependence pattern, even when the signal is predominantly in one cell type. (b) aLDG produces a high value for the obvious 3 mixture relationship in the first subplot. By contrast, in the second subplot, the cell identity are randomly shuffled for each gene pair, resulting in a constructed case of independence. Most measures, including aLDG, give near-zero values in this setting. The exception is MIC, which gives a misleadingly high value. (c) This example illustrate performance when there is a high level of sparsity: MIC and the moment-based methods like Pearson, dCor, and HSIC provide estimates that are greatly overestimated, while aLDG, TauStar, and Hoeffding’s D are not influenced by this phenomenon. (d) This gene pair combines the challenge of sparsity with considerable noise: aLDG is still able to capture the less noisy, local cluster pattern in the upper left corner.

Detecting change point along trajectory: mouse liver datasets. The data set we use is a merged data set from 4 different sources using scMerge (Lin et al., 2019), as scHOT (Ghazanfar et al., 2020) did. The dataset contains cells captured from 8 real-time stages, different time stages may contain different cell types. The scHOT (Ghazanfar et al., 2020) paper conducted downstream analysis on this dataset for the 3 most interesting cell types: Cholangiocyte, Hepatoblast, and Hepatocyte (540 cells in total). Particularly, Hepatoblast cells are a predecessor of both Cholangiocyte and Hepatoblast cells: at some time point the Hepatoblast lineage splits into two branches, one becomes Cholangiocyte cells, and the other becomes Hepatoblast cells. We focus on these 3 cell types in this section.

In Figure S21, we plot the first 2 principal components and indicate cell types and real-time stages for each point (cell), with the curves estimated by slingshot (Street et al., 2018) using only a randomly selected half of the data. We can see that the curves fit the data well, and the real-time stages generally agree with the pseudotime. Then we use the remaining half of the data to estimate the minibatched LDG. In Figure 4, we show results for the curve starting from Hepatoblast and ending at Cholangiocyte (conclusions are similar for the other branch). We visually spot a consistent emergence of strong gene coexpression patterns around the branching time (framed by the red rectangle).

FIGURE 4.

FIGURE 4

The minibatched LDG at some interesting pseudotime points for the curve starting with Hepatoblast cells and ending with Cholangiocyte cells, estimated using the other half of the data (the first half was used to estimate pseudotime). We show the estimated minibatched LDG for 3 independent trials (ie, different data-splitting). We annotate on top of each coexpression matrices the sLED p-value (testing whether the current time point is significantly different from the latter one right after it).

Now that we have a gene coexpression matrix (ie, the minibatched LDG) that changes over pseudotime, we consider the task of change point detection. Our estimated time-specific gene coexpression appears to be very sparse in many stages, making the dynamic community estimation based on the stochastic block model inappropriate. Other methods that impose fewer structure constraints require lots of tuning and computation time (Wang et al., 2021), in order to get high-confidence results. In the following, we present a simple heuristic method instead, which works well in simulation and real data examples. Specifically, we use sLED (Zhu et al., 2017) to test whether time (ie, pseudotime rank) Inline graphic and time Inline graphic are different: we input LDG tensor, and during permutation, we permute the entire time indices; the differences matrix is computed as the absolute differences between minibatched LDGs at time Inline graphic and Inline graphic using window size w (ie, averaged LDG within the Inline graphicth and the Inline graphicth samples). We observe that the resulting sLED p-values are only smaller than our testing level 0.05 near the branching point, meaning that our minibatched LDG method can reveal statistically significant changes around the branching point.

Highlight brain structure: MERFISH brain datasets. This dataset was used by Fischer et al. (2023) for cell communication estimation. The dataset was first assimilated by Zhang et al. (2021), who measured mouse primary motor cortex with multiplexed error-robust fluorescence in situ hybridization (MERFISH) in 634 images across 2 mice with 254 genes observed in 284 098 cells. The cell-types were originally annotated by Zhang et al. (2021). We focus on L2/3, L4/5, L5, and L6 cells, which were shown to form an interesting transition in Zhang et al. (2021). We further constrain the other experimental conditions to rule out any other confounding effects. We include the details of all the preprocessing and filtration in Supplementary 11. The final dataset has around 2600 cells, and we can see that in Figure 5, the gene interaction shows more spatial patterns: the top and the bottom (especially the top) layer seem to have more interaction than the middle layer, and especially in the top layer.

FIGURE 5.

FIGURE 5

The spatial plot annotated by cell-type, total gene expression level and total gene interaction, using 52 genes (the union of 26 differentially co-expressed and 26 nondifferentially expressed genes), and L2/3, L4/5, L5,and L6 IT cells. (a) The cell type annotation for each spatial sample; (b) the average of all gene expression levels for each spatial sample; (c) the average of all edges in minibatched LDG for each spatial sample; (d) the degree of each gene in minibatched LDG for each spatial sample, larger 3d point size represents larger degrees. We can see that gene 30-50 contribute to most of the gene interactions.

6. DISCUSSION

In this paper, we formalize the idea of averaging the cell-specific gene association (Dai et al., 2019; Wang et al., 2021) under a general statistical framework. We show that this approach produces a novel univariate dependence measure, called aLDG, that can detect nonlinear, nonmonotone relationships between a pair of variables. We then develop the corresponding theoretical properties of this estimator, including robustness and consistency. We also provide several hyper-parameter choices that are more justifiable and effective. Extensive simulations, motivated by expected scRNA-seq gene co-expression relationships and real data applications, show that this measure outperforms existing dependence measures in various aspects: (1) it accumulates subtle local dependence over subpopulations; (2) it successfully interprets the relative strength of a monotonic function of dependence in the presence of noise better than many other measures that arose from independence test; (3) it is sensitive to complex relationships while robustly maintaining near-zero value at true independence, while several other measures are often overly sensitive to slight perturbations from independence and noise; (4) it computes comparatively rapidly compared to other dependence measures designed to capture complex relationships. Other measures perform well in some settings but fail in others that are highly relevant to the single-cell setting. For instance, MIC performed well as part of the sLED test for differences in co-expression matrices, but this measure tends to produce a high estimate of dependence even when the variables are independent, or nearly so (Figures 1a and 2 b). The moment-based methods like Pearson, dCor, and HSIC perform poorly when the expression values are sparse, producing false indications of correlation (Figure 3), and yet sparsity is the norm in most single-cell data. Our method is implemented in the R package aLDG (Tian, 2022), where we also include all the other methods that we have compared with.

Inspired by new techniques like spatial transcriptomics (Marx, 2021), we also explored the potential of aLDG in structured data analysis. We show that a minibatched version of aLDG highlights the structural changepoint using 2 real data examples: bifurcating point detection in cell trajectory, and spatial transcriptomics structure visualization.

The aLDG method does have some practical challenges: as a measure based on density estimation, the hyperparameter choices such as bandwidth can affect the performance of the measure. Though we provide some asymptotically optimal choices of those hyperparameters, in practice, they can fail due to the small sample size. For any given setting, the hyperparameters can be adjusted based on realistic simulations of the actual data and a solid understanding of the scRNA-seq data distribution. Similarly, due to the reliance on density estimation, it is hard to extend this measure to a multivariate setting. The sample size required for accurate estimation grows exponentially with the dimension. In practice, this limitation has little practical importance because gene co-expression studies focus on bivariate relationships.

Supplementary Material

ujae001_Supplemental_File

Web Appendices, Tables, and Figures referenced in Sections 3, 4, and 5.1 are available with this paper at the Biometrics website on Oxford Academic.

Acknowledgement

The authors would like to thank Xuran Wang for helpful comments.

Contributor Information

Jinjin Tian, Department of Statistics and Data Science, Carnegie Mellon University, 15213, Pittsburgh, PA, United States.

Jing Lei, Department of Statistics and Data Science, Carnegie Mellon University, 15213, Pittsburgh, PA, United States.

Kathryn Roeder, Department of Statistics and Data Science, Carnegie Mellon University, 15213, Pittsburgh, PA, United States.

FUNDING

This project is funded by National Institute of Mental Health (NIMH) grants R01MH123184 and National Science Foundation (NSF) DMS-2015492.

CONFLICT OF INTEREST

None declared.

DATA AVAILABILITY

No new data were generated or analyzed in support of this research.

References

  1. Allen  J. D., Xie  Y., Chen  M., Girard  L., Xiao  G. (2012) Comparing statistical methods for constructing large scale gene networks. PloS one, 7, e29348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bell  C. (1962) Mutual information and maximal correlation as measures of dependence. The Annals of Mathematical Statistics, 33, 587–595. [Google Scholar]
  3. Bergsma  W., Dassios  A. (2014) A consistent test of independence based on a sign covariance related to kendall’s tau. Bernoulli, 20, 1006–1028. [Google Scholar]
  4. Chu  L.-F., Leng  N., Zhang  J., Hou  Z., Mamott  D., Vereide  D. T.  et al. (2016) Single-cell rna-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biology, 17, 1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Dai  H., Li  L., Zeng  T., Chen  L. (2019) Cell-specific network constructed by single-cell rna sequencing data. Nucleic Acids Research, 47, e62–e62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Daub  C. O., Steuer  R., Selbig  J., Kloska  S. (2004) Estimating mutual information using b-spline functions–an improved similarity measure for analysing gene expression data. BMC Bioinformatics, 5, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Dhar  S. S., Dassios  A., Bergsma  W. (2016) A study of the power and robustness of a new test for independence against contiguous alternatives. Electronic Journal of Statistics, 10, 330–351. [Google Scholar]
  8. Eisen  M. B., Spellman  P. T., Brown  P. O., Botstein  D. (1998) Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95, 14863–14868. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Emmert-Streib  F., Dehmer  M., Haibe-Kains  B. (2014) Gene regulatory networks and their applications: understanding biological and medical problems in terms of networks. Front Cell Dev Biol, 2, 38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Fischer  D. S., Schaar  A. C., Theis  F. J. (2023) Modeling intercellular communication in tissues using spatial graphs of cells. Nature Biotechnology, 41, 332–336. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Ghazanfar  S., Lin  Y., Su  X., Lin  D. M., Patrick  E., Han  Z.-G.  et al. (2020) Investigating higher-order interactions in single-cell data with schot. Nature Methods, 17, 799–806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gretton  A., Bousquet  O., Smola  A., Schölkopf  B. (2005) Measuring statistical dependence with hilbert-schmidt norms. In: International conference on algorithmic learning theory. Springer, Berlin, Heidelberg, 3734, 63–77. [Google Scholar]
  13. Haque  A., Engel  J., Teichmann  S. A., Lönnberg  T. (2017) A practical guide to single-cell rna-sequencing for biomedical research and clinical applications. Genome Med, 9, 75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Heller  R., Heller  Y., Gorfine  M. (2013) A consistent multivariate test of association based on ranks of distances. Biometrika, 100, 503–510. [Google Scholar]
  15. Heller  R., Heller  Y., Kaufman  S., Brill  B., Gorfine  M. (2016) Consistent distribution-free k-sample and independence tests for univariate random variables. The Journal of Machine Learning Research, 17, 978–1031. [Google Scholar]
  16. Hoeffding  W. (1948) A non-parametric test of independence. The Annals of Mathematical Statistics, 19, 546–557. [Google Scholar]
  17. Lin  Y., Ghazanfar  S., Wang  K. Y., Gagnon-Bartsch  J. A., Lo  K. K., Su  X.  et al. (2019) scmerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell rna-seq datasets. Proceedings of the National Academy of Sciences, 116, 9775–9784. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Marx  V. (2021) Method of the year: spatially resolved transcriptomics. Nature Methods, 18, 9–14. [DOI] [PubMed] [Google Scholar]
  19. Raj  A., Peskin  C. S., Tranchina  D., Vargas  D. Y., Tyagi  S. (2006) Stochastic mrna synthesis in mammalian cells. PLoS Biology, 4, e309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Rényi  A. (1959) On measures of dependence. Acta Mathematica Academiae Scientiarum Hungarica, 10, 441–451. [Google Scholar]
  21. Reshef  D. N., Reshef  Y. A., Finucane  H. K., Grossman  S. R., McVean  G., Turnbaugh  P. J.  et al. (2011) Detecting novel associations in large data sets. Science, 334, 1518–1524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Song  L., Langfelder  P., Horvath  S. (2012) Comparison of co-expression measures: mutual information, correlation, and model based indices. BMC Bioinformatics, 13, 1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Steuer  R., Kurths  J., Daub  C. O., Weise  J., Selbig  J. (2002) The mutual information: detecting and evaluating dependencies between variables. Bioinformatics, 18, S231–S240. [DOI] [PubMed] [Google Scholar]
  24. Street  K., Risso  D., Fletcher  R. B., Das  D., Ngai  J., Yosef  N.  et al. (2018) Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics, 19, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Székely  G. J., Rizzo  M. L., Bakirov  N. K. (2007) Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35, 2769–2794. [Google Scholar]
  26. Tian  J. (2022) R Package: Averaged Local Density Gap. https://github.com/jinjint/aldg.
  27. Tian  J., Wang  J., Roeder  K. (2020) ESCO: single cell expression simulation incorporating gene co-expression. Bioinformatics, 37, 2374–2381. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Wang  D., Yu  Y., Rinaldo  A. (2021) Optimal change point detection and localization in sparse dynamic networks. The Annals of Statistics, 49, 203–232. [Google Scholar]
  29. Wang  X., Choi  D., Roeder  K. (2021) Constructing local cell-specific networks from single-cell data. Proceedings of the National Academy of Sciences, 118, e2113178118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Wang  Y. R., Waterman  M. S., Huang  H. (2014) Gene coexpression measures in large heterogeneous samples using count statistics. Proceedings of the National Academy of Sciences, 111, 16371–16376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Zhang  M., Eichhorn  S. W., Zingg  B., Yao  Z., Cotter  K., Zeng  H.  et al. (2021) Spatially resolved cell atlas of the mouse primary motor cortex by merfish. Nature, 598, 137–143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Zhu  L., Lei  J., Devlin  B., Roeder  K. (2017) Testing high-dimensional covariance matrices, with application to detecting schizophrenia risk genes. The Annals of Applied Statistics, 11, 1810. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ujae001_Supplemental_File

Web Appendices, Tables, and Figures referenced in Sections 3, 4, and 5.1 are available with this paper at the Biometrics website on Oxford Academic.

Data Availability Statement

No new data were generated or analyzed in support of this research.


Articles from Biometrics are provided here courtesy of Oxford University Press

RESOURCES