Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Jun 12.
Published in final edited form as: J Am Stat Assoc. 2018 Jun 12;113(523):1296–1310. doi: 10.1080/01621459.2017.1341412

Modeling Persistent Trends in Distributions

Jonas Mueller 1, Tommi Jaakkola 1, David Gifford 1
PMCID: PMC6428438  NIHMSID: NIHMS1504762  PMID: 30906084

Abstract

We present a nonparametric framework to model a short sequence of probability distributions that vary both due to underlying effects of sequential progression and confounding noise. To distinguish between these two types of variation and estimate the sequential-progression effects, our approach leverages an assumption that these effects follow a persistent trend. This work is motivated by the recent rise of single-cell RNA-sequencing experiments over a brief time course, which aim to identify genes relevant to the progression of a particular biological process across diverse cell populations. While classical statistical tools focus on scalar-response regression or order-agnostic differences between distributions, it is desirable in this setting to consider both the full distributions as well as the structure imposed by their ordering. We introduce a new regression model for ordinal covariates where responses are univariate distributions and the underlying relationship reflects consistent changes in the distributions over increasing levels of the covariate. This concept is formalized as a trend in distributions, which we define as an evolution that is linear under the Wasserstein metric. Implemented via a fast alternating projections algorithm, our method exhibits numerous strengths in simulations and analyses of single-cell gene expression data.

Keywords: Wasserstein distance, batch effect, quantile regression, pool adjacent violators algorithm, single cell RNA-seq

1. Introduction

A common type of data in scientific and survey settings consists of real-valued observations sampled in batches, where each batch shares a common label (this numerical/ordinal value is the covariate) whose effects on the observations are the item of interest. When each batch consists of a large number of i.i.d. observations, the empirical distribution of the batch may be a good approximation of the underlying population distribution conditioned on the value of the covariate. A natural goal in this setting is to quantify the covariate’s effect on these conditional distributions, considering changes across all segments of the population. In the case of high-dimensional observations, one can measure this effect separately for each variable to identify which are the most interesting. However, it may often occur that, in addition to random sampling variability, there exist unmeasured confounding variables (unrelated to the covariate) that affect the observations in a possibly dependent manner within the same batch (cf. batch effects in Risso et al. 2014).

The primary focus of this paper is the introduction of the TRENDS (Temporally Regulated Effects on Distribution Sequences) regression model, which infers the magnitude of these covariate-effects across entire distributions. TRENDS is an extension of classic regression with a single covariate (typically of fixed-design), where one realization of our dependent variable is a batch’s entire empirical distribution (rather than a scalar) and the condition that fitted-values are smooth/linear in the covariate is replaced by the condition that fitted distributions follow a trend. Formally defined in §5, a trend describes a sequence of distributions where the pth quantile evolves monotonically for all p ∈ (0,1), though not necessarily in the same direction for different p, and there are at most two partitions of the quantiles that move in opposite directions. Thus, TRENDS extends scalar-valued regression to full distributions while retaining the ability to distinguish effects of interest from extraneous noise.

Despite the generality of our ideas, we motivate TRENDS with a concrete scientific application: the analysis of single-cell RNA-sequencing time course data (see §S7 for a different application to income data; references preceded by ‘S’ are in the Supplementary Material).

The recent introduction of single-cell RNA-seq (scRNA-seq) techniques to obtain transcriptome-wide gene expression profiles from individual cells has drawn great interest (Geiler-Samerotte et al. 2013). Previously only measurable in aggregate over a whole tissue-sample/culture consisting of thousands of cells, gene-expression at the single-cell level offers insight into biological phenomena at a much finer-grained resolution, and is important to quantify as even cells of the same supposed type exhibit dramatic variation in morphology and function. One promising experimental design made feasible by the advent of this technology involves sampling groups of cells at various times from tissues / cell-cultures undergoing development and applying scRNA-seq to each sampled cell (Trapnell et al. 2014, Buettner et al. 2015). It is hoped that these data can reveal which developmental genes regulate/mark the emergence of new cell types over the course of development.

Current scRNA-seq cost/labor constraints prevent dense sampling of cells continuously across the entire time-continuum. Instead, researchers target a few time-points, simultaneously isolating sets of cells at each time and subsequently generating RNA-seq transcriptome profiles for each individual cell that has been sampled. More concretely, from a cell population undergoing some biological process like development, one samples N ⩾ 1 batches of cells from the population at time t where = 1, 2, …, L indexes the time-points in the experiment and i=1,,N=l=1LN indexes the batches. Each batch consists of ni cells sampled and sequenced together. We denote by xi,s(g) the measured expression of gene g in the sth cell of the ith batch (1 ⩽ sni), sampled at time tli.

Because expression profiles are restricted to a sparse set of time points in current scRNA-seq experiments, the underlying rate of biological progression can drastically differ between equidistant times. Thus, changes in the expression of genes regulating different parts of this process may be highly nonuniform over time, invalidating assumptions like linearity or smoothness. One common solution in standard tissue-level RNA-seq time course analysis is time-warping (Bar-Joseph et al. 2003). Since our interest lies not in predicting gene-expression at new time-points, we instead aim for a procedure that respects the sequence of times without being sensitive to their precise values. In fact, researchers commonly disregard the wall-clock time at which sequencing is done, instead recording the experimental chronology as a sequence of stages corresponding overall qualitative states of the biological sample. For example, in Deng et al. (2014): Stage 1 is the oocyte, Stage 2 the zygote, …, Stage 11 the late blastocyst. Attempting to impose a common scale on the stage numbering is difficult because the similarity in expression expected across different pairs of adjacent stages might be highly diverse for different genes. In this work, we circumvent this issue by disregarding the time-scale and t values, instead working only with the ordinal levels (so the only information retained about the times is their order t1 < t2 < ⋯ < tL), as done by Bijleveld et al. (1998) (Section 2.3.2).

Depictions of such data from two genes (where N = 1 for each ) are shown in the lefthand panels of Figure 1. Lacking longitudinal measurements, these data differ from those studied in time series analysis: at each time point, one observes a different group of numerous exchangeable samples (no cell is profiled in two time points), and also the number of time points is small (generally L < 10). As a result of falling RNA-seq costs, multiple cell-capture plates (each producing a batch of sampled cells, i.e. N > 1) are being used at each time point to observe larger fractions of the cell population (Zeisel et al. 2015). Because the cells in a batch are simultaneously collected and sequenced (independently of other batches), the measured gene-expression values are often biased by batch effects: technical artifacts that perturb observed values in a possibly correlated fashion between cells of the same batch (Risso et al. 2014, Kharchenko et al. 2014). Rather than treating the cells from a single time point identically, it is desirable to retain batch information and account for this nuisance variation. Batch effects are also prevalent in other applications including temporal studies of demographic statistics, where a simultaneously-collected group of survey results may be biased by latent factors like location.

Figure 1:

Figure 1:

Violin plots (kernel density estimates) depicting the empirical distribution of known developmental genes’ expression measured in myoblast cells (on left), and the corresponding TRENDS fitted distributions (on right). Each point shows a sampled cell.

Furthermore, cell populations can exhibit enormous heterogeneity, particularly in developmental or in vivo settings (Trapnell et al. 2014, Buettner et al. 2015). A few high-expression cells often bias a population’s average expression, and transcript levels can vary 1,000-fold between seemingly equivalent cells (Geiler-Samerotte et al. 2013)1. By fitting a TRENDS model (which accounts for both batch effects and the full distribution of expression across cells) to each gene’s expression values, researchers can rank genes based on their presumed developmental relevance or employ hypothesis testing to determine whether observed temporal variation in expression is biologically relevant.

2. Related Work

To better motivate the ideas subsequently presented in this paper, we first describe why existing methods are not suited for scRNA-seq time course experiments and similar ordered-batched data lacking longitudinal measurements. As an alternative to time-series techniques, regression models might be applied in this setting, such as the Tobit generalized linear model of Trapnell et al. (2014). However, these models rely on linearity/smoothness assumptions, which can be inappropriate for sporadic processes such as development. More importantly, classic regression models scalar values such as conditional expectations, for which results must be interpreted as the effects in a hypothetical “average cell”.

Rather than focusing only on (conditional) expectations or a few quantiles, it is often more appropriate to model the full (conditional) distribution of values in a heterogeneous population (Geiler-Samerotte et al. 2013, Buettner et al. 2015). Let P denote the underlying distribution of the observations from covariate-level . An omnibus test for distribution-equality (H0 : P1 =⋯= PL vs. the alternative that they are not all identical, cf. the Komogorov-Smirnov method described in §S3) can capture arbitrary changes, but fails to reflect sequential dynamics. Significance tests also do not quantify the size of effects, only the evidence for their existence. Krishnaswamy et al. (2014) have proposed a mutual-information based measure (DREMI) to quantify effects, which could be applied to our setting. However, under systematic noise caused by batch effects, measures of general statistical dependence between the batch-values and label (e.g. mutual information or hypothesis testing) become highly susceptible to the spurious variation present in the observed distributions (resulting in false positives). We thus prefer borrowing strength in the sense that a consistent change in distribution should ideally be observed across multiple time points for an effect to be deemed significant.

Instead of these general approaches, we model the P as conditional distributions Pr(X | ) which follow some assumed structure as increases. Work in this vein has focused on modeling only a few particular quantiles of interest (Bondell et al. 2010) or accurate estimation of the conditional distributions using smooth nonparametric regression techniques (Fan et al. 1996, Hall et al. 1999). While such estimators possess nice theoretical properties and good predictive-power, the relationships they describe may be opaque and it is unclear how to quantify the covariate’s effect on the entire distribution. Note that in the case of classic regression, interpretable linear methods remain favored for measuring effects throughout the sciences, despite the availability of flexible nonlinear function families. Our TRENDS framework retains this interpretability while modeling effects across full distributions.

Change-point analysis can also be applied to sequences of distributions, but is designed for detecting the precise locations of change-points over long intervals. However, scRNA-seq experiments only span a brief time-course (typically L ⩽ 10), and the primary analytic goal is rather to quantify how much a gene’s expression has changed in a biologically interesting manner. Many change-point methods require explicit parameterization of the types of distributions, an undesirable necessity given the irregular nature of scRNA-seq expression measurements (Kharchenko et al. 2014). Moreover, some development-related genes exhibit gradual rather than abrupt temporal temporal changes in expression. Requiring few statistical assumptions, TRENDS models changes ordinally rather than only considering effects that are either smooth or instantaneous, and this method can therefore accurately quantify both abrupt or gradual effects.

3. Methods

Formally, TRENDS fits a regression model to an ordered sequence of distributions, or more broadly, sample pairs {(li,P^i)}i=1N where each i ∈{1,…,L} is an ordinal-valued label associated with the ith batch, for which we have univariate empirical distribution Pi^. Here, it is supposed that for each batch i: a (empirical) quantile function F^i1 is estimated from ni scalar observations {Xi,s}s=1ni~Pi sampled from underlying distribution Pi = Pr(X|i), which may be contaminated by different batch effects for each i. We assume a fixed-design where each level of the covariate 1,…,L is associated with at least one batch. In scRNA-seq data, Pi^ is the empirical distribution of one gene’s measured expression values over the cells captured in the same batch and i indicates the index of the time point at which the batch was sampled from the population for sequencing.

Unlike the supervised learning framework where one observes samples of X measured at different and the goal is to infer some property of P := Pr(X|), in our setting, we can easily obtain Pi^ as an empirical estimate of Pr(X|i). We thus neither seek to estimate the distributions P1,…,PL, nor test for inequality between them. Rather, the primary goal of TRENDS analysis is to infer how much of the variation in Pr(X | ℓ) across different may be attributed to changes in as opposed to the effects of other unmeasured confounding factors. To quantify this variation, we introduce conditional effect-distributions Q for which the sequence of transformations Q1 → Q2 → ⋯ → QL entirely captures the effects of -progression on Pr(X | ), under the assumption that these underlying forces follow a trend (defined in §5). We emphasize that the Q themselves are not our primary inferential interest, rather it is the variation in these conditional-effect distributions that we attribute to increasing- rather than batch effects.

Thus, the Q are not estimators of the sequence of Pli. Rather, the Q represent the distributions one would expect see in the absence of exogenous effects and random sampling variability, in the case where the underlying distributions only change due to -progression and we observe the entire population at each . Because we do not believe exogenous effects unrelated to -progression are likely to follow a trend over , we can identify the sequence of trending distributions which best models the variation in {P^li}i=1N and reasonably conclude that changes in this sequence reflect the -progression-related forces affecting P.

4. Wasserstein Distance

TRENDS employs the Wasserstein distance to measure divergence between distributions. Intuitively interpreted as the minimal amount of “work” that must be done to transform one distribution into the other, this metric has been successfully applied in many domains (Levina & Bickel 2001, Mueller & Jaakkola 2015). The Wasserstein distance is a natural dissimilarity measure of populations because it accounts for the proportion of individuals that are different as well as how different these individuals are. For univariate distributions, the Lq Wasserstein distance is simply the Lq distance between quantile functions given by:

dLq(P,Q)=(01|F1(p)G1(p)|qdp)1/q (1)

where F, G are the CDFs of P, Q and F1,G−1 are the corresponding quantile functions.Slightly abusing notation, we use dLq(,) to denote both Wasserstein distances between distributions or the corresponding quantile functions’ Lq-distance (both q = 1, 2 are used in this work). In addition to being easy to compute (in 1-D), the L2 Wasserstein metric is equipped with a natural space of quantile functions, in which the Fréchet mean takes the simple form stated in Lemma 1. Calling this average the Wasserstein mean, we note its implicit use in the popular quantile normalization technique (Bolstad et al. 2003).

Lemma 1. Let Q denote the space of all quantile functions. The Wasserstein mean is the Fréchet mean in Q under the L2 norm:

F¯1:=1Ni=1NFi1=G1Qargmin{i=1N01(Fi1(p)G1(p))2dp} (2)

5. Characterizing trends in distributions

Definition 1. Let Fl1(p) denote the pth quantile of distribution P with CDF F. A sequence of distributions P1 ,…,PL follows a trend if:

  1. For any p ∈ (0,1), the sequence [F11(p),,FL1(p)] is monotonic.

  2. There exists p* ∈ [0,1) and two intervals A,B that partition the unit-interval at p* (one of A or B equals (0,p*) and the other equals [p*, 1)) such that: for all pA, the sequences [F11(p),,FL1(p)] are all nonincreasing, and for all qB, the sequences [F11(p),,FL1(p)] are all nondecreasing. Note that if p* = 0, then all quantiles must change in the same direction as grows.

Our formal definition of a trend applies to distributions which evolve in a consistent fashion, ensuring that the temporal-forces that drive the transformation from P1 to PL do so without reversing their effects or leading to wildly different distributions at intermediate values. While the second condition of our definition technically subsumes the first, Condition 1 contains our key idea and is therefore separated from Condition 2, a subtler additional assumption that does not significantly alter results in practice. Note that the trend definition employed in this paper is intended for relatively short sequences and does not include cyclic/seasonal patterns studied in time-series modeling.

Lemma 2. If distributions P1,…,PL follow a trend, then

dL1(Pi,Pj)==i+1jdL1(P1,P)forall i<j{1,,L}

Measuring how much the distributions are perturbed between each pair of levels via the L1 Wasserstein metric, Lemma 2 shows the trend criterion as an instance of Occam’s razor, where the underlying effects of interest are assumed to transform the distribution sequence in the simplest possible manner (recall that the Wasserstein distance is interpreted as the minimal work required for a given transformation). If one views the underlying effects of interest as a literal force acting in the space of distributions, Lemma 2 implies that this force points the same direction for every (i.e. P1,,PL lie along a line in the L1 Wasserstein metric space of distributions). A trend is more flexible than a linear restriction in the standard sense, because the magnitude of the force (how far along the line the distributions move) can vary over . Thus, we have formally extended the colloquial definition of a trend (“a general direction in which something is developing or changing”) to probability distributions.

To further conceptualize the trend idea, one can view quantiles as different segments of a population whose values are distributed according to Pr(X | ) (e.g. for wealth-distributions, it has become popular to highlight the “one percent”). From this perspective, it is reasonable to assume that while the force of sequential progression may have different effects on the groups of individuals corresponding to different segments of the population, its effects on a single segment should be consistent over the sequence. If some segment’s values initially change in one way at lower levels of and subsequently revert in the opposite direction over larger (i.e. this quantile is non-monotone), it is natural to conclude there are actually multiple different progression-related forces affecting this homogeneous group of individuals. It is therefore natural to assume a trend if we only wish to measure the effects of a single primary underlying force. Often in settings such as scRNA-seq developmental experiments, the researcher has a priori interest in a specific effect (such as how each gene contributes to a specific stage of the developmental process). Therefore, data are collected over a short -range such that the primary effects of interest should follow a trend.

The second condition in the trend definition specifies that adjacent quantiles must move in the same direction over except at most a single p*. This restricts the number of population-segments which can increase over when a nearby segment of the population is decreasing. Intuitively, Condition 2 forces us to borrow strength across adjacent quantiles when estimating effects that follow a trend. The main effect of the additional restriction imposed by this condition prevents a trend from completely capturing extremely-segmented effects (such as the example depicted in Figure 3C). However, applications involving such complex phenomena are uncommon (it is difficult to imagine a setting where the primary effects-of-interest push more than two adjacent segments of a population in different directions), and such nuanced changes can be reasonably attributed to spurious nuisance variation. We note that a trend can still roughly approximate the major overall effects even when the actual distribution-evolution violates Condition 2 (as seen in Figure 3C). In practice, the results of our method are not significantly affected by this second restriction, but it provides nice theoretical properties ensuring our estimation procedure (presented in §8) efficiently finds a globally optimal solution, as well as additional robustness against spurious quantile-variation in the data (possibly due to estimation-error given limited samples per batch).

Figure 3:

Figure 3:

Violin plots depicting sequences of distributions which do not follow a trend (Observed Distributions in lefthand panels). Shown to the right of each example are the corresponding fitted distributions estimated by TRENDS (with the TRENDS R2 value).

Figure 2 depicts simple examples of trending distribution-sequences. In each example, it is visually intuitive that the evolution of the distributions proceeds in a single consistent fashion. To highlight the broad spectrum of interesting effects TRENDS can detect, we present three conceptual examples in §S1 of distribution-sequences that follow a trend, which includes consistent changes in location/scale and the growth/disappearance of modes. Despite imposing conditions on every quantile, the trend criterion does not require: explicit parameterization of the distributions, specification of a precise functional form of the -effects, or reliance on a smooth or constant amount of change between different levels. This generality is desirable for modeling developmental gene expression and other enigmatic phenomena where stronger assumptions may be untenable.

Figure 2:

Figure 2:

Violin plots depicting four different sequences of distributions which follow a trend. The pth rectangle in the color bar on the righthand side indicates the monotonicity of the pth quantile over the sequence of distributions (for p = 0.01, 0.02,…, 0.99).

The lefthand panels of Figure 3 depict three examples of sequences which do not follow a trend for different reasons. To the right of each example, we show the “best-fitting” sequence that does follow a trend (formally defined in (4)), each distribution of which corresponds to our estimate of Q (introduced in §3). We reiterate that the Q are not by themselves of interest, but are merely used to quantify the sequential-progression effects (as will be described in §7). Nonetheless, the visual depiction of the trending Q provides insight regarding what sort of changes a trend can accurately approximate. Whereas the evolution of the (trending) fitted distributions in Figure 3A (on right) can intuitively be attributed to one consistent force, multiple are required to explain the variation in the original non-trending sequence of distributions on the left. Identifying a single consistent effect responsible for the changes in the left panel of Figure 3B is far more plausible, and we note that these distributions in fact are much closer to following a trend (while hard to visually discern, the 0.04th – 0.16th quantiles of the observed distribution sequence increase between = 1 to 2 and decrease slightly from = 2 to 3, thus violating a trend).

During specific stages of development, changes in the observed cellular gene-expression distributions generally stem from the emergence/disappearance of different cell subtypes (plus batch and random sampling effects). Clear subtype distinctions may not exist in early stages where cells remain undifferentiated, and thus not only are the relative proportions of different subtypes changing, but the subtypes themselves may transform as well. Therefore, developmental genes’ underlying expression patterns are likely described by Examples 2 and 3 (of specific conceptual types of trends) in §S1. The trend criterion fits our a priori knowledge well, while remaining flexible with respect to the precise nature of expression changes.

6. TRENDS regression model

Recall that in our setting, even the underlying batch distributions Pi (from which the observations Xi,s are sampled) may be contaminated by latent confounding effects. We assume the quantile functions of each Pi are generated from the model below:

Fi1=Gi1+Ɛi such that G11,,GL1 follow a trend,and the following hold: (3)

(A.1) Ɛi:(0,1) is constrained so that Gi1 and Fi1 are valid quantile functions.

(A.2) For all p ∈ (0,1) and i: Ɛi(p) follows a sub-Gaussian(σ) distribution (Honorio & Jaakkola 2014), so E[Ɛi(p)]=0 and Pr(|Ɛi(p)|>t)2exp(t22σ2) for any t > 0.

(A.3) For all p ∈ (0,1) and i ≠ j: Ɛi(p) is statistically independent of Ɛj(p).

In this model, G1 is the quantile function of the conditional effect-distribution Q, whose evolution captures the underlying effects of level-progression. The random noise functions Ɛi : (0,1) → ℝ can represent measurement-noise or the effects of other unobserved variables which contaminate a batch. Note that the form of Ɛi is implicitly constrained to ensure all Fi1,Gi1 are valid quantile functions. Because Ɛi(p1) and Ɛi(p2) are allowed to be dependent for p1 ≠ p2, the effect of one Ɛi may manifest itself in multiple observations Xi,s, even if these observations are drawn i.i.d. from Pi (for example, a batch effect can cause all of the observed values from a batch to be under-measured). In fact, condition (A.1) encourages significant dependence between the noise at different quantiles for the same batch. The assumption of sub-Gaussian noise is quite general, encompassing cases in which the Ɛi(p) are either: Gaussian, bounded, of strictly log-concave density, or any finite mixture of sub-Gaussian variables (Honorio & Jaakkola 2014). Although condition (A.3) stringently ensures all dependence between observations from different arises due to the trend, similar independence assumptions are required in general regression settings where one cannot reasonably a priori specify a functional form of dependence in the noise. Real batch effects are likely to satisfy (A.3) since they typically have the same chance of affecting any given batch in a certain manner (because the same experimental procedure is repeated across batches, as in the case of the cell-capture and library preparation in scRNA-seq). Nonetheless, we note that assumption (A.2) can be immediately generalized (with trivial changes to our proofs) in order to allow heteroscedasticity in the batch effects Ɛi (endowing each batch with a different σi sub-Gaussian parameter), but we opt for simplicity in this theoretical exposition.

Model (3) is a distribution-valued analog of the usual regression model, which assumes scalars Yi= f (Xi) + i where i ~ sub-Gaussian(σ2) and i is independent of j for ij. In (3), an analogous f maps each ordinal level {1,…, L} to a quantile function, f(i)=Gi1, and the class of functions is restricted to those which follow a trend. Our assumption of mean-zero Ɛi that are independent between batches is a straightforward extension of the scalar error-model to the batch-setting, and ensures that the exogenous noise is unrelated to -progression under (3). Just as the Y1,…,YN are rarely expected to exactly lie on the curve f (x) in the classic scalar-response model, we do not presume that the observed distributions Pi^ will exactly follow a trend (even as ni → ∞ ∀i so that Pi^Pi). Rather our model simply encodes the assumption that the effects of level-progression on the distributions should be consistent over different (i.e. the effects follow a trend).

For each , TRENDS finds a fitted distribution Q^ using the Wasserstein least-squares fit which minimizes the following objective:

Q^1,,Q^L=Q1,,QLargmin{=1LiIdL2(Q,P^i)2}whereQ1,,QLfollowatrend (4)

where I is the set of batch-indices i such that i = , and we require N :=|I| ⩾ 1 for all ∈ {1,…,L}. Subsequently, one can inspect changes in the Q^ which should reflect the transformations in the underlying P that are likely caused by increasing . Figure 3 shows some examples of fitted distributions produced by TRENDS regression. The objective in (4) bears great similarity to the usual least-squares loss used in scalar regression, the only differences being: scalars have been replaced by distributions, squared Euclidean distances are now squared Wasserstein distances, and the class of regression functions is defined by a trend rather than linearity/smoothness criteria.

Expression measurements in scRNA-seq are distorted by significant batch effects, so the Ɛi are likely to be large. In addition to technical artifacts, Buettner et al. (2015) find biological sources of noise due to processes such as transcriptional bursting and cell-cycle modulation of expression. Unlike development-driven changes in the underlying expression of a developmental gene, other biological/technical sources of variation are unlikely to follow any sort of trend. TRENDS thus provides a tool for modeling full distributions, while remaining robust to the undesirable variation rampant in these applications by leveraging independence of the noise between different batches of simultaneously captured and sequenced cells.

7. Measuring fit, effect size, and statistical significance

Analogous to the coefficient of determination used in classic regression, we define the Wasserstein R2 to measure how much of the variation in the observed distributions P^1,,P^N is captured by the TRENDS model’s fitted distributions Q^1,,Q^L:

R2:=1(1Ni=1NdL2(Q^i,Pi^)2)/(1Ni=1NdL2(Pi^,F¯1)2)[0,1] (5)

Here, squared distances between scalars in the classic R2 are replaced by squared Wasserstein distances between distributions, and the quantile function F¯1=1Ni=1NF^i1 is the Wasserstein mean of all observed distributions. By Lemma 1, the numerator and denominator in (5) are respectively analogous to the residuals and the overall variance from usual scalar regression models.

In classic linear regression, the regression line slope is interpreted as the expected change in the response resulting from a one-unit increase in the covariate. While TRENDS operates on unit-less covariates, we can instead measure the overall expected Wasserstein-change under model (3) in the P^i over the full ordinal progression = 1,…,L using:

Δ:1LdL1(Q^1,Q^L) (6)

The L1 Wasserstein distance is a natural choice, since by Lemma 2, it measures the aggregate difference over each pair of adjacent levels (just as the difference between the largest and smallest fitted-values in linear regression may be decomposed in terms of covariate units to obtain the regression-line slope). Thus, Δ measures the raw magnitude of the inferred trend-effect (depends on the scale of X), while R2 quantifies how well the trend-effect explains the variation in the observed distributions (independently of scaling). Note that if the TRENDS model is fit to the distributions from the example in Figure 3B, the TRENDS-inferred effect of sequential-progression is nearly as large as the overall variation in this sequence, which agrees with our visual intuition that the observed distributions already evolve in a fairly consistent fashion.

Finally, we introduce a test to assess statistical significance of the trend-effect. We compare the null hypothesis H0 : Q1 = Q2 = ⋯ = QL against the alternative that the Qi are not all equal and follow a trend. To obtain a p-value, we employ permutation testing on the i-labels of our observed distributions P^i with test-statistic R2 (Good 1994). More specifically, the null distribution is determined by repeatedly executing the following steps: (i)randomly shuffle the i so that each P^i is paired with a random iPerm{1,,L} value, (ii)fit the TRENDS model to the pairs {(iPerm,P^i)}i=1N to produce Q^1Perm,,Q^LPerm, (iii) use these estimated distributions to compute RPerm2 using (5). Due to the quantile-noise functions Ɛi(⋅) assumed in our model (3), H0 allows variation in our sampling distributions Pi which stems from non--trending forces. Thus the TRENDS test attempts to distinguish whether the effects transforming the Pi follow a trend or not, but does not presume the Pi will look identical under the null hypothesis. By measuring how much further the P^i lie from one distribution vs. a sequence of trending distributions in Wasserstein-space, we note that our R2 resembles a likelihood-ratio-like test statistic between maximum-likelihood-like estimates F¯1 and Q^ (where we operate under the Wasserstein distance rather than Kullback-Leibler which underlies the maximum likelihood framework).

As we do not parametrically treat the distributions, we find permutation testing more suitable than relying on asymptotic approximations. Unfortunately, N and L may be small, undesirably limiting the number of possible label-permutations. In §S2, we overcome the granularity problem that arises in such settings by developing a more intricate permutation procedure akin to the smoothed bootstrap of Silverman & Young (1987).

To determine whether our model is reasonable when working with real data, it is best to rely on prior domain knowledge regarding whether or not the effects of primary interest should follow a trend. When this fact remains uncertain, then (as in the case of classical regression) the question is not properly answered using just our Wasserstein R2 values (which we caution tend to be much larger than the familiar R2 values from linear regression, due to the heightened flexibility of our TRENDS model). §S6 demonstrates a simple method for model checking based on plotting empirically-estimated residual functions E^i against the sequence-level . Similar plots of scalar residuals are the most common diagnostic employed in standard regression analysis. While this model-checking procedure is able to clearly delineate simulated deviations from our assumptions, it shows little indication that the TRENDS assumptions are inappropriate for the real scRNA-seq data from major known developmentally-relevant genes. Our simulation in §S6 also empirically demonstrates that despite its restrictive assumptions, the TRENDS model can provide superior estimates of severely-misspecified effects than the initial empirical distributions.

8. Fitting the TRENDS model

We propose the trend-fitting (TF) algorithm which finds distributions satisfying

Q^1,,Q^L=argminQ1,,QL{=1LiIwidL2(Q,P^i)2}whereQ1,,QLfollowatrend (7)

If P^i (the empirical per-batch distributions) are estimated from widely varying sample sizes ni for different batches i, then it is preferable to replace the objective in (4) with the weighted sum in (7). Given weights wi chosen based on ni and N, TRENDS can better model the variation in the empirical distributions that are likely more accurate due to larger sample size. As ni and N are fairly homogeneous in scRNA-seq experiments, we use uniform weights here (but provide an algorithm for the general formulation). To fit TRENDS to data {(i,P^i,wi)}i=1N via our procedure, the user must first specify:

  • Numerical quadrature points 0 < p1 < p2 < ⋯ < pP−1 < 1 for evaluating the Wasserstein distance integral in (1), i.e. which P – 1 quantiles to use for each batch

  • a quantile estimator F^1(p) for empirical CDF F^

Given these two specifications, the TF procedure solves a numerical-approximation of the constrained distribution-valued optimization problem in (7). Defining p0 := 2p1p2 and pP : = 2pP−1 − pP−2, we employ the following midpoint-approximation of the integral

minG11,,GL1{=1LiIwik=1P1(F^i1(pk)G1(pk))2[pk+1pk12]}whereG1,,GLmust follow a trend (8)

While this problem is unspecified between the pkth and pk+1th quantiles, all we require to numerically compute Wasserstein distances (and hence R2 or Δ) is the values of the quantile functions at p1,…,pP−1, which are uniquely determined by (8). Although our algorithm operates on a discrete set of quantiles like techniques for quantile regression (Bondell et al. 2010), this is only for practical numerical reasons; the goal of our TRENDS framework is to measure effects across an entire distribution. Throughout this work, we use P – 1 uniformly spaced quantiles between 1P and P1P (with P = 100) to comprehensively capture the full distributions while ensuring computational efficiency. In settings with limited data per batch, one might alternatively select fewer quadrature points (quantiles), avoiding tail regions of the distributions for increased stability (our results were robust to the precise number of quadrature points employed).

Since no unbiased minimum-variance ∀p ∈ (0,1) quantile estimator is known, we simply use the default setting in R’s quantile function, which provides the best approximation of the mode (Type 7 of Hyndman & Fan (1996)). Other quantile estimators perform similarly in our experiments, and Keen (2010) have found little practical difference between estimation procedures for sample sizes ⩾ 30. Here, we assume the ni cells sampled in the ith batch are i.i.d. samples (reasonable for cell-capture techniques). If this assumption is untenable in another domain, then the quantile-estimation should be accordingly adjusted (cf. Heidelberger & Lewis 1984).

Our procedure uses the Pool-Adjacent-Violators-Algorithm (PAVA), which given an input sequence y1,,yL, finds the least-squares-fitting nondecreasing sequence in only O(L) runtime (de Leeuw 1977). The basic PAVA procedure is extended to weighted observations by performing weighted backaveraging in Step 3. When multiple (i,yi) pairs are observed with identical covariate-levels, i.e. ∃ℓ s.t. N : = |I| > 1 where I := {i : i = }, we adopt the simple tertiary approach for handling predictor-ties (de Leeuw 1977). Here, one defines y¯ as the (weighted) average of the {yi : iI} and for each level all yi : iI are simply replaced with their mean-value y¯. Subsequently, PAVA is applied with non-uniform weights to {(,y¯)}=1L where the th point receives weight N (or weight ΣiIwi wi if the original points are assigned non-uniform weights w1,…,wN). By substituting “nonincreasing” in place of “nondecreasing” in Steps 2 and 3, the basic PAVA method can be trivially modified to find the least-squares nonincreasing sequence. From here on, we use PAVA((y1,w1),…, (yN,wN); δ) to refer to a more general version of basic PAVA, which incorporates observation-weights wi (for multiple y values at a single ), and a user-specified monotonicity condition δ ∊ {“nonincreasing”, “nondecreasing”} that determines which monotonic best-fitting sequence to find.

Basic PAVA Algorithm: minzL=1(yz)2s.t.z1zL
Input:  A sequence of real numbers y1 ,…, yL
Output: The minimizing sequence y^1,,y^L which is nondecreasing.
 1. Start with the first level = 1 and set the fitted value y^1=y1
 2. While the next yy^1, set y^=y and increment
 3. If the next violates the nondecreasing condition, i.e. y<y^1 then backaverage to restore monotonicity: find the smallest integer k such that replacing y^,,y^k by their average restores the monotonicity of the sequence y^1,,y^. Repeat Steps 2 and 3 until = L.

Theorem 1. The Trend-Fitting algorithm produces valid quantile-functions G^11,,G^L1 which solve the numerical version of the TRENDS objective given in (8).

Fundamentally, our TF algorithm utilizes Dykstra’s method of alternating projections (Boyle & Dykstra 1986) to project between the set of L-length sequences of vectors which are monotone in each index over and the set of L-length sequences of vectors where each vector represents a valid quantile function. Despite the iterative nature of alternating projections, we find that the TF algorithm converges extremely quickly in practice. This procedure has overall computational complexity O(TLP2 + NP), which is efficient when T (the total number of projections performed) is small, since both P and L are limited. The proof of Theorem 1 provides much intuition on the TF algorithm (all proofs are relegated to §S8). Essentially, once we fix a δ configuration (specifying which quantiles are decreasing over and which are increasing), our feasible set becomes the intersection of two convex sets between which projection is easy via PAVA. Furthermore, the second statement in our trend definition limits the number of possible δ configurations, so we simply solve one convex subproblem for each possible δ to find the global solution.

Trend-Fitting Algorithm: Numerically solves (7) by optimizing (8)
Input 1: Empirical distributions and associated levels (and optional weights) {(i,F^i,wi)}i=1N
Input 2: A grid of quantiles to work with 0 < p1 < ⋯ < pP–1 < 1
Output: The estimated quantiles of each Q{G^1(pk):k=1,,P1} for {1,,L} from which these underlying trending distributions can be reconstructed.
 1. F^i1(pk):=quantile(F^i,pk) for each i ∈ {1,…,N}, k ∈ {1,…,P – 1}
 2. w*:=iIwi for each ∈ {1,…,L}
 3. x[k]:=1w*iIwiF^i1(pk) for each ∈ {1,…,L}, k ∈ {1,…,P – 1}
 4. for p* = 0,p1, p2, pP-1 :
 5.  δ[k] := “nondecreasing” if pk > p*;otherwise δ[k] := “nondecreasing”
 6.  y1,…,yL := AlternatingProjections (x1,,xL;δ;{w*}=1L,{pk}k=1P1)
 7.  W[δ] := the value of (8) evaluated with G1(pk)=y[k],k
 8.  Redefine δ[k] := “nonincreasing” if pk > p*; otherwise δ[k] := “nonincreasing” and repeat Steps 6 and 7 with the new δ
 9.  Identify min W[δ] and return G1(pk)=y*[k],k where y* was produced at the Step 6 or 8 corresponding to δ* := argmax W[δ].
AlternatingProjections Algorithm: Finds the Wasserstein-least-squares sequence of vectors which represent valid quantile-functions and a trend whose monotonicity is specified by δ.
Input 1:  Initial sequence of vectors x1(0),,xL(0)
Input 2:  Vector δ whose indices specify directions constraining the quantile-changes over .
Input 3:  Weights w* and quantiles to work with 0 < p1 < ⋯ < pP–1 < 1
Output:  Sequence of vectors y1(t),,yL(t) where ,k:y(t)[k]y(t)[k+1] and the sequence y1(t)[k],,yL(t)[k] is monotone nonincreasing/nondecreasing as specified by δ[k], provided that x(0)[k]x(0)[k+1] for each , k
 1. r(0)[k]:=0,s(0)[k]:=0 for each ∈ {1,…,L}, k ∈ {1,…,P – 1}
 2. for t = 0, 1, 2, … until convergence:
 3.   y1(t)[k],,yL(t)[k]:=PAVA((x1(t)[k]+r1(t)[k],w1*),,(xL(t)[k]+rL(t)[k],wL*);δ[k]) for each k ∈ {1,…,P – 1}.PAVA computes either the least-squares nondecreasing or nonincreasing weighted fit, depending on δ[k].
 4.   r(t+1)[k]:=x(t)[k]+r(t)[k]y(t)[k] for each , k
 5.   {1,,L}:x(t+1)[1],,x(t+1)[P1]:=PAVA((y(t)[1]+s(t)[1],p2p02),,(y(t)[P1]+s(t)[P1],pPpP22);"nondecreasing")
 6.   s(t+1)[k]:=y(t)[k]+s(t)[k]x(t+1)[k] for each , k

9. Theoretical results

Under the model given in (3), we establish some results regarding the quality of the Q^1,,Q^L estimates produced by the TF algorithm. To develop pragmatic theory, we use finite-sample bounds defined in terms of quantities encountered in practice rather than the true Wasserstein distance (1), which relies on an integral that must be numerically approximated. Thus, in this section, dW(·,·) is used to refer to the midpoint-approximation of the L2 Wasserstein integral illustrated in (8). In addition to the conditions of model (3), we make the following simplifications throughout for ease of exposition:

(A.4) The number of batches at each level is the same, i.e. N := N1 = ⋯ = NL ⩾ 1

(A.5) The same number of samples are drawn per batch, i.e. n := ni for all 1 ⩿ i ⩿ N

(A.6) For k = 1,…,P — 1: the (k/P)th quantiles of each distribution are considered

(A.7) Uniform weights are employed, i.e. in (7): wi = 1 for all i

Theorem 2. Under model (3) and additional conditions (A. 4)-(A.7), suppose the TF algorithm is applied directly to the true quantiles of Pi,…, PN. Then, given any ϵ > 0, the resulting estimates satisfy:dW(G^1,G^1)< for each ℓ ϵ{1,…,L}

withprobabilitygreaterthan:12PLexp(ϵ2N8σ2L) (9)

Thus, Theorem 2 implies that our estimators are consistent with asymptotic rate OP(1/N) if we directly observe the true per-batch quantiles P1,…,PN (which are contaminated by Ɛi under our model). By using the union-bound, our proof does not require any independence assumptions for the noise introduced at different quantiles of the same batch. Because direct quantile-observation is unlikely in practice, we now examine the performance of TRENDS when these quantiles are instead estimated using n samples from each Pi. Here, we additionally assume:

(A.8) For i = 1,…,N : quantiles are estimated from n i.i.d. samples X1,i,…, Xn,i ~ Pi

(A.9) There is nonzero density at each of the quantiles we estimate, i.e. CDF Fi is strictly increasing around each Fi1(k/P) for k = 1,…,P – 1.

(A.10) The simple quantile estimator defined below is used for each k/P, k = 1,…,P – 1

F^i1(p):=inf{x:F^i(x)p}

where F^i() is the empirical CDF computed from X1,i,…,Xn,i, ~ Pi.

Theorem 3. Under the assumptions of Theorem 2 and (A.8)-(A.10), suppose the TF algorithm is applied to estimated quantiles F^i1(k/P) for i = 1,…,N,k = 1,…,P – 1. Then, given any ϵ > 0, the resulting estimates satisfy:dW(G^1,G1)< for each ∈ {1,…,L} with probability greater than:

12PL[exp(ϵ2N32σ2L)+Nexp(2nR(ϵ4L)2)] (10)

Where for γ > 0:

R(γ):=mini,k{R(γ,i,k/p):i=1,,N,k=1,,P1}R(γ,i,p):min{Fi(Fi1(p)+γ)p,pFi(Fi1(p)γ)} (11)

Theorem 3 is our most general result applying to arbitrary distributions Pi that satisfy basic condition (A.9). However, the resulting probability-bound may not converge toward to 1 if nR(ϵ4L)2<O(logN), which occurs if few samples are available per batch (because then the Pi are can be very poorly estimated). Thus, TRENDS is in general only designed for applications with large per-batch sample sizes. The bounds obtained under the extremely broad setting of Theorem 3 may be significantly improved by instead adopting one of the following stronger assumptions:

(A.11) The simple quantile-estimator defined in (A.10) is used, and the support of each Pi is bounded and connected with non-neglible density, i.e. ∃ constants B, c > 0 s.t. ∀i : fi(x) =0 ∀x ∉ [–B, B] and fi(x) ⩾cx ∊ [–B, B] (fi is density for CDF Fi).

(A.12) The following is known regarding the quantile-estimation procedure:

  1. The quantiles of each Pi are estimated independently of the others.

  2. The quantile-estimates converge at a sub-Gaussian rate for each quantile of interest, i.e. there exists c > 0 such that for each k, i and any ϵ > 0:
    Pr(|F^i1(k/P)Fi1(k/P)|>ϵ)2exp(2nc2ϵ2)

Theorem 4. Under the assumptions of Theorem 2, conditions (A.8), (A.9), and one of either (A.ll) or (A.12), the bound in (10) may be sharpened to ensure that for any ϵ > 0:

dW(G^1,G1)<ϵforeach{1,,L}

with probability greater than:

12P[Lexp(ϵ2N32σ2L)+exp(c28Nnϵ2)] (12)

In Theorem 4, the additional assumption of bounded/connected underlying distributions results in a much better finite sample bound that is exponential in both n and N (implying asymptotic OP(N1/2+n1/2) convergence). While this condition and the result of Theorem 3 assume use of the simple quantile-estimator from (A.10), numerous superior procedures have been developed which can likely improve practical convergence rates (Zielinski 2006). Assuming guaranteed bounds for the quantile-estimation error (which may be based on both underlying properties of the Pi as well as the estimation procedure), one can also obtain the same exponential bound. In fact, condition (A.11) is an example of a distribution and quantile-estimator combination which achieves the error required by (A.12). Because the boundedness assumption is undesirably limiting, we also derive a similar result under weaker assumptions:

(A.13) Each Pi has connected support with non-neglible interior density and sub-Gaussian tails, i.e. there are constants B > b > 0,a > 0,c > 0 such that for all i :

  • (1)

    Fi is strictly increasing,

  • (2)

    fi(x) ⩾ cx ∊ [–B, B] where fi is the density function of CDF Fi.

  • (3)

    Pr(Xi > x) ⩽ exp (– a [x – (B – b)]2) if x > B

and Pr(Xi < x) ⩽ exp (– a [x – (–B + b)]2) if x < –B

(A.14) Defining r:=min{2c2,2ab214PB2}, we have r > 0, or equivalently, 2ab2 > 1.

(A.15) We avoid estimating extreme quantiles, i.e. Fi1(k/P)(B,B)k{1,,P1}

Theorem 5. Under the assumptions of Theorems 2 and 3 as well as conditions (A.13)-(A.15), the previous bound in (10) may be sharpened to ensure that for all ϵ > 0:

dW(G^1,G1)<ϵforeach{1,,L}

with probability greater than:

12P[Lexp(ϵ2N32σ2L)+exp(r216Nnϵ2)] (13)

Theorem 5 again provides an exponential bound in both n and N under a realistic setting where the distributions are small tailed with connected support, and the simple quantile estimator of (A.10) is applied at non-extreme quantiles. Note that while we specified properties of the distributions, noise, and quantile estimation in order to develop this theory, our nonparametric significance tests do not rely on these assumptions.

10. Simulation study

We perform a simulation which realistically reflects various properties of scRNA-seq data, based on assumptions similar to those explicitly relied upon by the model of Kharchenko et al. (2014). Samples are generated from one of the following choices of the underlying trending distribution sequence Q1,…,QL with L = 5 (additional details in §S4):

(S1) Q ~ NB(r,p) with r = 5 and p = 0.3, 0.3, 0.4,0.5, 0.8 for = 1,…,5.

(S2) Q is a mixture of NB(r = 5,p = 0.3) and NB(r = 5,p = 0.7) components, with the mixing proportion of the latter ranging over λ = 0.1, 0.4, 0.8,0.8, 0.8 for = 1,…,5.

(S3) Q ~ NB(r = 5,p = 0.5) for = 1,…,5.

NB(r, p) denotes the negative binomial distribution parameterized by r (target number of successful trials) and p (probability of success in each trial). To capture various types of noise affecting scRNA-seq measurements (e.g. dropout, PCR amplification bias, transcriptional bursting), noise for the ith batch is introduced (independently of the other batches) via the following steps: rather than sampling from Qi, we instead sample from Pi~NB(r˜,p˜), where r˜=r+rnoise and p˜=p+pnoise. Here, pnoise, rnoise are independently drawn from centered Gaussian distributions with standard deviations σ, 10 σ respectively (σ thus controls the degree of noise). For the mixture-models in S2, we sample from Pi which is also a mixture of negative binomials (with the same mixing proportions as Qi) where the parameters of both mixing components are perturbed by noise variables rnoise,pnoise. To the observations sampled from Pi, we finally apply a log10(x + 1) transform (also applied to the scRNA-seq data in §11) before proceeding with our analysis.

We first investigate the convergence of TRENDS estimates under each of the models S1, S2, and S3, varying n, N, and the amount of noise independently. Figure 4 shows the Wasserstein error (sum over of the squared Wasserstein distances between the underlying Q and estimates thereof) of our TRENDS estimates vs. the error of the empirical distributions. The plot demonstrates rapid convergence of the TRENDS estimator (as guaranteed by our theory in §9) and shows that TRENDS can produce a much better picture of the underlying distributions than the (noisy) observed empirical distributions. As shown in Figure 4A, this may occur even in the absence of noise, thanks to the additional structure of the trend-assumption exploited by our estimator. Thus, when the underlying effects follow a trend, our Δ statistic provides a much more accurate measure of their magnitude than distances between the empirical distributions. These results indicate that the largest benefit of our TRENDS approach is for small to moderate sized samples.

Figure 4:

Figure 4:

The Wasserstein error of the TRENDS fitted distributions vs. the observed empirical distributions, under models S1 - S3 with various settings of n, σ, and N. Depicted is the average error (and standard deviation) over 100 repetitions.

To compare performance, we evaluate TRENDS against alternative methods under our models Si-S3 with substantial batch-noise (σ = 0.1). Fixing N = 1,ni = 1000 for all , i, we generate 400 datasets from the different underlying trending models described above (100 from each of Si, S2, and 200 from S3). TRENDS is applied to each dataset to obtain a p-value (via the permutation procedure described in §S2). In this analysis, we also apply the following alternative methods (detailed in §S3): a linear variant of our TRENDS model (where quantiles are restricted to evolve linearly rather than monotonically), an omnibus-testing approach (using the maximal Kolmogorov-Smirnov (KS) statistic between any pair of distributions), and a measure of the (marginally-normalized) mutual information (MI) between and the values in each batch. The latter two alternative methods make no underlying assumption and capture arbitrary variation in distributions over . We employ the same approach to ascertain statistical significance (at the 0.05 level) under each method. All p-values are obtained via permutation-testing (with 1000 permutations). To correct these p-values for multiple comparisons, we employ the step-down minP adjustment algorithm of Ge et al. (2003), which cleverly avoids double permutations to remain computationally efficient.

Table 1 demonstrates that methods sensitive to arbitrary differences in distributions are highly susceptible to spurious batch effects (both the KS and MI identify all 400 datasets as statistically significant), whereas our TRENDS method has the lowest false-positive rate, only incorrectly rejecting its null hypothesis for 4 out of the 200 datasets from S3. TRENDS also exhibits the greatest power in these experiments. To ascertain how well these methods distinguish the trending data from the non-trending samples, we computed area under the ROC curve (AUROC) by generating ROC curves for each method using its p-values (ties broken using test statistics) as a classification-rule for determining which simulated datasets the method would correctly distinguish from constant model S3 at each possible cutoff value. The results of Table 1 show that TRENDS is superior at drawing this distinction in these simulations.

Table 1:

False-positive rate (FPR) and true-positive rate (TPR) produced by different methods, as well as AUROC values. FPR is determined by the fraction of datasets generated under model S3 deemed statistically significant (or Si, S2 for TPR).

Method FPR TPR AUROC
TRENDS 0.02 0.35 0.87
Linear-TRENDS 0.03 0.32 0.85
KS 1.0 1.0 0.44
MI 1.0 1.0 0.53

11. Single cell RNA-seq analysis

To evaluate the practical utility of our method, we analyze two scRNA-seq time course experiments and compare TRENDS against the alternative approaches described in §S3. The first dataset is from Trapnell et al. (2014) who profiled single-cell transcriptome dynamics of skeletal myoblast cells at 4 time-points during differentiation (myoblasts are embryonic progenitor cells which undergo myogenesis to become muscle cells). In a second larger-scale scRNA-seq experiment, Zeisel et al. (2015) isolated 1,691 cells from the somatosensory cortex (the brain’s sensory system) of juvenile CD1 mice aged P22-P32. We treat age (in postnatal days) as our batch-labels, with L = 10 possible levels. §S5 contains detailed descriptions of the data and our analysis.

Assuming that trending temporal-progression effects on expression reflect each gene’s importance in development, we measure the size of these effects using our Δ statistic (6). Fitting a separate TRENDS model to each gene’s measurements, we thus produce a ranking of the genes’ presumed developmental importance. If instead, one’s goal is simply to pinpoint high-confidence candidate genes relevant at all in development (ignoring the degree to which their expression transforms in the developmental progression), then our permutation test can be applied to establish which genes exhibit strong statistical evidence of an underlying nonconstant TREND effect. For all methods, p-values are obtained using the same procedure as in the simulation study (1000 permutations with step-down minP multiple-testing correction). In these analyses, significance testing (which identifies high-confidence effects) and the Δ statistic (which identifies very large effects) both produce informative results.

As the myoblast data only contains four -levels and one batch from each, the TRENDS permutation test stringently identifies only 20 genes with significant non-constant trend at the 0.05 level (with multiple-testing correction). Terms which are statistically overrepresented in the Gene Ontology (GO) annotations of these significant genes (Kamburov et al. 2011), indicate the known developmental relevance of a large subset (see Figure 5A). Enriched biological process annotations include “anatomical structure development” and “cardiovascular system development” (Table S2A). In contrast, the cortex data are much richer, and TRENDS accordingly finds far stronger statistical evidence of trending genes, identifying 212 as significant (at the 0.05 level with multiple testing correction). A search for GO enriched terms in the annotations of these genes shows a large subset to be developmentally relevant (Figure 5B), with enriched terms such as “neurogenesis” and “nervous system development” (Table S2B). Due to the limited batches in these scRNA-seq data (each of which may be corrupted under our model), the TRENDS significance-tests act conservatively (a desirable property given the pervasive noise in scRNA-seq data), identifying small sets of genes we have high-confidence are primarily developmentally relevant.

Figure 5:

Figure 5:

Word clouds of terms significantly enriched (at the 0.01 level) in GO annotations of the genes with significantly trending expression in each analysis (Kamburov et al. 2011).

Ranking the genes by their TRENDS-inferred developmental effects (using Δ), 9 of the top 10 genes in the myoblast experiment have been previously discovered as significant regulators of myogenesis and some are currently employed as standard markers for different stages of differentiation (see Table S3A). Also, 7 of the top 10 genes in the cortex analysis have been previously implicated in brain development, particularly in sensory regions (Table S3B). Thus, TRENDS accurately assigns the largest inferred effects to clearly developmental genes (see also Table S4). Since experiments to probe putative candidates require considerable effort, this is a very desirable feature for studying less well-characterized developmental systems than our cortex/myoblast examples. Figure 1A shows TRENDS predicts that MT2A (the gene with the largest Δ-inferred effect in myoge-nesis and a known regulator of this process) is universally down-regulated in development across the entire cell population. Interestingly, the majority of cells express MT2A at a uniformly high level of ⩾ 3 log FPKM just before differentiation is induced, but almost no cell exhibits this level of expression 24 hours later. MT2A expression becomes much more heterogenous with some cells retaining significant MT2A expression for the remainder of the time course while others have stopped expressing this gene entirely by the end. TRENDS accounts for all of these different changes via the Wasserstein distance which appropriately quantifies these types of effects across the population.

Because any gene previously implicated in muscle development is of interest in the myoblast analysis, we can form a lower-bound approximation of the fraction of “true positives” discovered by different methods by counting the genes with a GO annotation containing both the words “muscle” and “development” (e.g. “skeletal muscle tissue development”). Table S5 contains all GO annotations meeting this criterion. Figure 6A depicts a pseudo-sensitivity plot based on this approximation over the genes with the highest presumed developmental importance inferred under different methods. Here, the Tobit models are censored regressions specifically designed for scRNA-seq data, which solely model conditional expectations rather than the full distribution of expression across the cells (see §S3). A larger fraction of the top genes found by TRENDS and our closely-related Linear TRENDS method have been previously annotated for muscle development than top candidates produced by the other methods.

Figure 6:

Figure 6:

Pseudo-sensitivity of various methods based on their ability to identify known developmental genes. (A) the number of genes with a GO annotation containing both “muscle” and “development” found in the top K genes (ranked by the different methods for the myoblast data), over increasing K. (B) similar plot for the cortex data, where developmental genes are now those annotated with a relevant GO term from Table S6.

We repeat this analysis for the cortex data using a different set of “ground truth” GO annotations (listed in Table S6), and again find that TRENDS produces higher sensitivity than the other approaches (Figure 6B) based on this crude measure. As researchers cannot practically probe a large number of genes in greater detail, it is important that a computational method for developmental gene discovery produces many high ranking true positives which can be verified through limited additional experimentation. While TRENDS appears to display greater sensitivity than other methods, we note that it is difficult to evaluate other performance-metrics (e.g. specificity) using the scRNA-seq data, since the complete set of genes involved in relevant developmental processes remains unknown.

The Nestin gene in the myoblast data provides one example demonstrating the importance of treating full expression distributions rather than just mean-effects. Nestin plays an essential role in myogenesis, determining the onset and pace of myoblast differentiation, and its overexpression can also bring differentiation to a halt (Pallari et al. 2011), a process possibly underway in the high-expression cells from the later time points depicted in Figure 1B. TRENDS ranks Nestin 35th in terms of inferred developmental effect-size (with TRENDS p-value = 0.02 before multiple-testing correction and 0.09 after), but this gene is overlooked by the scalar regression methods (only ranking 3,291 and 5,094 in the linear / B-spline Tobit results). Although Figure 1B depicts a clear temporal effect on mean Nestin expression, scalar regression does not prioritize this gene because these methods fail to properly consider the full spectrum of changes affecting different segments of the cell population in the multitude of other genes with similar mean-effects as Nestin.

Although the closely-related Linear TRENDS model appears to do nearly as well as TRENDS in our Figure 6 pseudo-sensitivity analysis, we find the linearity assumption overly restrictive, preventing the Linear TRENDS model from identifying important genes like TSPYL5, a nuclear transcription factor which suppresses levels of well-known myogenesis regulator p53 (Epping et al. 2011, Porrello et al. 2000). Linear TRENDS model only assigns this gene a p-value of 0.2 whereas TRENDS identifies it as significant (p = 0.05), since TSPYL5 expression follows a monotonic trend fairly closely (R2 = 0.95) but is not as well approximated by a linear trend (R2 = 0.68).

12. Discussion

While established methods exist to quantify change over a sequence of probability distributions, TRENDS addresses the scientific question of how much of the observed change can be attributed to sequential progression rather than nuisance variation. Although the TF algorithm resembles quantile-modeling techniques, our ideas are grounded under the unifying lens of the Wasserstein distance, which we use to measure effects (6), goodness-of-fit (5), and a distribution-based least-squares fit (4). Like linear regression, an immensely popular scientific method despite rarely reflecting true underlying relationships, our TRENDS model is not intended to accurately model/predict the data, which are likely subject to many more effects than our simple trend definition encompasses. Rather, TRENDS quantifies effects of interest, which remain highly interpretable (via our Wasserstein-perspective) despite being considered across fully nonparametric populations.

We recommend our model for data in which the underlying population is heterogeneous (possibly subject to diverse effects), each batch contains many samples (ni ⩾ 50), and the sequence of levels L ⩾ 3 is short enough that effects of interest should follow persistent trends. When considering TRENDS analysis, it is important to ensure that the primary effects of interest are a priori expected to follow our trend definition. For the developmental scRNA-seq data considered in this work, this is a reasonable assumption because the experiments typically focus on a limited window of the underlying process. Furthermore, the severe prevalence of nuisance variation makes it preferable to identify a high-confidence developmentally-relevant subset of genes (e.g. because they display consistent effects over time), rather than attempting to characterize the complete set of genes displaying interesting effects.

While our trend definition produces good empirical results in these scRNA-seq analyses (and encompasses various conceptually interesting effects discussed in §S1), we emphasize that adopting this assumption narrowly restricts the sort of effects measured by our approach. Our limited definition is unlikely to characterize more complex effects of interest in general settings (particularly for longer sequences), and future work should explore extensions such as allowing change-points in the model. Note that our proposed Wasserstein-least-squares fit objective and Wasserstein-R2 measure remain applicable for more general classes of regression functions on distributions. Furthermore, Lemma 2 provides an alternative definition of a trend which also applies to multidimensional distributions, and thus may be useful for applications such as spatiotemporal modeling. Nevertheless, the basic TRENDS methodology presented in this work can produce valuable insights. As simultaneously-profiled cell numbers grow to the many-thousands thanks to technological advances (Macosko et al. 2015), significant discoveries may be made by studying the evolution of population-wide expression distributions, and TRENDS provides a principled framework for this analysis.

Supplementary Material

Supp1

Footnotes

1

Geiler-Samerotte et al. lament: “analyzing gene expression in a tissue sample is a lot like measuring the average personal income throughout Europe – many interesting and important phenomena are simply invisible at the aggregate level. Even when phenotypic measurements have been meticulously obtained from single cells or individual organisms, countless studies ignore the rich information in these distributions, studying the averages alone”.

References

  1. Bar-Joseph Z, Gerber G, Simon I, Gifford DK & Jaakkola TS (2003), ‘Comparing the continuous representation of time-series expression profiles to identify differentially expressed genes’, Proceedings of the National Academy of Sciences 100(18), 10146–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bljleveld C, van der Kamp LJT, Van Der Kamp P, Mooljaart A, Van Der Van Der Kloot WA, Van Der Leeden R & Van Der Burg E (1998), Longitudinal Data Analysis: Designs, Models and Methods, Sage Publications. [Google Scholar]
  3. Bolstad BM, Irizarry RA, Astrand M & Speed TP (2003), ‘A comparison of normalization methods for high density oligonucleotide array data based on variance and bias’, Bioinformatics 19(2), 185–193. [DOI] [PubMed] [Google Scholar]
  4. Bondell HD, Reich BJ & Wang H (2010), ‘Non-crossing quantlle regression curve estimation’, Biometrika 97(4), 825–838. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Boyle J & Dykstra R (1986), ‘A Method for Finding Projections onto the Intersection of Convex Sets in Hilbert Spaces’, Lecture Notes in Statistics 37, 28–47. [Google Scholar]
  6. Buettner F, Natarajan KN, Casale FP, Proserplo V, Sclaldone A, Thels FJ, Telchmann SA, Marioni JC & Stegle O (2015), ‘Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells’, Nat Biotechnol 33(2), 155–60. [DOI] [PubMed] [Google Scholar]
  7. de Leeuw J (1977), ‘Correctness of Kruskal’s algorithms for monotone regression with ties’, Psychometrika 42(1), 141–144. [Google Scholar]
  8. Deng Q, Ramskold D, Reinius B & Sandberg R (2014), ‘Single-Cell RNA-Seq Reveals Dynamic, Random Monoallelic Gene Expression in Mammalian Cells’, Science 343(6167), 193–196. [DOI] [PubMed] [Google Scholar]
  9. Epping MT, Meljer LAT, Krljgsman O, Bos JL, Pandolfi PP & Bernards R (2011), ‘TSPYL5 suppresses p53 levels and function by physical interaction with USP7’, Nat Cell Biol 13(1), 102–108. [DOI] [PubMed] [Google Scholar]
  10. Fan J, Yao Q & Tong H (1996), ‘Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems’, Biometrika 83(1), 189–206. [Google Scholar]
  11. Ge Y, Dudoit S & Speed TP (2003), ‘Resampling-based multiple testing for microarray data analysis’, Test 12(1), 1–77. [Google Scholar]
  12. Geiler-Samerotte KA, Bauer CR, Li S, Ziv N, Gresham D & Siegal ML (2013), ‘The details in the distributions: why and how to study phenotypic variability.’, Current opinion in biotechnology 24(4), 752–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Good P (1994), Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses, Spring-Verlag. [Google Scholar]
  14. Hall P, Wolff RCL & Yao Q (1999), ‘Methods for Estimating a Conditional Distribution Function’, Journal of the American Statistical Association 94(445), 154–163. [Google Scholar]
  15. Heidelberger P & Lewis PAW (1984), ‘Quantile Estimation in Dependent Sequences’, Operations Research 32(1), 185–209. [Google Scholar]
  16. Honorio J & Jaakkola T (2014), ‘Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees’, Fourteenth International Conference on Artificial Intelligence and Statistics. [Google Scholar]
  17. Hyndman RJ & Fan Y (1996), ‘Sample Quantiles in Statistical Packages’, The American Statistician 50(4), 361–365. [Google Scholar]
  18. Kamburov A, Pentchev K, Galicka H, Wierling C, Lehrach H & Herwig R (2011), ‘ConsensusPathDB: toward a more complete picture of cell biology.’, Nucleic acids research 39, D712–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Keen KJ (2010), Graphics for Statistics and Data Analysis with R, Taylor & Francis. [Google Scholar]
  20. Kharchenko PV, Silberstein L & Scadden DT (2014), ‘Bayesian approach to singlecell differential expression analysis’, Nat. Meth 11(7), 740–742. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Krishnaswamy S, Spitzer MH, Mingueneau M, Bendall SC, Litvin O, Stone E, Pe’er D & Nolan GP (2014), ‘Conditional density-based analysis of T cell signaling in single-cell data’, Science 346(6213). [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Levina E & Bickel P (2001), ‘The Earth Mover’s distance is the Mallows distance: some insights from statistics’, Proceedings. Eighth IEEE International Conference on Computer Vision 2, 251–256. [Google Scholar]
  23. Macosko E, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas A, Kamitaki N, Martersteck E, Trombetta J, Weitz D, Sanes J, Shalek A, Regev A & McCarroll S (2015), ‘Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets’, Cell 161(5), 1202–1214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Mueller J & Jaakkola T (2015), ‘Principal Differences Analysis: Interpretable Characterization of Differences between Distributions’, Advances in Neural Information Processing Systems pp. 1702–1710. [Google Scholar]
  25. Pallari H-M, Lindqvist J, Torvaldson E, Ferraris SE, He T, Sahlgren C & Eriksson JE (2011), ‘Nestin as a regulator of Cdk5 in differentiating myoblasts’, Molecular Biology of the Cell 22(9), 1539–1549. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Porrello A, Cerone MA, Coen S, Gurtner A, Fontemaggi G, Cimino L, Piag-gio G, Sacchi A & Soddu S (2000), ‘p53 regulates myogenesis by triggering the differentiation activity of pRb.’, The Journal of cell biology 151(6), 1295–1304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Risso D, Ngai J, Speed TP & Dudoit S (2014), ‘Normalization of RNA-seq data using factor analysis of control genes or samples’, Nature Biotechnology 32(9), 896–902. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Silverman BW & Young GA (1987), ‘The bootstrap: To smooth or not to smooth?’, Biometrika 74(3), 469–79. [Google Scholar]
  29. Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS & Rinn JL (2014), ‘The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells’, Nat. Biotechnol 32(4), 381–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Zeisel A, Munoz-Manchado AB, Codeluppi S, Lonnerberg P, La Manno G, Jureus A, Marques S, Munguba H, He L, Betsholtz C, Rolny C, Castelo-Branco G, Hjerling-Leffler J & Linnarsson S (2015), ‘Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.’, Science 347(6226), 1138–42. [DOI] [PubMed] [Google Scholar]
  31. Zielinski R (2006), ‘Small-Sample Quantile Estimators in a Large Nonparametric Model’, Communications in Statistics - Theory and Methods 35(7), 1223–1241. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp1

RESOURCES