Riemannian geometry and statistical modeling correct for batch effects and control false discoveries in single-cell surface protein count data

Shuyi Zhang; Jacob R Leistico; Christopher Cook; Yale Liu; Raymond J Cho; Jeffrey B Cheng; Jun S Song

doi:10.1103/PhysRevE.102.012409

. Author manuscript; available in PMC: 2020 Aug 19.

Published in final edited form as: Phys Rev E. 2020 Jul;102(1-1):012409. doi: 10.1103/PhysRevE.102.012409

Riemannian geometry and statistical modeling correct for batch effects and control false discoveries in single-cell surface protein count data

Shuyi Zhang ^1,², Jacob R Leistico ^1,², Christopher Cook ³, Yale Liu ³, Raymond J Cho ³, Jeffrey B Cheng ^3,⁴, Jun S Song ^1,^2,^*

PMCID: PMC7437020 NIHMSID: NIHMS1614829 PMID: 32794969

Abstract

Recent advances in next generation sequencing-based single-cell technologies have allowed high-throughput quantitative detection of cell-surface proteins along with the transcriptome in individual cells, extending our understanding of the heterogeneity of cell populations in diverse tissues that are in different diseased states or under different experimental conditions. Count data of surface proteins from the cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) technology pose new computational challenges, and there is currently a dearth of rigorous mathematical tools for analyzing the data. This work utilizes concepts and ideas from Riemannian geometry to remove batch effects between samples and develops a statistical framework for distinguishing positive signals from background noise. The strengths of these approaches are demonstrated on two independent CITE-seq data sets in mouse and human.

I. INTRODUCTION

In recent years, single-cell analysis has undergone immense and rapid progress, continuing to transform our understanding of the diversity, development, and cooperation of distinct cell types in various tissues. It is now possible to measure the level of messenger RNAs (mRNAs) in thousands of individual cells via a single experiment of single-cell RNA sequencing (scRNA-seq). Furthermore, multi-omics technologies providing complementary information about the genomic, proteomic, and metabolomic states of single cells are being developed and applied.

Immunophenotyping is the process of classifying immune cells, often relying on the detection of cell-surface proteins. For example, fluorescent activated cell sorting (FACS), a commonly used technique, can be performed before scRNA-seq to provide the immunophenotype information of cells. Three recent technologies based on next-generation sequencing (NGS) have enabled simultaneous performance of immunophenotyping and scRNA-seq transcriptomic profiling at the single-cell level: Ab-Seq [1], cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) [2] and RNA expression and protein sequencing (REAP-seq) [3]. These methods allow the detection of selected proteins on the surface of single cells by adding a panel of DNA-barcoded antibodies on top of the existing high-throughput scRNA-seq techniques. The antibodies bind their corresponding surface proteins, and after cell lysis, the DNA barcodes attached to the antibodies are PCR amplified and sequenced along with the mRNAs. All three methods use a unique molecular identifier (UMI)-based protocol, which largely reduces amplification biases. In addition to a count matrix for genes from sequencing the mRNAs, these methods also yield a matrix of UMI counts – referred to as the antibody-derived tag (ADT) counts in the CITE-seq literature – derived from sequencing the barcodes attached to the antibodies.

The number of different DNA-barcoded antibodies added in CITE-seq, typically 10–100, is much smaller than the number of genes measured, and the ADT assay is currently less prone to “dropout” events compared to the RNA assay [2]. Arising directly from measuring a selected list of biologically relevant cell-surface proteins, the ADT count matrix provides complementary information about the immunophenotypes of single cells, while posing new computational challenges in data analysis. Similar to other single-cell techniques, sequencing depth differs from cell to cell; a sound model of ADT count data should take the variation in sequencing depth into account. While it has been demonstrated that UMI-based scRNA-seq data can be modeled with negative binomial (NB) or zero-inflated negative binomial (ZINB) models even for heterogeneous cells [4–6], a direct application of the same approach is not ideal for the count matrix of surface proteins, because a significant portion of the counts comes from nonspecific background binding of antibodies, making the distribution of the data bimodal or multimodal [2]. Fortunately, this type of background noise can be assessed by spiking in control cells from another species that normally do not cross-react with the antibodies.

We are thus motivated to develop a rigorous statistical method that, for each protein measured, fits the NB or ZINB distribution to the ADT count data of spiked-in cells and then uses this null model to distinguish positive signals from the background noise; to our knowledge, a rigorous statistical framework for such hypothesis testing is not yet available. Once the parameters of the null model are determined, we can detect positive signals at an adjustable false discovery rate (FDR) and also derive an interpretable method of data transformation. However, when multiple samples from the same lab are being analyzed, we have observed that model fitting could be adversely affected by systematic differences in measurement between samples, suggesting that potential systematic biases should be removed prior to model fitting. To accomplish this task, we view single cells as points on a Riemannian manifold, while defining the difference between any two cells as the Riemannian distance on the manifold. This approach allows us to apply ideas from differential geometry to develop a method for removing inter-sample differences on the manifold, while preserving the pairwise distance between cells from the same sample. Therefore, for multiple samples, our aim is to first alleviate potential batch effects using what we can learn from the spike-in data and subsequently apply the above statistical formalism to the corrected protein counts to detect positive signals at a desired FDR. We acknowledge that the problem of batch effects is complicated and that the information contained in spiked-in cells might not be sufficient to resolve it completely.

In the following sections, we first introduce the notion of high-dimensional Riemannian manifold endowed with the Fisher-Rao metric, and apply the idea to map the immunophenotype profiles of single cells to a hypersphere. The gist of our batch correction approach relies on the intuition that on this hypersphere, the distribution of points corresponding to spiked-in control cells should be similar between independent samples. We will thus remove sample-specific biases by aligning the center of mass (COM) of spiked-in cells from each sample to a consensus COM. We then apply the same aligning transformation learned from the spike-in data to the native species data, thereby removing potential systematic biases present in both spiked-in and native cells of a given sample. Main computational challenges lie in computing the COM of a point cloud on the hypersphere and “parallel transporting” the point cloud along a specific path connecting the old and new COM, according to some notion of geometry defined on the manifold. Finally, we implement expectation-maximization (EM) algorithms to fit the parameters of our null models describing nonspecific binding of antibodies, and perform statistical tests to detect signals, while keeping the false discovery under control.

II. RESULTS

We have applied our geometric and statistical methods to the following two CITE-seq data sets: (1) the public data set of human cord blood mononuclear cells (CBMC) [2], with a low level (~ 5%) of spiked-in mouse control cells; (2) our own data set consisting of 6 samples, each containing immune cells isolated from mouse skin after topical treatment with either inflammation-inducing oxazolone (OXA, 3 mice) or ethyl alcohol (EtOH, 3 mice) as control, as well as a small percentage (~ 5%) of human HEK293 cells spiked in after the treatment and isolation of the mouse cells [7].

In the first data set, individual cells’ RNA expression (scRNA-seq) profiles were used to identify the mouse spiked-in cells and classify the human cells into distinct cell types (Appendix A); in the second set, scRNA-seq data were used only to identify the human spiked-in cells. For both data sets, the relatively small number of spiked- in cells from a distinct species were identified and separated by calculating the percentage of total RNA counts mapped to the native vs. spiked-in species’ genome (Appendix A). Our approach utilizes the geometric and statistical information contained in spiked-in cells to model the background noise and systematic batch effects in the count data of cell-surface proteins.

Each experiment under consideration has a fixed list of M DNA-barcoded antibodies added before sequencing, where each antibody primarily binds its corresponding cell-surface protein, although some nonspecific binding manifested as background noise may also be possible. This experimental design determines the list of M cell-surface proteins, the abundance of which is to be measured by sequencing the DNA barcodes of corresponding antibodies. For a set of N single cells sequenced, a subset of D proteins selected from the whole list yields an N × D matrix of count data,

C = [\begin{matrix} c_{1, 1} & c_{1, 2} & \dots & c_{1, D} \\ c_{2, 1} & c_{2, 2} & \dots & c_{2, D} \\ ⋮ & ⋮ & ⋮ \\ c_{N, 1} & c_{N, 2} & \dots & c_{N, D} \end{matrix}],

(1)

where $c_{i, j} \in Z_{\geq 0}$ denotes the UMI count, for the i-th cell, of the j-th protein in the selected subset. A row vector in this matrix thus contains the immunophenotype information of the corresponding cell. In our analysis, the chosen set of N cells could be all the cells sequenced in an experiment or only a subset, representing a certain species or a particular inferred cell type. Similarly, the dimension D could be equal to the total number M of assayed surface proteins, or it could be chosen to be smaller, depending on the biological question of interest.

A. Mapping immunophenotypes of cells to points on a Riemannian manifold

We first transform the row vector of count data for the i-th cell into a probability vector, with the j-th component p_j calculated as the fraction $c_{i, j} / \sum_{k = 1}^{D} c_{i, k} .$ The fraction can be interpreted as the maximum likelihood estimation (MLE) of the probability of finding a certain protein on the i-th cell to be the j-th protein, given that it is one of the D proteins. This transformation maps each cell to a point on the (D − 1)-dimensional probability simplex $Δ_{(D - 1)} \subset R^{D},$ which, under the coordinate system (p₁, … , p_D) of the ambient space, is a polytope satisfying $\sum_{j = 1}^{D} p_{j} = 1$ and p_j ≥ 0. On the simplex, the usual Euclidean distance does not properly represent how dissimilar two points are from each other. Hence, we employ mathematical techniques from information geometry and differentiable manifolds to enable the analysis of single-cell data on the probability simplex.

The open probability simplex $ᐂ_{(D - 1)}$ , i.e., the relative interior of the probability simplex, satisfying

\sum_{j = 1}^{D} p_{j} = 1, p_{j} > 0,

(2)

forms a differentiable Riemannian manifold $M$ when equipped with the Fisher-Rao information metric [8–10]. A vector $\vec{u}$ in the tangent space $T_{p} M$ at p = (p₁, … , p_D) is given by

\vec{u} = (u_{1}, \dots, u_{D}) \in ℝ^{D}, such that \sum_{j = 1}^{D} u_{j} = 0;

(3)

for any $\vec{u}, \vec{u}' \in T_{p} M,$ the inner product defined by the Fisher-Rao metric is

{〈 \vec{u}, \vec{u}' 〉}_{p} = \sum_{j = 1}^{D} \frac{u_{j} u_{j}^{'}}{p_{j}} .

(4)

Let $N \equiv 2 S^{(D - 1)} \subset R^{D}$ denote a (D − 1)-dimensional hypersphere of radius R =2 centered at the origin, and $N_{+} \equiv 2 S_{+}^{(D - 1)} \subset N$ the positive orthant of the hypersphere. It is well known [8–10] that the open probability simplex can be isometrically mapped onto $N_{+}$ via the diffeomorphism

ψ : M \to N_{+}, p \mapsto x, (p_{1}, \dots, p_{D}) \mapsto (x_{1}, \dots, x_{D}) \equiv (2 \sqrt{p_{1}}, \dots, 2 \sqrt{p_{D}}),

(5)

where $x = ψ (p) \in N_{+}$ has coordinates (x₁, …, x_D) satisfying

\sum_{j = 1}^{D} x_{j}^{2} = R^{2} = 4, x_{j} > 0.

(6)

The tangent space at $x \in N_{+}$ can be obtained as the image of the differential of ψ,

D ψ (p) : T_{p} M \to T_{x} N_{+}, \vec{u} \mapsto \vec{w} = D ψ (p) [\vec{u}], w_{j} = \frac{u_{j}}{\sqrt{p_{j}}},

(7)

with the standard inner product

{\forall \vec{w}, \vec{w}' \in T_{x} N_{+}, 〈 \vec{w}, \vec{w}' 〉}_{x} = \sum_{j = 1}^{D} w_{j} w_{j}^{'} .

(8)

Note that the pullback of this standard inner product on $N_{+}$ by ψ is just the Fisher-Rao inner product on the open probability simplex $M$ .

The entire hypersphere $N$ can be regarded as a manifold embedded in the Euclidean space $R^{D}$ with the Cartesian coordinates (x₁, … , x_D). Unlike the geometry of the open probability simplex, several properties of the hypersphere with the standard induced metric from $R^{D}$ facilitate straightforward intuition and calculations. For example, any point $x \in N$ on the hypersphere can be represented as a vector x = (x₁, …, x_D), such that a vector $\vec{w} \in T_{x} N$ in the tangent space has coordinates w = (w₁, …, w_D) satisfying $x \cdot w = \sum_{j = 1}^{D} x_{j} w_{j} = 0,$ where the dot (·) denotes the usual dot product in $R^{D}$ . Furthermore, the geodesic between two points x and y on a manifold can be derived using the metric-compatible Levi-Civita connection on the manifold. When the manifold is the hypersphere $N$ of radius R = 2, the geodesic is simply the great arc connecting the two points; that is, with the vector representations x = (x₁, …, x_D) and y = (y₁, …, y_D) in the ambient Euclidean space, the geodesic distance between x and y is given by

d_{N} (x, y) \equiv R arccos (\frac{x \cdot y}{R^{2}}) = R arccos (\frac{\sum_{j = 1}^{D} x_{j} y_{j}}{R^{2}}) .

(9)

One of the goals in our analysis is to adjust the count data of immunophenotypes by first mapping them to points on the hypersphere $N$ , then removing sample-specific biases on $N$ where calculations are simpler, and eventually mapping the corrected points back to count data. As we try to map the count data of cell-surface proteins to the hypersphere, however, there is a small caveat that we need to address. That is, one or more counts of surface proteins might be zero for a cell, and the probability vector will consequently reside on the boundary of the probability simplex, where the Fisher-Rao metric is not defined. Suppose in the probability vector (p₁, … , p_D), one component, say p_k, is 0. One strategy is to replace it with a small positive number, $p_{k} \mapsto {\tilde{p}}_{k} = ϵ,$ and rescale the remaining components as $p_{ℓ \neq k} \mapsto {\tilde{p}}_{ℓ} = p_{ℓ} (1 - ϵ),$ so that the normalization $\sum_{j = 1}^{D} {\tilde{p}}_{j} = 1$ is still preserved. The probability vector now resides on $M$ and can be mapped to a point $(x_{1}, \dots x_{D}) \in N$ with $x_{k} = 2 \sqrt{ϵ}$ . The distance from this point to any other point on $N$ is well defined; taking the limit ϵ → 0, the distance remains finite as the point is pushed to the boundary of the positive orthant with its components being x_k = 0 and $x_{ℓ \neq k} = 2 \sqrt{p_{ℓ}} .$ The argument can be generalized to the case where there are more than one protein with zero UMI counts.

In summary, given a selected list of D surface proteins and the N × D count matrix, each row [c_i,1, … , c_i,D] representing the immunophenotype of the i-th cell can be mapped to a point on a (D − 1)-dimensional hypersphere of radius 2, with coordinates (x₁, … , x_D) given by

x_{j} = 2 \sqrt{\frac{c_{i, j}}{\sum_{k = 1}^{D} c_{i, k}} .}

(10)

Figure 1 demonstrates two distinct mappings of the in-dicated cell types from the human CBMC data set. For the sake of visualization, we have chosen the dimensionality to be D = 3. For the list {CD3, CD19, CD56}, we observe that most T cells reside in one corner of the positive-orthant hypersphere with a large CD3 component, while most B cells are in another corner with a large CD 19 component, both forming densely packed point clouds clearly separated from other cell types and from each other. For the complementary list {CD3, CD19, CD56}, we see a further separation of CD4+ T cells and CD8+ T cells (although some of them seem to have been misclassified). By contrast, the spiked-in mouse cells and the human erythrocytes (red blood cells) do not possess those human-specific surface proteins expressed on immune cells; therefore, their count data only come from background nonspecific binding to the DNA-barcoded antibodies, and their corresponding point clouds he far away from any of the corners or edges and mostly overlap with each other. In the following sections, we will introduce a method that utilizes the data of spiked-in control cells to remove systematic differences between samples and to model the “noise” of nonspecific binding.

FIG. 1. — Examples of mapping the surface protein count data of human CBMC with spiked-in mouse cells to a three-dimensional sphere of radius 2. (a) The list of selected proteins is {CD3, CD19, CD56}. (b) The list of selected proteins is {CD4, CDS, CDllc}. In both cases, each distinct cell type is displayed with the color indicated in the legend. NK and DC denote natural killer cells and dendritic cells, respectively. Small dots denote individual cells, and large dots with black outlines denote the Riemannian mean of the point cloud of each cell type.

B. Computing the Riemannian mean and removing batch effects on the hypersphere

With the immunophenotypes of single cells mapped to points on the hypersphere, we can adjust the points from different experiments and remove sample-specific biases by employing the idea, from statistics, of standardizing data distributions by aligning their mean vectors. On a Riemannian manifold, Ffechet mean generalizes the notion of Euclidean mean [11]. It is also often referred to as the Karcher mean, Riemannian center of mass, or Riemannian mean in the literature [10, 12, 13]. In this work, we will simply use the term Riemannian mean and adopt the recent numerical algorithm for computing the Riemannian mean of a set of points on the hypersphere [10].

Computing the COM of a point cloud on a Riemannian manifold involves minimizing an objective function consisting of the pairwise distance between the candidate COM and every point mass in the collection, thus requiring an investigation of the shortest distance between two given points or, equivalently, the geodesic path connecting them. On the (D − 1)-dimensional hypersphere $N$ of radius R, a geodesic g parameterized by t, satisfying the conditions $g (t = 0) = x \in N,$ and $\frac{d}{d t} g (t = 0) = \vec{w} \in T_{x} N,$ is given by

g_{j} (t) = x_{j} \cos (\frac{‖ \vec{w} ‖ t}{R}) + \frac{w_{j} R}{‖ \vec{w} ‖} \sin (\frac{‖ \vec{w} ‖ t}{R}),

(11)

where the embedding coordinates of x and g in $R^{D}$ are x = (x₁, …, x_D) and g = (g₁, …, g_D), respectively, and $‖ \vec{w} ‖ \equiv \sqrt{{〈 \vec{w}, \vec{w} 〉}_{x}}$ . Taking t = 1, we obtain the exponential map on the hypersphere, ${Exp}_{x} : T_{x} N \to N$ , $\vec{w} \mapsto y = g (1),$ with the corresponding vector in $R^{D}$ given by

y = x \cos (\frac{‖ \vec{w} ‖}{R}) + w \frac{R}{‖ \vec{w} ‖} \sin (\frac{‖ \vec{w} ‖}{R}) .

(12)

The inverse of the exponential map, ${Exp}_{x}^{- 1} : N \to T_{x} N$ , $y \mapsto \vec{w} = {Exp}_{x}^{- 1} (y),$ can be easily computed by the Gram-Schmidt process and is given in the embedding Euclidean coordinates by

w = \frac{θ}{\sin θ} (y - x cos θ), θ = \arccos (\frac{x \cdot y}{R^{2}}),

(13)

for which θ = 0 is reached if and only if s = r, and in that case, $\vec{w} = \vec{0}$ is obtained by taking the limit θ → 0. It follows that the norm of the vector $\vec{w} \equiv {Exp}_{x}^{- 1} (y)$ is equal to the geodesic distance between the two points on the hypersphere (9),

‖ \vec{w} ‖ = R θ = d_{N} (x, y) .

Note that because the exponential map commutes with isometry, the exponential map and the inverse exponential map on the open probability simplex $M$ can be obtained from these results on the hypersphere by using Dψ and (Dψ)⁻¹.

We can now define the Riemannian mean on a manifold, generalizing the Euclidean COM as follows: given a set of N data points {y⁽¹⁾, … , y^(N)} with corresponding masses {m₁, … , m_N} on the hypersphere, the Riemannian mean $\bar{y}$ of the collection of point masses is

\bar{y} = \underset{x}{\arg \min} \sum_{i = 1}^{N} m_{i} d_{N}^{2} (x, y^{(i)}) = \underset{x}{\arg \min} \sum_{i = 1}^{N} m_{i} {‖ {\vec{w}}^{(i)} ‖}^{2},

(14)

where ${\vec{w}}^{(i)} \equiv {Exp}_{x}^{- 1} (y^{(i)})$ is the inverse exponential map at x. The constrained gradient condition with respect to x then reads

\sum_{i = 1}^{N} m_{i} {\vec{w}}^{(i)} = \vec{0},

(15)

and the Riemannian mean on the hypersphere can be attained numerically in iterative steps until this condition is approximately satisfied [10]. Mapping back the resulting mean to the probability simplex $M$ via the inverse isometry ψ⁻¹ yields the corresponding Riemannian mean on $M$ . In our biological application, we take all point masses to be 1.

As the human cells in the CBMC data, are labeled with cell type information inferred from the transcriptome, we have computed the Riemannian mean for each cell type, as well as the spiked-in mouse cells. The result depends on the choice of surface protein subsets. In Fig. 1, the Riemannian mean of each cell type is shown as a large dot with black outline. We see that the Riemannian mean is a good representative of a densely packed point cloud on the sphere. Figure 2 shows the components of the Riemannian mean for D = M = 13 (all the proteins) and D = M − 1 = 12 (CD45 excluded). We see that, in this case, excluding CD45 increases the contrast of specific markers associated with each cell type, due to the fact that CD45 is generally high in human immune cells and may suppress the resolution of other cell-surface markers specific to certain immune cell subtypes. We also see that CD 10, CCR5, and CCR7 are not biological markers for any of the cell types, consistent with the result in [2]. This analysis illustrates how the Riemannian mean summarizes a set of homogeneous cells, and this idea will guide our method for removing sample-specific biases.

FIG. 2. — Riemanman mean calculated from the surface protein count data of each indicated cell type in the human CBMC data set. (a) All proteins are included (D = 13). (b) CD45RA is excluded (D = 12). In both ways of mapping, the components of the Riemanman mean correspond to the height of the bars; the light gray bars in the back represent the spiked-in mouse data, while the thin bars in the front represent the different cell types in human blood, with their order and colors indicated in the legend.

For our own data, set of immune cells isolated from the mouse skin, plus spiked-in human cells, we have only identified the species without further classification into distinct cell types. Figure 3 shows the components of Riemannian mean for the mouse and human cells in each of the 6 samples, in which we include all proteins measured, D = M = 42. We observe not only biological differences caused by the different treatment conditions – e.g., the enrichment of C-D11b in OXA-trea.ted mouse cells – but also some systematic differences between samples subjected to the same treatment – e.g., CD69 being much higher in EtOH2 than in EtOHl and EtOH3 for both mouse and human data.. For the spiked-in human cells, all the count data, should in principle come from nonspecific binding, but the Riemannia.n mean of some samples are very much separated from the rest. In fact, certain surface proteins (e.g., CD69, CD44, CD134, and CD86) show notable, reproducible skews in both mouse and human cells of the same sample. The pattern of certain surface protein enrichment in spiked-in human cells of a specific sample and the persistence of these biases in the mouse cells of the same sample suggest that there might exist systematic differences between the samples. These differences between samples can be further visualized in the principal component analysis (PCA), where the point clouds of EtOH2, OXA1, and OXA2 are seen not to overlap with the other similarly treated samples for both human and mouse cells [Fig. 4(a,b)]. Systematic differences between samples, also known as batch effects, will prevent comparison between different experiments. We now describe our method for correcting such batch effects by aligning the Riemannian mean of samples and parallel transporting the collection of data from each sample along a geodesic path connecting the old COM and the new aligned COM.

FIG. 3. — Batch effects within the three oxazolone-treated samples (OXA1,2,3) and the three control samples (EtOHl,2,3) of mouse skin cells, (a) Riemannian mean of the native mouse cells from each sample, (b) Riemannian mean of the spiked-in human cells from each sample. All proteins measured are included (D = 42). The bar height corresponds to the component of the Riemannian mean in the direction indicated on the ;r-axis.

FIG. 4. — Principal component analysis (PCA) and the analysis of variance (ANOVA) of spiked-in human cells and native mouse cells on the probability simplex before and after batch correction. The single-cell ADT count data of surface proteins were transformed to probability vectors, which were then projected to the plane spanned by the first two principal components (PCs) for the PCA plots (a-d) and on which ANOVA was performed, (a) Spiked-in human data before batch correction. One control sample (EtOH2) and two treated samples (OXA1,2) are seen to be outliers from the rest, (b) Mouse data before batch correction. The biases observed in (a) are seen to be carried over here, (c) Spiked-in human data after batch correction. Point clouds of all six samples are seen to overlap well, (d) Mouse data after batch correction. Points from the six samples are seen to align well with respect to the two treatment conditions, (e) Distribution of the F-statistics from ANOVA for 41 surface proteins before and after the batch correction in six murine skin samples.

In the above discussion of Riemannian mean, we have seen that given a reference point x, points y⁽¹⁾, … , y^(N) in its neighborhood can be mapped to vectors ${\vec{w}}^{(1)}, \dots, {\vec{w}}^{(N)}$ in the tangent space $T_{x} N {via y}^{(i)} \mapsto {\vec{w}}^{(i)} = {Exp}_{x}^{- 1} (y^{(i)}),$ and vice versa. Using the Levi-C-ivita connection, we propose to parallel transport these vectors along the geodesic path from x to a new reference point z, and then retrieve points in the neighborhood of z via the exponential map acting on the transported vectors lying in $T_{z} N$ . On the hypersphere embedded in $R^{D}$ , this transformation is equivalent to a rotation in the plane spanned by the vectors $x, z \in R^{D}$ . Using (12) and (13), we have for the rotation

z = x cos θ + \frac{v}{θ} \sin θ = R (\hat{x} cos θ + \hat{v} sin θ),

(16)

with θ = arccos(x · z/R²), $\vec{v} = {Exp}_{x}^{- 1} (z) \in T_{x} N,$ and the unit vectors $\hat{x} = x / \sqrt{x \cdot x}$ and $\hat{v} = v / \sqrt{v \cdot v}$ . For a point y⁽ⁱ⁾ in the neighborhood of x, the corresponding point transported to the neighborhood of z is given by

{\tilde{y}}^{(i)} = y^{(i)} - y_{∥}^{(i)} (1 - \cos θ) + y_{⊥}^{(i)} \sin θ,

(17)

in which

y_{∥}^{(i)} \equiv (y^{(i)} \cdot \hat{x}) \hat{x} + (y^{(i)} \cdot \hat{v}) \hat{v}, y_{⊥}^{(i)} \equiv (y^{(i)} \cdot \hat{x}) \hat{v} - (y^{(i)} \cdot \hat{v}) \hat{x} .

This approach thus enables a method of correcting batch effects by aligning the Riemannian mean x of a point cloud from each sample to a consensus reference point z, and thereby transporting each point cloud to the neighborhood of z. As previously discussed, the spiked-in human cells in each mouse sample are supposed to measure “noise” from nonspecific background binding and should be similarly distributed on the hypersphere. We have chosen the consensus reference point z to be the Riemannian mean of the aggregated human cells from three samples for which the point clouds of human cells mostly overlap, namely EtOHl, EtOH3, and OXA3 [Fig. 4(a)]. The points transported to this reference point z represent the immunophenotypes of cells after the batch correction. The same rotation on the hypersphere is now applied to the mouse cells to remove systematic biases between samples. As the operation preserves any pairwise distance within a point cloud on the manifold, our method of batch correction has the advantage of not distorting the relationship between cells from the same sample.

As an approximate inverse of (10), we calculate the corrected fractions of surface proteins from the corrected coordinates on the hypersphere, and restore the count data by multiplying the fractions by the total number of UMI counts and rounding the results to nearest integers. It is possible that after the batch correction, some points on the hypersphere could have small negative coordinate components; i.e., some points may be moved out of the positive orthant after the rotation. As the components of a probability vector can never be negative, we force the negative components in (x₁, … , x_D) to be zero, and rescale other components so that the probability normalization still holds. In our experience, this thresholding has a negligible effect on the downstream analysis. We see in Fig. 4(c) and (d) that our batch correction method has successfully removed the differences between spiked- in human cells, and also aligned the mouse samples according to the treatment conditions. The analysis of variance (ANOVA), comparing the between-sample variance against the within-sample variance for each surface protein, also shows that our batch correction reduces the inter-sample difference both among the spiked-in human cells from all six samples and among the mouse cells undergoing the same treatment [Fig. 4(e)]. We showcase the successful removal of batch effects using four surface proteins, CD69, TCR γ/δ, CD90.2 and I-A/I-E, in Fig. 5. Before the correction is applied, it can be seen that biases found in the distribution of the spiked-in human cells in an outlier sample is often replicated in that of the mouse cells in the same sample – e.g., CD69 in EtOH2, TCR γ/δ in EtOH2 and OXA2, CD90.2 in OXA1, and I-A/I-E in EtOH3 and OXA1. These systematic biases are largely removed in both human and mouse cells by our batch correction method.

FIG. 5. — Effects of batch correction on the six samples of mouse skin cells with spiked-in human cells. The data points on the hypersphere either before or after the batch correction are mapped back to the probability simplex. Distributions of proportion for human and mouse cells in each of the six samples are shown for the four selected surface proteins CD69, TCR γ/δ, CD90.2, and I-A/I-E.

C. Fitting the null model and performing statistical tests on count data

We now present a statistical framework for testing the significance of enrichment of a specific surface protein in the sequencing of a native cell, compared to the null distribution of read counts for that protein in the population of spiked-in cells. For the j-th surface protein, we build the null model on the j-th column, {c_1,j, c_2,j, …, c_N,j}, of the count matrix for spiked-in cells. In the following, we will focus on one surface protein at a time and omit the subscript j indexing surface proteins to simplify notation.

When the number of zero counts is small for the surface protein under consideration, we propose to model the null distribution by a generalized form of negative binomial (NB) model, with cell-specific relative size factors {t₁, t₂, …, t_N} capturing differences in individual cells’ sequencing depths. The idea is similar to that described in [5] for analyzing scRNA-seq UMI counts, but instead of regressing a generalized linear model (GLM), we will utilize the expectation-maximization (EM) algorithm to estimate the parameters in the NB model [14]; we will subsequently show that the algorithm can also be modified to estimate the parameters of zero-inflated models that are suitable for sparse data. Once the model parameters are determined, we will then use the null model to perform statistical tests on the count data of surface proteins in native cells, distinguishing “signal” from “noise” while keeping the false discovery rate (FDR) under control.

For the i-th cell, we denote the surface protein count random variable as q_i and its actual observed value as $q_{i}^{obs} = c_{i, j} .$ The relative size factor t_i is a measure of the cell’s sequencing depth covering all surface proteins, relative to a typical sequencing depth among all N cells. The calculation of t_i is discussed in Appendix D, where we offer two choices of definition, (D6) and (D7). We use the expression of t_i defined in (D7) in this paper, unless stated otherwise.

The NB probability distribution for a count random variable q_i with a predetermined relative size factor t_i is

NB (q_{i}; μ_{i} = t_{i} ρ, α) = \frac{Γ (q_{i} + α)}{Γ (α) q_{i}!} {(\frac{t_{i} ρ}{t_{i} ρ + α})}^{q_{i}} {(\frac{α}{t_{i} ρ + α})}^{α},

(18)

parametrized by a cell-specific mean μ_i and a universal ‘stopping-time’ parameter a, with the mean being E[q_i] = μ_i = t_iρ and the variance $Var (q_{i}) = μ_{i} + μ_{i}^{2} / α .$ For our EM implementation, it is instructive to view the NB distribution as an infinite mixture of Poisson distributions with mixing coefficients given by the Gamma distribution. This paper uses the following conventions:

Gamma (λ_{i}; α, β) = \frac{λ_{i}^{α - 1} β^{α}}{Γ (α)} e^{- β λ_{i}},

(19)

Poisson (q_{i}; λ_{i} t_{i}) = \frac{{(λ_{i} t_{i})}^{q_{i}}}{(q_{i}!)} e^{- λ_{i} t_{i}} .

(20)

The shape parameter α and rate parameter β for the Gamma distribution are the same for all N cells in the set. In the mixture model, once λ_i is sampled from the Gamma distribution for the i-th cell, the mean of the Poisson distribution is determined by the product of λ_i and the cell-specific size factor t_i, independent of the parameters (α, β); we get the NB probability distribution under a reparametrization ρ = α/β, as

P (q_{i} | t_{i}, α, β) = \int_{0}^{\infty} P (q_{i} | λ_{i}, t_{i}, α, β) p (λ_{i} | t_{i}, α, β) d λ_{i} = \int_{0}^{\infty} Poisson (q_{i}; λ_{i} t_{i}) Gamma (λ_{i}; α, β) d λ_{i} = \frac{Γ (q_{i} + α)}{Γ (α) (q_{i}!)} {(\frac{t_{i}}{t_{i} + β})}^{q_{i}} {(\frac{β}{t_{i} + β})}^{α} \equiv NB (q_{i}; μ_{i} = t_{i} α / β, α) .

(21)

In this formulation, the NB probability distribution arises by marginalizing the hidden variable λ_i from the joint distribution of (q_i, λ_i). In this paper, we use P to denote both probability mass functions of discrete random variables and joint distributions of discrete and continuous random variables. To apply the EM algorithm to infer α and β, we need not only the joint distribution

P (q_{i} λ_{i} | t_{i}, α, β) = Poisson (q_{i}; λ_{i} t_{i}) Gamma (λ_{i}; α, β),

(22)

but also the posterior density of λ_i, computed by applying the Bayes’ rule as

p (λ_{i} | q_{i}, t_{i}, α, β) = \frac{Poisson (q_{i}; λ_{i} t_{i}) Gamma (λ_{i}; α, β)}{NB (q_{i}; μ_{i} = t_{i} α / β, α)},

(23)

which is also Gamma distributed. Once the maximum likelihood estimation (MLE) of the parameters (α, β) is attained, we can interpret the mean of this posterior distribution as the expected surface protein levels of single cells.

For the set of N homogeneous spiked-in cells, with count variables q = {q₁, q₂, … q_N}, their observed values being $q^{obs} = {q_{1}^{obs}, q_{2}^{obs}, \dots, q_{N}^{obs}},$ and corresponding size-factors t = {t₁, t₂, …, t_N}, we have $P (q | t, α, β) = \prod_{i = 1}^{N} P (q_{i} | t_{i}, α, β) .$ Denoting λ = {λ₁, λ₂, …, λ_N} to be the set of hidden variables for the N cells, the full joint distribution is $P (q, λ | t, α, β) = \prod_{i = 1}^{N} P (q_{i}, λ_{i} | t_{i}, α, β),$ and we have

log P (q^{obs}, λ | t, α, β) = \sum_{i = 1}^{N} \log Gamma (λ_{i}; α, β) + \sum_{i = 1}^{N} \log Poisson (q_{i}^{obs}; λ_{i} t_{i}) .

(24)

By taking expectation of (24) with respect to the posterior distribution of λ given in (23), we then apply an implementation of the EM algorithm to obtain the maximum likelihood estimates of the model parameters α and β (see Appendix B for details) [14].

When an experiment produces ${q_{1}^{obs}, q_{2}^{obs}, \dots, q_{N}^{obs}}$ containing a relatively large number of zero counts, we use a zero-inflated negative binomial (ZINB) model, defined by

P (q_{i} | t_{i}, α, β, ω) = ω δ_{q_{i}, 0} + (1 - ω) NB (q_{i}; μ_{i} = t_{i} α / β, α),

(25)

where δ_q,0 = 1 for q = 0, and 0 otherwise. The new parameter ω is the probability of a “dropout” event in the measurement. Upon some modification, the above EM algorithm can be used to obtain the maximum likelihood estimates of α, β, and ω (see Appendix C).

Note that the NB distribution is a special case of the ZINB distribution with ω = 0. We have observed that if the NB model is sufficient for modeling the observed counts, then fitting a ZINB model will give ω → 0. However, fitting a ZINB model will take longer time, and one may wish to choose a particular model based on the data at hand. The human C-BMC dataset is not sparse, with only a few zero counts; we have thus chosen to fit a NB model for each of the M = 13 surface proteins. By contrast, the count data of mouse skin cells are sparse, with the rates of zeros sometimes as high as 70%, and we have chosen to fit a ZINB model for each of the M = 42 proteins.

Fitting the NB or ZINB distribution on the count data of each surface protein from spiked-in cells yields a null model, from which we can compute the p-values for the observed counts in native cells and thus distinguish potential “signals” from “noise” in native cells. In Fig. 6(a), we show the null model fitted on the count data of human CD3 observed on the spiked-in mouse cells, with the distribution of p-values being nearly uniform and thus indicating a reasonable fit. The p-values of the human cells show a clear distinction from the null model and show an enrichment of cells having a significant signal of CD3 (small p-values). Comparing the count data from native cells to the null distribution of spiked-in cells introduces a problem of multiple hypothesis testing. To control for the false discovery rate (FDR), we use the Benjamini-Hochberg (BF1) procedure [15], as implemented in the R package ‘stats’ [16]. The adjusted p-values for mouse and human data are shown in Fig. 6(b). We would decide a cell as having a significant signal for a surface protein when the corresponding adjusted p-value is smaller than the chosen FDR threshold.

FIG. 6. — Fitting the NB model on spiked-in mouse cells in the CBMC data set, and performing statistical tests and data transformation with the estimated model parameters. The surface protein is chosen to be (human) CD3, with the fitted model paramters α = 10.30, β = 0.2074 estimated from the mouse data, and ω = 0 fixed for the NB model, (a) The distribution of p-values for mouse and human cells calculated from the fitted model. The horizontal dashed line indicates a uniform distribution with constant density 1. (b) The distribution of adjusted p-values. The vertical red dashed line indicates the FDR threshold of 0.05; cells to the left of this line are considered as CD3+, and they are all human cells, (c) The distribution of the posterior mean E[λ] for mouse and human cells calculated from the model parameters, (d) The distribution of log₁₀E[λ] for mouse and human cells. In (c) and (d), the vertical dashed line indicates the 0.99 quantile of the spiked-in mouse data.

The model fitting also provides a method of transforming the count data in a way compatible with the statistical model. The transformed data can be used for downstream analysis, such as correlation analysis, dimension reduction, and unsupervised clustering. With the ZINB parameters (α, β, ω) estimated from a given set of (q^obs, t), we can transform a pair (q′, t′) of observed UMI count and size factor, either used in the model fitting or previously unseen, as

E_{λ' | q', t', α, β, ω} [λ'] = (1 - \frac{ω δ_{q', 0}}{ω + (1 - ω) {(\frac{β}{t' + β})}^{α}}) \frac{q' + α}{t' + β},

(26)

which is roughly the posterior expected Poisson mean of the UMI-count random variable in the ZINB model. For the NB model, we simply need to take ω = 0, and the expectation value reduces to (q′ + α)/(t′ + β). The transformed CD3 count data in the CBMC data set are shown in Fig. 6(c) and (d), where the human data are clearly seen to have a mode with high E[λ] signals, well separated from the background distribution of spiked-in mouse cells. Taking the logarithm of E[λ] results in a better visualization, as seen in Fig. 6(d), clearly capturing the bimodal distribution of the CD3 expression level on the surface of human blood cells. For CD3, the number of human cells above the 0.99 quantile of the mouse control distribution approximately coincides with the number of human cells passing the statistical test at the adjusted p-value threshold of 0.05 in Fig. 6(b).

Python source code implementing the algorithms described in this manuscript is available on Github [17].

III. DISCUSSION

Inspired by the techniques of differential geometry and stochastic processes often used to model physical systems, we proposed a series of methods for analyzing the count data of surface proteins from CITE-seq. Mapping the count data to a Riemannian manifold, we used the Riemannian center of mass to find an exemplar point that best represents each set of homogeneous control cells on the manifold. We then removed potential batch effects between multiple samples by aligning their center of mass on the Riemannian manifold and built a null model in order to separate significant signals in the count data from the noise of nonspecific antibody binding.

To date, CITE-seq analysis lacked a rigorous statistical framework for testing the significance of ADT counts and adjusting for multiple hypothesis testing. Our probabilistic modeling of ADT sequencing addresses this gap and also provides an appealing data representation based on the posterior mean E[λ], which can be used for downstream analyses such as clustering and visualization. Although visually similar to other data transformation methods, inheriting the parameters from the (ZI)NB model fitting makes our transformation easily interpretable and compatible with the proposed statistical hypothesis testing framework. Further details and comparison with other data transformation (normalization) methods are discussed in Appendix D.

Unlike the original approach [2], some CITE-seq data may lack a spike-in control from another species. In those cases, we recommend first finding a set of non-immune cells (e.g., erythrocytes in the blood, and keratinocytes in the skin [18, 19]) that are transcriptomically distinct from the rest of the cells, and then using the set to build the null model for immunophenotype profiling. If this strategy is not feasible, then an unsupervised method could be developed to distinguish signal from noise by fitting bimodal or multimodal distributions. Although spiking in foreign cells could provide useful information about background noise of antibodies, several issues may still remain. For example, high background noise may result from low quality of cells, making it difficult to disentangle true presence of surface proteins. Cross-reactivity of antibodies may also pose a problem. An interesting alternative approach to fitting the null distributions of ADT counts would be to perform parallel experiments, for instance mass cytometry [20], on select control cells and calibrate the ADT count data of these cells from CITE-seq.

As previously mentioned, parallel transporting the immunophenotypes of cells on the hypersphere might move some cells slightly out of the positive orthant. We here addressed the issue by setting the small negative components to zero and rescaling the rest of the components to preserve the normalization condition. Even though this simple correction method did not noticeably affect the neighborhood structure of the point clouds in our data, future studies would be needed to develop a more rigorous geometric construction that can handle these cells.

Our method of batch correction, upon some modification, may be also applicable to other types of count data; e.g., other multi-omics count data that complement the scRNA-seq assay, and count data used for topic modeling in text mining. A potentially interesting direction for future investigation would be integrating the geometric and statistical methods directly on a Riemannian data manifold.

ACKNOWLEDGMENTS

This project was supported in part by funds from NIH K08AR067243 (JBC), NIH R01CA163336 (JSS) and the Grainger Engineering Breakthroughs Initiative (JSS).

Appendix A: Data set preparation

The CITE-seq data set of human CBMC and spiked- in mouse cells was obtained from the Gene Expression Omnibus under the accession number GSE100866. For the scRNA-seq data, we followed the suggested procedures of normalization, feature selection, dimensional reduction, and Louvain clustering in Seurat v3 [21]. The cell labels were determined from the list of biomarkers detected for each cluster using Seurat, as in [2]. Furthermore, we summed up the RNA counts mapping to the mouse genome, and calculated the percentage of mouse gene counts with respect to the total RNA counts; the putative single cells with a percentage of mouse genes from 5% to 95% were filtered out, as they might be doublets of cells from the two species. The cells with larger than 95% mouse genes were labeled as mouse cells. A tSNE plot of the tra.nscriptomic data, with labeled cells is shown in Fig. 7(a). For a clear demonstration of our analysis, we have chosen human cells with labels only from the following eight cell types: B cells, memory CD4+ T cells, naive CD4+ T cells, CD8+ T cells, natural killer (NK) cells, dendritic cells (DCs), CD34+ cells and erythrocytes. The cells labeled as CD 14+ monocytes, CD 16+ monocytes, megakaryocytes, plasmacytoid dendritic cells (pDCs), and multiplets were all filtered out. The full list of 13 cluster of differentiation (CD) proteins measured in the experiment is {CD3, CD4, CD45RA, CD56, CD 16, CD10, CDllc, CD14, CD19, CD34, CCR5 (CD195), CCR7 (CD197)}, all of which are shown on the x-a.xis in Fig. 2.

FIG. 7. — Data transformation applied to human CBMC with spiked-in mouse cells, (a) tSNE plot of the single-cell transcriptomic (scRNA-seq) data. The RNA count data, have been log-normalized, as described in Eq. (D1), and compressed using a dimensional reduction method (Appendix A). The indicated color scheme for cell types is carried over to (b,c,d,e). NK and DC denote natural killer cells and dendritic cells, respectively. CD14+ monocytes, CD16+ monocytes, megakaryocytes, and plasmacytoid dendritic cells (pDCs) are grouped into the category “Other” and omitted in other panels. (b) A version of the centered log ratio (CLR) transformation of the single-cell immunophenotype data, as described in Eq. (D2). (c) Another version of the CLR transformation of the single-cell immunophenotype data, as described in Eq. (D3). (d) Our data transformation method using the relative size factor *t_i* = *a_i*/a₀, with *a_i* being the arithmetic mean of count per protein, as described in Eq. (D6). (e) Our data transformation method using the relative size factor *t_i* = *g_i*/g₀, with *g_i* being the geometric mean of count (plus one pseudocount) per protein, as described in Eq. (D7).

Processed tables of ADT counts in murine skin cells and spiked-in human embryonic kidney 293 (HEK293) cells [7] are available at https://github.com/jssong-lab/SAGACITE/data. The full list of 42 mouse proteins measured in the experiment is {CD3, CD4, CD8a, CD69, CD62L, CD44, CD25, CD127, TCRy/A, CD 19, IgM, CD335, NK-1.1, CD90.2, I-A/I-E, CD45R, GDI lb, CD172a, XCR1, CD103, CD64, CD11c, CD24, CD326, CD195, CD196, CXCR4, CD134, CD278, CD223, CD274, KLRG1, CD27, CD21/CD35, CD86, IgD, CD Id, CD 138, CD117, CX3CR1, F4/80, CD192}, as is shown on the x-a.xis in Fig. 3. The samples OXA1, 2, and 3 were from the ear skin of three different mice treated with inflammation-inducing oxa.zolone, while EtOH1, 2, and 3 were from the ear skin of three different mice treated with ethyl alcohol as control. The immune cells in each skin sample were isolated after enzymatic digestion and then cell sorting using flow cytometry. The F1EK293 cells were then spiked in, just before CITE-seq was performed. For each cell, we calculated the percentage of RNA counts mapping to the mouse genome with respect to the total amount of RNA counts; a cell with the percentage greater than 95% was classified as a mouse cell, and a cell with the percentage smaller than 1% was classified as a human cell. No further sub-classification of mouse cells based on the tra.nscriptome was performed.

Appendix B: EM algorithm for NB model fitting

The posterior density p(λ_i|q_i,t_i, α, β) of λ_i defined in (23) satisfies

E_{λ_{i} | q_{i}, t_{i}, α, β} [λ_{i}] = \int_{0}^{\infty} λ_{i} p (λ_{i} | q_{i}, t_{i}, α, β) d λ_{i} = \frac{q_{i} + α}{t_{i} + β}

(B1)

and

E_{λ_{i} | q_{i}, t_{i}, α, β} [\log λ_{i}] = \int_{0}^{\infty} \log λ_{i} p (λ_{i} | q_{i}, t_{i}, α, β) d λ_{i} = Ψ (q_{i} + α) - \log (t_{i} + β),

(B2)

where $Ψ (z) = \frac{d}{d z} \log Γ (z)$ is the digamma. function.

For a dataset (q, t) of N independent samples, we define the following sample averages of expectation values:

\bar{E_{λ | q, t, α, β} [1]} = 1,

(B3)

\bar{E_{λ | q, t, α, β} [λ]} = \frac{1}{N} \sum_{i = 1}^{N} \frac{q_{i} + α}{t_{i} + β},

(B4)

\bar{E_{λ | q, t, α, β} [\log λ]} = \frac{1}{N} \sum_{i = 1}^{N} [Ψ (q_{i} + α) - \log (t_{i} + β)] .

(B5)

Given the current estimate of (α, β), we need to update them as

(α^{new}, β^{new}) = \underset{α^{*}, β^{*}}{\arg \max} E_{λ | q, t, α, β} [\log P (q, λ | t, α^{*}, β^{*})] = \underset{α^{*}, β^{*}}{\arg \max} \sum_{i = 1}^{N} E_{λ_{i} | q_{i}, t_{i}, α, β} [\log P (q_{i}, λ_{i} | t_{i}, α^{*}, β^{*})] = \underset{α^{*}, β^{*}}{\arg \max} \sum_{i = 1}^{N} \int_{0}^{\infty} \log Gamma (λ_{i}; α^{*}, β^{*}) \times p (λ_{i} | q_{i}, t_{i}, α, β) d λ_{i} = \underset{α^{*}, β^{*}}{\arg \max} ℓ_{NB} (α^{*}, β^{*}),

(B6)

where

ℓ_{NB} (α^{*}, β^{*}) \equiv (α^{*} - 1) \bar{E_{λ | q, t, α, β} [\log λ]} - β^{*} \bar{E_{λ | q, t, α, β} [λ]} + α^{*} \log β^{*} - \log Γ (α^{*}) .

(B7)

Solving $\frac{\partial}{\partial β^{*}} ℓ_{NB} (α^{*}, β^{*}) = 0$ for β* we get $β^{*} = α^{*} / \bar{E_{λ | q, t, α, β} [λ]} .$ Substituting this expression of β* into ℓ_NB(α*, β*) we define a new function depending only on α*:

{\tilde{ℓ}}_{NB} (α^{*}) = (α^{*} - 1) \bar{E_{λ | q, t, α, β} [\log λ]} - α^{*} \log \bar{E_{λ | q, t, α, β} [λ]} + α^{*} \log α^{*} - α^{*} - \log Γ (α^{*}) .

(B8)

Maximizing this function, we finally obtain the updates

α^{new} = \underset{α^{*}}{\arg \min} {\tilde{ℓ}}_{NB} (α^{*}), β^{new} = α^{new} / \bar{E_{λ | q, t, α, β} [λ]} .

(B9)

In each iterative step, we compute the optimization of α^new numerically using a generalized version of Newton’s method which enables faster convergence [14, 22].

Appendix C: EM algorithm for ZINB model fitting

For the ZINB model, the joint probability function is

P (q_{i}, z_{i}, λ_{i} | t_{i}, α, β, ω) = ω δ_{z_{i}, 0} δ_{q_{i}, 0} δ (λ_{i}) + (1 - ω) δ_{z_{i}, 1} Poisson (q_{i}; λ_{i} t_{i}) Gamma (λ_{i}; α, β),

(C1)

where z_i ∈ {0, 1} is a latent Bernoulli random variable modeling the “dropout” event (z_i = 0). The posterior probability function is

P (z_{i}, λ_{i} | q_{i}, t_{i}, α, β, ω) = δ_{z_{i}, 0} δ (λ_{i}) h_{i} (α, β, ω) + δ_{z_{i}, 1} (1 - h_{i} (ω, α, β)) p (λ_{i} | q_{i}, t_{i}, α, β),

(C2)

where the posterior density p(λ_i|q_i, t_i, α, β), for z_i = 1, is defined in (23), and

h_{i} (α, β, ω) \equiv \frac{ω δ_{q_{i}, 0}}{ω δ_{q_{i}, 0} + (1 - ω) NB (q_{i}; μ_{i} = t_{i} α / β, α)} .

(C3)

Given the current estimate of (α, β, ω), we need to update them as

(α^{new}, β^{new}, ω^{new}) = \underset{α^{*}, β^{*}, ω^{*}}{\arg \max} E_{z, λ | q, t, α, β, ω} [\log P (q, z, λ | t, α^{*}, β^{*}, ω^{*})] = \underset{α^{*}, β^{*}, ω^{*}}{\arg \max} ℓ_{NB} (α^{*}, β^{*}, ω^{*}),

(C4)

where ℓ_ZINB(α*, β*, ω*) , collecting only the terms involving α*, β*, ω*, is given by

ℓ_{ZINB} (α^{*}, β^{*}, ω^{*}) = \frac{1}{N} \sum_{i = 1}^{N} [h_{i} (α, β, ω) \log ω^{*} + (1 - h_{i} (α, β, ω)) \log (1 - ω^{*})] + \frac{1}{N} \sum_{i = 1}^{N} (1 - h_{i} (α, β, ω)) \times \int_{0}^{\infty} p (λ_{i} | q_{i}, t_{i}, α, β) \log Gamma (λ_{i}; α^{*}, β^{*}) d λ_{i} = \frac{1}{N} \sum_{i = 1}^{N} [h_{i} (α, β, ω) \log ω^{*} + (1 - h_{i} (α, β, ω)) \log (1 - ω^{*})] + (α^{*} - 1) ε_{\log λ} - β^{*} ε_{λ} + [α^{*} \log β^{*} - \log Γ (α^{*})] ε_{1} .

(C5)

In the last line, ε₁, ε_λ, and ε_{log λ} are defined as follows, similar to (B3), (B4), and (B5):

ε_{1} \equiv \frac{1}{N} \sum_{i = 1}^{N} (1 - h_{i} (ω, α, β)),

(C6)

ε_{λ} \equiv \frac{1}{N} \sum_{i = 1}^{N} (1 - h_{i} (α, β, ω)) \int_{0}^{\infty} d λ_{i} f_{i} (α, β) λ_{i} = \frac{1}{N} \sum_{i = 1}^{N} (1 - h_{i} (α, β, ω)) \frac{q_{i} + α}{t_{i} + β},

(C7)

ε_{\log λ} \equiv \frac{1}{N} \sum_{i = 1}^{N} (1 - h_{i} (α, β, ω)) \int_{0}^{\infty} d λ_{i} f_{i} (α, β) λ_{i} = \frac{1}{N} \sum_{i = 1}^{N} (1 - h_{i} (α, β, ω)) [Ψ (q_{i} + α) - \log (t_{i} + β)] .

(C8)

Solving $\frac{\partial}{\partial ω^{*}} ℓ_{ZINB} (α^{*}, β^{*}, ω^{*}) = 0$ for ω*, we obtain

ω^{new} = \frac{1}{N} \sum_{i = 1}^{N} h_{i} (α, β, ω) .

(C9)

From $\frac{\partial}{\partial β^{*}} ℓ_{ZINB} (α^{*}, β^{*}, ω^{*}) = 0$ , we have α*/β* = ε_λ/ε₁.

Keeping only those terms involving α* and β* in (C5) and substituting β* = α*ε₁/ε_λ, we can define a function that depends only on α* as follows

{\tilde{ℓ}}_{Z I N B} (α^{*}) / ε_{1} \equiv (α^{*} - 1) ε_{\log λ} / ε_{1} - α^{*} \log (ε_{λ} / ε_{1}) + α^{*} \log α^{*} - α^{*} - \log Γ (α^{*}) .

(C10)

The update now reads

α^{n e w} = \underset{α^{*}}{\arg \max} {{\tilde{ℓ}}_{Z I N B} (α^{*}) / ε_{1}}, β^{n e w} = α^{n e w} ε_{1} / ε_{λ} .

(C11)

With the ratios ε_{log λ}/ε₁ and ε_λ/ε₁ calculated using (C6), (C7), and (C8), the resulting optimization is the same as in (B8).

Appendix D: Comparison of data transformation methods

The log normalization with a fixed scale factor so, commonly used to process scRNA-seq data, transforms the count data (c_i,1, …, c_i,D) within the i-th cell as

c_{i, j} \mapsto \log (\frac{c_{i, j}}{s_{i}} \times s_{0} + 1) = \log (\frac{c_{i, j}}{s_{i} / s_{0}} + 1),

(D1)

where $s_{i} \equiv \sum_{k = 1}^{D} c_{i, k}$ is the total sequencing depth in the i-cell, and s₀ can be chosen to be a typical value of s_i. Some common choices are s₀ = 1000, 10000, or 100000, depending on the data. We also find it reasonable to choose s₀ as either the arithmetic or the geometric mean of s_i’s.

The centered log ratio (CLR) is a related transformation method that is previously used to process the CITE-seq count data [2], and is defined as

c_{i, j} \mapsto \log (\frac{c_{i j} + 1}{g_{i}}) = \log (c_{i, j} + 1) - \frac{1}{D} \sum_{k = 1}^{D} \log (c_{i, k} + 1),

(D2)

where $g_{i} = {[\prod_{k = 1}^{D} (c_{k, j} + 1)]}^{1 / D}$ is the geometric mean of the D surface protein counts, each adjusted by one pseudocount. It can be interpreted as row mean-centering the table of pseudocount-adjusted log counts, log (c_i,j + 1). Another version of the CLR transformation, implemented in Seurat v3 [21], is given by

c_{i, j} \mapsto \log (\frac{c_{i, j}}{g_{i}} + 1),

(D3)

which is not row-centered. Unlike the expression defined in (D2), the alternative form given in (D3) always yields nonnegative values.

We here propose a new data transformation method using the posterior E[λ_i,j] computed using the MLE of ZINB model parameters (α_j, β_j, ω_j) for the j-th protein, as given in (26); we take the logarithm for better visualization of the data, the effect of which is clear when we compare Fig. 6(c) with Fig. 6(d). When the zero-inflation mixing coefficient ω_j = 0, our transformation is

c_{i, j} \mapsto \log (\frac{c_{i, j} + α_{j}}{t_{i} + β_{j}});

(D4)

when ω_j ≠ 0, it is

c_{i, j} \mapsto \log [(1 - \frac{ω_{j} δ_{c_{i, j}, 0}}{ω_{j} + (1 - ω_{j}) {(\frac{β_{j}}{t_{i} + β_{j}})}^{α_{j}}}) \frac{c_{i, j} + α_{j}}{t_{i} + β_{j}}] .

(D5)

We provide two choices regarding how to compute the relative size factor t_i for the i-th cell: the first definition is

t_{i} = s_{i} / s_{0} = a_{i} / a_{0}, a_{i} \equiv \frac{1}{D} \sum_{k = 1}^{D} c_{i, k},

(D6)

where the ratio s_i/s₀ is the same as that used in the log normalization method (D1), a_i is the arithmetic mean of the D surface protein counts, and a₀ = s₀/D is some choice of typical value of a_i; the other definition is

t_{i} = g_{i} / g_{0}, g_{i} = {[\prod_{k = 1}^{D} (c_{i, k} + 1)]}^{1 / D},

(D7)

where g_i is the geometric mean as in the two versions of CLR transformation (D2) and (D3), and g₀ is some choice of typical value of g_i. Although (D6) capturing the differences in total count might seem more intuitive, we recommend (D7), as the geometric mean is more robust against outliers than the arithmetic mean. Transformation results for the two choices are shown in Fig. 7(d) and (e), respectively, where it is apparent that the second convention t_i = g_i/g₀ better separates the CD8+ T cells from spiked-in mouse cells along the CD3+ direction. This phenomenon may be attributed to the fact that CD8+ T cells have high CD8 counts and, thus, inflated size factors under the former definition, thereby suppressing the transformed CD3 values. Consistent with this observation, performing statistical tests on the CD3 level shows that the CD8+ T cells are correctly classified as being CD3+, when (D7), but not when (D6), is used to calculate the size factors.

In each of the transformation methods described above, the argument of the logarithm can be considered as a normalized version of the raw count c_i,j. In our transformation, the argument is E[λ], and the normalization adds a data-driven pseudocount α_j to the raw count and β_j to the relative size factor t_i ~ 1, which corrects the sum c_i,j + α_j for different sequencing depths. Similarly, the term containing ω_j corrects for the case of an inflated zero count. Compared with the log normalization (D1) and the CLR transformation (D2,D3), our approach utilizes the model parameters inferred from the data, rather than adding an arbitrary pseudocount of 1. It is also specific to a particular surface protein and has the ability to address a potential dropout effect in the measurement.

In our implementation, we have chosen g₀ to be the geometric mean of all g_i’s. However, a different choice of g₀ would merely translate the distribution of the transformed data by a constant. That is, under rescaling t_i → bt_i by a fixed constant b, the mixture probabilities of the NB and ZINB models remain invariant under the compensating redefinitions λ_i,j → λ_i,j/b, α_j → α_j, β_j → b β_j, (c_i,j+α_j)/(t_i+β_j) → (1/b)(c_i,j+α_j)/(t_i+β_j) and ω_j → ω_j; hence, the only effect of choosing a different g₀ is to rescale E[λ_i,j] by a multiplicative constant for all i, resulting in translating log E[λ_i,j] by a global constant for all cells.

References

[1].Shahi P, Kim SC, Haliburton JR, Gartner ZJ, and Abate AR, Abseq: Ultrahigh-throughput single cell protein profiling with droplet microfluidic barcoding, Scientific Reports 7, 44447 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Stoeckius M, Hafemeister C, Stephenson W, Houck-Loomis B, Chattopadhyay PK, Swerdlow H, Satija R, and Smibert P, Simultaneous epitope and transcriptome measurement in single cells, Nature Methods 14, 865 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Peterson VM, Zhang KX, Kumar N, Wong J, Li L, Wilson DC, Moore R, McClanahan TK, Sadekova S, and Klappenbach JA, Multiplexed quantification of proteins and transcripts in single cells, Nature Biotechnology 35, 936 (2017). [DOI] [PubMed] [Google Scholar]
[4].Chen W, Li Y, Easton J, Finkelstein D, Wu G, and Chen X, UMI-count modeling and differential expression analysis for single-cell rna sequencing, Genome Biology 19, 70 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, Murray JI, Raj A, Li M, and Zhang NR, SAVER: gene expression recovery for single-cell RNA sequencing, Nature Methods 15, 539 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Risso D, Perraudeau F, Gribkova S, Dudoit S, and Vert J-P, A general and flexible method for signal extraction from single-cell RNA-seq data, Nature Communications 9, 1 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Liu Y, Cook C, Sedgewick A, Zhang S, Harirchian P, Ricardo-Gonzalez R, Kashem S, Hanakawa S, Leistico JR, North JP, Zhang W, Man M, Charruyer A, Beliakova-Bethell N, Benz SC, Ghadially R, Kaplan D, Kabashima K, Choi J, Song JS, Cho RJ, and Cheng JB, (unpublished).
[8].Amari S.-i. and Nagaoka H, Methods of information geometry, Vol. 191 (American Mathematical Soc., 2007). [Google Scholar]
[9].Amari S.-i., Information geometry and its applications, Vol. 194 (Springer, 2016). [Google Scholar]
[10].Åström F, Petra S, Schmitzer B, and Schnörr C, Image labeling by assignment, Journal of Mathematical Imaging and Vision 58, 211 (2017). [Google Scholar]
[11].Fréchet M, Les éléments aléatoires de nature quelconque dans un espace distancié, in Annales de I’institut Henri Poincaré, Vol. 10 (1948) pp. 215–310. [Google Scholar]
[12].Karcher H, Riemannian center of mass and mollifier smoothing, Communications on Pure and Applied Mathematics 30, 509 (1977). [Google Scholar]
[13].Karcher H, Riemannian center of mass and so called Karcher mean, arXiv preprint arXiv:1407.2087 (2014). [Google Scholar]
[14].Minka TP, Estimating a Gamma distribution (2002), unpublished, available at https://tminka.github.io/papers/minka-gamma.pdf.
[15].Benjamini Y and Hochberg Y, Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal statistical society: Series B (Methodological) 57, 289 (1995). [Google Scholar]
[16].R Core Team: R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria: (2019). [Google Scholar]
[17].Zhang S and Song JS, SAGACITE (Statistical and Geometric Analysis of CITE-seq) (2020), available at https://github.com/jssong-lab/SAGACITE.
[18].Cheng JB, Sedgewick AJ, Finnegan AI, Harirchian P, Lee J, Kwon S, Fassett MS, Golovato J, Gray M, Ghadially R, Liao W, Perez White BE, Mauro TM, Mully T, Kim EA, Sbitany H, Neuhaus IM, Grekin RC, Yu SS, Gray JW, Purdom E, Paus R, Vaske CJ, Benz SC, Song JS, and Cho RJ, Transcriptional programming of normal and inflamed human epidermis at single-cell resolution, Cell Reports 25, 871 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Finnegan A, Cho RJ, Luu A, Harirchian P, Lee J, Cheng JB, and Song JS, Single-cell transcriptomics reveals spatial and temporal turnover of keratinocyte differentiation regulators, Frontiers in Genetics 10, 775 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Bandura DR, Baranov VI, Ornatsky OI, Antonov A, Kinach R, Lou X, Pavlov S, Vorobiev S, Dick JE, and Tanner SD, Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry, Analytical Chemistry 81, 6813 (2009). [DOI] [PubMed] [Google Scholar]
[21].Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck III WM, Hao Y, Stoeckius M, Smibert P, and Satija R, Comprehensive integration of single-cell data, Cell 177, 1888 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Minka TP, Beyond Newton’s method (2000), unpublished, available at https://tminka.github.io/papers/minka-newton.pdf.

[R1] [1].Shahi P, Kim SC, Haliburton JR, Gartner ZJ, and Abate AR, Abseq: Ultrahigh-throughput single cell protein profiling with droplet microfluidic barcoding, Scientific Reports 7, 44447 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Stoeckius M, Hafemeister C, Stephenson W, Houck-Loomis B, Chattopadhyay PK, Swerdlow H, Satija R, and Smibert P, Simultaneous epitope and transcriptome measurement in single cells, Nature Methods 14, 865 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Peterson VM, Zhang KX, Kumar N, Wong J, Li L, Wilson DC, Moore R, McClanahan TK, Sadekova S, and Klappenbach JA, Multiplexed quantification of proteins and transcripts in single cells, Nature Biotechnology 35, 936 (2017). [DOI] [PubMed] [Google Scholar]

[R4] [4].Chen W, Li Y, Easton J, Finkelstein D, Wu G, and Chen X, UMI-count modeling and differential expression analysis for single-cell rna sequencing, Genome Biology 19, 70 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, Murray JI, Raj A, Li M, and Zhang NR, SAVER: gene expression recovery for single-cell RNA sequencing, Nature Methods 15, 539 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Risso D, Perraudeau F, Gribkova S, Dudoit S, and Vert J-P, A general and flexible method for signal extraction from single-cell RNA-seq data, Nature Communications 9, 1 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Liu Y, Cook C, Sedgewick A, Zhang S, Harirchian P, Ricardo-Gonzalez R, Kashem S, Hanakawa S, Leistico JR, North JP, Zhang W, Man M, Charruyer A, Beliakova-Bethell N, Benz SC, Ghadially R, Kaplan D, Kabashima K, Choi J, Song JS, Cho RJ, and Cheng JB, (unpublished).

[R8] [8].Amari S.-i. and Nagaoka H, Methods of information geometry, Vol. 191 (American Mathematical Soc., 2007). [Google Scholar]

[R9] [9].Amari S.-i., Information geometry and its applications, Vol. 194 (Springer, 2016). [Google Scholar]

[R10] [10].Åström F, Petra S, Schmitzer B, and Schnörr C, Image labeling by assignment, Journal of Mathematical Imaging and Vision 58, 211 (2017). [Google Scholar]

[R11] [11].Fréchet M, Les éléments aléatoires de nature quelconque dans un espace distancié, in Annales de I’institut Henri Poincaré, Vol. 10 (1948) pp. 215–310. [Google Scholar]

[R12] [12].Karcher H, Riemannian center of mass and mollifier smoothing, Communications on Pure and Applied Mathematics 30, 509 (1977). [Google Scholar]

[R13] [13].Karcher H, Riemannian center of mass and so called Karcher mean, arXiv preprint arXiv:1407.2087 (2014). [Google Scholar]

[R14] [14].Minka TP, Estimating a Gamma distribution (2002), unpublished, available at https://tminka.github.io/papers/minka-gamma.pdf.

[R15] [15].Benjamini Y and Hochberg Y, Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal statistical society: Series B (Methodological) 57, 289 (1995). [Google Scholar]

[R16] [16].R Core Team: R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria: (2019). [Google Scholar]

[R17] [17].Zhang S and Song JS, SAGACITE (Statistical and Geometric Analysis of CITE-seq) (2020), available at https://github.com/jssong-lab/SAGACITE.

[R18] [18].Cheng JB, Sedgewick AJ, Finnegan AI, Harirchian P, Lee J, Kwon S, Fassett MS, Golovato J, Gray M, Ghadially R, Liao W, Perez White BE, Mauro TM, Mully T, Kim EA, Sbitany H, Neuhaus IM, Grekin RC, Yu SS, Gray JW, Purdom E, Paus R, Vaske CJ, Benz SC, Song JS, and Cho RJ, Transcriptional programming of normal and inflamed human epidermis at single-cell resolution, Cell Reports 25, 871 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Finnegan A, Cho RJ, Luu A, Harirchian P, Lee J, Cheng JB, and Song JS, Single-cell transcriptomics reveals spatial and temporal turnover of keratinocyte differentiation regulators, Frontiers in Genetics 10, 775 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Bandura DR, Baranov VI, Ornatsky OI, Antonov A, Kinach R, Lou X, Pavlov S, Vorobiev S, Dick JE, and Tanner SD, Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry, Analytical Chemistry 81, 6813 (2009). [DOI] [PubMed] [Google Scholar]

[R21] [21].Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck III WM, Hao Y, Stoeckius M, Smibert P, and Satija R, Comprehensive integration of single-cell data, Cell 177, 1888 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Minka TP, Beyond Newton’s method (2000), unpublished, available at https://tminka.github.io/papers/minka-newton.pdf.

PERMALINK

Riemannian geometry and statistical modeling correct for batch effects and control false discoveries in single-cell surface protein count data

Shuyi Zhang

Jacob R Leistico

Christopher Cook

Yale Liu

Raymond J Cho

Jeffrey B Cheng

Jun S Song

Abstract

I. INTRODUCTION

II. RESULTS

A. Mapping immunophenotypes of cells to points on a Riemannian manifold

FIG. 1.

B. Computing the Riemannian mean and removing batch effects on the hypersphere

FIG. 2.

FIG. 3.

FIG. 4.

FIG. 5.

C. Fitting the null model and performing statistical tests on count data

FIG. 6.

III. DISCUSSION

ACKNOWLEDGMENTS

Appendix A: Data set preparation

FIG. 7.

Appendix B: EM algorithm for NB model fitting

Appendix C: EM algorithm for ZINB model fitting

Appendix D: Comparison of data transformation methods

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Riemannian geometry and statistical modeling correct for batch effects and control false discoveries in single-cell surface protein count data

Shuyi Zhang

Jacob R Leistico

Christopher Cook

Yale Liu

Raymond J Cho

Jeffrey B Cheng

Jun S Song

Abstract

I. INTRODUCTION

II. RESULTS

A. Mapping immunophenotypes of cells to points on a Riemannian manifold

FIG. 1.

B. Computing the Riemannian mean and removing batch effects on the hypersphere

FIG. 2.

FIG. 3.

FIG. 4.

FIG. 5.

C. Fitting the null model and performing statistical tests on count data

FIG. 6.

III. DISCUSSION

ACKNOWLEDGMENTS

Appendix A: Data set preparation

FIG. 7.

Appendix B: EM algorithm for NB model fitting

Appendix C: EM algorithm for ZINB model fitting

Appendix D: Comparison of data transformation methods

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases