Skip to main content
MethodsX logoLink to MethodsX
. 2021 Nov 17;8:101580. doi: 10.1016/j.mex.2021.101580

Statistical methods for analysis of single-cell RNA-sequencing data

Samarendra Das a,b,c, Shesh N Rai b,c,d,e,f,g,
PMCID: PMC8720898  PMID: 35004214

Abstract

Single-cell RNA-sequencing (scRNA-seq) is a recent high-throughput genomic technology used to study the expression dynamics of genes at single-cell level. Analyzing the scRNA-seq data in presence of biological confounding factors including dropout events is a challenging task. Thus, this article presents a novel statistical approach for various analyses of the scRNA-seq Unique Molecular Identifier (UMI) counts data. The various analyses include modeling and fitting of observed UMI data, cell type detection, estimation of cell capture rates, estimation of gene specific model parameters, estimation of the sample mean and sample variance of the genes, etc. Besides, the developed approach is able to perform differential expression, and other downstream analyses that consider the molecular capture process in scRNA-seq data modeling. Here, the external spike-ins data can also be used in the approach for better results. The unique feature of the method is that it considers the biological process that leads to severe dropout events in modeling the observed UMI counts of genes.

• The differential expression analysis of observed scRNA-seq UMI counts data is performed after adjustment for cell capture rates.

• The statistical approach performs downstream differential zero inflation analysis, classification of influential genes, and selection of top marker genes.

• Cell auxiliaries including cell clusters and other cell variables (e.g., cell cycle, cell phase) are used to remove unwanted variation to perform statistical tests reliably.

Keywords: Zero inflated negative binomial model, Molecular capture model, Observed UMI count, True UMI count, Mean, Zero Inflation, Overdispersion

Graphical abstract

Image, graphical abstract


Specifications table

Subject area Statistics
More specific subject area Statistical Genomics and Computational Biology
Method name SwarnSeq
Name and reference of original method Das, S. and Rai, S.N. (2021). SwarnSeq: An improved statistical approach for differential expression analysis of single-cell RNA-seq data. Genomics, 113(3), 1308-1324.doi.org/10.1016/j.ygeno.2021.02.014
Resource availability www.github/sam-uofl/SwarnSeq

Data descriptions

We illustrated the performance of the methods on a publicly available single-cell RNA-seq (scRNA-seq) data. The full dataset was obtained from Yoruba (YRI) induced pluripotent stem cell (iPSC) lines, with three 96-well plates per individual [1]. We downloaded the Unique Molecular Identifier (UMI) counts, ERCC spike-in, and molecular concentration datasets from the github repository (https://github.com/jdblischak/singleCellSeq). We only used data of two individual cell lines NA19101 (288 cells) and NA19239 (288 cells) for further statistical analyses. The original UMI count data have expression values of genes/transcripts over 576 cells. To reduce the dimension of the data, we have removed the genes, which do not have non-zero expression values in at least five cells.

Method details

Notations: Let, Yijkl be a random variable (rv) represents the observed (known) UMI counts in ith cell (i= 1, 2,…, Ik) for jth gene (j= 1, 2,…, J) in kth cell cluster (k= 1, 2,…, K) at lth (l=1, 2,…, L) cell type/pseudo-time; Zijkl: rv represents unobserved/true (unknown) UMI counts in ith cell for jth gene in kth cell cluster at lth cell type/pseudo-time; Ik: Number of cells present in kth cell cluster; I (=k=1KIk): total number of cells present in scRNA-seq data; J: total number of genes in the data; K: total number of cell clusters; L: number of cell types; μijkl be the mean of non-zero counts in ith cell for jth gene in kth cell cluster of lth cell type; φijkl (=θijkl1) and θijkl be the dispersion and size parameters, respectively in ith cell for jth gene in kth cell cluster of lth cell type; πijkl be the zero inflation probability in ith cell for jth gene in kth cell cluster of lth cell type.

Traditional statistical models for fitting observed scRNA-seq data

Negative binomial (NB) model

NB models are extensively used in modeling the read counts obtained from RNA-sequencing (RNA-seq) studies. The Probability Mass Function (PMF) of the NB distributional model is expressed in Eq. (1).

fNB(y)=P[Yijkl=y|θijkl,μijkl]=G(y+θijkl)G(y+1)G(θijkl)(θijklθijkl+μijkl)θijkl(μijklθijkl+μijkl)yy=0,1,2, (1)

where, μijkl0;θijkl>0 are the parameters of NB model, G(.): Gamma function. The NB distribution becomes Poisson, when θijkl.

The mean and variance of the NB model is given in Eqs. (2) and (3), respectively.

E(Yijkl)=μijkl (2)
Var(Yij)=μijkl+μijkl2θijkl=μijkl+μijkl2φijkl (3)

Zero inflated negative binomial (ZINB) model

The NB model implemented in bulk RNA-seq differential expression (DE) analytic tools including DESeq2, edgeR, baySeq, SAMSeq, etc., may not handle the excess overdispersion and zero inflation present in the single-cell UMI counts data [2,3]. Therefore, ZINB model is exclusively used for modeling/fitting of UMI count data obtained from single-cell studies [2], [3], [4], [5]. The ZINB model can be briefly described as follows:

The PMF of the ZINB distribution is given in Eq. (4).

fZINB(y)=P[Yijkl=y|πijkl,θijkl,μijkl]=πijklδ0(y)+(1πijkl)fNB(y)y=0,1,2, (4)

where, fNB(.): PMF of NB distribution (Eq. 1); δ0(.): Dirac's delta function. Here, δ0(.)is used to model the excess zeros, and its PMF is equal to zero for every non-zero UMI counts and one for each zero-counts and can be expressed in Eq. (5).

δ0(Yijkl=y)={1;y=00;y0 (5)

The PMF of the ZINB distribution, used to model the UMI counts from scRNA-seq studies, is given in Eq. (6).

P[Yijkl=y]={πijkl+(1πijkl)(θijklθijkl+μijkl)θijkly=0(1πijkl)G(y+θijkl)G(y+1)G(θijkl)(θijklθijkl+μijkl)θijkl(μijklθijkl+μijkl)y;y>0 (6)

If πijkl=0; ZINB(πijkl,μijkl,θijkl)NB(μijkl,θijkl)

If θijkl(Nodispersion);ZINB(πijkl,μijkl,θijkl)ZIP(πijkl,μijkl) where, ZIP: Zero Inflated Poisson model.

SwarnSeq model

In the existing single-cell data analytic tools including Seurat, DEsingle, Monocle, MAST, etc., the observed UMI counts are considered the realizations of true UMI counts. This assumption is not true, as different noises including biological sources, e.g., lower molecular capture, are mostly confounded with the observed UMI counts [2,4]. For instance, the recent single-cell sequencing protocols only capture the 1–10 % of the transcriptomics present in the cell [4,5]. Therefore, this property needs to be incorporated in modeling of the observed UMI count data. Here, we considered a simple Binomial cell capture model to model the observed UMI count data. However, other cellular capture model, e.g., Beta-Binomial, Poisson-NB models, Hypergeometric models, etc., can also be considered to represent biological dropout events in single-cell studies.

Theorem: Let, ρijkl be the rv represents the transcriptional capture rate of ith cell for jth gene in kth cell cluster at lth cell type/pseudo-time. If the true UMI counts, Zijkl, follow ZINB(πijkl,μijkl,θijkl) distribution, and ρijkl follows a binomial model with parameter pijkl(0pijkl1), then the observed UMI counts, Yijkl, will also follow ZINB distribution with parameters (πijkl,μijklpijkl,θijkl).

Proof: Given that, ZijklZINB(πijkl,μijkl,θijkl)andρijkl=(Yijkl|Zijkl=z)B(z,pijkl)

Now, the PMF of Zijkl is given in Eq. (4) and the PMF of ρijkl can be expressed in Eq. (7).

P[Yijkl=y|Zijkl=z]=(zy)pijkly(1pijkl)zy (7)

The joint probability distribution of the observed and true UMI counts, YijklandZijkl, can be written as:

P[Yijkl=y,Zijkl=z|πijkl,μijkl,θijkl,pijkl]=P[Yijkl=y|Zijkl=z,pijkl]P[Zijkl=z|πijkl,μijkl,θijkl] (8)

Now, the marginal probability distribution of Yijkl can be obtained as:

P[Yijkl=y|πijkl,μijkl,θijkl,pijkl]=zP[Yijkl=y|Zijkl=z,pijkl]P[Zijkl=z|πijkl,μijkl,θijkl] (9)

Case-1: when observed UMI count is zero (i.e., Yijkl=0)

P[Yijkl=0|πijkl,μijkl,θijkl,pijkl]=πijkl+(1πijkl)(θijklθijkl+μijkl)θijkl(μijklpijkl=μijkl(say)) (10)

Case-2: when observed UMI count is non-zero (i.e., Yijkl(>0)=t=1,2,3,)

P[Yijkl=t|πijkl,μijkl,θijkl,pijkl]=(1πijkl)G(t+θijkl)G(t+1)G(θijkl)(θijklθijkl+μijkl)θijkl(μijklθijkl+μijkl)t (11)

Now, Eqs. (10) and (11) are in the form of Eq. (4), which indicates the distribution of the observed UMI counts, Yijkl, is also from ZINB(πijkl,μijkl,θijkl). The detailed proof of this theorem can be found at [2].

Corollary 1: When pijkl=1 (i.e., under full capture rates), this means that all the transcriptomic material present in the cell is fully captured during the sequencing process, this is called as perfect deep sequencing. Under such scenarios, the distributions of the observed and true UMI counts remain same, i.e., a ZINB model. Mathematically,

ZINB(πijkl,μijkl,θijkl)dZINB(πijkl,μijkl,θijkl) (12)

Here, the genes in a cell will have zero counts which are not truly expressed (i.e., biological zeros) and the single-cell experiment will be free from dropout events. However, such a scenario is a dream in real experimental single-cell studies. In other words, the real limits of pijkl is 0<pijkl<1.

Corollary 2: In case pijkl<1, i.e., in real experimental case the transcriptomic materials present in cells is not fully captured, but only certain fraction is captured [9]. Then, zero counts in the single-cell expression data are the mixture of dropout/false zeros and true zeros. Further, mean of the observed non-zero UMI counts depend on the cell capture rate parameter, while the zero inflation and overdispersion parameters are independent of the cell capture rates. Here, it is worthy to note that π^ijkl from observed data can be used to estimate the proportions of true zeros, as πijkl remains unaffected by the capture rate parameter.

TrueUMIcounts:ZijklZINB(πijkl,μijkl,θijkl) (13)
ObservedUMIcounts:YijklZINB(πijkl,μijkl,θijkl),μijkl=μijklpijkl (14)

In single-cell experiments, the observed UMI counts are noisy reflection of the true expression of genes due to lower cellular transcriptional capturing Eqs. (13), ((14)). In other words, distributions of the observed UMI counts of genes are the joint distributions of gene's true expression and transcriptional (cell) capture rate. The relation between the true and observed means of non-zero counts of genes is μijkl>μijkl. This means, the distribution of observed UMI counts will shift more towards zero, if the cellular capture rate is decreased. In other words, weightage of the Dirac's delta function will be more in the mixture distribution (Eq. (4)) compared to be NB part.

Expected value and variance of the observed UMI counts in SwarnSeq model

The expected value and variance of the observed UMI counts of genes, Yijkl, in the SwarnSeq model can be expressed in Eq. (15).

E(Yijkl)=(1πijkl)μijklpijkl (15)
V(Yijkl)=(1πijkl)μijklpijkl(1+πijklμijklpijkl+μijklpijklφijkl) (16)

In the SwarnSeq method, expected value of the observed UMI counts of genes depends on the zero inflation, mean of non-zero counts, and cell capture rate parameter. While the variance of the observed UMI counts are the functions of the zero inflation, mean of non-zero counts, overdispersion, and cell capture rate parameters. Further, the relation between the variance and expected value of the observed UMI counts of genes can be shown in Eq. (17). Alternatively, variance of the observed UMI counts of a gene is the function of its expected values (Eq. (17)) (i.e., case of overdispersion).

V(Yijkl)=E(Yijkl){1+μijklpijkl(πijkl+φijkl)} (17)

Distributions of sample mean and sample variance of observed counts of genes

Usually, population parameters of the genes including population mean and variance are unknown, and they are estimated from experimentally observed sample UMI count data. Hence, it is important to obtain the sampling distribution of sample means and variances of the genes in a single-cell experimental study. The sample mean and variance of the observed UMI counts for jth gene can be expressed in Eqs. (18), and (19), respectively. Here, for simplicity, we omitted the subscript denoting cell type.

y¯j=1Kk=1K1Iki=1IkYijk (18)
sj2=1Kk=1K1(Ik1)i=1Ik(Yijky¯j)2 (19)

The expected values of the gene sample mean, and sample variance of the observed UMI counts can be derived under certain statistical assumptions. In other words, we assume that the observed count data are drawn from the ZINB population model, as given in Eq. (4), and the transcriptional capture efficiencies of the genes remain same. Further, the model parameters for the genes remain same over the cells in different cell clusters, i.e., μ1j1==μI1j1==μIKjK=μj; π1j1==πI1j1=μIKjK=πj; θ1j1==θI1j1=θIKjK=θj;

pi1k=pi2k==piJk=pik (20)

Now, the theoretical expression of expected value of the sample mean for jth gene can be derived as:

E(y¯j)=1Kk=1K1Iki=1IkE(Yijk)=1Kk=1K1Iki=1IkE{E(Yijk|Zijk)}=1Kk=1K1Iki=1Ik(1πijkl)(μijkpijk) (21)

Under the assumption of Eq. (20), the expected value of sample mean for jth gene (Eq. (21)) can be obtained, as shown in Eq. (22).

E(y¯j)=1Kk=1K1Iki=1Ik(1πj)μjpik=μj(1πj)1Kk=1K1Iki=1Ikpik=μj(1πj)p¯.. (22)

The variance of the observed UMI data, V(Yijk), (Eq. (16)) under the assumption of Eq. (20), becomes:

V(Yijk)=(1πj)μjpik(1+πjμjpik+μjpikφj) (23)

Now, the variance of sample mean (Eq. (18)) can be obtained as shown in Eq. (24) under the assumption of Eq. (20).

V(y¯j)=E(y¯j2){E(y¯j)}2=μj(1πj)I{2p¯+μjφjp··2¯}+(1πj)2μj2var(pik) (24)

Let, sj2 be the sample variance of jth gene, expressed in Eq. (19). Then its expected value can be derived as follows.

E(sj2)=1Kk=1K1(Ik1)i=1Ik{V(Yijk)+E(Yijk)2}1K(K1)kk=1K1Ik(Ik1)ii=1IkE(Yijk)E(Yijk)=μjp¯..+μj2φjp..2¯+μj2var(pik) (25)

where, p¯..=1Kk=1K1Iki=1Ikpik, p2¯..=1Kk=1K1Iki=1Ikpik2 and var(pik)is the variance of pik. I is the total number of cells, i.e., I=k=1KIk.

Estimation of SwarnSeq model parameters

We have shown that the distribution of sample means and variances of genes in experimental single-cell studies depends on gene specific model parameters, which are unknown. So, it is necessary to estimate them to get the exact distribution of gene specific sample statistic(s) and performing other analyses including DE analysis. Here, the parameters of the SwarnSeq model, given in Eqs. (10) and (11), were estimated from the observed UMI count data (adjusted for cell capture rates) under a Generalized Linear Model (GLM) framework. We have shown that the observed UMI counts for jth gene, Yijk, as a ZINB rv with parameters: μj=(μ1j1,,μI1j1,μI2j2,,μIKjK); πj=(π1j1,,πI1j1,,πI2j2,,μIKjK); θk=(θ1j1,,θI1j1,,θI2j2,,θIKjK) and further the following GLMs Eqs. (26)–((28)) are considered to model these parameters in the presence of cell-level co-variates and cell cluster data.

αj=logμj=Xγj+Rwj+Csj+Oμ (26)
τj=logitπj=Xβj+Ruj+Cvj+Oπ (27)
ωj=logθj (28)

where, logit(πj)=log(πj1πj); αj, τj and ωj: I × 1 vector of parameters for jth gene; X: I × L design matrix providing group information (first column consists of 1’s to include intercept term); L: number of cellular groups/types (cell clusters are divided into L cell groups, if cell group is unknown); R: I × K design matrix providing cell cluster information; C: I × C design matrix providing other cell level auxiliary information; γj and βj: L × 1 vectors of cellular groups effects for jth gene; wj and uj: K × 1 vectors of cell cluster effects for jth gene; sj and vj: C × 1 vectors of effects for other cell level co-variates including cell cycle, cell phase, etc. for the jth gene; C: Levels of cell level auxiliaries. Oμ,Oπ: offsets for μj and πj respectively.

Expectation maximization (EM) algorithm

The parameters in Eqs. (26)–(28) for jth gene, i.e., Ωj={αj,τj,ωj} can be estimated by using the Maximum Likelihood Estimation (MLE) Method. It is very difficult to obtain closed form solutions for the resulting log-likelihood function, given in Eq. (29). So, we developed an EM algorithm to estimate the SwarnSeq model parameters. For simplicity, we omit the subscripts for cellular type/pseudo-time in the notations. For the EM algorithm, we recast our estimation procedure into a missing data problem through introducing a latent rv, Vijk, as defined in Eq. (30). Further, the incomplete data likelihood function for jth gene can be expressed as:

L(Ωj;Yijk=yijk)=k=1Ki=1Ik{πijkδ0(yijk)+(1πijk)fNB(yijk)} (29)
Vijk={1ifYijkcomesfromthezerocomponet0ifYijkcomesfromthecountcomponent (30)

Now, the joint likelihood function for complete data (in presence of latent variable), i.e.,(Yijk,Vijk) can be expressed in Eq. (31), as:

L(Ωj;Yijk,Vijk)=[{πijk+(1πijk)(θijkθijk+μijk)θijk}Vijk{(1πijk)G(z+θijk)G(z+1)G(θijk)(θijkθijk+μijk)θijk(μijkθijk+μijk)yijk}1Vijk] (31)

Then, the log-likelihood function in Eq. (31) becomes:

l(Ωj;Yijk,Vijk)=k=1Ki=1IkVijklog{πijk+(1πijk)(θijkθijk+μijk)θijk}+k=1Ki=1Ik(1Vijk)log{(1πijk)G(z+θijk)G(z+1)G(θijk)(θijkθijk+μijk)θijk(μijkθijk+μijk)yijk}=l1(Ωj;Vijk)+l2(Ωj;Yijk,Vijk) (32)

where, l1(.): log-likelihood due to the zero-component of the model and l2(.): log-likelihood due to the count-component of the model. Further, the expected value of the log-likelihood function (Eq. (32)) can be obtained as:

Q=E[l(Ωj;Yijk=y,Vijk)]=k=1Ki=1IkE(Vijk|Yijk,Ωj)log{πijk+(1πijk)(θijkθijk+μijk)θijk}+k=1Ki=1Ik(wijk)log{(1πijk)G(y+θijk)G(y+1)G(θijk)(θijkθijk+μijk)θijk(μijkθijk+μijk)yijk} (33)

The conditional expectations in Eq. (33) can be given as:

E(Vijk|Yijk=yijk,Ωj)=P[Vijk=1|Yijk,Ωj]=πijk+(1πijk)(θijkθijkl+μijk)θijkπijkδ0(yijk)+(1πijk)fNB(yijk;μijk,θijk) (34)

The posterior probabilities or the conditional weights in Eqn 33 for observations originate from the count component of the model and can be given as:

wijk=1E(Vijk|Yijk,Ωj)=P[Vijk=0|Yijk,Ωj]=(1πijk)fNB(yijk;μijk,θijk)πijkδ0(yijk)+(1πijk)fNB(yijk;μijk,θijk) (35)

where, fNB(.) is the PMF of NB distribution given in Eq. (1).

E-step: The E-step in the EM algorithm involves in evaluating the expected value of the log-likelihood of the complete data (Eq. (33)), given the observed data with current estimates of the parameters. In this approach, for each gene, given the observed data and the current estimate of the ZINB parameters, the expected value of the log-likelihood is calculated. Let, Ω^jc={α^jc,τ^jc,φ^jc} be the given current estimate of the parameters, then the expected value of log likelihood (Eq. (33)) at step (c + 1), i.e., Qc+1 is calculated. The conditional expectation at cth step, i.e., E(Vijk|Yijk,Ω^jc)(Eqn 33)) can be estimated using Eq. (36).

E(Vijk|Yijk,Ω^jc)=π^ijk+(1π^ijk)(θ^ijkθ^ijk+μ^ijk)θ^ijkπ^ijkδ0(yijk)+(1π^ijk)fNB(yijk|μ^ijk,θ^ijk) (36)

A. M-step: Maximize Qc+1 to update the parameter estimates. (i). The parameters from the count component of the model, {μ^j,θ^j}, are updated within the GLM framework, as given in Eq. (37).

logμj=Xγj+Rwj+Csj+Oμ (37)

The updated values of the estimates of parameters at (c + 1)th step is obtained by providing the observation wise weights, w^ijk(c)(Eq. (35)) and parameters estimates at cth step. For this purpose, the glm.nb function in MASS R package was executed. (ii). The zero-inflation probability, π^ijk, is updated with the logistic regression, can be expressed as:

logit(πj)=Xβj+Ruj+Cvj+Oπ (38)

The updated value of π^ijk at step (c + 1) is obtained by incorporating the observation level weights, w^ijk(c) (Eq. (35)) and the parameters estimate at cth step. For this, glm(…, family= ‘binomial’) function in stat R package was executed.

The above procedure is iterated until the convergence is achieved, the detail procedure can be found at [2]. It is important to note that for some genes, the EM algorithm may fail to converge or may be not successful [8]; therefore, we used Nelder's optimization algorithm [6] implemented in optim function of stats R package to estimate the MLE of parameters. The developed EM algorithm for estimation of SwarnSeq model parameters was applied to the considered experimental single-cell UMI data. The obtained analytical results are shown in Figs. 1 and 2. Furthermore, relations between the estimated values of parameters for the genes are also shown (Figs. 1, 2).

Fig. 1.

Fig 1

Relationship among the SwarnSeq model parameters with expected value of sample statistics. (A) Expected value vs. variance of the observed UMI counts. X-axis: log of the expected value of the observed UMI counts. Y-axis: log of the variance. (B) Expected value vs. Co-efficient of variation (CV) of the observed UMI counts. X-axis: log of the expected value of the observed UMI counts. Y-axis: log of CV. (C) Zero-inflation vs. CV of the observed UMI counts. X-axis: log of CV. Y-axis: log of zero-inflation. (D) CV vs. Dispersion. X-axis: log of the CV. Y-axis: log of Dispersion. (E) Variance vs. Zero-inflation observed UMI counts. X-axis: log of the variance. Y-axis: log of zero-inflation. (F) Variance of the observed UMI counts vs. Dispersion. X-axis: log of the variance. Y-axis: log of dispersion.

Fig. 2.

Fig 2

Parameters of the SwarnSeq model estimated through the EM algorithm. (A) Relationship between estimated values of mean with dispersion parameters of genes. X-axis: log of estimated values of means; Y-axis: log of estimated values of dispersions. (B) Relationship between estimated values of mean with zero-inflation parameters. X-axis: log of estimated values of means. Y-axis: log of estimated values of zero-inflation. (C) Relationship between estimated values of zero-inflation with dispersion parameters of genes. X-axis: log of estimated values of dispersion. Y-axis: log of estimated values of zero-inflation. (D) Relationship between estimated values of zero-inflation with observed zero proportions of genes. X-axis: observed means zero proportions. Y-axis: estimated values of zero-inflation parameters. (E) Relationship between observed zero proportions with difference between observed and true proportion of zeros of genes. X-axis: observed means zero proportions. Y-axis: difference between observed and true proportion of zeros. (F) Relation between true and dropout zeros. X-axis: dropout zero probability. Y-axis: true zero probability.

Cell capture rate estimation

The distributions of the observed scRNA-seq UMI counts Eq. (10)–((16) and sample statistic(s) including sample mean and variance Eqs. (22)–((25) depend on the value of cell specific capture rate parameter, pijk. However, it is extremely difficult to estimate the cell capture rate parameters inside the estimation procedure based on EM algorithm. Hence, one analytical technique is discussed here to estimate the cell capture rate parameters. For computational simplicity, we assume that the cell specific capture rate parameters remain same across all the genes, i.e., pi1k=pi2k==piJk=pik.

Case 1: External RNA spike-ins data available

Let, n RNA spike-ins are added to each cell's lysate and spike-in transcripts are processed in parallel. This process will result a set of UMI counts for spike-in transcripts. Let, C1,C2,,Cu,,Cn be the respective mRNA concentrations of n spike-in transcripts added to ith (i=1, 2, …, Ik) cell of kth (k=1, 2, …, K) cell cluster and let Ri1k,Ri2k,,Riuk,Rink be the observed UMI counts of the n spike-in transcripts for ith cell, here, CuandRiuk be the molecular concentration and UMI counts of uth spike-in transcript. Now, the transcriptional capture rate for ith cell in kth cell cluster can be estimated through a linear regression equation, given in Eq. (40).

Riuk=pik0+pikCi+ϵu (40)

where, ϵu is the random error for uth spike-in transcript and assumed to follow Gaussian distribution with zero mean and unit variance. Further, p^ik, regression co-efficient, is the estimate of the capture rate for ith cell in kth cell cluster.

Case 2: RNA spike-ins data not available

In most of cases, the spike-ins data are not readily available with researchers in single-cell experimental studies. In such situation, the observed cell library sizes [7] can be used to empirically compute the cell specific capture rate. The procedure is given as follows.

Let, (ρ1,ρ2) be the range of cell capture rates and Sik be the library size of ith cell in kth cell cluster and,

Lik=log10(Sik)i,k (41)
p^ik=ρ1+(ρ2ρ1)LikLminLmaxLmin (42)

where, Lmin and Lmax in Eq. (42) is given in Eq. (43).

Lmin=mini,kLikandLmax=maxi,kLik (43)

The above procedure for the estimation of cell capture rate parameters was illustrated on the example single-cell dataset and the results are shown in Fig. 3. The estimation of the cell capture rate parameter is shown for the two cases, 1: RNA spike-in data available and 2: RNA spike-in data not available, in Fig. 3.

Fig. 3.

Fig 3

Relationship between the cell specific parameters. (A) Distribution of cell library sizes. X-axis represents the cell ranks; Y-axis represents the cell library sizes. Relationship of cell library sizes with ranks of the cells is s-shaped sigmoid curve. (B) Distribution of cell library sizes with zero counts % in cells. X-axis represents the cell library sizes; Y-axis represents with the zero counts % in cells. Cells with lower library sizes have higher proportions of zero counts as genes expression and vice-versa. (C) Relationship of cell capture rates with cell ranks. Here, the cell capture rates are estimated from the external RNA spike-in data. (D) Relationship of cells’ captures rates (estimated from the UMI data) with cell library sizes. The relationship between the capture rates with cell library sizes is bell-shaped. It means the cells with higher library sizes have better cell capture rates and vice-versa. (E) Relationship between mean of non-zero counts and zero counts % in cells. X-axis represents the zero counts % in cells; Y-axis represents the mean of non-zero UMI counts. The relation is inversely proportional, i.e., cells with higher zero % have lower mean UMI counts and vice-versa. (F) Relationship between capture rates and zero counts % in cells. X-axis represents the zero counts % in cells; Y-axis represents the cell capture rates.

Estimated values of parameters from SwarnSeq model

Let, (π^j,θ^j,μ^j) be the MLE estimates of the parameters for jth gene estimated through the EM algorithm and p^ik be the estimate of the cell capture rate for ith cell, p^¯ be the average of the cell capture estimates over all the cells. Now, the estimated values of different statistic(s) including expected value of sample mean, sample variance, standard error and co-efficient variation for jth gene can be obtained as in Eqs. (44)–(48). Further, these developed formulae was applied to the considered experimental single-cell data, to estimate the distribution of sample means of genes and the results are shown in Fig. 4.

Fig. 4.

Fig 4

Sample mean and variance of the observed UMI counts of the genes. (A) Expected value vs. variance of sample mean plot. X-axis: Expected value of sample mean; Y-axis: Variance of sample mean. (B) Expected value of sample mean vs. expected value of sample variance plot. X-axis: Expected value of sample mean; Y-axis: Expected value of sample variance. (C) Expected value of sample mean vs. CV of the sample mean plot. X-axis: Expected value of sample mean; Y-axis: CV of sample mean. (D) Expected value of sample mean vs. standard error of sample mean plot. X-axis: Expected value of sample mean; Y-axis: standard error of sample mean. (E) Variance of sample mean vs. expected value of sample variance plot. X-axis: Expected value of variance of sample mean; Y-axis: Expected value of sample variance. (F) CV of sample mean vs. expected value of sample variance. X-axis: CV of sample mean; Y-axis: Expected value of sample variance.

The expression for the estimated value of sample mean is given in Eq. (44).

E(y¯j)=μ^j(1π^j)p^¯ (44)

The expression for estimated value of variance of the sample mean for jth gene can be given in Eq. (45).

V^(y¯j)=μ^j(1π^j)I(2p^¯+μ^jφ^jp^2¯)+(1π^j)2μ^j2var(p^) (45)

The expression for the estimate of the expected value of sample variance of jth gene is shown in Eq. (46).

E(sj2)=μ^jp^¯+μ^j2θ^jp^2¯+μ^j2var(p^) (46)

The estimated value of co-efficient of variation for the sample mean of jth gene is expressed in Eq. (47).

CV^(y¯j)=sd^(y¯j)E^(y¯j) (47)

where, sd^(y¯j)=+V^(y¯j)

The estimated value of standard error (SE) of the sample mean for jth gene can be expressed in Eq. (48).

SE^(y¯j)=sd^(y¯j)/I (48)

Determination of optimum number of cell clusters

The major downstream analysis for scRNA-seq data is cluster analysis, extensively used for detecting various cell types [2,3]. For this purpose, k-means clustering technique is used and implemented in various single-cell analytic tools. However, not much work has been done to determine the optimum value of number of cell clusters, to which the cells present in the scRNA-seq data, is categorized. Besides, the SwarnSeq model requires cell cluster information to model the observed UMI counts of the genes. Therefore, we reported an algorithm to determine the optimum number of cell clusters that the cells need to be grouped based on the observed UMI count data, which is given as follows.

Let, Yik: mean expression value of ith cell in kth cell cluster; Y.k: mean expression value of kth cell cluster, and Y¯... be the over-all mean.

Then, Total Sum of Squares (TSS) can be expressed as:

TSS=k=1Ki=1Ik(YikY¯··)2=k=1Ki=1Ik(YikY¯··)2+K=1IkIk(Y¯.kY¯··)2=WSS+BSS (49)

where, WSS: Within cluster sum of squares, BSS: Between cluster sum of squares.

Now, the proposed index to decide the optimum number of cell clusters can be expressed in Eq. (50).

rh=WSSBSS (50)

where, rh>0 is the index value at h number of cell clusters.

In our algorithm, the clustering indices (rh) were computed for different values of h (≥ 2) using the observed scRNA-seq UMI counts data. Then, the h value which provides the maximum value of rhcan be chosen as the estimator for optimum number of cell clusters for that scRNA-seq data. Alternatively, the optimum value of h can be obtained through graphically by plotting h vs. rh and choosing the point in x-axis where the curve gets flatten. The algorithm for this reported technique is given in Fig. 5. The algorithm is also implemented in optimcluster function of SwarnSeq R package. Further, this algorithm was applied to the considered experimental single-cell data to demonstrate its utility and the results are shown in Fig. 5. For instance, in cluster index vs. cluster number plot, the curve has its inflexion point at k = 8, means that the 576 cells present in the data can be clustered into eight optimal cell clusters (Fig. 5B). The cluster wise distribution of cells is also shown (Fig. 5C).

Fig. 5.

Fig 5

Schematic layout of cluster analysis in SwarnSeq method. (A) Flowchart for cell cluster number determination algorithm. (B) Determination of the optimum number of cell cluster for the experimental single-cell data. X-axis: Number of cell clusters; Y-axis: Clustering indices for every cell cluster. (C) Distribution of the cells across the cell clusters.

Differential expression analysis of genes

In SwarnSeq approach, the mean parameter of each gene depends on the cellular groups (Eq. (26)). Further, the factors such as cell clusters and cell co-variates are included in the model (Eq. (26)) to remove their unwanted effects on the mean of genes. For DE analysis of genes, two group comparisons are made and the model in Eq. (26) can be expanded as:

log(μijk)=γ0j+γ1jxijk+wj1rij1++wjKrijK+sj1c1ij++sMjcMij+Oμj (51)

where, xijk: binary indicator for cellular group membership, γ0j: (intercept term) logarithm of mean parameter for jth gene in the reference cellular group, γ1j: log Fold Change parameter for jth gene, wjk: regression co-efficient for kth cell cluster for jth gene, rijk: indicator variable for cell cluster membership of ith cell in kth cluster for jth gene, sjm: regression co-efficient for mth (m = 1, 2, …, M) cell co-variates of jth gene, cmij: indicator variable for mth co-variate of ith cell for jth gene and Oμj: offset term.

To statistically test whether jth gene is expressed differentially or not across the cellular groups, the following hypotheses are tested.

H0:γ1j=0vs.H1:γ1j0

The above test can be performed by using Likelihood Ratio Test (LRT) statistic, and can be expressed in Eq. (52).

DSj=2{l(Ωj=Ω^j0)l(Ωj=Ω^j)} (52)

where, DSj: LRT statistic of jth gene; Ω^j0: MLE of Ωj for jth gene under the constraint of H0; and Ω^j: unconstrained MLE of Ωj for jth gene. The test statistic, DSj, follows a Chi-square distribution with 1 degree of freedom (for 2 groups) under H0. Further, based on the distribution of DSj, the p-value for jth gene was computed and this procedure was repeated for all the genes. Then the adjusted p-values and FDRs for the genes were computed after adjustment for multiple hypothesis testing. The above statistical methods of DE analysis was illustrated on the considered single-cell dataset [1] and the results are shown in Fig. 6. The volcano plot of the genes obtained through DE analysis is shown in Fig. 6A. The DE analysis results indicated that 274 genes were identified as differentially expressed between the NA19101 and NA19239 cell groups (Fig. 6A) for the considered data.

Fig. 6.

Fig 6

Key analytical results obtained through SwarnSeq Model. (A) Volcano plot for differential expression analysis results. X-axis represents the log2 transformation of the fold change values of genes. Y-axis represents the -log10 transformation of the p-values computed through the SwarnSeq model. red color represent the genes whose both -log10 p-values > 20 and |log2FC| > 3; blue color represent the genes whose -log10 p-values > 20; green color represent the genes whose |log2FC| > 3; black color indicates the non-significant genes. (B) Volcano plot for differential zero-inflation analysis results. X-axis represents the log2 transformation of the fold change values of genes. Y-axis represents the -log10 transformation of the p-values computed through the SwarnSeq model. red color represent the genes whose both -log10 p-values > 7 and |log2FC| > 2; blue color represent the genes whose -log10 p-values > 7; green color represent the genes whose |log2FC| > 2; black color indicates the non-significant genes. (C) Schematic representation of the classification of key genes detected through SwarnSeq model. DE genes: Differentially expressed; DZI: Differentially zero-inflated; DEZI: Both differentially expressed and differentially zero-inflated; Non-DE: non-differentially expressed; non-DZI: non-differentially zero-inflated. (D) Illustration of SwarnSeq method for classification of influential genes. Numbers in cells represent the genes belong to each category; (.): classes of the genes.

Differential zero inflation analysis of genes

In literature, it is well established that the genes in scRNA-seq data are highly zero inflated (i.e., biological and dropout zeros) due to the nature of single-cell studies and several technical, and biological factors [2], [3], [4], [5]. Therefore, it is important to identify the genes which have different number of zeros as expression across the two cellular groups. For this purpose, the SwarnSeq method can perform the zero inflation analysis of the genes across the two cell groups and detect those genes for further study. In SwarnSeq model, the zero inflation parameters of genes depend on the cellular groups through the model given in Eq. (27). Further, factors such as cell clusters and other cell-level auxiliaries are included in the model to remove the unwanted confounded effects from the zero-inflation probabilities of genes. For Differential Zero Inflation (DZI) analysis of genes, two cell groups’ comparisons are made and the model in Eq. (27) can be written as:

logit(πijk)=β0j+β1jxijk+uj1rij1++ujKrijK+v1jc1ij++vMjcMij+Oπj (53)

where, xijk: binary indicator for cellular group membership, β0j: intercept term for jth gene (reference cellular group), β1j is the log Fold Change (zero inflation) parameter for jth gene, ujk: regression co-efficient of kth cell cluster for jth gene, rijk: indicator variable for cell cluster membership of ith cell in kth cluster for jth gene, vmj: regression co-efficient for mth (m = 1, 2, …, M) cell co-variates of jth gene, cmij: indicator variable for mth co-variate of ith cell for jth gene and Oπj: offset term.

Statistically to decide whether jth gene is DZI or not, the following hypotheses are tested.

H10:β1j=0vs.H1:β1j0

The above test can be performed by using LRT statistic, and its expression is given in Eq. (54).

DZj=2{l(Ωj=Ω^j0)l(Ωj=Ω^j)} (54)

where, DZj: DZI LRT statistic for jth gene; Ω^j0: MLE of Ωj under the constraint of β1j=0 and Ω^j: unconstrained MLE of Ωj. Here DZj, for all j, has a Chi-square distribution with 1 degree of freedom (for 2 groups comparison) under H0. The adjusted p-values and FDR for the DZI analysis were computed for all the genes after adjusting for multiple hypothesis testing through the SwarnSeq method. The above statistical methods of DZI analysis was illustrated on the considered Tung's scRNA-seq data [1]. The volcano plot of the genes obtained through the developed DZI analysis is shown in Fig. 6B. The results indicated that 243 genes were identified as differentially zero-inflated between the NA19101 and NA19239 cell groups (Fig. 6B). In other words, 243 genes have significant number of expressions as zero counts across the NA19101 and NA19239 cell groups.

Classification of detected influential genes

DE and DZI analyses are two major downstream analytical procedures usually practiced in single-cell experimental studies. Hence, it is interesting to know the group of genes which are expressed differentially across the cellular groups as well as differentially zero inflated. For this purpose, SwarnSeq method is able to classify the detected influential genes into different classes based on DE and DZI analyses, as shown in Fig. 6. For instance, H0:γ1j=0 detects all the genes, which are expressed differentially, while H10:β1j=0 detects the genes differentially zero inflated across the cellular groups. Further, the SwarnSeq detects a class of genes in scRNA-seq data with both H0 and H10 rejected. This indicates there is a significant difference in the number of cells with zero values as expression of genes across the cellular groups, but the (non-zero) expressions in the remaining cells show significant differences. This group of influential genes is termed as ‘DEZI’ genes (Fig. 6). The other class of genes, for which H0 is rejected, but H10 is not rejected. This means the class of genes for which there is no significant difference in the number of cells whose expressions are zeros across the cellular groups, but they are expressed differentially. We call this group of genes as only ‘DE’ class genes (Fig. 6). Further, the third type (i.e., only DZI) of genes, for which H10 is rejected, but H0 is not rejected (Fig. 6). It includes the genes for which, there is a significant difference in the number of cells with zero expression values across the two cellular groups, but the (non-zero) expressions in the remaining cells show no significant difference. The utility of the SwarnSeq method for classification of the detected influential genes in scRNA-seq study was demonstrated on one real single-cell data and the results are shown in Fig. 6.

Conclusion

Statistical analysis of single-cell data in presence of biological confounding factors (leading to severe dropout events) is a challenging task. Therefore in this paper, statistical techniques, implemented in the SwarnSeq, are presented for various analyses of single-cell experimental datasets. The analytical techniques include model fitting, EM algorithm based model parameters estimation procedure, estimation of cell capture parameters, clustering and determination of optimal cell clusters, distribution of observed UMI counts of genes, distribution of sample mean and variance of genes, differential expression, and differential zero inflation analyses, classification of genes, etc. A practical real data example was given for illustration of all the analytical techniques in the SwarnSeq. The SwarnSeq method will surely help the experimental biologist and genome researchers to perform various analyses on a single platform. In future, improved parameter estimation procedure including Bayesian techniques can be implemented in the SwarnSeq tool to estimate the gene specific dispersion, and that will enhance its performance. The SwarnSeq method assumes the factors, such as cellular groups, cell clusters and other co-variates, have fixed effects on means and zero inflations. This assumption may not hold good for single-cell data, as some biological factors may have random effects. Therefore, random or mixed effect models can be implemented in SwarnSeq method to improve its performance. The proposed approach is shown with one application in single-cell data analytics and it can be applied in other analytical fields where the data is zero-inflated and over dispersed such as pest population, sample surveys, etc. studies.

Submission type

Direct submission

CRediT authorship contribution statement

Samarendra Das: Conceptualization, Investigation, Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing. Shesh N. Rai: Project administration, Supervision, Funding acquisition, Writing – review & editing.

Declaration of Competing Interest

Authors declare that they have no competing interests.

Acknowledgments

Funding

Samarendra Das: Indian Council of Agricultural Research (ICAR), New Delhi, India (Netaji Subhas-ICAR International Fellowship, OM No. 18(02)/2016-EQR/Edn), ICAR-Indian Agricultural Statistics Research Institute (ICAR-IASRI), New Delhi, India.

Shesh N. Rai: Clinical Trial Research Fund (Wendell Cherry Chair), JG Brown Cancer Center, USA; multiple National Institutes of Health (NIH), USA grants (5P20GM113226, PI: McClain; 1P42ES023716, PI: Srivastava; 5P30GM127607-02, PI: Jones; 1P20GM125504-01, PI: Lamont; 2U54HL120163, PI: Bhatnagar/Robertson; 1P20GM135004, PI: Yan; 1R35ES0238373-01, PI: Cave; 1R01ES029846, PI: Bhatnagar; 1P30ES030283, PI: States); Kentucky Council on Postsecondary Education grant, USA (PON2 415 1900002934, PI: Chesney)

Availability of data and materials

The UMI counts, ERCC spike-ins and molecular concentration datasets were taken from the GitHub repository (https://github.com/jdblischak/singleCellSeq). The R software package for the SwarnSeq method is available at https://github.com/sam-uofl/SwarnSeq.

Acknowledgment

Authors duly acknowledge the help and support obtained from Education Division, ICAR, New Delhi, India and ICAR-IASRI, New Delhi, India. The authors would like to thank the anonymous reviewers whose comments helped in deeper understanding and improving the quality of the research presented in the original paper.

Contributor Information

Samarendra Das, Email: samarendra.das@louisville.edu.

Shesh N. Rai, Email: shesh.rai@louisville.edu.

References

  • 1.Tung P.Y., Blischak J.D., Hsiao C.J., Knowles D.A., Burnett J.E., Pritchard J.K., et al. Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 2017;7:39921. doi: 10.1038/srep39921. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Das S., Rai SN. SwarnSeq: an improved statistical approach for differential expression analysis of single-cell RNA-seq data. Genomics. 2021 doi: 10.1016/j.ygeno.2021.02.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Van den Berge K., Soneson C., Love M.I., Robinson M.D., Clement L. zingeR: unlocking RNA-seq tools for zero-inflation and single cell applications. doi.org. 2017. doi:10.1101/157982 [DOI] [PMC free article] [PubMed]
  • 4.Ye C., Speed T.P., Salim A. DECENT: differential expression with capture efficiency adjustmeNT for single-cell RNA-seq data. Bioinformatics. 2019;35:5155–5162. doi: 10.1093/bioinformatics/btz453. Berger B, editor. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Miao Z., Deng K., Wang X., Zhang X. DEsingle for detecting three types of differential expression in single-cell RNA-seq data. Bioinformatics. 2018;34:3223–3224. doi: 10.1093/bioinformatics/bty332. Berger B, editor. [DOI] [PubMed] [Google Scholar]
  • 6.Dempster A.P., Laird N.M., Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. 1977;39:1–22. doi: 10.1111/j.2517-6161.1977.tb01600.x. [DOI] [Google Scholar]
  • 7.Van den Berge K., Perraudeau F., Soneson C., Love M.I., Risso D., Vert J.P., et al. Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications. Genome Biol. 2018;19:24. doi: 10.1186/s13059-018-1406-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.McKinnon K.I.M. Convergence of the Nelder-Mead simplex method to a nonstationary point. SIAM J. Optim. 1998 doi: 10.1137/S1052623496303482. [DOI] [Google Scholar]
  • 9.Ziegenhain C., Vieth B., Parekh S., Reinius B., Guillaumet-Adkins A., Smets M., et al. Comparative analysis of single-Cell RNA sequencing methods. Mol. Cell. 2017 doi: 10.1016/j.molcel.2017.01.023. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The UMI counts, ERCC spike-ins and molecular concentration datasets were taken from the GitHub repository (https://github.com/jdblischak/singleCellSeq). The R software package for the SwarnSeq method is available at https://github.com/sam-uofl/SwarnSeq.


Articles from MethodsX are provided here courtesy of Elsevier

RESOURCES