. 2022 Jul 18;24(7):995. doi: 10.3390/e24070995

Table 2.

Classes of statistical approaches and tools extensively used in DEA of scRNA-seq data.

SN.	Class	Features	Limitations	Tools
1	GLM	Gene expression can have any form of exponential distribution type. Suitable for bi-modality of data. Able to deal with categorical predictors, e.g., cell type, cell cycle, etc. Easy to interpret and allows a clear understanding of how each of the predictors are influencing the gene parameters. Can be generalized to multi-cell group comparisons. Less susceptible to model over-fitting.	Strict exponential family distributional assumptions about the data. Needs relatively large datasets (with more predictor and large number of cells). Sensitive to outliers. Sensitive to dropout events. Not suitable for low expressed genes. Cannot handle multi-modality of the data. ZIM–GLM approaches are not able to handle zero-deflation at any level of a factor and will result in parameter estimates of infinity for the logistic component. Higher computational cost especially for large datasets.	NBID, ZingeR ZINB–WaVE, DECENT, SwarnSeq, scMMST, TPMM, Tweedieverse
2	GAM	Predictor functions are automatically derived during model estimation. Marginal impact of a single variable does not depend on the values of the other variables in the model. Flexibility in choosing the type of functions, which will help in finding patterns missed in a parametric model. Allows controlling smoothness of the predictor functions to prevent model over-fitting. By controlling the wiggliness of the predictor functions, we can directly tackle the bias/variance tradeoff. Highly effective in many settings, particularly when one wishes to model the response variable as a function of both categorical (e.g., cell groups) and continuous predictors (e.g., cell-level auxiliary variables). Considers both linear and non-linear functions of cell-level predictors to model gene parameters. Each lineage is represented by a separate cubic smoothing spline, and its flexibility allows adjustment for other covariates or confounders as fixed effects in the model.	Approaches such as Monocle can only handle a single lineage of cells. Lack of interpretability, to infer differences in expression between lineages of cells. Assumes the dropout events to be linear; however, the effect of dropout events is likely to be non-linear, especially for genes with low to moderate expression. Computationally complex.	Monocle, Monocle2, Monocle3, tradeSeq
3	Hurdle Model	Considers the excess zeros while model building. Can handle zero-inflation as well as zero-deflation present in data. Models the bimodality of gene expression distribution.	Does not differentiate the generating process for excessive zeros versus sampling zeros. Fails to consider the multi-modality of gene expression distribution. Requires higher runtime.	MAST, Random Hurdle
4	Mixture-Model	Considers bi-modal or multi-modal nature of single-cell data. Can differentiate between major sources of variation in single-cell data.	Certain approaches including BPSC, SC2P cannot consider the zero-inflation in single-cell data. Mostly uses linear models for DEA, which is cumbersome. Higher runtime and computationally intensive.	SCDE, D3E, BPSC, BASiCS, DESCEND, SC2P, ZIAQ, ZIQRank, SimCD
5	Non-parametric (two-class)	Distribution-free approaches. Considers the multi-modality of the data. Computationally not cumbersome (less runtime). Estimates the parameters without fitting any distribution for genes. Performs DEA with distance-like metrics across two cell types. Performs well when there are lesser proportions of zeros in the data.	Mostly focuses on two cellular groups’ comparison. Computationally complex for multi-groups. Performance severely affected due to high dropouts (some methods exclude dropouts). Cannot separate between true/biological and false/dropout zeros. Sensitive to sparsity. Methods such as D3E, scDD fail to consider UMI count nature of the data. Cannot separate confounding factors from each other.	Wilcox, NODES, ROTS, EMDomics, ROSeq, SINCERA, sigEMD, DTWscore, SAMstrt
6	Parametric (two-class)	Easy to understand and execute. Lesser runtime. Particularly suitable for larger datasets.	Makes strict distributional assumption about the data. Cannot generalize to multi-group comparisons. Ignores the multi-modal distributions of the scRNA-seq data. Sensitive to sparsity or dropout events. Cannot differentiate between the major sources of variability in the data.	scDD, DEsingle, t-test, NYMP, IDEAS