1 |
GLM |
Gene expression can have any form of exponential distribution type.
Suitable for bi-modality of data.
Able to deal with categorical predictors, e.g., cell type, cell cycle, etc.
Easy to interpret and allows a clear understanding of how each of the predictors are influencing the gene parameters.
Can be generalized to multi-cell group comparisons.
Less susceptible to model over-fitting.
|
Strict exponential family distributional assumptions about the data.
Needs relatively large datasets (with more predictor and large number of cells).
Sensitive to outliers.
Sensitive to dropout events.
Not suitable for low expressed genes.
Cannot handle multi-modality of the data.
ZIM–GLM approaches are not able to handle zero-deflation at any level of a factor and will result in parameter estimates of infinity for the logistic component.
Higher computational cost especially for large datasets.
|
NBID, ZingeR ZINB–WaVE, DECENT, SwarnSeq, scMMST, TPMM, Tweedieverse |
2 |
GAM |
Predictor functions are automatically derived during model estimation.
Marginal impact of a single variable does not depend on the values of the other variables in the model.
Flexibility in choosing the type of functions, which will help in finding patterns missed in a parametric model.
Allows controlling smoothness of the predictor functions to prevent model over-fitting.
By controlling the wiggliness of the predictor functions, we can directly tackle the bias/variance tradeoff.
Highly effective in many settings, particularly when one wishes to model the response variable as a function of both categorical (e.g., cell groups) and continuous predictors (e.g., cell-level auxiliary variables).
Considers both linear and non-linear functions of cell-level predictors to model gene parameters.
Each lineage is represented by a separate cubic smoothing spline, and its flexibility allows adjustment for other covariates or confounders as fixed effects in the model.
|
Approaches such as Monocle can only handle a single lineage of cells.
Lack of interpretability, to infer differences in expression between lineages of cells.
Assumes the dropout events to be linear; however, the effect of dropout events is likely to be non-linear, especially for genes with low to moderate expression.
Computationally complex.
|
Monocle, Monocle2, Monocle3, tradeSeq |
3 |
Hurdle Model |
Considers the excess zeros while model building.
Can handle zero-inflation as well as zero-deflation present in data.
Models the bimodality of gene expression distribution.
|
|
MAST, Random Hurdle |
4 |
Mixture-Model |
|
Certain approaches including BPSC, SC2P cannot consider the zero-inflation in single-cell data.
Mostly uses linear models for DEA, which is cumbersome.
Higher runtime and computationally intensive.
|
SCDE, D3E, BPSC, BASiCS, DESCEND, SC2P, ZIAQ, ZIQRank, SimCD |
5 |
Non-parametric (two-class) |
Distribution-free approaches.
Considers the multi-modality of the data.
Computationally not cumbersome (less runtime).
Estimates the parameters without fitting any distribution for genes.
Performs DEA with distance-like metrics across two cell types.
Performs well when there are lesser proportions of zeros in the data.
|
Mostly focuses on two cellular groups’ comparison.
Computationally complex for multi-groups.
Performance severely affected due to high dropouts (some methods exclude dropouts).
Cannot separate between true/biological and false/dropout zeros.
Sensitive to sparsity.
Methods such as D3E, scDD fail to consider UMI count nature of the data.
Cannot separate confounding factors from each other.
|
Wilcox, NODES, ROTS, EMDomics, ROSeq, SINCERA, sigEMD, DTWscore, SAMstrt |
6 |
Parametric (two-class) |
|
Makes strict distributional assumption about the data.
Cannot generalize to multi-group comparisons.
Ignores the multi-modal distributions of the scRNA-seq data.
Sensitive to sparsity or dropout events.
Cannot differentiate between the major sources of variability in the data.
|
scDD, DEsingle, t-test, NYMP, IDEAS |