Skip to main content
. 2022 Jul 18;24(7):995. doi: 10.3390/e24070995

Table 2.

Classes of statistical approaches and tools extensively used in DEA of scRNA-seq data.

SN. Class Features Limitations Tools
1 GLM
  • Gene expression can have any form of exponential distribution type.

  • Suitable for bi-modality of data.

  • Able to deal with categorical predictors, e.g., cell type, cell cycle, etc.

  • Easy to interpret and allows a clear understanding of how each of the predictors are influencing the gene parameters.

  • Can be generalized to multi-cell group comparisons.

  • Less susceptible to model over-fitting.

  • Strict exponential family distributional assumptions about the data.

  • Needs relatively large datasets (with more predictor and large number of cells).

  • Sensitive to outliers.

  • Sensitive to dropout events.

  • Not suitable for low expressed genes.

  • Cannot handle multi-modality of the data.

  • ZIM–GLM approaches are not able to handle zero-deflation at any level of a factor and will result in parameter estimates of infinity for the logistic component.

  • Higher computational cost especially for large datasets.

NBID, ZingeR
ZINB–WaVE,
DECENT, SwarnSeq, scMMST, TPMM, Tweedieverse
2 GAM
  • Predictor functions are automatically derived during model estimation.

  • Marginal impact of a single variable does not depend on the values of the other variables in the model.

  • Flexibility in choosing the type of functions, which will help in finding patterns missed in a parametric model.

  • Allows controlling smoothness of the predictor functions to prevent model over-fitting.

  • By controlling the wiggliness of the predictor functions, we can directly tackle the bias/variance tradeoff.

  • Highly effective in many settings, particularly when one wishes to model the response variable as a function of both categorical (e.g., cell groups) and continuous predictors (e.g., cell-level auxiliary variables).

  • Considers both linear and non-linear functions of cell-level predictors to model gene parameters.

  • Each lineage is represented by a separate cubic smoothing spline, and its flexibility allows adjustment for other covariates or confounders as fixed effects in the model.

  • Approaches such as Monocle can only handle a single lineage of cells.

  • Lack of interpretability, to infer differences in expression between lineages of cells.

  • Assumes the dropout events to be linear; however, the effect of dropout events is likely to be non-linear, especially for genes with low to moderate expression.

  • Computationally complex.

Monocle, Monocle2, Monocle3, tradeSeq
3 Hurdle Model
  • Considers the excess zeros while model building.

  • Can handle zero-inflation as well as zero-deflation present in data.

  • Models the bimodality of gene expression distribution.

  • Does not differentiate the generating process for excessive zeros versus sampling zeros.

  • Fails to consider the multi-modality of gene expression distribution.

  • Requires higher runtime.

MAST, Random Hurdle
4 Mixture-Model
  • Considers bi-modal or multi-modal nature of single-cell data.

  • Can differentiate between major sources of variation in single-cell data.

  • Certain approaches including BPSC, SC2P cannot consider the zero-inflation in single-cell data.

  • Mostly uses linear models for DEA, which is cumbersome.

  • Higher runtime and computationally intensive.

SCDE, D3E, BPSC, BASiCS, DESCEND, SC2P, ZIAQ, ZIQRank, SimCD
5 Non-parametric (two-class)
  • Distribution-free approaches.

  • Considers the multi-modality of the data.

  • Computationally not cumbersome (less runtime).

  • Estimates the parameters without fitting any distribution for genes.

  • Performs DEA with distance-like metrics across two cell types.

  • Performs well when there are lesser proportions of zeros in the data.

  • Mostly focuses on two cellular groups’ comparison.

  • Computationally complex for multi-groups.

  • Performance severely affected due to high dropouts (some methods exclude dropouts).

  • Cannot separate between true/biological and false/dropout zeros.

  • Sensitive to sparsity.

  • Methods such as D3E, scDD fail to consider UMI count nature of the data.

  • Cannot separate confounding factors from each other.

Wilcox, NODES, ROTS, EMDomics, ROSeq, SINCERA, sigEMD, DTWscore, SAMstrt
6 Parametric (two-class)
  • Easy to understand and execute.

  • Lesser runtime.

  • Particularly suitable for larger datasets.

  • Makes strict distributional assumption about the data.

  • Cannot generalize to multi-group comparisons.

  • Ignores the multi-modal distributions of the scRNA-seq data.

  • Sensitive to sparsity or dropout events.

  • Cannot differentiate between the major sources of variability in the data.

scDD, DEsingle, t-test, NYMP, IDEAS