Skip to main content
. 2015 Dec 13;14(Suppl 1):57–67. doi: 10.4137/CIN.S21631

Algorithm Overview 5: Love et al.’s9 DESeq2

DESeq2 allows the normalization factors to be gene specific (mgi), rather than being fixed across genes (mi). The estimation of mgi is implemented in their new R packages.9
When modeling dispersion parameters, a large variation in estimates usually arises because of small sample sizes. DESeq2 proposed to pool genes with similar average expression together for the estimation of dispersions. To do this, one first separately estimates dispersion with maximum likelihood. Then, one identifies a location parameter for the distribution of the estimates by fitting a smooth curve dependent on average normalized expressions, before finally shrinking gene-specific dispersions to the fitted curve using an empirical Bayesian approach. The authors stated that this procedure is more superior than DESeq.
In order to avoid identifying differential expressions in genes of small average expression, fold change estimation is shrunken toward 0 for genes with insufficient information by employing an empirical Bayesian shrinkage. The procedure is as follows: (1) obtain the maximum likelihood estimates for the log fold changes from the GLM fit, then (2) fit a normal distribution with mean 0 to the estimates, and (3) use that as the prior for a second GLM fit. The maximum a posterior and the standard error for each estimate are the products of this procedure and will be used for the calculation of Wald statistics for DEA.
DESeq2 computes a threshold, η, to filter genes based on their average normalized expressions. The threshold is calculated for maximizing the number of genes with a userdefined false discovery rate. The authors claimed that this filtering step effectively controls the power of detecting DE genes. The null hypothesis becomes |βgp| ≤ η where βgp is the shrunken log fold change.
Finally, the method provides a way to diagnose outliers using the Cook’s distance from the GLM within each gene, Cd. Samples are flagged with Cd ≥ 99% quantile of an F distribution with degrees of freedom as the number of parameter, P, and the difference in the number of samples and the number of parameter, N – P. When there is a large number of replicates available, influential data can be removed without removing the whole gene; however, when there is a small number of replicates, the entire gene with influential points should be removed from the analysis to preclude bias. More details on DESeq2’s features can be found in the study by Love et al.9 In conclusion, DESeq2 is recommended by its authors as an improved solution to perform differential analysis because it adopts many competitive features.