Skip to main content
. 2015 Dec 13;14(Suppl 1):57–67. doi: 10.4137/CIN.S21631

Algorithm Overview 4: Anders and Huber’s8 DESeq

The read count ygi is modeled by a GLM of negative binomial distribution with a log link:
log(λgi)=p=1Pxipβgp (3.2.c)
The mean µgi is the proportion of reads for gene g in sample i, λgk(i), scaled by a normalization factor, mi. The variance σgi2 is μgi+mi2νgk(i), where νgk(i) is assumed to be a per gene raw variance, a smoothing function of λg and k. The use of the smoothing function can help stabilize the variance estimates especially when the number of samples is small. For the estimation of the normalization factor (which is referred to as the size factor by Anders and Huber), mi, for each sample, the authors noted that highly DE genes are more likely to be influential on total count and so the median of the ratios of counts should be used for more robustness:
m^i=medianiygi(v=1Nygv)1/N (3.2.d)
Since λgk(i) is proportional to the expected value of the unknown proportion from gene g in group k, it is estimated by the average of counts from all samples in group k with a common scale.
λ^gk(i)=1Mki:k(i)=kygim^i, (3.2.e)
where Mk is the total number of replicates for group k. The sample variances with the common scale are calculated as:
wgk=1Mk1i:k(i)=k(ygim^iλ^gk(i))2 (3.2.f)
zgk=λ^gk(i)Mki:k(i)=k1m^i (3.2.g)
In the case of a sufficiently large number of Mk, one can see wgk – zgk as the unbiased estimator of the raw variance νgk. In the case of a small number of Mk, local regression for a smooth function wk(λ) on the graph of (λ^gk(i),wgk) was suggested so that wk(λ^gk(i))zgk would be the estimate for the raw variance. More details are in the study by Anders and Huber.8