The read count ygi is modeled by a GLM of negative binomial distribution with a log link:
The mean µgi is the proportion of reads for gene g in sample i, λgk(i), scaled by a normalization factor, mi. The variance
is
, where νgk(i) is assumed to be a per gene raw variance, a smoothing function of λg and k. The use of the smoothing function can help stabilize the variance estimates especially when the number of samples is small. For the estimation of the normalization factor (which is referred to as the size factor by Anders and Huber), mi, for each sample, the authors noted that highly DE genes are more likely to be influential on total count and so the median of the ratios of counts should be used for more robustness:
Since λgk(i) is proportional to the expected value of the unknown proportion from gene g in group k, it is estimated by the average of counts from all samples in group k with a common scale.
where Mk is the total number of replicates for group k. The sample variances with the common scale are calculated as:
In the case of a sufficiently large number of Mk, one can see wgk – zgk as the unbiased estimator of the raw variance νgk. In the case of a small number of Mk, local regression for a smooth function wk(λ) on the graph of
was suggested so that
would be the estimate for the raw variance. More details are in the study by Anders and Huber.8
|