Differential Expression Analysis for RNA-Seq: An Overview of Statistical Methods and Computational Software

. 2015 Dec 13;14(Suppl 1):57–67. doi: 10.4137/CIN.S21631

Algorithm Overview 4: Anders and Huber’s⁸ DESeq

The read count y_gi is modeled by a GLM of negative binomial distribution with a log link:

\log (λ_{g i}) = \sum_{p = 1}^{P} x_{i p} β_{g p}

(3.2.c)

The mean µ_gi is the proportion of reads for gene g in sample i, λ_gk₍_i₎, scaled by a normalization factor, m_i. The variance

σ_{g i}^{2}

μ_{g i} + m_{i}^{2} ν_{g k (i)}

, where ν_gk₍_i₎ is assumed to be a per gene raw variance, a smoothing function of λ_g and k. The use of the smoothing function can help stabilize the variance estimates especially when the number of samples is small. For the estimation of the normalization factor (which is referred to as the size factor by Anders and Huber), m_i, for each sample, the authors noted that highly DE genes are more likely to be influential on total count and so the median of the ratios of counts should be used for more robustness:

{\hat{m}}_{i} = \underset{i}{median} \frac{y_{g i}}{{(\prod_{v = 1}^{N} y_{g v})}^{1 / N}}

(3.2.d)

Since λ_gk₍_i₎ is proportional to the expected value of the unknown proportion from gene g in group k, it is estimated by the average of counts from all samples in group k with a common scale.

{\hat{λ}}_{g k (i)} = \frac{1}{M_{k}} \sum_{i : k (i) = k} \frac{y_{g i}}{{\hat{m}}_{i}},

(3.2.e)

where M_k is the total number of replicates for group k. The sample variances with the common scale are calculated as:

w_{g k} = \frac{1}{M_{k} - 1} {\sum_{i : k (i) = k} (\frac{y_{g i}}{{\hat{m}}_{i}} - {\hat{λ}}_{g k (i)})}^{2}

(3.2.f)

z_{g k} = \frac{{\hat{λ}}_{g k (i)}}{M_{k}} \sum_{i : k (i) = k} \frac{1}{{\hat{m}}_{i}}

(3.2.g)

In the case of a sufficiently large number of M_k, one can see w_gk – z_gk as the unbiased estimator of the raw variance ν_gk. In the case of a small number of M_k, local regression for a smooth function w_k(λ) on the graph of

({\hat{λ}}_{g k (i)}, w_{g k})

was suggested so that

w_{k} ({\hat{λ}}_{g k (i)}) - z_{g k}

would be the estimate for the raw variance. More details are in the study by Anders and Huber.⁸