Table 1. Clustering of samples according to different normalization/scaling strategies.
Normalization method | Use of gene length data | Scaling | Fraction |
Sample |
||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
T1 |
T2 |
T3 |
T4 |
|||||||||
A | B | A | B | A | B | A | B | Number of falsely grouped samples | ||||
None | Raw data | 2 | 2 | 4 | 3 | 3 | 3 | 2 | 1 | 2 | ||
1 | Exon or transcript length not considered | 4 | 4 | 2 | 1 | 2 | 1 | 4 | 3 | 3 | ||
2 | Reads multiplied by exon length | R | N | 3 | 3 | 2 | 4 | 2 | 4 | 1 | 1 | 2 |
3 | Reads divided by exon length | R | N | 4 | 4 | 4 | 3 | 1 | 3 | 2 | 2 | 2 |
4 | Reads multiplied by transcript length | R | N | 2 | 2 | 2 | 3 | 3 | 3 | 1 | 4 | 2 |
5 | Reads divided by transcript length | R | N | 3 | 2 | 1 | 4 | 1 | 4 | 1 | 1 | 3 |
6 | Reads multiplied by exon length | S | N | 3 | 3 | 4 | 2 | 4 | 2 | 1 | 1 | 2 |
7 | Reads divided by exon length | S | N | 3 | 3 | 3 | 2 | 4 | 1 | 4 | 4 | 2 |
8 | Reads multiplied by transcript length | S | N | 1 | 1 | 1 | 4 | 4 | 4 | 2 | 3 | 2 |
9 | Reads divided by transcript length | S | N | 4 | 4 | 1 | 1 | 2 | 1 | 3 | 3 | 1 |
10 | Reads multiplied by exon length | R | Y | 1 | 1 | 2 | 3 | 2 | 3 | 4 | 4 | 2 |
11 | Reads divided by exon length | R | Y | 3 | 3 | 3 | 2 | 4 | 2 | 1 | 1 | 2 |
12 | Reads multiplied by transcript length | R | Y | 4 | 4 | 4 | 2 | 1 | 2 | 3 | 3 | 2 |
13 | Reads divided by transcript length | R | Y | 4 | 4 | 2 | 1 | 3 | 1 | 2 | 2 | 2 |
14 | Reads multiplied by exon length | S | Y | 3 | 3 | 1 | 2 | 1 | 4 | 3 | 3 | 4 |
15 | Reads divided by exon length | S | Y | 2 | 2 | 2 | 4 | 3 | 1 | 3 | 3 | 2 |
16 | Reads multiplied by transcript length | S | Y | 3 | 3 | 4 | 4 | 1 | 1 | 2 | 2 | 0 |
17 | Reads divided by transcript length | S | Y | 4 | 4 | 3 | 3 | 2 | 3 | 1 | 1 | 1 |
Shown are cluster memberships for technical replicates (A and B) of four randomly chosen samples (T1–4). The principles guiding the different normalization strategies with respect to transcript or exon length are shown in column 2. Column 3 delineates the choice of two different scaling methods: R—counts are divided by the total number of reads per sample; S—counts are scaled to the total sum of gene × length products or quotients per sample. Column 4 indicates whether individual transcript reads are divided by the total number of mappable reads before entering the equation (N—no; Y—yes), that is, whether reads per transcript are considered as a fraction or all reads or not. A full description is given in the Methods section. The labels 1, 2, 3 and 4 define the four clusters; samples that belong to the same cluster receive the same label. Method 3 is analogous to the RPKM method. Only method 16 (bold) leads to the correct clustering. Methods 9 and 17 (italics) give the same clustering results—only the numbering of the clusters is changed—and perform slightly worse than method 16, with one falsely grouped sample (T3B).