Table 2.
(a) | (b) | (c) | (d) | ||||||
---|---|---|---|---|---|---|---|---|---|
Dataset | Threshold | No. of sequences | % | No. of features | % | No. of zeros | % | No. [2-9] | % |
Yeast RNAseq, N= 16 samples | (mRNA) | ||||||||
No filtering | 37 710 728 | 100.00 | 3034 | 100.00 | 56 | 100.00 | 278 | 100.00 | |
Relative abundance ≥ .0001 | 34 330 805 | 91.04 | 2019 | 66.55 | 0 | 0.00 | 7 | 2.52 | |
Relative abundance ≥ .001 | 22 464 080 | 59.57 | 317 | 10.45 | 0 | 0.00 | 0 | 0.00 | |
Relative abundance ≥ .01 | 8 277 104 | 21.95 | 24 | 0.79 | 0 | 0.00 | 0 | 0.00 | |
Count ≥ 2 | 37 710 696 | 100.00 | 3031 | 99.90 | 8 | 14.29 | 278 | 100.00 | |
Count ≥ 10 | 37 708 896 | 100.00 | 3029 | 99.84 | 7 | 12.50 | 269 | 96.76 | |
Tara Oceans, N= 139 samples | (OTU) | ||||||||
No filtering | 14 129 941 | 100.00 | 35 651 | 100.00 | 4 394 814 | 100.00 | 199 424 | 100.00 | |
Relative abundance ≥ .0001 | 13 093 797 | 92.67 | 7250 | 20.34 | 595 938 | 13.56 | 155 003 | 77.73 | |
reRelative abundance ≥ .001 | 8 241 812 | 58.33 | 2450 | 6.87 | 135 678 | 3.09 | 56 849 | 28.51 | |
Relative abundance ≥ .01 | 1 499 364 | 10.61 | 113 | 0.32 | 5324 | 0.12 | 2369 | 1.19 | |
Count ≥ 2 | 13 941 637 | 98.67 | 19 803 | 55.55 | 2 222 449 | 50.57 | 199 424 | 100.00 | |
Count ≥ 10 | 13 147 108 | 93.04 | 7483 | 20.99 | 623 333 | 14.18 | 157 107 | 78.78 | |
Gut microbiome, N= 265 samples | (OTU) | ||||||||
No filtering | 17 365 964 | 100.00 | 10 000 | 100.00 | 2 535 419 | 100.00 | 37 964 | 100.00 | |
Relative abundance ≥ .0001 | 17 266 878 | 99.43 | 9862 | 98.62 | 2 499 064 | 98.57 | 37 893 | 99.81 | |
Relative abundance ≥ .001 | 16 302 087 | 93.87 | 8992 | 89.92 | 2 276 347 | 89.78 | 34 302 | 90.35 | |
Relative abundance ≥ .01 | 12 125 721 | 69.82 | 1521 | 15.21 | 370 082 | 14.60 | 7431 | 19.57 | |
Count ≥ 2 | 17 346 927 | 99.89 | 9897 | 98.97 | 2 508 181 | 98.93 | 37 964 | 100.00 | |
Count ≥ 10 | 17 180 567 | 98.93 | 9419 | 94.19 | 2 382 756 | 93.98 | 37 141 | 97.83 |
The first row shows no filtering of the dataset, so for yeast, there are 37.7M sequences, of which 56 are zero counts and 278 have counts between 2 and 9; these sequences collapse down to 3K features after clustering. The second row shows in the Tara Oceans dataset that by filtering on relative abundance ≥0.0001, we reduce the number of OTUs from 35 651 down to 7250 (20%), which is comparable to using the threshold of absolute minimum count of 10. The number of zero count data has also reduced significantly from 4.4M to 596K.