. 2020 Jun 19;2(2):lqaa040. doi: 10.1093/nargab/lqaa040

Table 2.

The number (and proportions) of remaining (a) sequences, (b) features (mRNA in the case of RNA-seq data and OTUs in the case of metagenomic data), (c) data with zero counts and (d) data counts between 2 and 9, after different filtering methods for the three example studies

		(a)		(b)		(c)		(d)
Dataset	Threshold	No. of sequences	%	No. of features	%	No. of zeros	%	No. [2-9]	%
Yeast RNAseq, N= 16 samples				(mRNA)
	No filtering	37 710 728	100.00	3034	100.00	56	100.00	278	100.00
	Relative abundance ≥ .0001	34 330 805	91.04	2019	66.55	0	0.00	7	2.52
	Relative abundance ≥ .001	22 464 080	59.57	317	10.45	0	0.00	0	0.00
	Relative abundance ≥ .01	8 277 104	21.95	24	0.79	0	0.00	0	0.00
	Count ≥ 2	37 710 696	100.00	3031	99.90	8	14.29	278	100.00
	Count ≥ 10	37 708 896	100.00	3029	99.84	7	12.50	269	96.76
Tara Oceans, N= 139 samples				(OTU)
	No filtering	14 129 941	100.00	35 651	100.00	4 394 814	100.00	199 424	100.00
	Relative abundance ≥ .0001	13 093 797	92.67	7250	20.34	595 938	13.56	155 003	77.73
	reRelative abundance ≥ .001	8 241 812	58.33	2450	6.87	135 678	3.09	56 849	28.51
	Relative abundance ≥ .01	1 499 364	10.61	113	0.32	5324	0.12	2369	1.19
	Count ≥ 2	13 941 637	98.67	19 803	55.55	2 222 449	50.57	199 424	100.00
	Count ≥ 10	13 147 108	93.04	7483	20.99	623 333	14.18	157 107	78.78
Gut microbiome, N= 265 samples				(OTU)
	No filtering	17 365 964	100.00	10 000	100.00	2 535 419	100.00	37 964	100.00
	Relative abundance ≥ .0001	17 266 878	99.43	9862	98.62	2 499 064	98.57	37 893	99.81
	Relative abundance ≥ .001	16 302 087	93.87	8992	89.92	2 276 347	89.78	34 302	90.35
	Relative abundance ≥ .01	12 125 721	69.82	1521	15.21	370 082	14.60	7431	19.57
	Count ≥ 2	17 346 927	99.89	9897	98.97	2 508 181	98.93	37 964	100.00
	Count ≥ 10	17 180 567	98.93	9419	94.19	2 382 756	93.98	37 141	97.83

The first row shows no filtering of the dataset, so for yeast, there are 37.7M sequences, of which 56 are zero counts and 278 have counts between 2 and 9; these sequences collapse down to 3K features after clustering. The second row shows in the Tara Oceans dataset that by filtering on relative abundance ≥0.0001, we reduce the number of OTUs from 35 651 down to 7250 (20%), which is comparable to using the threshold of absolute minimum count of 10. The number of zero count data has also reduced significantly from 4.4M to 596K.