Skip to main content
. 2015 Jan 21;16(Suppl 1):S1. doi: 10.1186/1471-2105-16-S1-S1

Table 2.

Data statistics after using CD-HIT.

Sequence identity Training data set (6259) Testing data set (35494)

Positive Negative Positive Negative
100% (original) 23949 228441 110695 1217977

90% 21621 196808 38739 325640

80% 21165 179691 36647 284713

70% 20709 165560 35165 255134

60% 18588 115296 29810 162044

50% 10216 34428 14210 41700

40% 2658 5532 3267 6214