Table 2.
Test datasets | Number of reads | Read length (nt) | Reads with homologs (by BLAST) | Running time (CPU hours) | Reads with homologs found in the IMG protein database a | |||
---|---|---|---|---|---|---|---|---|
BLAST | RAPSearch | Overlap g | BLAST-only | RAPSearch-only | ||||
SRR020796 (2%) b | 1,164,805 | 72 | 19%e | 1,590 f | 16.8 | 218,134 (98.4%) | 2,832 (1.3%) | 745 (0.3%) |
4440037c | 188,445 | 100 | 5% | 154 | 3.5 | 9,791 (95.3%) | 270 (2.6%) | 213 (2.1%) |
TS50d | 622,554 | 200 | 75% | 1000 | 54.3 | 459,509 (97.9%) | 7,339 (1.5%) | 2683 (0.6%) |
TS28d | 312,665 | 329 | 75% | 900 | 45.7 | 225,953 (96%) | 7,511 (3.2%) | 1,222 (0.5%) |
a: the reads are searched against the 98% non-redundant dataset of proteins collected in the IMG database with a total of 4,054,694 proteins, and an E-value cutoff of 1e-1 was used to define homologs (less stringent) for the Illumina reads (the SRR020796 dataset) considering the reads are extremely short, and an E-value cutoff of 1e-3 for the rest. b: the dataset was downloaded from the NCBI website (from the rumen microbiota response study), and only 2% of the reads were used for testing because the BLAST search of the entire dataset will require a computer farm. c: dataset was from the nine biomes project [7]. d: TS50 (4440615.3) and TS28 (4440613.3) datasets were from the Twin Study [24]. 4440037, TS50 and TS28 datasets were downloaded from the MG-RAST server. e: the percentage of reads that have homologs in the IMG database as identified by BLAST. f: the running time was estimated based on the running time of BLAST search of a small fraction of the original dataset on the same computer (Intel Xeon 2.93 GHz) on which RAPSearch was carried out for comparison purposes; the actual BLAST search of the original datasets was carried out on BigRed, a computer cluster maintained at Indiana University. g: the Overlap column lists the total number of reads that have homologs in the IMG database detected by both BLAST and RAPSearch, while the total number of reads that have homologs in the IMG database detected by BLAST or RAPSearch only are listed in the BLAST-only and RAPSearch-only column, respectively.