Skip to main content
. 2018 Jul 18;35(3):407–414. doi: 10.1093/bioinformatics/bty632

Table 2.

Compression ratio of wavelet trie and Bloom filter schemes (measured as number of bits per edge)

Proposed
Data set Colors (m) gzip bzip2 VARI RBF WTr WTr (CI) BF 95% BF 99.0%
Virus100 100 11.4 4.8 9.8 5.8 2.2 1.3 (52) 0.36 0.44
Virus1000 1000 26.5 7.5 14.7 9.7 18.2 5.28 (272) 0.49 0.82
Virus50000 53,412 135.3 37.7 56.0a a,b 662.1 64.8 (1693) 2.58 7.41
Lactobacillus 135 15.6 5.7 19.3 7.8 3.3 1.6 (20) 0.95 1.40
chr22+gnomAD 9 4.6 2.7 17.3a 3.3a N/A 1.2 (1)c 0.45 2.41
hg19+gnomAD 30 10.9 5.4 14.5a 5.6a N/A 5.4 (22)c 0.68 1.82

Note: Each dataset is encoded with eight different compression schemes, including general compression with gzip and bzip2, existing methods specific to colored de Bruijn graphs VARI (Muggli et al., 2017) and Rainbowfish (RBF, Almodaresi et al., (2017)), as well as the wavelet trie encoding (WTr) with and without the class indicator bits set (CI; value in parenthesis describes the number of the first columns in the annotation matrices that were used as the indicator columns), and the corrected Bloom filters at >95% (BF 95%) and >99% (BF 99%) accuracy. All compression ratios are measured as average number of bits per edge. VARI was compiled with 1024 bit support.

a

On these datasets, VARI and RBF results are generated by exporting the annotation data in compatible formats.

b

Consumed more than 400GB memory limit.

c

The class indicators were the columns representing the reference chromosomes, hence, no extra columns were added.