Skip to main content
Bioinformatics logoLink to Bioinformatics
letter
. 2015 Nov 28;32(7):1115–1117. doi: 10.1093/bioinformatics/btv704

Comment on: ‘ERGC: an efficient referential genome compression algorithm’

Sebastian Deorowicz 1,*, Szymon Grabowski 2, Idoia Ochoa 3,*, Mikel Hernaez 3, Tsachy Weissman 3
PMCID: PMC4907388  PMID: 26615213

Abstract

Motivation: Data compression is crucial in effective handling of genomic data. Among several recently published algorithms, ERGC seems to be surprisingly good, easily beating all of the competitors.

Results: We evaluated ERGC and the previously proposed algorithms GDC and iDoComp, which are the ones used in the original paper for comparison, on a wide data set including 12 assemblies of human genome (instead of only four of them in the original paper). ERGC wins only when one of the genomes (referential or target) contains mixed-cased letters (which is the case for only the two Korean genomes). In all other cases ERGC is on average an order of magnitude worse than GDC and iDoComp.

Contact: sebastian.deorowicz@polsl.pl, iochoa@stanford.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction

The rapid growth of genomic data in the last few years demands for efficient compression methods to facilitate their storage and transfer. One of the standard problems of this kind is (referential) compression of multiple individual genomes of the same species (Deorowicz et al., 2015; Deorowicz and Grabowski, 2011; Kuruppu et al., 2011; Ochoa et al., 2014; Wandelt and Leser, 2013). As a pair of individual genomes differ only in variant loci, we can expect radical savings in storing a referentially encoded genome.

Recently, (Saha and Rajasekaran 2015) published a new referential genome compression algorithm, called ERGC. As the experimental results presented in this article seemed to suggest that ERGC is surprisingly good, winning easily in compression ratio over the previously proposed algorithms GDC (Deorowicz and Grabowski, 2011) and iDoComp (Ochoa et al., 2014), we decided to take a closer look at ERGC and repeat the experiments on a wider set of data.

2 Discussion

In the first experiment, we repeated the referential compression evaluation from the ERGC paper, where the D1,,D5 experiments are specified in Table 1 in the cited paper and the actual results are shown there in Table 2. In each Di experiment, all 24 chromosomes of one human genome are referentially compressed, given some other human genome as a reference. We repeated these experiments in our Table 1.

Table 1.

. Rerun experiments from Table 2 in the ERGC paper

Dataset Target size GDC size iDoComp size ERGC size
D1 3132 6439 6288 7890
D2 3132 29 831 30 437 9217
D3 3132 35 227 33 926 9404
D4 2938 12 293 7213 4914
D5 3124 34 806 33 158 9373

Table 2.

. Rerun experiments from Table 1 above, with the difference that the KOREAN genomes are converted to upper case

Dataset Target size GDC size iDoComp size ERGC size
D2 3132 7660 7473 9273
D3 3132 7849 7691 9449
D4 2938 1073 1020 1246
D5 3124 7820 7754 9425

The D1 experiment is skipped run since both the reference and the target in it consist of upper case DNA symbols only.

As it can be observed, the results differ significantly for cases D1 and D5. First, the cited paper incorrectly reported ‘NA’ for the GDC algorithm in these two cases. Moreover, iDoComp achieves 6288 KB on D1 and 33 158 KB on D5, rather than 65 708.47 KB and 209 380.79 KB as reported in the cited paper (note the order of magnitude difference). Finally, applying ERGC on D5 resulted in a compressed file of 9373 KB, instead of the reported 19 396.40 KB.

With the corrected results, we concluded that ERGC achieves the best compression ratio in all cases except in D1, where it is outperformed by both GDC and iDoComp. Taking a closer look, we observed that in all but D1 one of the KOREAN genomes is used either as a reference or a target genome. The KOREAN genomes differ from the other ones in that they contain both lower and upper case letters (see Supplementary material). They also contain non-standard symbols, similarly to the YH genome.

To better understand if the potential of ERGC is due to the design of the algorithm itself, or to the handling of the lower and upper case letters and/or the non-standard symbols, we simulated these experiments again (D2,,D5) with the KOREAN genomes transformed to all upper case. Surprisingly, as demonstrated in Table 2, ERGC achieved the worst compressed size in all cases. The reported results suggest that the gain of ERGC comes from the handling of the lower and upper case symbols.

Since the previous experiments may not be representative enough, we decided to use a wider set of data. Specifically, we chose the human genome datasets from the GDC 2 paper (Deorowicz et al., 2015), which is a superset of the collections from (Deorowicz and Grabowski, 2011) and (Ochoa et al., 2014). To reduce the amount of computations, we chose only two chromosomes from each genome: chr10 and chr20. Then, to follow the ERGC paper methodology, we tested referential compression in pairs of chromosomes, in a round-robin fashion. In Table 3, we show the sums of compressed sizes; detailed results are presented in the Supplementary material. ERGC is outperformed in all cases by GDC and iDoComp.

Table 3.

Summary of the results for all the considered pairs both for chromosomes 10 and 20

Reference Chromosome 10
Chromosome 20
Target size GDC size iDoComp size ERGC size Target size GDC size iDoComp size ERGC size
CHM1_1.0 1 495 816 10 229 11 805 289 929 694 646 4242 5132 135 412a
CHM1_1.1 1 496 019 9505 11 054 290 014 694 641 4252 5088 100 306
CSA 1 497 653 51 057 59 500b 339 998 695 024 13 741 15 695 146 233
HG17 1 496 426 8146 7801 167 270 695 127 3457 3893 48 193
HG18 1 496 465 8106 7753 145 209 695 127 3457 3893 48 193
HG19 1 496 303 8153 7913 227 746 694 529 3472 3919 48 294
HG38 1 498 065 9329 9543 222 986 693 090 2743 2798 78 216
HuRef 1 502 946 12 680 15 582 295 338 697 946 5037 5862 95 444
KO131 1 496 143 16 388 18 246b 158 849 694 978 6169 6984b 75 827
KO224 1 496 143 14 736 16 997b 158 635 694 978 5486 6543b 68 173
WGSA 1 503 392 25 607 31 897 336 064 697 939 10 403 12 470 137 623
YH 1 496 143 9388 8785 153 532 694 978 4063 4406 49 951

aThis total ERGC result is not certain, as ERGC crashed on the pair of sequences: CHM1_1.0 (reference), CHM1_1.1 (target).

bThese total iDoComp results are taken from its bug-fix version (https://github.com/mikelhernaez/iDoComp, v1.2) as the earlier version v1.1 crashed on five pairs of sequences.

3 Conclusion

From the conducted experiments we can draw several conclusions. Firstly, the reported compression results on the D1 and D5 cases in the cited paper are not correct for both GDC and iDoComp. Secondly, ERGC seems to get the main advantage due to specific handling of lower and upper case difference between the target and the reference genome. We believe that using the KOREAN genomes is fair, however, the authors should have provided simulations on a wider set of data where the gain does not come from the lower and upper case difference. Finally, after running several experiments on different datasets, we can conclude that ERGC performs in general poorly when compared with GDC and iDoComp.

Funding

This work has been supported by a fellowship from the Basque Government, a grant from the Center for Science of Information (CSoI), the 2014-07364-01 NIH grant and the 1157849-1-QAZCC NSF grant. It is also supported by the Polish National Science Centre under the project DEC-2011/03/B/ST6/01588. The infrastructure supported by POIG.02.03.01-24-099/13 grant: ‘GeCONiI-Upper’ Silesian Center for Computational Science and Engineering’.

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data

References

  1. Deorowicz S. et al. (2015) GDC 2: Compression of large collections of genomes,. Sci. Reports, 5, Article no. 11565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Deorowicz S., Grabowski S. (2011) Robust relative compression of genomes with random access,. Bioinformatics, 27, 2979–2986. [DOI] [PubMed] [Google Scholar]
  3. Kuruppu S. et al. (2011) Optimized relative Lempel-Ziv compression of genomes In: Reynolds M. (ed.) Proceedings of the ACSC Australasian Computer Science Conference. Australian Computer Society, Inc, Sydney, Australia, pp. 91–98. [Google Scholar]
  4. Ochoa I. et al. (2015) iDoComp: a compression scheme for assembled genomes. Bioinformatics, 31, 626–633. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Saha S., Rajasekaran S. (2015) ERGC: An efficient referential genome compression algorithm. Bioinformatics, 31, 3468–3475. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Wandelt S., Leser U. (2013) FRESCO: Referential compression of highly similar sequences,. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 10, 1275–1288. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data
supp_btv704_supp.pdf (169.9KB, pdf)

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES