Skip to main content
Bioinformation logoLink to Bioinformation
. 2011 Jan 22;5(8):350–360. doi: 10.6026/97320630005350

DNABIT Compress – Genome compression algorithm

Pothuraju Rajarajeswari 1,*, Allam Apparao 2
PMCID: PMC3046040  PMID: 21383923

Abstract

Data compression is concerned with how information is organized in data. Efficient storage means removal of redundancy from the data being stored in the DNA molecule. Data compression algorithms remove redundancy and are used to understand biologically important molecules. We present a compression algorithm, “DNABIT Compress” for DNA sequences based on a novel algorithm of assigning binary bits for smaller segments of DNA bases to compress both repetitive and non repetitive DNA sequence. Our proposed algorithm achieves the best compression ratio for DNA sequences for larger genome. Significantly better compression results show that “DNABIT Compress” algorithm is the best among the remaining compression algorithms. While achieving the best compression ratios for DNA sequences (Genomes),our new DNABIT Compress algorithm significantly improves the running time of all previous DNA compression programs. Assigning binary bits (Unique BIT CODE) for (Exact Repeats, Reverse Repeats) fragments of DNA sequence is also a unique concept introduced in this algorithm for the first time in DNA compression. This proposed new algorithm could achieve the best compression ratio as much as 1.58 bits/bases where the existing best methods could not achieve a ratio less than 1.72 bits/bases.

Keywords: Ziv-Lempel, BioCompress, GenCompress, Arithmetic coding, DNA compress

Background

Life is strongly associated with organization and structure [1].With the completion of 1000 genomes project, the project is estimated to generate about 8.2 billion bases per day, with the total sequence to exceed 6 trillion nucleotide bases. The DNA molecule is made up of a concatenation of four different kinds of nucleotides namely: Adenine, Thymine, cytosine and Guanine (A,T,C,G).General purpose compression algorithms do not perform well with biological sequences. Giancarlo et al. [2] have provided a review of compression algorithms designed for biological sequences. Finding the characteristics and comparing Genomes is a major task (Koonin 1999[3]; Wooley 1999 [4]). In mathematical point of view, compression implies understanding and comprehension (Li and Vitanyi 1998) [5]. Compression is a great tool for Genome comparison and for studying various properties of Genomes. DNA sequences, which encode life should be compressible. It is well known that DNA sequences in higher eukaryotes contain many tandem repeats, and essentials genes (like rRNAs) have many copies. It is also proved that genes duplicate themselves sometimes for evolutionary purposes. All these facts conclude that DNA sequences should be compressible. The compression of DNA sequences is not an easy task. (Grumback and Tahi 1994[6], Rivals et al. 1995 [7]; Chen et al. 2000 [8]) DNA sequences consists of only four nucleotides bases {a,c,g,t}. Two bits are enough to store each base. The standard compression softwares such as “compress”, “gzip”, “bzip2”, “winzip” expanded the DNA genome file more than compressing it. Most of the Existing software tools worked well for English text compression (Bell et al. 1990[9]) but not for DNA Genomes. Increasing genome sequence data of organisms lead DNA database size two or three times bigger annually. Thus it becomes very hard to download and maintain data in a local system. Other algorithms specifically designed for DNA sequences compression did not manage to achieve average compression rate below 1.7 bits/base. Algorithms for Compressing DNA sequences, such as Ziv-Lempel compression algorithms [10, 11]. Biocompress [12], Gencompress [13] and DNAcompress [14] compress DNA sequences.

Their average compression rate is about 1.74 bits per base. Hence we present a new compression algorithm named “DNABIT Compress” whose compression rate is below 1.56 bits per base (for Best case) even for larger genome (nearly 2,00,000 characters).

Existing compression algorithms

General purpose compression algorithms do not perform well with biological sequences, resulting quite often in expansion rather than compression. Probably one of the most well-known DNA compressors is Ziv-Lempel compression algorithms [10, 11] are based on an idea of complexity presented by Lempel and Ziv in [15]. Grumbach and Tahi proposed two lossless compression algorithms for DNA sequences, namely Biocompress and Biocompress-2 using the technique of Ziv and Lempel (1997) data compression method. Biocompress-2 detects exact repeats and complementary palindromes located in the target sequence and then encode them by repeat length and the position of a previous repeat occurrence. Biocompress-2 also uses arithmetic coding of order 2 if no significant repetition is found. Biocompress also uses the same methodology as in Biocompress-2 except that it does not use order-2 arithmetic coding. Gencompress (Chen et al. 2000 [13]) achieves significantly higher compression ratios than either Biocompress-2 or Cfact. Gencompress is a one-pass algorithm. In Gencompress we search for approximate matches that satisfy condition C. This algorithm carefully finds the optimal prefix and uses order-2 arithmetic encoding (Nelson 1991; Bell et al. 1990) whenever needed. Gencompress also detects the approximate complemented palindrome in DNA sequences. The Average compression ratio is 1.7428. DNA compress employs the Lempel-Ziv compression scheme as Biocompress-2 and Gencompress. It consists of two components: find all approximate repeats including complemented palindromes; and encode approximate repeat regions and non-repeat regions. The Average ratio used for compressing sequences found to be 1.7254. Behzadi and Fessant find repeats to the cost of a dynamic programming search and select from a second order markov model, a context tree and two-bit coding for the non-repeated parts. They use an expert model with Bayesian averaging over a second order Markov model, a first order Markov model estimated on short term data (last 512 symbols) and a repeat model. Expressed per element complexity can provide information about the structure of the regions and local properties of a genome, or proteome. Perhaps more than their use in compressing biological sequences, compression algorithms, in particular variants of the Ziv-Lempel algorithms [10, 11], have been useful as measures of evolutionary distance (Li et al.) [17] used. Gencompress is used as the distance estimator in hierarchical clustering of biological sequences as a solution to the phylogeny construction problem.

Methodology

Please see (supplementary material) for methodology.

Experimental Results

To experiment our algorithm, we tried to compress a standard set of DNA sequences with our algorithm, and we compare with results published for other efficient DNA compressors. A test prototype is implemented to assess the capability of our framework. The code is written in java and compiled in java 1.2 SDK. Tests ran on a system based on intel Pentium we tested our DNABIT program on a dataset of DNA sequences typically used in DNA compression studies. The datasets includes 9 sequences: two chloroplast genomes (CHMPXX AND CHNTXX); 4 HUMAN GENES [HUMDYSTROP, HUMHBB, HUMHDABCD (Humanbeta globinregion on chromosome11) and HUMHPRTB (Humanhypoxanthine phosphoribosyl transferase gene); 1 mitochondria genome (MPOMTCG); and genomes of 2 Viruses (HEHCMVCG AND VACCG)

The results are displayed on Table 10 (see Table 10). The table shows that our proposed new DNABIT compress algorithm program performs better than other programs. Other existing algorithms that are used to compare with the proposed DNABIT are Normal CTW, CTW+LZ, BIOCOMPRESS 2, GENCOMPRESS, DNA COMPRESS, DNA PACK. During our experiments, we tried to compress the DNA genomes as long as nearly 2 lakhs length. The time taken to compress is approximately in seconds. Our proposed algorithm DNABIT compress performs better than the best algorithms GENCOMPRESS and DNA PACK whose compression ratio is approximately 1.7. Our DNABIT compress algorithm’s compression ratio is as less as 1.58 bits/base.

DNABIT Compress Comparison with other Compressors

Compression algorithms designed to compress DNA sequences comparison is a difficult task because the source or executable code of compressors are usually not available. Another reason is space and time requirement of most compressors makes it difficult to test them on sequences of larger Genomes. Complete data on the running times is not available. In [14] the authors report some compression times for the file hehcmvcg (229354 bases) on a 700MHz Pentium III. According to [14] the compression of hehcmvcg takes a few hours for CTW+LZ, 51 seconds for GenCompress-2 [18], but only 10 Seconds for our proposed algorithm DNABIT Compress. Unfortunately, we could not test in depth the performance of DNACompress. Since it is based on the PatternHunter [17] search engine which is not freely available.

Finally, we would like to comment on the performance of Off-line which, like our algorithms, only encodes exact repeats. While we use a simple and fast procedure to find repeats, Off-line is designed to search for an “optimal” sequence of textual substitutions. It is therefore not surprising that Off-line is much slower than our algorithms (for the yeast sequences the difference is by a factor 1000 or more). The experimental results show that our algorithms are an order of magnitude faster than the other DNA compressors and that their compression ratio is very close to the one of the best compressors.The speed of our algorithm and its small working space have been able to compress sequences of length up to 220MB, which are well beyond the range of any other DNA compressor. Figure 1 shows the comparison ratios of proposed DNABIT compress algorithm with popular existing algorithms.

Figure 1.

Figure 1

Comparison of ratios of DNABIT Compress with existing algorithms

DNA COMPRESS JAVA TOOL

The proposed DNABIT compress tool screens are shown in Figure 2, 3, 4. The tool has the option of selecting the type as either encrypt or decrypt. The DNA genomes of any length can be given as Input in the input column selecting the encryption option. The compression ratio is displayed in the output column with the time taken for computing. The encrypted text can be decrypted back to its original DNA sequence by using the decryption option. The original DNA text is displayed in the output option. Sample Code is given in supplementary material.

Detecting Tandem repeats of Repetitive Nucleotides

In addition to the compression technique , an algorithm is designed to find tandem repeats as DNABIT compress enlighthens on compressing repetitive bases The detection of tandem repeats is important in biology and medicine as it can be used for phylogenic studies and disease diagnosis. This paper proposes two techniques for detecting approximate tandem repeats (ATRs) in DNA sequences. We have proposed a algorithm to calculate the tandem repeats since our proposed DNABIT compress focuses on repetitive Bases. our algorithm calculates consequetive similar two repeats (di nucleotides) and three repeats (tri nucleotides or codons). Tandem repeats (TRs) are defined as two or more contiguous approximate copies of a pattern of nucleotides. Tandem repeats have been known to play important roles in human disease, regulation, and evolution.

Repetitive structures are present in over one-third of the human genome [18]. The expansion of the trinucleotide repeat results in anticipation or progression in severity of the disorder through each generation. In general, there is a correlation between the size of the expansion and the severity of the phenotype. Furthermore, instabilities in dinucleotide repeat sequences have been observed in colon cancer [19] Some biological mechanisms for the expansion of repeats include: defect in mismatch repair system, polymerase slippage during replication, and genetic instability of some DNA structures [21, 22]. Repeats play a role in gene regulation when present in regions with transcription factors [22]. Sample code is given in supplementary material.

Conclusion

A simple DNA compression algorithm which is completely new in its design is proposed to compress DNA sequences which are repetitive as well as non repetitive in nature. Data compression reveals certain theoretical ideas such as entropy, mutual information and complexity between sequences of different genomes. Data compression also plays a vital role in analyzing biological sequences to discover hidden patterns, infer phylogenetic relationship between organisms which are areas of active research in bioinformatics. If the sequence is compressed using DNABIT Compress algorithm, it will be easier to compress large bytes of DNA sequences with the average compression ratio of 1.5359 bits per base which will be very useful in sequence comparisons and Multiple sequence Alignment analysis. The simplicity and flexibility of DNABIT Compress algorithm could make it an invaluable tool for DNA compression in clinical research.

Limitations

Summative evaluation of learning outcomes such as testing the real biological sequences on this algorithm, performance with the tools, or the transfer of knowledge to similar tasks could not be performed

(1)The compression algorithm can be applied to calculate phylogeny and multiple sequence alignment of various DNA sequences. (2) Developing a new DNA compressor with compressed text data rather than Binary bits.

Supplementary material

Data 1
97320630005350S1.pdf (271.8KB, pdf)

Figure 2.

Figure 2

DNAcompress Tool Screen

Figure 3.

Figure 3

Compression Ratio displayed in the output Box.

Figure 4.

Figure 4

Decompressed Text in the output Box.

Acknowledgments

Authors are thankful for the support rendered by V. K. Kumar during its development phase

Footnotes

Citation:Rajarajeswari & Apparao, Bioinformation 5(8), 350-360 (2011)

References

  • 1.Schrodinger E. Cambridge, UK: Cambridge University Press; p. 1944. [Google Scholar]
  • 2.Giancarlo R, et al. A synopsis Bioinformatics. 2009;25:1575. doi: 10.1093/bioinformatics/btp117. [DOI] [PubMed] [Google Scholar]
  • 3.Koonin EV, et al. Bioinformatics. 1999;15:265. doi: 10.1093/bioinformatics/15.4.265. [DOI] [PubMed] [Google Scholar]
  • 4.Wooley JC, et al. J.Comput.Biol. 1999;6:459. doi: 10.1089/106652799318391. [DOI] [PubMed] [Google Scholar]
  • 5.Bennett CH, et al. IEEE Trans.Inform.Theory. 1998;44:4. [Google Scholar]
  • 6.Grumbach S, Tahi F. Journal of Information Processing and Management. 1994;30(6):875. [Google Scholar]
  • 7.Rivals E, et al. LIFL, Lille I University, technical report IT-285. 1995. [Google Scholar]
  • 8.Chen X, et al. Proceedings of the Fourth Annual International Conference on Computational Molecular Biology. 2000 Tokyo, Japan. [Google Scholar]
  • 9.Bell TC, et al. Newyork: Prentice Hall; p. 1990. [Google Scholar]
  • 10.Ziv J, Lempel A. IEEE Trans. Inf. Theory. 1977;23:337. [Google Scholar]
  • 11.Ziv J, Lempel A. IEEE Trans. Inf.Theory. 1987;24:530. [Google Scholar]
  • 12.Grumbach A, Tahi F. Proceedings of the IEEE Data Compression Conference; 1993. Snowbird, UT, USA. [Google Scholar]
  • 13.Chen X, et al. Proceedings of the Fourth Annual International Conference on Computational Molecular Biology; 2000. Tokyo, Japan. [Google Scholar]
  • 14.Chen X, et al. Bioinformatics. 2002;18:1696. doi: 10.1093/bioinformatics/18.12.1696. [DOI] [PubMed] [Google Scholar]
  • 15.Lempel A, Ziv J. IEEE Trans. Inf. Theory. 1976;22:75. [Google Scholar]
  • 16.Li M, et al. IEEE Trans. Inf. Theory. 2003;50:3250. [Google Scholar]
  • 17.Ma B, et al. Bioinformatics. 2002;18:440. [Google Scholar]
  • 18.Lander ES, et al. Nature. 2001;409:860. [Google Scholar]
  • 19.Thibodeau S, et al. Science. 1993;260:816. doi: 10.1126/science.8484122. [DOI] [PubMed] [Google Scholar]
  • 20.Mitas M. Nucleic Acids Research. 1997;25:2245. doi: 10.1093/nar/25.12.2245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wells RD, et al. Journal of Biological Chemistr. 1996;271:2875. doi: 10.1074/jbc.271.6.2875. [DOI] [PubMed] [Google Scholar]
  • 22.Perutz M, et al. Current Opinion in Structural Biology. 1996;6:848. doi: 10.1016/s0959-440x(96)80016-9. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data 1
97320630005350S1.pdf (271.8KB, pdf)

Articles from Bioinformation are provided here courtesy of Biomedical Informatics Publishing Group

RESOURCES