Abstract
Motivation: Genome sequencing has become faster and more affordable. Consequently, the number of available complete genomic sequences is increasing rapidly. As a result, the cost to store, process, analyze and transmit the data is becoming a bottleneck for research and future medical applications. So, the need for devising efficient data compression and data reduction techniques for biological sequencing data is growing by the day. Although there exists a number of standard data compression algorithms, they are not efficient in compressing biological data. These generic algorithms do not exploit some inherent properties of the sequencing data while compressing. To exploit statistical and information-theoretic properties of genomic sequences, we need specialized compression algorithms. Five different next-generation sequencing data compression problems have been identified and studied in the literature. We propose a novel algorithm for one of these problems known as reference-based genome compression.
Results: We have done extensive experiments using five real sequencing datasets. The results on real genomes show that our proposed algorithm is indeed competitive and performs better than the best known algorithms for this problem. It achieves compression ratios that are better than those of the currently best performing algorithms. The time to compress and decompress the whole genome is also very promising.
Availability and implementation: The implementations are freely available for non-commercial purposes. They can be downloaded from http://engr.uconn.edu/∼rajasek/ERGC.zip.
Contact: rajasek@engr.uconn.edu
1 Introduction
Next-generation sequencing (NGS) technologies are producing millions to billions of short reads from DNA molecules simultaneously in a single run within a very short time period, leading to a sharp decline in whole genome sequencing costs. As a result, we are observing an explosion of genomic data from various species. Storing these data is an important task that the biologists have to perform on a daily basis. To save space, compression could play an important role. Also, when the size of the data transmitted through the Internet increases, the transmission cost and congestion in the network also increase proportionally. Here again compression could help. Although we can compress the sequencing data through standard general purpose algorithms, these algorithms may not compress the biological sequences effectively, since they do not exploit inherent properties of the biological data. Genomic sequences often contain repetitive elements, e.g. microsatellite sequences. The input sequences might exhibit high levels of similarity. An example will be multiple genome sequences from the same species. Additionally, the statistical and information-theoretic properties of genomic sequences can potentially be exploited. General purpose algorithms do not exploit these properties. In this article, we offer a novel algorithm to compress genomic sequences effectively and efficiently. Our algorithm achieves compression ratios that are better than the currently best performing algorithms in this domain. By compression ratio, we mean the ratio of the uncompressed data size to the compressed data size.
The following five versions of the compression problem have been identified in the literature: (i) genome compression with a reference: here we are given many (hopefully very similar) genomic sequences. The goal is to compress all the sequences using one of them as the reference. The idea is to utilize the fact that the sequences are very similar. For every sequence other than the reference, we only have to store the difference between the reference and the sequence itself; (ii) reference-free genome compression: this is the same as Problem 1, except that there is no reference sequence. Each sequence has to be compressed independently; (iii) reference-free reads compression: it deals with compressing biological reads where there is no clear choice for a reference; (iv) reference-based reads compression: in this technique, complete reads data need not be stored but only the variations with respect to a reference genome are stored; and (v) metadata and quality scores compression: in this problem, we are required to compress quality sequences associated with the reads and metadata such as read name, platform and project identifiers.
In this article, we focus on Problem 1. We present an algorithm called ERGC (Efficient Referential Genome Compressor) based on a reference genome. It employs a divide and conquer strategy. At first it divides both the target and reference sequences into some parts of equal size and finds one-to-one maps of similar regions from each part. It then outputs identical maps along with dissimilar regions of the target sequence. The rest of this article is organized as follows: Section 2 has a literature survey. Section 3 describes the proposed algorithm and analyses its time complexity. Our experimental platform is explained in Section 4. This section also contains the experimental results and discussions. Section 5 concludes the article.
2 A survey of compression algorithms
We now survey some of the algorithms that have been proposed in the literature to solve Problem 1. In referential genome compression, the goal is to compress a large set S of similar sequences. The core idea of reference-based compression can be described as follows. We first choose the reference sequence R. Then we compress every other sequence by comparing it with R. The target (i.e. the current sequence to be compressed) is first aligned to the reference. Then mismatches between the target and the reference are identified and encoded. Each record of a mismatch may consist of the position with respect to the reference, the type (e.g. insertion, deletion or substitution) of mismatch and the value.
Brandon et al. (2009) have used various coders like Golomb (Golomb, 1966), Elias (Peter et al., 1975) and Huffman (Huffman, 1952) to encode the mismatches. Wang and Zhang (2011) have presented a de novo compression program, GRS, which obtains variant information by using a modified Unix diff program. The algorithm GReEn (Pinho et al., 2012) employs a probabilistic copy model that calculates target base probabilities based on the reference. Given the base probabilities as input, an arithmetic coder was then used to encode the target. Recently, another algorithm, namely, iDoComp (Ochoa et al., 2014) has been proposed which outperforms some of the previously best known algorithms like GRS, GReEn and GDC. GDC (Deorowicz and Grabowski, 2011) is an LZ77-style compression scheme for relative compression of multiple genomes of the same species. In contrast to the algorithms mentioned above, Christley et al. (2009) have proposed the DNAzip algorithm that exploits the human population variation database, where a variant can be a single-nucleotide polymorphism (SNP) or an indel (an insertion or a deletion of multiple bases). Some other notable algorithms that employ VCF (Variant Call Format) files to compress genomes have been given by Deorowicz et al. (2013) and Pavlichin et al. (2013). These algorithms have been used in the 1000 Genomes project to encode SNPs and other structural genetic variants. Next we explain elaborately some of the best known algorithms in this domain.
2.1 GRS
GRS is a reference-based genome compression tool exclusively dependent on the Unix program diff as mentioned above. Specifically, the primary step of GRS is to find longest common subsequences in two input strings. The auxiliary Unix program diff is employed to calculate a similarity measure between a target genomic sequence and a reference genomic sequence. If the similarity score exceeds a predefined threshold, the difference between the target and reference genomic sequences is encoded using Huffman encoding. If the similarity score is below the threshold, the target and reference sequences are split into smaller blocks and the computation is restarted on each pair of the blocks. It is to be noted that GRS does not require any additional information about the sequences, e.g. a reference SNPs map. If there exists an excessive difference between the target and reference sequences, GRS will not be able to compress the target sequence effectively.
2.2 GDC
GDC is a LZ77-style (Ziv and Lempel, 1977) compression algorithm closely related to RLZopt (Shanika et al., 2011) where GDC performs a non-greedy parsing of the target into the reference by hashing. On the contrary, RLZopt uses a suffix array. The main difference between GDC and the other reference-based compression tools is that it can choose a suitable reference (or, more than one reference) sequence among the set of genomic sequences from the same species using a heuristic and use it to compress the rest. It also introduces a clever trick to encode approximate matches. The algorithm slightly alters the original Lempel-Ziv parsing scheme by considering trade-offs between the length of matches and distance between matches. Compression is performed on input blocks with shared Huffman codes by enabling random access of the reference.
2.3 GReEn
GReEn is also a reference-based genome compression tool. It encodes the target sequence using an arithmetic encoder. At first it generates statistics using the reference sequence and then performs the compression of the target by employing arithmetic coding. Arithmetic encoder uses the previously computed statistics to encode the target. From experimental results, it is evident that GReEn outperforms both GRS and the non-optimized RLZ. Similar to the non-referential compression scheme XM (Cao et al., 2007), GReEn introduces a copy expert model. This model tries to find identical k-mers between the target and reference sequences. Raw characters in the form of arbitrary ASCII characters are encoded with arithmetic encoding. However, there is a special case where target and reference sequences have equal length. Although not justified, GReEn assumes that the sequences are already aligned and can be distinguished by SNPs.
2.4 iDoComp
The basic functioning of iDoComp can be summarized in three main steps: (i) mapping generation: in this stage, the target genome is expressed in terms of the reference genome. It uses suffix arrays to parse the target into the reference; (ii) post-processing: the post-processing looks for consecutive matches that can be merged together and converted into an approximate match and (iii) entropy encoding: entropy encoder compresses the mapping and generates the compressed file.
3 Methods
3.1 Our algorithm
We have developed a reference-based genome compression algorithm called ERGC. It performs better than the best known algorithms existing in the current literature. Our algorithm runs in stages. Each stage is independent of the previous stage(s). In this setting, it can be readily transferred from in-core to out-of-core model and single-core to multi-core environment. We will discuss these enhancements later in this section. Details of our algorithm follow. Assume that R is the reference sequence and T is a target sequence to be compressed. At first ERGC divides the entire reference and target genomes into parts of equal sizes and processes each pair of parts sequentially. If the parts in R and T are and , respectively, then r1 and t1 are processed first, r2 and t2 are processed next and so on.
Let be the pair processed at some point in the algorithm (where comes from the reference genomic sequence R and comes from the target genomic sequence T). To find the similarities between and , we need to align onto . Similar regions between two sequences can be found globally aligning onto using Needleman–Wunsch algorithm as the sequences in the query set are similar and of roughly equal size. Since the time complexity of the global alignment algorithm is quadratic and thus based on dynamic programming, it is a very time and space intensive procedure especially when the length of the sequences is very large. In this context, we have devised our own greedy alignment algorithm to find similar regions between two sequences with high confidence (it is applicable when the sequences of interest are highly similar, e.g. two genomic sequences of the same species). Now we describe our greedy algorithm next.
Our greedy alignment algorithm is based on hashing. At first, the algorithm generates all the k-mers from and hashes them into a hash table H (for some suitable values of k). It then generates k-mers from one at a time and hashes them to H until one of these k-mers collides with an entry in H. If none of the k–mers collides with an entry in H, the algorithm generates another set of -mers (where ) from and hashes them into a hash table . In a similar way, it then generates -mers from one at a time and hashes them to until one of these -mers collides with an entry in . We employ a predefined set of values for k and try these values one at a time until a collision happens. The reason for taking a range of k–mer values is that the occurrences of substitutions, insertions and/or deletions can be more frequent in some parts of and than in the others. A range of values for k ensures that for at least one value a collision will occur. If there is no such collision, it is not possible to align onto . If none of the values of k from this set results in a collision, then the algorithm extends the length of on both sides and a similar scheme is followed as described above. If all of the above mentioned techniques fail, then is saved as a raw sequence. Otherwise we align and with the k-mer that causes a collision as the anchor and extend the alignment beyond that position until there is a mismatch between and . We record the matching length and the starting position of this stretch of matching in the reference genome. At this point, we delete the matching sequences from and and align the rest using our greedy algorithm as described above until the length of or becomes zero or there can not be any further alignment possible. This is how the algorithm proceeds iteratively. Next we describe how ERGC takes care of unmatched sequences.
As there can be substitutions, insertions and/or deletions in the reference and target genomes, some portions of the genomes between two alignments will not be matched perfectly. In this case, we align those sequences using edit scripting. If the edit distance is large enough to exceed the cost to store the unmatched sequence of the target genome, we discard the edit script and store the raw sequence. The information generated to compress the target sequence is stored in an ASCII formatted file. After having processed all the parts of and the corresponding parts in , we compress the starting positions and matching length using delta encoding. The resulting file is further compressed by using PPMd lossless data compression algorithm. It is a variant of prediction by partial matching algorithm and an adaptive statistical data compression technique based on context modeling and prediction. For more details, the reader is referred to Moffat et al. (1990). Some recent implementations of PPMd are effective in compressing text files containing natural language text. The seven-zip open-source compression utility provides several compression options including the PPMD algorithm. Details of the algorithm are shown as Algorithm 1.
Values of parameters such as set K, q and τ have been optimized to get the best results. In our experiments, we have used default values of K and q where τ is chosen dynamically. The set K contains two fixed values, i.e. . At first ERGC tries to align using 21-mers (i.e. k = 21). If it fails, k = 9 is picked to align the parts. The value of q is chosen in such a way that each part is composed of 20 000 nucleotides approximately. If the unaligned substrings from the reference and target are approximately equal and the memory needed to store the cost of edit distance information exceeds the memory needed to store the raw sequence, ERGC discards the edit distance information and stores the raw sequence as an ASCII formatted text file. Again, to speed up the proposed algorithm we have used several techniques.
Algorithm 1: ERGC |
---|
Input: Reference sequence R, target sequence T, a threshold τ, a set K of values for k. The set K contains two fixed values . τ is the memory needed to store the raw sequence of interest |
Output: Compressed sequence TC |
begin |
1 Divide R and T into q equal parts. Let these be and , respectively; |
2 for to q do |
3 Hash the k-mers (for a suitable value of k from K) of ri into a hash table H; |
4 Generate one k-mer at a time from ti and hash it into H; |
5 If there is no collision try different values of k from K and repeat lines 3 and 4; |
6 If all the different k-mers have been tried with no collision, extend the length of ri and go to line 3; |
7 When a collision occurs in H, align ri and ti with this common k-mer as the anchor; |
8 Extend the alignment beyond the common k-mer until there is a mismatch; |
9 Record the matching length and the starting position of this match in the reference genome R; |
10 Compute the edit distance between unmatched subsequences (one each from ri and ti); |
11 If the edit distance di is , store the raw (unmatched) subsequence of ti; |
12 Otherwise store the edit script information; |
13 Compress the stored information using delta encoding; |
14 Encode the stored information using PPMD encoder; |
15 Return the compressed sequence TC |
3.2 An illustrative example
Let us illustrate our algorithm with a suitable example in brief. Please see Figure 1 for visual details. Suppose is the pair processed at some point in the algorithm. As described above, comes from the reference genomic sequence R, and comes from the target genomic sequence T. Let and be nucleotide positions in and , respectively. At first, we hash the k-mers (for a suitable value of k from K) of into a hash table H. In this procedure, similar k-mers fall into the same bucket in the hash table H. We record the position of occurrence of each k-mer. Next we generate one k-mer at a time from and hash it into H. Let a specific -mer starting at position bp in collide with an entry of H. As a number of identical -mers can be found across the genomic sequence, the bucket can have multiple positions of identical -mers. Next we retrieve the -mer from the bucket which has the least position among all the identical -mers in the same bucket. Let the position be ai in . The proposed method then aligns onto using bp and ai and extend the alignment beyond the common -mers until a mismatch is found. While extending, the first mismatch occurs in and of and , respectively. So, can be represented with respect to by recording the position of ai and the length of . The same procedure is repeated again. The unaligned substring is aligned with by considering some heuristics. At this point, three cases are possible.
3.2.1 Case 1: mutations in and
In this case, the unaligned substring of is aligned with employing edit distance calculations. In our example, this case arises while we attempt to align unmatched substring onto . If the lengths of and are approximately equal and the memory needed to store the edit distance information between them is less than for the raw sequence , we store with respect to by recording the starting position of and edit distance information. Otherwise we store the raw sequence .
3.2.2 Case 2: insertions in
In this case, nucleotides are inserted in positions of . We align onto am and extend it until we find any mismatch using a similar procedure and record the positions of am and as described above.
3.2.3 Case 3: insertions in
Case 3 arises when nucleotides are inserted in positions of . In this special case, we store the raw sequence and the starting position of as there is no other choice left.
3.3 Time complexity analysis
Consider a pair of parts r and t (where r comes from the reference and t comes from the target). Let . We can generate k-mers from r and hash them in time. The same amount of time is spent, in the worst case, to generate and hash the k-mers of t. The number of different k-values that we try is a small constant and hence the total time spent in all the hashing that we employ is . If a collision occurs, then the alignment we perform is greedy and takes only time. After the alignment recording the difference and subsequent encoding also takes linear (in ) time.
If no collision occurs for any of the k-values tried, t is stored as such and hence the time is linear in . Put together, the run time for processing r and t is . Extending this analysis to the entire target sequence, we infer that the run time to compress any target sequence T of length n is O(nk) where k is the largest value used in hashing.
It is easy to see that our algorithm can be implemented in a single pass through the data and thus can be employed in an out-of-core setting by using an appropriate value for the length of r and t. This can be ensured by choosing the length of r and t to be where M is the size of the core memory. The performance of any out-of-core algorithm is measured in terms of the number of I/O operations performed. A single pass through the data refers to the number of I/O operations needed to bring each data item exactly once into the core memory. As a result, ERGC is optimal in terms of out-of-core computing.
4 Results and discussion
4.1 Experimental environment
We have compared our algorithm with the best known algorithms existing in the referential genome compression domain. In this section, we summarize the results. All the experiments were done on an Intel Westmere compute node with 12 Intel Xeon X5650 Westmere cores and 48 GB of RAM. The operating system running was Red Hat Enterprise Linux Server release 5.7 (Tikanga). ERGC compression and decompression algorithms are written in standard Java programming language. Java source code is compiled and run by Java Virtual Machine (JVM) 1.6.0.
4.2 Datasets
To measure the effectiveness of our proposed algorithm, we have done a number of experiment using real datasets. We have used hg18 release from the UCSC Genome Browser, the Korean genomes (KOR131 for short) and (KOR224 for short) (Ahn et al., 2009) and the genome of a Han Chinese known as YH (Levy et al., 2008). Since to compress a genomic sequence we need a reference genome, we have randomly chosen five pairs of target-reference sequences from the above benchmark datasets. We have taken chromosomes 1–22, X and Y chromosomes (i.e. a total of 24 chromosomes) for comparison purposes. Please see Table 1 for details about the datasets we have used.
Table 1.
Dataset | Species | Chr. | Sequence | Taken from |
---|---|---|---|---|
D1 | Homo sapiens | 24 | Target: YH | yh.genomics.org.cn |
Reference: hg18 | ncbi.nlm.nih.gov | |||
D2 | Homo sapiens | 24 | Target: YH | yh.genomics.org.cn |
Reference: KO224 | koreangenome.org | |||
D3 | Homo sapiens | 24 | Target: YH | yh.genomics.org.cn |
Reference: KO131 | koreangenome.org | |||
D4 | Homo sapiens | 24 | Target: KO224 | koreangenome.org |
Reference: KO131 | koreangenome.org | |||
D5 | Homo sapiens | 24 | Target: hg18 | ncbi.nlm.nih.gov |
Reference: KO131 | koreangenome.org |
4.3 Discussion
Next we present details on the performance evaluation of our proposed algorithm ERGC with respect to both compression and running time. We have compared ERGC with two of the three best performing algorithms namely GDC and iDoComp using several standard benchmark datasets. GReEn is another state-of-the-art algorithm existing in the literature. But we could not compare it with our algorithm, as the site containing the code was down at the time of experiments. GDC, GReEn and iDoComp are highly specialized algorithms designed to compress genomic sequences with the help of a reference genome. These are the best performing algorithms in this area as of now.
Given a reference sequence, our algorithm compresses the target sequence by exploiting the reference. So, it needs the reference sequence at the time of decompression also. We use the target and reference pairs of sequences illustrated in Table 1 to assess the effectiveness of the algorithm. Some notable algorithms such as Pavlichin et al. (2013), Deorowicz et al. (2013) and Christley et al. (2009) exploit SNPs/indels variation files and achieve high compression ratios. But it may not be possible to find variation files for every species and these algorithms will not work without variation files. Our algorithm does not employ variation files and so it can compress any genomic sequence given a reference. As a result, we feel that algorithms that employ variation files form a separate class of algorithms and are not comparable to our algorithm. Again our proposed algorithm is devised in such a way that it is able to work with any alphabet used in the genomic sequence. Every other algorithm works only with valid alphabets intended for genomic sequence e.g. . The characters most commonly seen in sequences are but there are several other valid characters that are used in clones to indicate ambiguity about the identity of certain bases in the sequence. It is not uncommon to see these ‘wobble’ codes at polymorphic positions in DNA sequences. It also differentiates between lower-case and upper-case letters. GDC, GReEn and iDoComp can differentiate between upper-case and lower-case letters specified in ∑ but previous algorithms like GRS or RLZ-opt only work with A, C, G, T and N in the alphabet. iDoComp replaces the character in the genomic sequence that does not belong to ∑ with N. Specifically, ERGC will compress the target genome file regardless of the alphabets used and decompress the compressed file which is exactly identical to the target file. This is the case for GDC and iDoComp also but GReEn does not include the metadata information and output the sequence as a single line instead of multiple lines.
Effectiveness of various algorithms including ERGC is measured using several performance metrics such as compression size, compression time, decompression time, etc. Gain measures the percentage improvement over the compression achieved by ERGC with respect to GDC and iDoComp. Comparison results are shown in Table 2. Clearly, our proposed algorithm is competitive and performs better than all the best known algorithms. In Tables 3 and 4, we show a comparison between compressed size (from different algorithms) and the actual size of individual chromosomes for some datasets. Memory consumption is also very low in our algorithm as it only processes one and only one part from the target and reference sequences at any time. Please note that we did not report the performance evaluation of GDC for every dataset, as it ran at least 1 h but did not able to compress a single chromosome for some datasets.
Table 2.
GDC |
iDoComp |
ERGC |
Gain |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | A.size | R.size | C.time | D.time | R.size | C.time | D.time | R.size | C.time | D.time | GDC | iDoComp |
D1 | 2987 | NA | NA | NA | 65 708.47 | 3157.00 | 812.02 | 7704.75 | 606.12 | 129.69 | NA | 88.27% |
D2 | 2987 | 31 832.76 | 5980.40 | 40.70 | 29 723.53 | 2192.00 | 268.97 | 9016.41 | 840.56 | 124.15 | 71.68% | 69.67% |
D3 | 2987 | 37 154.19 | 6785.40 | 40.93 | 33 131.26 | 1857.00 | 276.47 | 9200.24 | 875.30 | 118.13 | 75.24% | 72.23% |
D4 | 2938 | 11 851.95 | 4829.10 | 50.09 | 7043.82 | 2534.00 | 137.45 | 5073.44 | 624.48 | 225.22 | 57.19% | 27.97% |
D5 | 2996 | NA | NA | NA | 209 380.79 | 3040.00 | 953.06 | 19 396.40 | 988.12 | 148.32 | NA | 90.73% |
Best values are shown in bold face. A.size and R.size refer to actual size in MB and reduced size in KB, respectively. C.time and D.time refer to the compression time and decompression time in seconds, respectively.
Table 3.
GDC |
iDoComp |
ERGC |
||||||||
---|---|---|---|---|---|---|---|---|---|---|
Chromosome | A.size | R.size | C.time | D.time | R.size | C.time | D.time | R.size | C.time | D.time |
C1 | 251 370 554 | 2 391 974 | 720.8 | 3.19 | 2 366 447 | 200.00 | 23.09 | 700 785 | 57.81 | 8.78 |
C2 | 247 000 341 | 2 345 540 | 755.90 | 3.29 | 1 822 538 | 175.00 | 14.62 | 685 102 | 49.02 | 8.51 |
C3 | 202 826 864 | 1 266 255 | 368.20 | 2.72 | 1 124 827 | 164.00 | 11.11 | 612 271 | 38.30 | 8.18 |
C4 | 194 460 954 | 1 485 973 | 281.50 | 2.66 | 1 305 098 | 140.00 | 11.12 | 626 786 | 44.44 | 7.03 |
C5 | 183 872 170 | 1 575 619 | 353.60 | 2.45 | 1 364 733 | 117.00 | 10.56 | 545 591 | 36.56 | 7.41 |
C6 | 173 748 332 | 1 260 446 | 327.00 | 2.33 | 1 087 335 | 117.00 | 9.76 | 555 871 | 38.97 | 6.82 |
C7 | 161 468 454 | 1 893 681 | 413.40 | 2.14 | 1 495 373 | 108.00 | 10.27 | 512 645 | 30.61 | 6.79 |
C8 | 148 712 746 | 1 224 754 | 198.50 | 1.91 | 1 056 722 | 96.00 | 8.93 | 468 036 | 30.25 | 5.82 |
C9 | 142 611 146 | 2 258 374 | 232.40 | 1.76 | 2 606 197 | 82.00 | 19.51 | 421 259 | 28.04 | 5.66 |
C10 | 137 630 990 | 1 427 667 | 304.40 | 1.83 | 1 295 147 | 106.00 | 9.18 | 413 975 | 28.16 | 5.66 |
C11 | 136 693 265 | 1 127 511 | 218.40 | 1.79 | 935 087 | 96.00 | 7.84 | 418 994 | 28.18 | 5.91 |
C12 | 134 555 367 | 901 759 | 223.20 | 1.79 | 760 267 | 87.00 | 6.99 | 406 557 | 27.36 | 5.74 |
C13 | 116 045 370 | 682 679 | 63.30 | 1.40 | 858 431 | 86.00 | 13.92 | 302 716 | 22.28 | 4.61 |
C14 | 108 141 402 | 808 036 | 79.10 | 1.29 | 931 770 | 70.00 | 13.78 | 293 895 | 19.69 | 4.30 |
C15 | 102 011 238 | 1 130 386 | 96.00 | 1.17 | 1 275 730 | 61.00 | 14.75 | 251 420 | 18.92 | 4.31 |
C16 | 90 307 716 | 1 260 031 | 175.20 | 1.13 | 1 157 411 | 60.00 | 10.08 | 284 209 | 17.95 | 3.58 |
C17 | 80 087 662 | 870 827 | 286.20 | 1.10 | 677 073 | 51.00 | 4.58 | 245 816 | 16.21 | 3.43 |
C18 | 77 385 780 | 560 056 | 55.00 | 1.03 | 447 788 | 47.00 | 4.27 | 236 401 | 16.15 | 3.15 |
C19 | 64 875 186 | 546 966 | 128.30 | 0.84 | 577 941 | 39.00 | 7.00 | 276 688 | 12.87 | 2.66 |
C20 | 63 476 571 | 414 322 | 50.20 | 0.84 | 384 499 | 52.00 | 4.38 | 199 830 | 12.92 | 2.44 |
C21 | 47 726 736 | 318 750 | 12.60 | 0.50 | 445 831 | 46.00 | 8.44 | 141 432 | 13.73 | 2.10 |
C22 | 50 519 630 | 573 318 | 28.00 | 0.56 | 685 738 | 36.00 | 9.88 | 149 413 | 9.13 | 2.10 |
CX | 157 495 656 | 3 628 432 | 584.60 | 2.35 | 2 824 069 | 123.00 | 12.76 | 351 252 | 107.16 | 6.85 |
CY | 58 735 843 | 2 643 399 | 24.60 | 0.63 | 2 950 851 | 33.00 | 22.15 | 131 863 | 135.85 | 2.32 |
A.size and R.size refer to actual size and reduced size in bytes, respectively. C.time and D.time refer to the compression time and decompression time in seconds, respectively. Best results are shown in bold.
Table 4.
GDC |
iDoComp |
ERGC |
||||||||
---|---|---|---|---|---|---|---|---|---|---|
Chromosome | A.size | R.size | C.time | D.time | R.size | C.time | D.time | R.size | C.time | D.time |
C1 | 251 370 554 | 2 721 204 | 762.70 | 3.20 | 2 586 013 | 154.00 | 23.78 | 697 595 | 57.75 | 8.74 |
C2 | 247 000 341 | 2 696 425 | 780.80 | 3.24 | 2 051 129 | 127.00 | 15.28 | 691 698 | 48.14 | 8.58 |
C3 | 202 826 864 | 1 534 808 | 386.50 | 2.74 | 1 294 797 | 118.00 | 11.61 | 618 221 | 37.80 | 7.66 |
C4 | 194 460 954 | 1 782 402 | 338.20 | 2.69 | 1 493 618 | 112.00 | 11.42 | 645 704 | 43.58 | 6.90 |
C5 | 183 872 170 | 1 826 898 | 421.50 | 2.49 | 1 536 335 | 119.00 | 11.40 | 544 582 | 37.17 | 6.83 |
C6 | 173 748 332 | 1 496 369 | 368.60 | 2.34 | 1 235 131 | 109.00 | 10.02 | 557 506 | 37.74 | 6.66 |
C7 | 161 468 454 | 2 175 785 | 541.60 | 2.15 | 1 685 493 | 83.00 | 10.84 | 516 224 | 30.36 | 6.53 |
C8 | 148 712 746 | 1 420 008 | 232.90 | 1.99 | 1 187 419 | 75.00 | 9.12 | 465 201 | 29.79 | 5.70 |
C9 | 142 611 146 | 2 435 967 | 295.80 | 1.76 | 2 801 794 | 76.00 | 19.91 | 427 900 | 27.46 | 5.26 |
C10 | 137 630 990 | 1 615 075 | 338.10 | 1.81 | 1 432 409 | 92.00 | 9.42 | 424 362 | 27.56 | 5.23 |
C11 | 136 693 265 | 1 329 807 | 225.60 | 1.85 | 1 059 181 | 85.00 | 8.19 | 429 965 | 28.16 | 5.21 |
C12 | 134 555 367 | 1 086 044 | 288.90 | 1.76 | 870 113 | 101.00 | 7.20 | 414 576 | 26.93 | 4.97 |
C13 | 116 045 370 | 820 625 | 72.10 | 1.42 | 948 002 | 59.00 | 14.24 | 305 009 | 21.84 | 4.29 |
C14 | 108 141 402 | 926 154 | 89.20 | 1.26 | 1 007 829 | 76.00 | 13.81 | 308 622 | 19.97 | 4.02 |
C15 | 102 011 238 | 1 242 272 | 131.50 | 1.19 | 1 362 715 | 57.00 | 14.71 | 272 473 | 19.22 | 3.92 |
C16 | 90 307 716 | 1 393 463 | 232.50 | 1.13 | 1 253 762 | 51.00 | 10.15 | 286 410 | 17.51 | 3.53 |
C17 | 80 087 662 | 1 003 297 | 258.00 | 1.10 | 757 520 | 63.00 | 4.78 | 250 363 | 15.98 | 3.15 |
C18 | 77 385 780 | 670 979 | 71.90 | 1.05 | 512 016 | 53.00 | 4.35 | 235 380 | 15.80 | 3.11 |
C19 | 64 875 186 | 655 576 | 143.70 | 0.83 | 644 870 | 29.00 | 7.07 | 274 639 | 12.95 | 2.56 |
C20 | 63 476 571 | 486 193 | 62.10 | 0.83 | 429 085 | 37.00 | 4.36 | 202 068 | 12.73 | 2.42 |
C21 | 47 726 736 | 376 979 | 13.70 | 0.50 | 475 821 | 22.00 | 8.41 | 142 895 | 13.84 | 2.19 |
C22 | 50 519 630 | 632 996 | 28.10 | 0.56 | 730 299 | 34.00 | 9.93 | 167 700 | 9.32 | 2.09 |
CX | 157 495 656 | 4 862 027 | 683.40 | 2.40 | 3 503 378 | 108.00 | 14.22 | 402 636 | 120.93 | 6.23 |
CY | 58 735 843 | 2 854 541 | 18.00 | 0.64 | 3 067 689 | 17.00 | 22.25 | 139 320 | 162.77 | 2.35 |
A.size and R.size refer to actual size and reduced size in bytes, respectively. C.time and D.time refer to the compression time and decompression time in seconds, respectively. Best results are shown in bold.
As stated above, ERGC differentiates upper-case and lower-case characters. It compresses target file containing the genomic sequence to be compressed and metadata if any with the help of a reference. The decompression procedure produces exactly the same file as the input. It does not depend on the alphabets and is universal in this sense. Consider dataset D1 where the target and reference sequences/chromosomes are from YH and hg18, respectively (Table 1). In this setting, GDC runs indefinitely. iDoComp compresses 2987 MB of data (i.e. the total size of 24 chromosomes and header information/metadata) roughly to 64 MB where ERGC reduces it to roughly 7 MB of data. The per cent improvement ERGC achieves with respect to iDoComp is 88.27%. Specifically, ERGC compresses 9 × better than iDoComp for this particular dataset. ERGC is also faster than iDoComp in terms of both compression () and decompression () times (see Table 2 for details).
Now consider dataset D2 where the target and reference sequences are from YH and KO224, respectively. The compressions achieved by GDC and iDoComp are roughly equal, whereas ERGC is about better than them. GDC’s compression time is longer than both of iDoComp and ERGC, but it decompresses the sequences very quickly. ERGC’s compression is approximately and faster than iDoComp and GDC, respectively. Next consider D5. GDC runs indefinitely for this dataset. The percentage improvement ERGC achieves with respect to iDoComp is 90.73%. Specifically, ERGC takes 11 × fewer disk space compared to iDoComp for this particular dataset. ERGC is also faster than iDoComp in terms of both compression () and decompression () times. Figure 2 shows a comparative study of different algorithms including ERGC with respect to compression ratio, compression and decompression time.
In brief, the minimum and maximum improvements observed from datasets were 27.97% and 90.73% with respect to iDoComp, respectively. The minimum and maximum improvements over GDC observed were 57.9% and 75.24%, respectively. ERGC compresses at least and at most faster than iDoComp. Although it is better than iDoComp and GDC in compression time for every dataset, it is slower than GDC with respect to decompression for datasets .
5 Conclusions
Data compression is a very important problem in biology especially for NGS data. Five different NGS data compression problems have been identified and studied. In this article, we have presented a novel algorithm for one of these problems, namely, reference-based genome compression to effectively and efficiently compress genomic sequences. From the experimental results, it is evident that our algorithm indeed achieves compression ratios that are better than those of the currently best known algorithms. The compression time is also better than that of state-of-the-art algorithms in this domain. Although GDC is better than ERGC in terms of decompression time, the time ERGC takes to decompress the genomic sequences is also very promising.
Funding
This research has been supported in part by the NIH grant R01-LM010101 and the NSF grant 1447711.
Conflict of Interest: none declared.
References
- Ahn S.-M., et al. (2009) The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res., 19, 1622–1629. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brandon M.C., et al. (2009) Data structures and compression algorithms for genomic sequence data. Bioinformatics, 25, 1731–1738. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cao M.D., et al. (2007) A simple statistical algorithm for biological sequence compression. In: Proceedings of the 2007 IEEE Data Compression Conference (DCC 07), IEEE, pp. 43–52. [Google Scholar]
- Christley S., et al. (2009) Human genomes as email attachments. Bioinformatics, 25, 274–275. [DOI] [PubMed] [Google Scholar]
- Deorowicz S., Grabowski S. (2011) Robust relative compression of genomes with random access. Bioinformatics, 27, 2979–2986. [DOI] [PubMed] [Google Scholar]
- Deorowicz S., et al. (2013) Genome compression: a novel approach for large collections. Bioinformatics, 29, 1–7. [DOI] [PubMed] [Google Scholar]
- Fritz M.H.-Y., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res., 21, 734–740. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Golomb S.W. (1966) Run-length encodings. IEEE Trans. Inf. Theory, 12, 399–401. [Google Scholar]
- Huffman D. (1952) A method for the construction of minimum-redundancy codes. In Proceedings of the Institute of Radio Engineers, pp. 1098–1101. [Google Scholar]
- Levy S., et al. (2008) The diploid genome sequence of an Asian individual. Nature, 456, 60–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moffat A. (1990) Implementing the PPM data compression scheme. IEEE Trans. Commun., 38, 1917–1921. [Google Scholar]
- Ochoa I., et al. (2014) iDoComp: a compression scheme for assembled genomes. Bioinformatics, 31, 626–633. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pavlichin D., et al. (2013) The human genome contracts again. Bioinformatics, 29, 2199–2202. [DOI] [PubMed] [Google Scholar]
- Peter E. (1975) Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory, 21, 194–203. [Google Scholar]
- Pinho A.J., et al. (2012) GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res., 40, e27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shanika K., et al. (2011) Optimized relative lempel-ziv compression of genomes. In: Reynolds,M. (ed.) 34th Australasian Computer Science Conference, Perth, Australia, Vol. 113, pp. 91–98. [Google Scholar]
- Wang C., Zhang D. (2011) A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Res., 39, e45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ziv J., Lempel A. (1977) A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23, 337–343. [Google Scholar]