Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2021 May 14;37(22):4248–4250. doi: 10.1093/bioinformatics/btab378

Sparse allele vectors and the savvy software suite

Jonathon LeFaive 1,, Albert V Smith 2, Hyun Min Kang 3, Gonçalo Abecasis 4
Editor: Pier Luigi Martelli
PMCID: PMC9502232  PMID: 33989384

Abstract

Summary

The sparse allele vectors file format is an efficient storage format for large-scale DNA variation data and is designed for high throughput association analysis by leveraging techniques for fast deserialization of data into computer memory. A command line interface has been developed to complement the storage format and supports basic features like importing, exporting and subsetting. Additionally, a C++ programming API is available allowing for easy integration into analysis software.

Availability and implementation

https://github.com/statgen/savvy.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

As the number of deeply sequenced genomes grows at an unprecedented rate, storage formats that efficiently scale with sample size are essential. Modern sequence and imputation datasets can now include 100 s of millions of genetic variants measured across 10 000 s of individuals (Taliun et al., 2021). The number of genetic variants in these data grows steadily as more samples are sequenced, but the resulting genotype matrix is typically very sparse since most individuals only exhibit variation at ∼4–5 million sites, even as the number of variants grows to 10 s or 100 s of millions.

We have designed a file format that exploits this sparsity to both compress data and reduce deserialization overhead. Since the proportion of rare variants increases with sample size, the compression ratio and efficiency of our format both improve as study sizes grow (Table 1). The sparse vector design of our format makes it a natural companion to analysis methods that capitalize on the use of sparse matrix operations.

Table 1.

SAV compression and deserialization performance

Sample size 2000 20 000 200 000
Deserialization speed (min)
 BCF (htslib)a 0.55 (1.000) 18.62 (1.000) 596.73 (1.000)
 BCF (savvy)a 0.47 (0.855) 15.60 (0.838) 494.08 (0.828)
 SAVb 0.03 (0.055) 0.20 (0.011) 1.73 (0.003)
 SAV w/PBWTb,c 0.17 (0.309) 1.84 (0.099) 19.69 (0.033)
File size (GiB)
 BCF 0.18 (1.00) 1.98 (1.00) 25.13 (1.00)
 BGT 0.06 (0.33) 0.47 (0.24) 8.55 (0.34)
 GDS 0.04 (0.22) 0.41 (0.21) 7.19 (0.29)
 GQT 0.22 (1.22) 2.11 (1.07) 20.98 (0.83)
 PGEN 0.12 (0.67) 1.02 (0.52) 9.69 (0.39)
 SAVb 0.06 (0.33) 0.47 (0.24) 4.36 (0.17)
 SAV w/PBWTb,c 0.04 (0.22) 0.24 (0.12) 2.15 (0.09)
 spVCF 0.21 (1.17) 2.11 (1.07) 19.35 (0.77)
 VCF 0.21 (1.17) 2.50 (1.26) 39.79 (1.58)

Notes: This table shows an evaluation of compression and deserialization of deeply sequenced chromosome 20 genotypes. Proportions to BCF with htslib are in parentheses.

a

BCF files were evaluated using both savvy and htslib (the official BCF library) v1.11.

b

SAV files were compressed with the maximum zstd compression level of 19.

c

SAV with PBWT used an allele frequency threshold of 0.01 to selectively apply PBWT.

Our design extends the widely used binary variant call format (BCF) (Li, 2011a), making our format compatible with existing datasets and straightforward to implement in software tools such as bcftools (Li, 2011a), PLINK (Chang et al., 2015) and GATK (McKenna et al., 2010) that already support BCF input or output.

2 Materials and methods

2.1 SAV file format

The sparse allele vectors (SAV) file format supplements the dense vectors used for storing genomic information in BCF files with a new sparse vector data type. Instead of storing a value for each allele, only offsets and values of non-reference alleles are stored. Since low frequency variants make up a majority of the genome, this translates into huge savings in terms of data size. To accommodate common variants in an efficient manner, the non-reference offsets are relative to the previous offset, which produces repeated offset values for each variant and facilitates downstream compression. Further common variant compression can be achieved by enabling the optional positional Burrows–Wheeler transform (PBWT) (Durbin, 2014) feature, which repositions variant record data based on the sorted order of corresponding data in variant records. PBWT sorting hinders deserialization speed, so selectively applying the transformation to common variants based on a configurable allele frequency threshold optimizes for both data compression and read throughput (Table 1). SAV files supplement sparse vectors and PBWT with bit-level compression applied via the generic Zstandard (zstd) compression algorithm. To support random access to variants, independent zstd compression blocks are concatenated together to produce a single file.

A commonly overlooked aspect of storage formats is deserialization speed. By only storing the offsets of alternate alleles, reference alleles are never parsed from disk. This results in faster deserialization of files, especially when reading genotypes into sparse vector and matrix data structures that only store non-zero values. In certain analyses where reading genotype data and iterating through the resulting genotype vectors represents a substantial portion of analysis time, the SAV format can reduce overall compute times (Supplementary Table S1).

2.2 S1R indexing

SAV files are indexed using a sort-tile-recursive one-dimensional r-tree (S1R) index file. Genomic regions are organized into an r-tree to enable fast random access to an SAV file without having to traverse the entire index file. Each leaf entry in the tree points to a zstd compressed block in the corresponding SAV file. The entry also encodes the number of variant records in the block, which can be variable depending on the parameters for compressing the SAV file. While the BGZF compression format used in BCF partitions files in 64 KiB chunks (Li, 2011b), our blocking scheme is based on the number of variant records, allowing for much larger blocks and providing for better compression with zstd. The r-trees are organized with a bottom-up design allowing index files to be generated without requiring the entire tree to fit into computer memory. This reversed structure also allows for the index to be appended to the end of the SAV file as opposed to being stored as a separate file. In the situation when the index is written as a separate file, association with a specific version of an SAV file can be enforced by matching a universally unique identifier that is stored in both the SAV and S1R headers.

Another benefit of storing the number of records per block in the index is the added ability to query by record offset within the SAV file. This is useful for distributing compute work evenly across multiple machines. We have also used this feature to efficiently subset random variants using a geometric distribution of random integers.

2.3 Savvy C++ API

Savvy is an open source C++ programming library for interfacing with SAV, BCF and VCF file formats. This library was designed for efficient association analysis and serves as an abstraction layer for variant call file formats. The API follows a Structure of Arrays memory layout for sample level data, which improves the performance of CPU cache and vectorized compute operations. Accompanying the library is a command line tool for converting and manipulating SAV files.

3 Results

We evaluated compression and read performance of genotype (GT) data in variant call sets with thousands, tens of thousands and hundreds of thousands of individuals, and we observed that both metrics increasingly improve relative to BCF as sample size grows. We tested read performance by timing the deserialization of GT values from each file into computer memory. An average was computed from multiple rounds, with the first round being discarded so that all evaluations could take advantage of file system caching. The results are shown in Table 1. When combining sparse vectors with PBWT to store 200 000 individuals, SAV is over 11 times smaller and 30 times faster than BCF.

The performance improvements in Table 1 naturally extend to standard file manipulations (like merging or subsetting samples or variants). To assess whether the deserialization efficiency of SAV could extend to improved association analysis times, we performed single variant association tests using a simple linear regression model (Supplementary Note S1). SAV reduced run time compared to BCF for analysis of 200 000 samples by 98% when using sparse vector operations and 31% with traditional dense vector operations (Supplementary Table S1).

We also compared the compression efficiency of SAV against other popular alternatives to BCF for storing genotype data: BGT (Li, 2016), GDS (Zheng et al., 2017), GQT (Layer et al., 2016), spVCF (Lin et al., 2020) and PGEN (Chang et al., 2015) and we found that SAV outperforms these alternatives at larger samples sizes (Table 1).

SAV is effective at compressing other data types as well. Compression of imputed haplotype dosages of 487 409 individuals and 5 190 817 variants from chromosome 20 of UK Biobank (UKB) (Bycroft et al., 2018) is 32% smaller than BGEN (Band and Marchini, 2018) (the file format used by UKB) and 70% smaller than BCF (Supplementary Table S2). Further compression of imputed data is possible with discretization of input values. Non-sparse data such as read depth, allele depth, genotype quality and phred-scaled genotype likelihoods benefit from compression savings of 65%, 61%, 45% and 56%, respectively when enabling field-specific PBWT on 201 503 individuals and 36 980 variants (Supplementary Table S3).

4 Conclusions

We have presented new file formats and software for storing, indexing and querying DNA variation data that are performant both in terms of storage size and deserialization speed. Our storage design serves as an exchange format for DNA variation as well as a fast analysis format optimized for both deeply sequenced and imputed genomes. By enhancing the widely used BCF specification, SAV remains fully compatible with existing datasets and allows for it to be more readily adopted as the file format of choice for large whole genome sequencing studies.

Funding

National Institutes of Health [3R01HL-117626-02S1 and U01HL137182].

Conflict of Interest: none declared.

Supplementary Material

btab378_Supplementary_Data

Contributor Information

Jonathon LeFaive, Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA.

Albert V Smith, Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA.

Hyun Min Kang, Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA.

Gonçalo Abecasis, Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA.

References

  1. Band G., Marchini J. (2018) BGEN: a binary file format for imputed genotype and haplotype data. bioRxiv, 308296. [Google Scholar]
  2. Bycroft C.  et al. (2018) The UK Biobank resource with deep phenotyping and genomic data. Nature, 562, 203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chang C.  et al. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4, 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Durbin R. (2014) Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics, 30, 1266–1272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Layer R.  et al. (2016) and Exome Aggregation Consortium. Efficient genotype compression and analysis of large genetic-variation data sets. Nat. Methods, 13, 63–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Li H. (2011a) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27, 2987–2993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Li H. (2011b) Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics, 27, 718–719. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Li H. (2016) BGT: efficient and flexible genotype query across many samples. Bioinformatics, 32, 590–592. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Lin M.  et al. (2020) Sparse project VCF: efficient encoding of population genotype matrices. Bioinformatics, 36, 5537–5538. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. McKenna A.  et al. (2010) The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res., 20, 1297–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Taliun D.  et al. (2021) Sequencing of 53,831 diverse genomes from the NHLBI TOPMed program. Nature, 590, 290–299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Zheng X.  et al. (2017) SeqArray—a storage-efficient high-performance data format for WGS variant calls. Bioinformatics, 33, 2251–2257. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btab378_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES