Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2015 Oct 24;32(4):590–592. doi: 10.1093/bioinformatics/btv613

BGT: efficient and flexible genotype query across many samples

Heng Li 1,
PMCID: PMC5963361  PMID: 26500154

Abstract

Summary: BGT is a compact format, a fast command line tool and a simple web application for efficient and convenient query of whole-genome genotypes and frequencies across tens to hundreds of thousands of samples. On real data, it encodes the haplotypes of 32 488 samples across 39.2 million SNPs into a 7.4 GB database and decodes up to 420 million genotypes per CPU second. The high performance enables real-time responses to complex queries.

Availability and implementation: https://github.com/lh3/bgt

Contact: hengli@broadinstitute.org

1 Introduction

VCF/BCF (Danecek et al., 2011) is the primary format for storing and analyzing genotypes of multiple samples. It however has a few issues. First, VCF is a site-oriented format. Although accessing a site and all the associated genotypes is efficient with indexing, retrieving site annotations or the genotypes of a few samples always requires to decode the genotypes of all samples, which is unnecessarily expensive. Second, VCF does not take advantage of linkage disequilibrium (LD), while using this information can dramatically improve compression ratio (Durbin, 2014). Third, a VCF record is not clearly defined. Each record may consist of multiple alleles with each allele composed of multiple SNPs and INDELs. This ambiguity complicates annotations, query of alleles and integration of multiple datasets. At last, most existing VCF-based tools do not support expressive data query. We frequently need to write scripts for advanced queries, which costs both development and processing time. GQT (Layer et al., 2015) attempts to solve some of these issues. Although it is very fast for selecting a subset of samples and for traversing all sites, it discards phasing, is inefficient for region query and is not compressed well. The observations of these limitations motivated us to develop BGT.

2 Methods

Unlike VCF which stores sample phenotypes, site annotations and genotypes all in one file, BGT separates the three types of information into individual files. BGT keeps diploid genotypes as a 2-bit integer matrix (Hki) with row indexed by a pair of overlapping reference/non-reference alleles and column by a sample haplotype (thus for m samples, the matrix has 2m columns). Hki takes value 0 if the i-th haplotype has the reference allele in the allele pair at row k, takes 1 if the haplotype has the non-reference allele, 2 if unknown and 3 if the haplotype has a different non-reference allele. BGT arbitrarily phases unphased genotypes and always breaks complex variants in VCF down to the smallest possible variants. It keeps the allele pairs (i.e. rows) in a site-only BCF, disallowing multiple alleles per VCF line, and stores the integer matrix as two positional BWTs (PBWTs), one for the lower bit and the other for the higher bit.

BGT obtains phenotypes and site annotations from files in the Flat Metadata Format (FMF). FMF is TAB-delimted with the first column showing the row name and following columns giving typed key-value pairs. An example looks like:

  • sample1 gender:Z:M height:f:1.73 foo:i:10

  • sample2 gender:Z:F height:f:1.64bar:i:20

BGT can retrieve rows via an arbitrary expression such as ‘height > 1.65’.

The multi-file design makes BGT unfriendly to data streaming but it enables BGT to use one set of site annotations across multiple BGT files and allows users to modify phenotypes and annotations without re-encoding all the genotypes.

2.1 PBWT overview

PBWT (Durbin, 2014) is a generic way to encode binary matrices. Let (Ak)k=(A0,,An1) denote a list of m-long binary strings. (Ak)k can be regarded as an n × m binary matrix with Ak[i] representing the cell at row k and column i. For simplicity, introduce a sentinel row A1=$0$1$m1 with a lexicographical order $0<<$m1.

Define binary string:

Pki=Ak[i]Ak1[i]A0[i]A1[i]

to be the reverse of the column prefix ending at (k, i) and define Sk(i) to be the column index of the i-th smallest prefix among set {Pkj}j . Sk(i) is a bijection on {0,,m1} and thus invertible. In a special case, S1(i)=i because P1,i=A1[i]=$i.

The PBWT of (Ak)k is (Bk)k, which is calculated by

Bk[i]=Ak[Sk1(i)]

An important use of (Bk)k is to compute Sk. Define

ϕk(i)=Ck(Bk[i])+rankk(Bk[i],i)

where Ck(b) gives the number of symbols in Bk that are lexicographically smaller than b and rankk(b,i) the number of b symbols in Bk before position i. The i-th smallest column in row k – 1 is ranked ϕk(i) in row k. Thus

Sk(ϕk(i))=Sk1(i)

Given Ak and Sk1, we can compute Sk and Bk in the order of BkϕkSk, starting from k = 0. Conversely, given Bk and Sk1, computing ϕkSkAk derives Ak from Bk.

When there are strong correlations between adjacent rows, which is true for haplotype data due to LD, 0 s and 1 s tend to form long runs in Bk. This usually makes Bk much more compressible than Ak under run-length encoding. For our test dataset, 32 000 genotypes in a row can be compressed to  < 200 bytes in average.

2.2 Query genotypes and output

A BGT query may consist of three types of conditions: (a) genotype-independent sample selection, such as a list of sample names or an arbitrary expression on phenotypes; (b) genotype-independent site selection, such as a genomic region, a list of alleles or an arbitrary expression on variant annotations; (c) genotype-dependent site conditions, such as alleles being common among selected samples. We may select multiple groups of samples with (a)-typed conditions. For each group, BGT will compute aggregate variables, including the number of called samples and the allele count, which can be outputted or used in (c)-typed conditions.

BGT usually outputs VCF/BCF with aggregate variables written to the INFO field. It may optionally output a TAB-delimited table on user selected fields. BGT may also output the samples having a list of alleles, and the counts of haplotypes across requested alleles in multiple sample groups.

2.3 BGT server

BGT comes with a standalone web server frontend implemented in the Go programming language. The server has a similar interface to the command line tool but with additional consideration of sample anonymity. With BGT, each sample has an attribute ‘minimal group size’ or MGS. If a query selects a group containing a sample with a MGS larger than the requested group size, the server will refuse the request. In particular, if a sample has MGS larger than one, users cannot access its sample name and individual genotypes but can retrieve allele counts computed together with other samples. This prevents users to access data at the level of a single sample.

3 Results

We generated the BGT database for the first release of Haplotype Reference Consortium (HRC; http://bit.ly/HRC-org). The input is a BCF containing 32 488 samples across 39.2 million SNPs on autosomes. The BGT file size is 7.4 GB, 11% of the genotype-only BCF or 8% of GQT. Decoding the genotypes of all samples across 142 k sites in a 10 Mbp region takes 11 CPU seconds, which amounts to decoding 420 million genotypes per second. This speed is even faster than computing allele counts and outputting VCF.

We use the following command line to demonstrate the query syntax of BGT:

  • bgt view -G -d var.fmf.gz -a’gene==“BRCA1”’ \

  •  -s ’source==“IBD”’ -s ’source==“1000G”’ \

  •  -f ’AC1/AN1>=0.001&&AC2/AN2>=0.001’ \

  • HRC-r1.bgt

It finds BRCA1 variants annotated in ‘var.fmf.gz’ that have 0.1% frequency in both the IBD dataset (http://www.ibdresearch.co.uk) and 1000 Genomes (1000 Genomes Project Consortium, 2012). In this command line, -G disables the output of genotypes. Option -a selects variants with the ‘gene’ attribute equal to ‘BRCA1’ according to the variant database specified with -d. This condition is a (b)-typed condition independent of sample genotypes. Each option -s sets an (a)-typed condition, selecting a group of samples based on phenotypes. For the #-th sample group/ -s, BGT counts the total number of called alleles and the number of non-reference alleles and writes them to the AN# and AC# aggregate variables, respectively. Option -f then use these aggregate variables to filter output. This is a (c)-typed condition.

The command line earlier takes 12 CPU seconds with most of time spent on reading through the variant annotation file to find matching alleles. The BGT server reads the entire file into memory to alleviate the overhead but a better solution would be to use a proper database for variant annotations.

To demonstrate the server frontend, we have also set up a public BGT server at http://bgtdemo.herokuapp.com. It hosts 1000 Genomes haplotypes for chromosome 11 and 20.

4 Discussion

Given a multi-sample VCF, most BGT functionalities can be achieved with small scripts, but as a command line tool, BGT has a few advantages. First, it saves development time. Extracting information from multiple files can be done with a command line instead of a script. Second, BGT saves processing time. With high-performance C code at the core, BGT is much faster than processing VCF in a scripting language such as Perl or Python. For example, deriving allele counts in a 10 Mbp region for the HRC data takes 30 s with BGT, but doing the same with a Perl script takes 40 min, a 80-fold difference. Third, the design of one non-reference allele per record simplifies BGT merge and makes it twice as fast as bcftools merge on two genotype-only input files.

The BGT server tries to solve a bigger problem: data sharing. Instead of always delivering full data in VCF, projects could have a new option to serve data publicly with the BGT server, letting users select the summary statistics of interest on the fly while keeping samples unidentifiable. This is an improvement to Stade et al. (2014) which only provide precomputed summary.

We acknowledge that our MGS-based data sharing policy might have oversimplified real scenarios, but we believe this direction, with proper improvements and more importantly the approval of ethical review boards, will be more open, convenient, efficient and secure than our current share-everything-with-trust model.

Acknowledgements

The authors are grateful to HRC for granting the permission to use the data for evaluating the performance of BGT and thank the Global Alliance Data Working Group for the helpful suggestions.

Funding

NHGRI [U54HG003037]; NIH [GM100233].

Conflict of Interest: none declared.

References

  1. 1000 Genomes Project Consortium. (2012) An integrated map of genetic variation from 1 092 human genomes. Nature, 491, 56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Danecek P., et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156–2158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Durbin R. (2014) Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics, 30, 1266–1272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Layer R.M., et al. (2015) Efficient compression and analysis of large genetic variation datasets. bioRxiv, dx.doi.org/10.1101/018259. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Stade B., et al. (2014) GrabBlur–a framework to facilitate the secure exchange of whole-exome and -genome SNV data using VCF files. BMC Genomics, 15 (Suppl. 4), S8. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES