vcfpp: a C++ API for rapid processing of the variant call format

Zilong Li

doi:10.1093/bioinformatics/btae049

. 2024 Jan 25;40(2):btae049. doi: 10.1093/bioinformatics/btae049

vcfpp: a C++ API for rapid processing of the variant call format

Zilong Li ^1,^✉

Editor: Peter N Robinson

PMCID: PMC10868310 PMID: 38273677

Abstract

Motivation

Given the widespread use of the variant call format (VCF/BCF) coupled with continuous surge in big data, there remains a perpetual demand for fast and flexible methods to manipulate these comprehensive formats across various programming languages.

Results

This work presents vcfpp, a C++ API of HTSlib in a single file, providing an intuitive interface to manipulate VCF/BCF files rapidly and safely, in addition to being portable. Moreover, this work introduces the vcfppR package to demonstrate the development of a high-performance R package with vcfpp, allowing for rapid and straightforward variants analyses.

Availability and implementation

vcfpp is available from https://github.com/Zilong-Li/vcfpp under MIT license. vcfppR is available from https://cran.r-project.org/web/packages/vcfppR.

1 Introduction

Computational biologists have made numerous efforts and contributions to facilitate analyses of genomic variants. The variant call format (VCF) has become the standard for storing genetic variant information, consisting of detailed specifications (Danecek et al. 2011). With big data on the rise, the binary variant call format (BCF) was later designed to query and store large datasets efficiently. The C API of HTSlib (Bonfield et al. 2021) provides a full set of functionalities to manipulate the VCF/BCF for both compressed and uncompressed files. Given that the C API is challenging for less proficient programmers to use, APIs derived from other languages have been created to fill in the gap. Existing popular libraries include vcfR (Knaus and Grünwald 2017) for R, cyvcf2 (Pedersen and Quinlan 2017) for Python, hts-nim (Pedersen and Quinlan 2018) for Nim, and vcflib (Garrison et al. 2022) for C++. All are valuable to each respective community, but not without their disadvantages. In particular, vcflib is both an API and a large collection of command line tools, with the primary pitfall being that it does not support the BCF format. It is noteworthy that many methods written in C++ designed for large sample size cannot input the compressed VCF or BCF, as in the case of Syllbable-PBWT (Wang et al. 2023). The motivation behind vcfpp is to offer full functionalities as HTSlib, and provide a simple and safe API in a single header file, which can be easily integrated for programming in C++ as well as other languages that can call C/C++ codes, such as R (https://www.r-project.org/) and Python (https://www.python.org/).

2 Materials and methods

2.1 The VCF and HTSlib

A VCF file consists of a header section and a body section. The header contains lines with meta-information, each starting with the characters “##” or “#,” while the body consists of TAB-delimited lines. A VCF file differs from a typical TAB-delimited file in several ways. Firstly, the header is too important to be ignored. A VCF file that contains a tag in the body without its declaration in the header violates the specification. Secondly, the VCF can be compressed and randomly accessed using bgzip and tabix, which were originally part of SAMtools library (Li et al. 2009) but were later separated into HTSlib (Bonfield et al. 2021). Additionally, the VCF standard is established and regularly updated by the SAMtools team. Therefore, it would be more advantageous to use a specialized library, namely HTSlib, to process the VCF rather than a customized parser. HTSlib offers full functionalities to interact with the VCF, including support for BCF, format validation, compression, random access, identification of variant types, and support for URL links as filenames, among others. Furthermore, HTSlib has demonstrated the best performance among many competitors (Bonfield et al. 2021). However, HTSlib was written in C language, which can be challenging for beginner programmers to use, especially when it comes to memory management.

2.2 A C++ API

As a C++ API of HTSlib, vcfpp inherits all functionalities from HTSlib and is implemented in a single header file that can be easily integrated and used safely. There are four core classes in vcfpp, as summarized in Table 1, all of which allocate and free memory automatically and safely, enabling users to program with ease.

Table 1.

vcfpp capabilities and implemented C++ class.

Class	Capabilities
BcfReader	Read VCF/BCF, decompression, random access
BcfWriter	Write VCF/BCF, compression
BcfRecord	Access the fields, modify the fields
BcfHeader	Access the header, modify the header

Open in a new tab

To illustrate commonly used features and demonstrate the simplicity of vcfpp, here, I showcase an example that will be used throughout the article. In Listing 1, we count the number of heterozygous sites for each sample in a VCF file. The following code first includes a single vcfpp.h file (Line 1), opens a compressed VCF file constrained to three samples and region “chr21” (Line 2), and creates a variant record associated with the header information in the VCF file (Line 3). Then, it defines several types of objects to collect the results we want (Lines 4 and 5). Taking advantage of generic templates in C++, the genotype (GT) value can be of bool, char, or int type so that users may control the memory consumed by their program. Then, it iterates over the VCF file and processes each variant record in the loop (Line 6). Here, we ignore variants of other types (INDEL, SV), or if FILTER does not display “PASS,” or if the QUAL value is smaller than nine (Lines 8 and 9), though the API also allows us to do more complicated filtering should. Finally, we count the number of heterozygous variants for each diploid sample (Lines 10 and 11). The core is only 12 lines.

Listing 1: Counting the number of heterozygous GTs for three samples on chr21

graphic file with name btae049ilf1.jpg

3 Results

To demonstrate the simplicity, portability, and performance of vcfpp, the following sections include the benchmarking results and highlight the vcfppR package as an example of vcfpp working with R (R Core Team 2023).

3.1 Working with R

While vcfpp is very simple for writing a C++ program, a single C++ header file can be easily integrated into popular script languages like R and Python. Particularly, R is designed for statistical modeling and visualization, widely used for data analyses. Therefore, I developed the vcfppR package to demonstrate how vcfpp can seamlessly work with R using Rcpp (Eddelbuettel and Francois 2011). For instance, with the basic familiarity of C++ and Rcpp, we can turn the C++ code in Listing 1 into an Rcpp function to return the heterozygosity counted per sample along with the sample’s name (Listing 2), which can then be compiled and called dynamically in R using sourceCpp (Listing 3). As such, we can further process and visualize these results straightforwardly within the R ecosystem. For example, we can analyze the data by genomic region in parallel using the parallel package and then stratify the results by population, leveraging the external labels of each sample, to finally visualize them in R.

Listing 2: vcfpp-hets.cpp

graphic file with name btae049ilf2.jpg

Listing 3: The R code compiles and calls the vcfpp-hets.cpp dynamically

graphic file with name btae049ilf3.jpg

3.2 The vcfppR package

The vcfppR package is developed and powered by the vcfpp API. To parse the VCF with vcfppR, the vcfreader and vcftable functions can rapidly read contents of the VCF into the R data types with fine control over the region, samples, variant types, FORMAT column, and filters. For instance, the code in Listing 4 parses the read depth per variant (DP) in the raw called VCF by the 1000 Genomes Project through the URL link. It restricts the analysis to three samples with variants in “chr21:1–10000000” region of SNP type, passing the FILTER, and discarding the INFO column in the returned list. Subsequently, the visual summary can be generated by using boxplot() in R (see Fig. 1).

Figure 1. — Analyzing variants discovered in the 1000 Genome Project with vcfppR (see code in the Supplementary Material).

Listing 4: Example of vcfppR::vcftable

graphic file with name btae049ilf4.jpg

Furthermore, as characterizing variants is an essential task in genomic analyses, I showcase the vcfsummary function in Fig. 1, which summarizes the variants found in the latest VCF released by the 1000 Genome Project (Byrska-Bishop et al. 2022).

3.3 Benchmarking

In addition to simplicity and portability, I showcase here the performance of vcfpp and vcfppR. For the benchmarking, I developed scripts (https://github.com/Zilong-Li/vcfpp/tree/main/scripts) to perform a common analysis of counting heterozygous GTs per sample on a Linux server with AMD EPYC 7643 48-Core Processor. As shown in Table 2, when using the compressed (gzipped) VCF of 3202 samples and 1 002 753 variants, which includes only GT in the FORMAT, the Rcpp function vcfppR::heterozygosity in Listing 2 demonstrates comparable performance to the compiled C++ code of vcfpp in Listing 1 with a minor overhead. This overhead is attributed to a list of sample names being returned to R. The dynamic script using vcfppR::vcfreader is only $1.3 \times$ slower than its compiled C++ counterpart, whereas the cyvcf2::VCF is $1.9 \times$ slower. With the streaming strategy, all scripts use little RAM, given that they only load one variant into memory at a time. However, R packages like vcfR and data.table usually load all VCF data into memory first and perform analyses later, which is referred here as the “two-step” strategy. To this end, I have also developed vcftable function in vcfppR to load the entire contents of the VCF into R for the two-step comparison. Notably, the vcfppR::vcftable is only $2.0 \times$ slower compared to the $19 \times$ slower vcfR::read.vcfR and the $119 \times$ slower data.table::fread. This discrepancy arises because the GT values returned by both vcfR and data.table are characters, which are inefficient to further process in R. On the other hand, with vcfppR, an integer matrix of GTs can be returned to R directly for fast computation. If we exclude the elapsed time of loading data, which means ignoring time marked by the *, then vcfppR demonstrates a 101× speed improvement over vcfR. Importantly, vcfpp and vcfppR offer users the full functionalities of HTSlib, including support for compressed VCF/BCF, selection of samples, regions, and variant types.

Table 2.

Performance of counting heterozygous GTs per sample with VCF of 3202 samples and 1 002 753 variants.^a

API/function	Time	Ratio	RAM	Strategy
vcfpp	97	1.0	0.006	Streaming
vcfppR::heterozygosity	109	1.1	0.074	Streaming
vcfppR::vcfreader	150	1.3	0.136	Streaming
cyvcf2::VCF	212	1.9	0.036	Streaming
vcfppR::vcftable	206^a +12	2.0	64.7	Two-step
vcfR::read.vcfR	625^a+1219	19.0	97.5	Two-step
data.table::fread	313^a+11243	119.1	77.3	Two-step

Open in a new tab

Time is in seconds. RAM is in gigabytes.

Used by loading data in two-step strategy.

4 Discussion

Here, I have developed vcfpp, a fast and flexible C++ API for high-performance genetic variant analyses with the VCF/BCF input. Its simplicity and portability can be very valuable for both developing packages and writing in-house scripts. The vcfppR package is a great example of vcfpp working with R, and there are also examples of developing a Python API for vcfpp available on GitHub. In vcfppR, there are vcfreader and vcftable functions that can process the variants. The vcftable is specifically designed for reading the VCF into the tabular data structure in R, but it can only read a single FORMAT item in one pass over the VCF. In contrast, vcfreader serves as the full R-bindings of vcfpp and allows iterative parsing of variants, giving users the flexibility to decide the information to retrieve for each variant. As such, many packages written in C++ using a customized VCF parser can be simply replaced with vcfpp to offer more functionalities. For instance, vcfpp can be found successfully implemented in the imputation software programs STITCH (Davies et al. 2016) and QUILT (Davies et al. 2021) to parse large reference panels in the compressed VCF/BCF.

Data and code availability

The latest release of vcfpp.h and documentation can be found at https://github.com/Zilong-Li/vcfpp. The vcfppR package can be installed through CRAN for all platforms (https://CRAN.R-project.org/package=vcfppR). Scripts for the benchmarking are available at https://github.com/Zilong-Li/vcfpp/tree/main/scripts.

Supplementary Material

btae049_Supplementary_Data

Click here for additional data file.^{(34.8KB, pdf)}

Acknowledgements

I would like to thank Anders Albrechtsen at Copenhagen University and Robert W. Davies at Oxford University for their helpful comments. They are statisticians as well as R enthusiasts working on genetics, whom I work with and have learned a lot from. Also, I want to thank Cindy G. Santander and the reviewers for helping me to improve the quality of this article.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

This work was supported by the Novo Nordisk 462 Foundation [NNF20OC0061343].

References

Bonfield JK, Marshall J, Danecek P. et al. HTSlib: C library for reading/writing high-throughput sequencing data. Gigascience 2021;10:giab007. 10.1093/gigascience/giab007 [DOI] [PMC free article] [PubMed] [Google Scholar]
Byrska-Bishop M, Evani US, Zhao X. et al. High-coverage whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell 2022;185:3426–40.e19. 10.1016/j.cell.2022.08.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
Danecek P, Auton A, Abecasis G. et al. The variant call format and VCFtools. Bioinformatics 2011;27:2156–8. 10.1093/bioinformatics/btr330 [DOI] [PMC free article] [PubMed] [Google Scholar]
Davies RW, Flint J, Myers S. et al. Rapid genotype imputation from sequence without reference panels. Nat Genet 2016;48:965–9. 10.1038/ng.3594 [DOI] [PMC free article] [PubMed] [Google Scholar]
Davies RW, Kucka M, Su D. et al. Rapid genotype imputation from sequence with reference panels. Nat Genet 2021;53:1104–11. 10.1038/s41588-021-00877-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
Eddelbuettel D, Francois R.. Rcpp: seamless R and C++ integration. J Stat Soft 2011;40:1–18. 10.18637/jss.v040.i08 [DOI] [Google Scholar]
Garrison E, Kronenberg ZN, Dawson ET. et al. A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLoS Comput Biol 2022;18:e1009123. 10.1371/journal.pcbi.1009123 [DOI] [PMC free article] [PubMed] [Google Scholar]
Knaus BJ, Grünwald NJ.. vcfr: a package to manipulate and visualize variant call format data in R. Mol Ecol Resour 2017;17:44–53. 10.1111/1755-0998.12549 [DOI] [PubMed] [Google Scholar]
Li H, Handsaker B, Wysoker A. et al. The sequence alignment/map format and SAMtools. Bioinformatics 2009;25:2078–9. 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
Pedersen BS, Quinlan AR.. cyvcf2: fast, flexible variant analysis with python. Bioinformatics 2017;33:1867–9. 10.1093/bioinformatics/btx057 [DOI] [PMC free article] [PubMed] [Google Scholar]
Pedersen BS, Quinlan AR.. hts-nim: scripting high-performance genomic analyses. Bioinformatics 2018;34:3387–9. 10.1093/bioinformatics/bty358 [DOI] [PMC free article] [PubMed] [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing, 2023. [Google Scholar]
Wang V, Naseri A, Zhang S. et al. Syllable-PBWT for space-efficient haplotype long-match query. Bioinformatics 2023;39:btac734. 10.1093/bioinformatics/btac734 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btae049_Supplementary_Data

Click here for additional data file.^{(34.8KB, pdf)}

Data Availability Statement

[btae049-B1] Bonfield JK, Marshall J, Danecek P. et al. HTSlib: C library for reading/writing high-throughput sequencing data. Gigascience 2021;10:giab007. 10.1093/gigascience/giab007 [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae049-B2] Byrska-Bishop M, Evani US, Zhao X. et al. High-coverage whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell 2022;185:3426–40.e19. 10.1016/j.cell.2022.08.004 [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae049-B3] Danecek P, Auton A, Abecasis G. et al. The variant call format and VCFtools. Bioinformatics 2011;27:2156–8. 10.1093/bioinformatics/btr330 [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae049-B4] Davies RW, Flint J, Myers S. et al. Rapid genotype imputation from sequence without reference panels. Nat Genet 2016;48:965–9. 10.1038/ng.3594 [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae049-B5] Davies RW, Kucka M, Su D. et al. Rapid genotype imputation from sequence with reference panels. Nat Genet 2021;53:1104–11. 10.1038/s41588-021-00877-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae049-B6] Eddelbuettel D, Francois R.. Rcpp: seamless R and C++ integration. J Stat Soft 2011;40:1–18. 10.18637/jss.v040.i08 [DOI] [Google Scholar]

[btae049-B7] Garrison E, Kronenberg ZN, Dawson ET. et al. A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLoS Comput Biol 2022;18:e1009123. 10.1371/journal.pcbi.1009123 [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae049-B8] Knaus BJ, Grünwald NJ.. vcfr: a package to manipulate and visualize variant call format data in R. Mol Ecol Resour 2017;17:44–53. 10.1111/1755-0998.12549 [DOI] [PubMed] [Google Scholar]

[btae049-B9] Li H, Handsaker B, Wysoker A. et al. The sequence alignment/map format and SAMtools. Bioinformatics 2009;25:2078–9. 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae049-B10] Pedersen BS, Quinlan AR.. cyvcf2: fast, flexible variant analysis with python. Bioinformatics 2017;33:1867–9. 10.1093/bioinformatics/btx057 [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae049-B11] Pedersen BS, Quinlan AR.. hts-nim: scripting high-performance genomic analyses. Bioinformatics 2018;34:3387–9. 10.1093/bioinformatics/bty358 [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae049-B12] R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing, 2023. [Google Scholar]

[btae049-B13] Wang V, Naseri A, Zhang S. et al. Syllable-PBWT for space-efficient haplotype long-match query. Bioinformatics 2023;39:btac734. 10.1093/bioinformatics/btac734 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

vcfpp: a C++ API for rapid processing of the variant call format

Zilong Li

Roles