Abstract
Motivation
Genome-wide association studies (GWAS) are widely used to investigate the role of genetics in disease traits, but the resulting file sizes from these studies are large, posing barriers to efficient storage, sharing, and querying. This issue is especially important for biobanks like the UK Biobank that publish GWAS for thousands of traits, increasing the volume of data that must be effectively managed. Current compression and query methods reduce file sizes and allow for quick genomic position-based queries but do not provide utility for quickly finding loci based on their summary statistics. For example, finding all SNVs in a particular p-value range would require decompressing and scanning the whole file. We propose a new tool, STABIX, which introduces summary-statistic-based queries and improves upon the standard bgzip compression and Tabix query tool in both compression ratio and decompression speed.
Results
When applied to 10 GWAS files from PanUKBB, STABIX created smaller compressed data and indices than Tabix for all files, where bgzip and tbi files were an average of 1.2 times the size of STABIX compressed files and indexes. In the same 10 files, STABIX per gene decompression was, on average 7× faster than Tabix per gene decompression, and achieved faster per gene decompression times for over 99% of nearly 20,000 genes.
Availability and implementation
Software freely available for download at GitHub: https://github.com/kristen-schneider/stabix/.
1 Introduction
Genome-wide association studies (GWAS) use statistical approaches to identify connections between genotypes and phenotypes by looking across control and disease populations (e.g. heart disease, type II diabetes, auto-immune and metabolic disorders, etc) to identify single nucleotide variants (SNVs) that are likely associated with certain disease traits (MacArthur et al. 2017; Uffelmann et al. 2021). Some of the earliest GWAS identified independent SNV association signals in bipolar disorder, coronary artery disease, Crohn’s disease, rheumatoid arthritis, and other common diseases (Wellcome Trust Case Control Consortium 2007). These early findings inspired hundreds of other studies which aimed to look more closely at single diseases and to explore disease patterns in diverse populations. Despite some of their early critiques (e.g. unclear biological relevance, flawed assumptions, and spurious results), GWAS has maintained its popularity for the last 15 years as researchers continue to improve analysis methods and publish new GWAS-based discoveries (Visscher et al. 2012, 2017; Loos 2020).
The recent and ongoing surge in biobanks including the UK Biobank (Sudlow et al. 2015), deCODE genetics, Biobank Japan (Nakamura 2007), has played a key role supporting the popularity of GWAS (Zhou et al. 2022). The larger sample sizes in these biobanks provide greater statistical power to discover new associations (Abdellaoui et al. 2023). GWAS for height, blood pressure, and smoking initiation, for example, have made history with sample sizes of over one million. In 2015, the National Human Genome Research Institute-European Bioinformatics Institute (NHGRI-EBI) redesigned its GWAS Catalog database to support the influx of GWAS, as well as support a wider range of information included with the studies (e.g. ancestry and recruitment information) (MacArthur et al. 2017; Buniello et al. 2019). As of August 2024, this catalog contains nearly 67,000 top associations and more than 90,000 full summary statistics. Additionally, the most updated version of Pan UK Biobank (PanUKBB, v0.4) includes a multi-ancestry set of over 7,000 GWAS for a half million samples from six continental ancestry groups (Karczewski et al. 2024).
Beyond the fundamental exploration of trait-disease associations, large-scale GWAS can help derive predictions about patients’ disease risks, termed polygenic risk scores (PRS) (de los Campos et al. 2010; Kullo et al. 2022). These measurements estimate a patient’s risk for certain diseases based on the unique variants in that patient’s genome. When applied appropriately, PRS can raise awareness of diseases before symptoms arise, inform decisions that can help slow disease progression, and help identify targets for drug development (Kullo et al. 2022). The power of large sample sizes allows for a deeper characterization of a patient's medical profile and advances our understanding of the genetic landscape of complex diseases. Even still, several untapped applications for GWAS lie just beneath the surface, such as the inclusion of diverse populations, alternative inheritance models, and other biological and environmental metrics (Tam et al. 2019). One factor that limits this potential is data size. Summary statistics alone can require hundreds of terabytes of space in a single biobank, making storage, sharing, and downstream investigation challenging.
With current technologies, GWAS summary statistics for a single trait require between 3 and 11GB of storage for bgzipped and plain text data, respectively. For over 7,000 traits in the current PanUKBB, this can amount to over 10TB of compressed data. This large and growing collection of GWAS ensures continued interest in sharing summary statistics across research and clinical communities and requires new computational methods for efficient storage and computation.
Two of the most ubiquitous methods that improve the accessibility of GWAS data use compressed (bgzip) and indexed (Tabix) data to reduce the storage burden and time required to retrieve individual records (Li 2011). Bgzip is similar to gzip, but introduces block-based compression, allowing for efficient regional data decompression. Tabix works alongside bgzip to create a genomic position index of the compressed blocks, enabling quick position-based data retrieval. While the standard B-tree index from l SQL databases is a reasonable approach to perform these queries, Tabix is preferred to SQL approaches as it directly works with popular file formats, uniquely works on compressed data files, and is faster than SQL databases at for big data [https://github.com/samtools/tabix/blob/master/tabix.1]. While bgzip and Tabix facilitate storing and retrieving GWAS summary statistic data (e.g. BED file formats), their compression is limited to a single compression scheme (i.e. codec), and the access pattern is restricted to genomic position-based queries.
To strengthen the efficient investigation of GWAS data, we introduce STABIX, a compress and index method that improves upon bgzip compression ratios with an ensemble codec compression approach and improves upon Tabix queries with the addition of a summary-statistic-based index. STABIX adds column compression to bgzip's block-based structure, allowing multiple codecs to be used for the different data types, resulting in a more effective compression strategy. Additionally, STABIX extends the Tabix position-based index to include a summary-statistic index, supporting a more comprehensive range of GWAS data exploration strategies. To accomplish this, first, STABIX creates blocks of rows of fixed or variable (see Methods) length and compresses individual columns with codecs corresponding to their data type (Fig. 1A). Concurrent to compression, STABIX generates a genomic index which stores block, genomic, and file information necessary to reconstruct individual blocks. Optionally, a second index can be created for some column of interest in the original GWAS file (e.g. p-value) which will permit efficient search which satisfies some constraint (e.g. p-value <= 5e-8) (Fig. 1B). Finally, STABIX’s query step allows users to query with both genomic and statistical constraints (Fig. 1C). Another workflow image is included at our GitHub repo main page where the STABIX software is openly available: https://github.com/kristen-schneider/stabix.
Figure 1.
STABIX workflow. STABIX takes a single GWAS file and configuration file as input. STABIX compression separates GWAS data into blocks and generates a column-centric data structure for column-based compression. STABIX generates a genomics index (genomic positions) and a statistical index (statistical bins) specified by the configuration file. STABIX returns records which match a query specified by the configuration file.
2 Methods
In this section, we first describe the input files necessary to make use of STABIX. Second, we detail each of three steps which comprise the STABIX’s compression and indexing procedures. Third, we discuss a STABIX query, including what constitutes a query, and how STABIX achieves its output. Finally, we will describe methods for reproducing experiments which are reported in our results section. Please note that documentation for installation and use of STABIX are available at: https://github.com/kristen-schneider/stabix/blob/main/.
2.1 STABIX input files
GWAS file is in standard bed file format, sorted by genomic position and tab delimited. It includes chrm, bp, etc. For our experimentation and results, we used 10 per-phenotype files from the PanUKBB. See data and code availability below for link to PanUKBB data.
Configuration file specifies STABIX options including path to GWAS file, block size, codec options, and query information. Examples are provided at: https://github.com/kristen-schneider/stabix/blob/main/config_files/test_config.yml.
2.2 STABIX compression and indexing
2.2.1 The STABIX header
STABIX’s custom header stores necessary information required to fully reconstruct (i.e. decompress) the original file and return queries. This header includes number of columns in the original GWAS file, number of blocks created during compression, a list of the column headers in the original GWAS file, a list of block header end bytes, a list of block end bytes, and a list of block sizes (i.e. last block might be different than specified block size, or all block sizes are different if a map file is used to create blocks.) Some of the header elements can be obtained without moving through any compression steps (e.g. number of columns in the original GWAS file). Others cannot be included in the header until after data compression is complete (e.g. block end bytes). Generating and writing the header is not completed until compression is completed.
2.2.2 Block-based compression
In many cases, it is not necessary to decompress a complete file (or other representation of data) to accomplish a task or answer a question. For example, with GWAS, it is common that a researcher might only be interested in looking through a particular region (i.e. gene or chromosome) or that they might only be interested in data that fall at or below some p-value. Block-based compression provides the benefit of decompressing only part of the data to investigate a query, over the need to decompress a full file. This feature can decrease decompression time when queries are small or span disjoint regions.
To achieve block-based compression for meaningful queries, a genomic index is created to store genomic information (e.g. block 1 = chromosome 1, base pairs 100–1300), byte-locations, and other information necessary to decompress a single block. Additionally, a binning index is created to store information about what blocks contain what kinds of data (e.g. blocks 1, 8, and 13 contain records with p-values at or below 5e-8.). More about how these indexes are created can be read about in the genomic and statistical index sections below.
The practicality of block-based compression depends on the size of the blocks as it relates to the queries being performed. Blocks with only a few lines might demonstrate quicker decompression times but will have a larger overhead for full compressed file size (i.e. a greater number of blocks requires a lengthier header to account for individual blocks). Furthermore, if the blocks are too small compared to the size of a typical query, the number of blocks which would need to be decompressed could end up reversing the benefit of quick decompression times. On the other hand, block sizes with many lines might experience a greater decompression ratio (though not necessarily), but the decompression time starts to slow (Chang et al. 2009).
To achieve block-based compression, STABIX parses the input GWAS file into a set of blocks whose size is determined by the user in our configuration file. STABIX’s default block size is fixed at 2,000. Optionally, the user can set the block size to point to a genomic map file (examples provided in our GitHub repository). In this case, block sizes are set to be 1cM in length and contain a variable number of records.
2.2.3 Column-specific compression
Within each block, records (i.e. rows) are further split into columns. All data from a single column in a single block is stored together in a list. Each column is then compressed independently, with the codec specified in the config file (codec specified by data type). After a column is compressed, the end byte of that column is recorded in the block’s header. Finally, the block header is compressed with the zlib codec. A compressed block constitutes a compressed block header, and a list of compressed columns whose length is the same as the number of columns in the file.
2.2.4 Writing compressed data
After each block has been independently compressed, we record the end bytes of the compressed block. The end bytes of each compressed block header and the end bytes of the compressed block are stored in the overall compressed file header as described above. The file header is compressed also with the xz codec, and the size of this compressed data is stored in the first four bytes of the compressed file. After the first four bytes, the compressed file header is written, then the first compressed header of the first block, then the first compressed block, then the second, and so forth.
2.2.5 Genomic index
The genomic index is created to make searching for genomic coordinates (e.g. return all records from chromosome 1 base pairs 12345–56789) more efficient. Because the input GWAS files must be provided in sorted order (i.e. by increasing chromosome and base pair coordinates), the genomic index is created concurrently with the compression step. As a block is being compressed, the genomic index records the block’s index, the chromosome at the start of the block, the base pair at the start of the block, the line number at the start of the block, and the block’s byte offset. Every block is compressed independently, and the genomic index is written after the last step of the compression process.
2.2.6 Statistical index
The floating point-based, binning index is created to make searching within a threshold of some query statistics (e.g. all records with p-value at or below 5e-8) more efficient. In this case, the input file is not sorted by this statistic, so the binning index is not created concurrently with compression. First, the user specifies bin boundaries and a query threshold in the configuration file. Because the UKBB p-values are stored as -log_10(p-value) to avoid underflow, STABIX creates 3 default bins: less than 0.3, between 0.3 and 4.3, and greater than 7.29; and a default query threshold greater than or equal to 7.3 (−log_10(5e-08) ∼ 7.3). While our results and defaults demonstrate utility for p-value statistical indexing, STABIX can create a similar index for any statistical column of interest, with any user-defined bins, and query thresholds (e.g. less rigid p-values). At index time, each bin is assigned all record IDs that fall within the bin’s range based on the record’s corresponding value in the specified column (e.g. p-value). All records are guaranteed a bin ID unless the column-of-interest value is unavailable, regardless of the bin thresholds. This operation is linear in the number of records. The binning index is serialized uncompressed. At query time, bins that overlap the query threshold are identified conservatively and their included record IDs are returned in aggregate. Extraneous record IDs can occur due to the coarse nature of the binning process and are filtered out before being presented to the user.
2.3 The STABIX query
STABIX first uses the statistical index to determine a set of blocks which contain records that satisfy the statistical threshold (e.g. p-value <= 5e-08). Then, for each query (e.g. gene), STABIX uses the genomic index to determine which blocks fall within the query boundaries (e.g chrm1:12345–67890). During this step, STABIX searches through a map of base pair positions for each chromosome in log(n) time to quickly return matching blocks. STABIX takes the intersection of these two sets (statistical hits ∩ genomic hits) to determine a final set of blocks to decompress. For each block in this set, STABIX first decompresses only the column specified by the statistical index (e.g. p-values) and identifies which specific indexes in this column satisfy the statistical threshold (STABIX has already determined that this block contains at least one record which satisfies the statistical threshold and now needs to determine the specific set of records). Once these indices are determined, STABIX decompresses only the chromosome and base pair columns to determine whether this statistical hit falls within the gene (STABIX has already determined that the gene is contained in this block but must determine if each record is within the specific boundaries of the gene). If at any point during these checks, a block no longer contains a hit (e.g. a p-value hit is just outside the boundary of the query), STABIX moves to the next block. Next, STABIX decompresses the appropriate set of blocks back into the column-centric readable data that were generated in the column-specific compression from above, allocates space for only the set of records which meet both criteria (statistical and genomic hit), and transposes only those matrix locations to return.
2.4 Experimentation
For a single file, we provide a bed file of 19,181 protein-coding genes and a p-value threshold <= 5e-08 (7.3 after -log_10(p-value) transformation by PanUKBB). We wrote a Python script with the pysam library to perform Tabix queries for each gene in the bed file adding a statistical check to compute p-value criteria. We perform STABIX compression, indexing, and decompression according to the appropriate configuration; and check that Tabix and STABIX output match. In the case of four genes across 10 files (TNIP2, SHPRH, MCCD1, and BNIP3L), STABIX reports 1 more record than Tabix, where STABIX is inclusive with a base pair boundary and Tabix is not. We compute STABIX average speedup over Tabix by computing [Tabix_time/STABIX_time] for each gene and taking the average over these measurements for a single file.
3 Results
3.1 Finding genome-wide significant SNVs
STABIX’s unique summary-statistic index offers efficient access to SNVs based on the strength of their association with the target trait. By only inflating blocks that contain a SNV with an association above the user-defined threshold, STABIX avoids a substantial amount of wasted work. Since Tabix index stores no information about the underlying summary statistic data, it must inflate all blocks first before passing the output to a second method that parses and tests the association values. To measure this improvement, we selected 10 PanUKBB GWAS files and queried 19,181 protein-coding genes for SNVs with p-values at or below 5e-08 using default parameters. STABIX was faster than its equivalent Tabix workflow for over 99% of gene queries (Fig. 2A; see Supplementary Table 1 for codec ensemble descriptions). As expected, the speedups depend on whether the target genes had significant SNVs (Fig. 2C). For the 88.6% of genes that contain zero significant SNVs, STABIX was 7.7× faster than Tabix. For the minority of genes with significant SNVs, STABIX was 1.75× faster. We run the same experiment for 100 PanUKBB GWAS files with continuous traits and report that STABIX is faster than its equivalent Tabix workflow for 99.15% of gene queries with an average speedup of 5.41×. STABIX was faster than Tabix for 93.31% of genes with significant SNVs and reports an average speedup of 1.70× over Tabix (see Supplementary Fig. 1). While bgzip/Tabix is the most widely used format and index standard for GWAS data, other formats, such as HDF5 (http://www.hdfgroup.org/HDF5/doc/index.html) and SQLite (https://www.sqlite.org/), can also be used. Although both formats require about four times more storage than STABIX (see Supplementary Fig. 2), and STABIX outperforms HDF5 in 97.4% of gene queries (see Supplementary Figure 3), SQLite is almost always faster than STABIX (see Supplementary Figure 4). However, the average difference in query time between STABIX and SQLite is only 0.006 s.
Figure 2.
STABIX vs Tabix performance for 10 PanUKBB GWAS per-phenotype files. A Significant SNV query speed by gene for Tabix and STABIX with hexagonal binning. B Uncompressed, bgzip/Tabix compressed, and STABIX compressed file sizes. Bgzip/Tabix compressed file sizes include the bgzip compressed file (bgz) and Tabix index file (tbi). STABIX file sizes include the STABIX compressed file, the genomic index (gidx) and the statistical index (sidx) files. C Speedup by gene for significance SNV queries for STABIX over Tabix + filtering step for all genes (top) and only genes with significant hits (bottom). A vertical dashed line (orange) is drawn at x = 1 to separate where Tabix wins (left of line) and STABIX wins (right of line). All figures report data for 10 per-phenotype files from PanUKBB, STABIX compression using default block size 2000 and codec ensemble xzb.
To better understand which features of each gene impacted the STABIX and Tabix query time, we investigate various components of the query. Figure 3A demonstrates that while there is a slight increase in decompression speed for both STABIX and Tabix queries as the uncompressed data grows larger, file size is not a limiting factor for decompression time, nor an advantage to either method. Figure 3B shows that the number of STABIX blocks also does not account for the difference in STABIX and Tabix decompression times. While STABIX and Tabix design blocks with different approaches, for both, using more blocks typically aligns with slower decompression times. Figure 3C shows that STABIX is significantly faster than Tabix when there are fewer significant p-value hits found in a block. This is because STABIX avoids unnecessary decompression when there are no p-value hits found. Likewise, Fig. 3D shows that STABIX offers significant speedup over Tabix in cases where there are few significant SNVs to return for a genomic query. To explore this phenomenon further, we select a highly heritable and polygenic trait [standing height (Silventoinen et al. 2003)] and plot comparisons for genes with no hits, p-value hits, and p-value hits within a genomic query in Supplementary Fig. 6A, B and C, respectively.
Figure 3.
STABIX vs Tabix decompression speed by gene with coloring by query information. Decompression speed by gene colored by A uncompressed file size, B number of blocks decompressed with STABIX, C number of p-value hits found during search, D number of significant SNVs returned in query. The frequency of genes plotted at each y-location is plotted in a log-axis histogram to the right of each row. Histograms for STABIX and Tabix decompression times are shown on the right and top of grids, respectively.
3.2 Reducing the GWAS file storage burden
In addition to enabling the indexing of statistical columns of interest, the STABIX column-based strategy provides a mechanism to enhance compression beyond what bgzip offers by employing additional codecs specifically optimized for various data types. This approach allows for more efficient storage and retrieval, as each column is compressed using methods best suited to its unique composition. On average, STABIX decreased file sizes by an additional 4% compared to bgzip (Fig. 2B, Table 1). To put that into perspective, if the PanUKBB uncompressed GWAS files required 77 TB, then bgzip would reduce the burden to about 20TB, and STABIX would reduce it to 17TB. That 3TB savings would come in addition to the retrieval speedups discussed above. More detailed information regarding Table 1 can be found in our supplementary material.
Table 1.
Performance of STABIX. Performance measurements and basic statistics for 10 per-phenotype files from PanUKBB. STABIX compression using default parameters. For the column reporting the percent of STABIX wins (% STABIX wins), we note that a win is awarded when STABIX’s query time per gene is faster than Tabix’s. The percent (%) win is calculated as the average number of wins over 19,181 genes in the 10 files. Compression ratios include index files when available.
| trait | Shellfish intake | Sickle cell anemia | Psoriasis | Size of red wine glass drunk | Standing height | Monocyte count | Type 2 diabetes | Diabetes mellitus | Snoring | Smoking status |
|---|---|---|---|---|---|---|---|---|---|---|
| no. sig genes | 977 | 42 | 143 | 0 | 5776 | 1803 | 141 | 156 | 59 | 168 |
| no. sig SNVs | 2189 | 392 | 2232 | 0 | 165522 | 41759 | 3514 | 4039 | 2120 | 5004 |
| Avg. STABIX wins | 5.095 | 6.549 | 7.1693 | 7.1773 | 4.2627 | 6.7368 | 8.3546 | 8.1861 | 8.4399 | 8.2152 |
| % STABIX wins | 98.4412 | 99.9791 | 99.9583 | 100.0 | 99.2649 | 98.4777 | 99.9009 | 99.8853 | 99.9583 | 99.9531 |
| Uncompressed (GB) | 1.2091 | 1.8225 | 4.1587 | 4.17 | 8.0126 | 8.7106 | 9.3567 | 9.3676 | 10.4422 | 10.6391 |
| bgzip (GB) | 0.2715 | 0.5107 | 1.0985 | 1.0982 | 2.0483 | 2.2893 | 2.4672 | 2.4732 | 2.7342 | 2.7522 |
| Tabix index (MB) | 1.6885 | 1.7465 | 1.9274 | 1.9279 | 2.1941 | 2.2362 | 2.2698 | 2.2704 | 2.3161 | 2.3221 |
| STABIX (GB) | 0.1838 | 0.4277 | 0.8771 | 0.8804 | 1.8838 | 1.9746 | 2.2061 | 2.2118 | 2.4149 | 2.4858 |
| STABIX genomic index (MB) | 0.506 | 0.511 | 0.513 | 0.513 | 0.5207 | 0.5211 | 0.522 | 0.522 | 0.5226 | 0.5228 |
| STABIX statistical index (MB) | 0.1981 | 0.1534 | 0.1553 | 0.1537 | 0.2522 | 0.2024 | 0.1605 | 0.1604 | 0.1613 | 0.1698 |
| STABIX: raw text compression ratio | 6.5534 | 4.2542 | 4.7379 | 4.7328 | 4.2516 | 4.4097 | 4.2401 | 4.234 | 4.3229 | 4.2787 |
| STABIX: bgzip+tabix compression ratio | 1.4805 | 1.1963 | 1.2536 | 1.2486 | 1.088 | 1.1601 | 1.1191 | 1.1189 | 1.1329 | 1.1078 |
3.3 Selecting default parameters for STABIX
STABIX splits data into blocks and then performs compression by column, allowing for different columns (i.e. different data types) to be compressed with different codecs. We evaluate a set of six popular codecs on three data types (e.g. integer, floating point, and string data). Selecting a single file (GWAS trait: shellfish intake), we perform STABIX compression, indexing, and decompression for each codec ensemble described in Supplementary Table 1 and block sizes 1000, 2000, 5000, 10000, and 1 centimorgan (cM). We query the set of protein-coding genes, and record decompression times and compressed data sizes for each column and show results in Fig. 4.
Figure 4.
Codec performance by data type. Decompression time (log scale) vs. compressed size of a single column. Block sizes 1000, 2000, 5000, 10000, and cM-based blocks are plotted in descending rows, respectively. Column data types integer, float, and string are plotted in columns left to right, respectively. We note that the fpfVB codec only functions on integer data.
Noting patterns of quickest decompression and smallest size of compressed data, we select two ensemble codec configurations with which to run complete STABIX compression, indexing, and decompression for the same file (GWAS trait=shellfish intake) and show a subset of results from these experiments in Fig. 5. We include all 40 combinations of block size and codec ensembles from this experiment in Supplementary Fig. 5.
Figure 5.
Performance of varying block sizes and codec ensembles for a single file. STABIX vs. Tabix gene-based query times for PanUKBB GWAS file trait: shellfish intake. Times are shown for four codec configurations from left to right: bz2, xz, zlib, and ensemble-xzb; and across three block sizes from top to bottom: 1000, 2000, and 10000. Bgzip + Tabix: STABIX + genomic index + statistical index compression ratios, STABIX average speedup, and % STABIX wins are reported for each of the twelve configurations.
From these results, we report the overall best performing configurations for compression ratio, average speedup, and percent (%) queries with STABIX speedup over Tabix (% STABIX wins) in Table 2. In effort to maximize compression ratio, average speedup, and % STABIX wins, we select block size 2000 and codec ensemble ensemble-xzb (int = xz, floating point = zlib, string = bz2) as the default configuration. Alongside these observations, we note the following three things: first, ensemble-xbb (int = xz, floating point = bz2, string = bz2) has comparable results depending on file size (see Supplementary Table 2). Second, while block size 10000 and codec ensemble-xzb achieve both compression ratio and average speedup, it performs poorly for % STABIX wins (where Tabix beats STABIX for more genes). In this case, STABIX can achieve a very high speedup for a few genes, which makes STABIX faster than Tabix on average, but for the majority of genes, Tabix is faster. Finally, we recall that a user has the option to select their own block sizes and codecs depending on their data and desired result.
Table 2.
Best performing codec ensemble and default parameters. For a single file, the best performance for compression ratio, average STABIX speedup over Tabix, and % STABIX wins. The winning values are highlighted in bold, and the default configuration results are italicized.
| Codec(s) | Block size | Compression ratio | Avg. speedup | % STABIX wins |
|---|---|---|---|---|
| ensemble-xzb | 10000 | 1.6401 | 13.0046 | 39.1742 |
| zlib | 1000 | 1.0055 | 2.8836 | 99.0094 |
| ensemble-xzb | 2000 | 1.4805 | 5.0950 | 98.4412 |
4 Discussion
STABIX is a novel compression and indexing framework that builds upon the widely used combination of bgzip and Tabix by incorporating data-type-dependent compression codecs and built-in indexing for statistical-based queries. Both features allow for STABIX to improve upon decompression speed (Fig. 2A) and compression ratio (Fig. 2B) as compared to bgzip and Tabix. With this in mind, we emphasize that STABIX’s addition of the statistical index provides its most significant advantage over Tabix as it allows for sophisticated searching beyond block-based decompression. To improve usability, we have also provided Python bindings.
4.1 Unique features of STABIX
Column-based compression offers customized compression for different data types, optimizing compressed file sizes and query speeds over a universal codec for all data. As shown in Fig. 4, different codec methods offer better or worse performance over different data types. Furthermore, column-based compression allows for sophisticated filtering of records during decompression. Unlike Tabix which must decompress all columns to check if a hit occurs (e.g., a p-value in some range), STABIX only needs to decompress a single column, reducing the time spent on data filtering and genomic position-based boundary checking. This feature allows for STABIX to improve on Tabix significantly when there are no p-value hits present in a block, or when the p-value hits do not fall within the boundaries of a query. In these cases, STABIX can exit early and avoid unnecessary decompression while Tabix queries necessarily still perform decompression before filtering (e.g. see GWAS file trait: size wine glass drunk). This phenomenon is demonstrated in Fig. 3C, where STABIX decompression times for few p-value hits (e.g. 0–2) are much shorter than the same queries for Tabix. Figure 3D shows a similar phenomenon where even when p-value hits are found in a block, if they occur outside of genomic position query boundaries STABIX can exit during the second filtering step and avoid full block decompression while Tabix must decompress the entire block before p-value filtering.
STABIX’s configuration file allows for tunable parameters (e.g. block size and codec ensembles) to meet varying needs for different tasks (e.g. extreme compression for small file sizes or quicker decompression for frequent or many queries). We provide some baseline performance for 40 configures of 5 block sizes and 8 codec ensembles in Fig. 5, Table 2, and Supplementary Fig. 1. While we report these baseline performances for a single file and a p-value-based statistical index, we expect similar results for files of varying size and other columns of interest on which one might want to construct the index. Depending on the expected number of hits, number of columns in the file, data types of those columns, and type of queries, different configurations can optimize for different performance goals. For example, on the same 10 files, STABIX ensemble-xzb achieves a compression ratio of 1.13, while STABIX ensemble-xbb compression ratio is 1.4 (See Supplementary Table 2).
Finally, we restate that STABIX’s statistical index is not designed solely for p-value queries on tab-delimited files. While we demonstrate results with a STABIX query over genes/SNVs with significant p-values with tab-delimited input, the STABIX statistical index can be custom-generated for any association metric included in a GWAS file (e.g. allele frequency, effect sizes, z-scores, etc). Additionally, STABIX detects the delimiter upon reading the input file and can detect delimiters for tab, comma, or white-space separated files. We discuss the generalizability of this index in our current limitations and future work section below.
4.2 Advantages and applications of STABIX in downstream analysis
Since GWAS files correspond to a single trait and often include statistics on millions of sites, most downstream analyses begin by identifying subsets of loci relevant to specific traits. To do this, STABIX organizes GWAS files into position-sorted blocks and then uses a position index to map genomic ranges (e.g., genes) to specific blocks. Additionally, STABIX optionally generates a statistical index to identify blocks containing at least one SNP within a given statistical significance range (e.g. p-value at or below 5e-8). The combination of position and statistical indices can improve the data collection step in many GWAS summary-statistic-based methods. For example, in polygenic risk prediction, the prune-and-threshold algorithm commonly used (e.g. Choi and O'Reilly 2019) iteratively selects the most significant SNP and excludes others in linkage disequilibrium (LD). STABIX optimizes this process by first using its position index to identify unexcluded blocks, then applying the significance index to locate blocks containing significant SNPs, and finally leveraging the position index again to exclude blocks in LD with the selected SNP. By leveraging just its indices, STABIX can quickly determine if a trait has a significant SNP in a gene by using the position index to identify the relevant blocks and the significance index to test if those blocks contain SNPs above the threshold.
STABIX can also be effective in pleiotropy discovery, where the goal is to identify traits with significant SNPs in a target gene, or for efficient generation of phenotype-wide association study data for a single variant or gene [PheWAS (Bastarache et al. 2022)]. Visualizing locus-specific or gene-centered associations, using tools like LocusZoom (Pruim et al. 2010), can be greatly accelerated with the use of the efficient indices in STABIX. Further, this can efficiently identify causal variants or genes via fine mapping (Kichaev et al. 2017) or TWAS (Lu et al. 2022) or across traits using colocalization (e.g. Giambartolomei et al. 2014).
4.3 Current limitations and future work
We have demonstrated improvement over Tabix in compressed file size and query speed with our initial release of STABIX. Compression time, as expected, is longer than bgzip. Like bgzip, STABIX uses block compression, but it also isolates and rotates the columns within each block before compression, resulting in a process that is approximately eight times slower (5 minutes vs. 40 minutes). Indexing is similar; however, in addition to the position index created by Tabix, STABIX generates a statistical index, which extends the indexing time by a factor of eight (30 seconds vs. 4 minutes). Much of the statistical indexing algorithm (e.g. binning boundaries, statistical value threshold, etc) is determined directly from the configuration file from the user. While default parameters have been defined, we believe that a more robust analysis of default parameters for many types of indices (i.e. more than just p-value) would allow for a more robust use of STABIX’s statistical index. Furthermore, we note that STABIX query times beat those of Tabix most often when there are few query values to return (i.e. few SNVs with p-value at or below 5e-08); while Tabix query times are quicker when there are many values to return (e.g. many blocks to decompress). Future implementations of STABIX should consider specific test cases across many statistical index types to consider ways of competing with Tabix even as the number of returnable SNVs increases.
Supplementary Material
Contributor Information
Kristen Schneider, Department of Computer Science, University of Colorado, Boulder, Boulder, 80309 CO, United States; BioFrontiers Institute, University of Colorado, Boulder, Boulder, 80303 CO, United States.
Simon Walker, Department of Computer Science, University of Colorado, Boulder, Boulder, 80309 CO, United States.
Chris Gignoux, Colorado Center for Personalized Medicine, University of Colorado, Anschutz Medical Campus, Aurora, 80045 CO, United States; Department of Biomedical Informatics, University of Colorado, Anschutz Medical Campus, Aurora, 80045 CO, United States.
Ryan Layer, Department of Computer Science, University of Colorado, Boulder, Boulder, 80309 CO, United States; BioFrontiers Institute, University of Colorado, Boulder, Boulder, 80303 CO, United States; Department of Biomedical Informatics, University of Colorado, Anschutz Medical Campus, Aurora, 80045 CO, United States.
Author contributions
Kristen Schneider (Conceptualization [equal], Formal analysis [lead], Investigation [lead], Methodology [lead], Software [lead], Visualization [lead], Writing—original draft [lead], Writing—review & editing [lead]), Simon Walker (Software [supporting]), and Chris R Gignoux (Investigation [supporting], Resources [supporting], Supervision [supporting], Writing—review & editing [supporting]), Ryan M Layer (Conceptualization [equal], Funding acquisition [lead], Investigation [supporting], Methodology [supporting], Resources [lead], Software [supporting], Supervision [lead], Writing—review & editing [supporting])
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest: RML is a cofounder of Codebreaker Tx.
Funding
This work has been supported by NIH/NHGRI R01HG011774 (RML).
Data availability
Data derived from a source in the public domain.
References
- Abdellaoui A, Yengo L, Verweij KJH et al. 15 Years of GWAS discovery: realizing the promise. Am J Hum Genet 2023;110:179–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bastarache L, Denny JC, Roden DM. Phenome-wide association studies. Jama 2022;327:75–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buniello A, MacArthur JAL, Cerezo M et al. The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res 2019;47:D1005–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang W-L, Yun X-C, Fang B-X et al. The Block LZSS Compression Algorithm. in 2009 Data Compression Conference 439–439. IEEE, 2009.
- Choi SW, O'Reilly PF. PRSice-2: polygenic risk score software for biobank-scale data. Gigascience 2019;8:giz082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de los Campos G, Gianola D, Allison DB. Predicting genetic predisposition in humans: the promise of whole-genome markers. Nat Rev Genet 2010;11:880–6. [DOI] [PubMed] [Google Scholar]
- Giambartolomei C, Vukcevic D, Schadt EE et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet 2014;10:e1004383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karczewski KJ, Gupta R, Kanai M et al. Pan-UK biobank GWAS improves discovery, analysis of genetic architecture, and resolution into ancestry-enriched effects. bioRxiv 2024; 10.1101/2024.03.13.24303864 [DOI] [Google Scholar]
- Kichaev G, Roytman M, Johnson R et al. Improved methods for multi-trait fine mapping of pleiotropic risk loci. Bioinformatics 2017;33:248–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kullo IJ, Lewis CM, Inouye M et al. Polygenic scores in biomedical research. Nat Rev Genet 2022;23:524–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics 2011;27:718–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loos RJF. 15 Years of genome-wide association studies and no signs of slowing down. Nat Commun 2020;11:5900–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu Z, Gopalan S, Yuan D et al. Multi-ancestry fine-mapping improves precision to identify causal genes in transcriptome-wide association studies. Am J Hum Genet 2022;109:1388–404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- MacArthur J, Bowler E, Cerezo M et al. The new NHGRI-EBI catalog of published genome-wide association studies (GWAS catalog). Nucleic Acids Res 2017;45:D896–901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nakamura Y. The BioBank Japan project. Clin. Adv Hematol Oncol 2007;5:696–7. [PubMed] [Google Scholar]
- Pruim RJ, Welch RP, Sanna S et al. LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics 2010;26:2336–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Silventoinen K, Sammalisto S, Perola M et al. Heritability of adult body height: a comparative study of twin cohorts in eight countries. Twin Res 2003;6:399–408. [DOI] [PubMed] [Google Scholar]
- Sudlow C, Gallacher J, Allen N et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of Middle and old age. PLoS Med 2015;12:e1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tam V, Patel N, Turcotte M et al. Benefits and limitations of genome-wide association studies. Nat Rev Genet 2019;20:467–84. [DOI] [PubMed] [Google Scholar]
- Uffelmann E, Huang QQ, Munung NS et al. Genome-wide association studies. Nat Rev Methods Primers 2021;1:1–21. [Google Scholar]
- Visscher PM, Brown MA, McCarthy MI et al. Five years of GWAS discovery. Am J Hum Genet 2012;90:7–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Visscher PM, Wray NR, Zhang Q et al. 10 Years of GWAS discovery: biology, function, and translation. Am J Hum Genet 2017;101:5–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007;447:661–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou W, Kanai M, Wu K-HH, UK Biobank et al. Global biobank meta-analysis initiative: powering genetic discovery across human disease. Cell Genom 2022;2:100192. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data derived from a source in the public domain.





