Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2023 Feb 16;39(2):btad092. doi: 10.1093/bioinformatics/btad092

LDmat: efficiently queryable compression of linkage disequilibrium matrices

Rockwell J Weiner 1,2,3, Chirag Lakhani 4, David A Knowles 5,6,7, Gamze Gürsoy 8,9,10,
Editor: Christina Kendziorski
PMCID: PMC9969815  PMID: 36794924

Abstract

Motivation

Linkage disequilibrium (LD) matrices derived from large populations are widely used in population genetics in fine-mapping, LD score regression, and linear mixed models for Genome-wide Association Studies (GWAS). However, these matrices can reach large sizes when they are derived from millions of individuals; hence, moving, sharing and extracting granular information from this large amount of data can be cumbersome.

Results

We sought to address the need for compressing and easily querying large LD matrices by developing LDmat. LDmat is a standalone tool to compress large LD matrices in an HDF5 file format and query these compressed matrices. It can extract submatrices corresponding to a sub-region of the genome, a list of select loci, and loci within a minor allele frequency range. LDmat can also rebuild the original file formats from the compressed files.

Availability and implementation

LDmat is implemented in python, and can be installed on Unix systems with the command ‘pip install ldmat’. It can also be accessed through https://github.com/G2Lab/ldmat and https://pypi.org/project/ldmat/.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Linkage disequilibrium (LD) is a measure of how often alleles at different loci appear together in a population (Collins, 2007). There are several alternative methods of computing LD between two loci, all of which provide values between −1 and 1 (Kijas et al., 2014; Mueller, 2004; Myers et al., 2020). High LD values between two alleles correspond to them occurring together frequently in the population and consecutive regions of the genome with high LD values are dubbed haplotype blocks or haploblocks (Gabriel et al., 2002). These haploblocks are known to be associated with hotspots of recombination (Slatkin, 2008). Along these lines, LD can provide powerful insights into population genetics, as these values are associated with natural selection, genetic drift, and other genome altering events (Cutter, 2019; Ennis, 2007; Hudson, 2004). LD scores are used in fine-mapping, LD score regression, and linear mixed models for Genome-wide Association Studies (GWAS). Since LD values can be calculated between every pair of variants in a chromosome, matrices are the most natural representation. However, even for a small chromosome, the total number of distinct data points is on the order of 1015. Typically, these values are only significantly non-zero for loci which are somewhat close to one another (i.e., in the same haploblock), so only nearby values may be calculated in practice. However, this only reduces the total number of data points by a few orders of magnitude, depending on the chosen genomic distance. For example, LD matrices calculated using the genotypes in the UK Biobank (UKBB) are publicly available in compressed numpy array format (Harris et al., 2020) and the total size of the data ranges from approximately 45 GB (chromosome 21) to 250 GB (chromosome 2), even with values given only for pairs of SNPs or indels that are within 3 MB of each other (Weissbrod et al., 2020). Moving, sharing and extracting granular information from this large amount of data can be cumbersome. Compounding the problem, there is no standard file format for storing these LD matrices. This means that LD matrices for different cohorts often use different ad hoc formats [e.g. LDStore2 (Benner et al., 2017) or Hail] and custom downstream analysis tools are required. Both the large file size and lack of standardization make it difficult to extract useful information from these files. Extracting a small sub-matrix from these large matrices requires access to large resources capable of storing all of the data in memory, along with a bespoke script to find and query the appropriate file(s) containing the relevant information. In particular, the memory to read these large files and file IO time become an important problem in downstream analysis. This prevents scientists with scarce resources from accessing and working on LD matrices from large population genetics studies, hence hampering advances in biomedical research. To address these issues, we developed a user-friendly tool called LDmat that can effectively compress LD matrices with an up to 90% compression rate. We also provide functionality that can query compressed LD matrices by desired loci and minor allele frequency (MAF) threshold and visualize the resulting sub-matrices. This tool is similar to TABIX (Li, 2011), but works with matrix format.

2 Ldmat functionalities

The tool includes two main modules.

2.1 Compress

We used Hierarchical Data Format version 5 (HDF5) for our compression mechanism. LDmat can compress a large LD matrix and associated MAF values (optional) down to a single HDF5 file (see Supplementary Information). Within this file, there exist many ‘groups’, each one covering a non-overlapping section of chromosome positions (Fig. 1A). These groups contain pointers to the data arrays indexed in HDF5. An appropriate group size is chosen automatically based on the overlap in the input files (although a different size can be manually specified). For the UKBB .npz files, the automatically chosen size is 1 MB.

Fig. 1.

Fig. 1.

(A) Internal structure of the HDF5 file. Rectangles represent groups, which are the HDF5 equivalent of dictionaries. The datasets and metadata attributes are shown in detail for ‘chunk_1000001’ only, although all chunks have the same set of entries. These chunks correspond to the trapezoids in Supplementary Figure S2. (B) The size of the compressed files expressed in terms of the percentage of the size of the original files at different parameters. ‘Lossless’ is shown as ‘full’. (C) Heritability of the height trait in UKBB is calculated from resulting LD matrices after truncation with keeping different decimal places and minimum LD thresholds

2.2 Query

When making a query, the tool must first find the groups within the HDF5 file that contain the desired data by checking for overlap with the start and end locus of each group (see Supplementary Information). In order to accommodate large queries without running out of memory, the tool can write the results to disk as they are calculated. This feature turns on automatically when the queried sub-matrix passes a size threshold.

3 Results

3.1 Compression and querying

We ran a series of tests on Chromosome 21 (and Chromosome 1, see Supplementary Information), compressing the full set of UKBB LD matrices into a single file. The total size of the raw data is 45,185 MB. In Figure 1B (see Supplementary Fig. S4 for Chromosome 1), we have the results of compressing these files, while varying the minimum LD value and decimal place parameters.

Notably, the data can be compressed down to less than 1% of its original size, when four decimal places and absolute LD value threshold of 0.06 are kept (Supplementary Fig. S5). Note that lossless compression with HDF5, that is, when all decimal places and LD values are retained, results in 59% reduction in file sizes (Fig. 1B). These compressed files also include the MAF values, which are not present in the original LD matrices. Furthermore, if we apply the same decimal places and minimum LD value threshold to the original .npz files, they are still over 3-fold larger than the corresponding HDF5 files. Moreover, they do not contain any metadata or auxiliary data such as the MAF values. We calculated the compression ratio as a function of minimum LD score threshold and found that compression rate plateaus for threshold values larger than 0.06 (Supplementary Fig. S5a), which can provide guidance on how to select the threshold. We also tested the computational time of running a set of randomized queries on our compressed LD matrices. We showed that LDmat query functionality can return a sub-matrix of 1 MB locus under 2 seconds when tested on both Chromosomes 1 and 21 (Supplementary Table S1). The query runtime for consecutive loci is dependent on the group sizes in HDF5. If the queried locus is larger than the group size, then LDmat has to search more than one hash table, increasing the time to query. To demonstrate how this works, we created a compressed matrix from Chromosome 21 with a group size of 0.5 MB. We then showed that querying a 1 MB locus in this matrix takes two times longer compared with a matrix with group size of 1 MB (2.4 seconds versus 1.25 seconds, Supplementary Table S1). We also showed that it takes around 2 and 6 min to return LD scores between a list of 106 non-consecutive loci on Chromosomes 21 and 1, respectively. See Supplementary Information for examples and Supplementary Figure S6 for usage.

3.2 Utility

In order to assess the accuracy of the compressed LD matrices, we first looked at the distribution of LD values after removing scores that are below the LD value threshold. As expected, the overall distribution and statistics of the LD values remain the same, while the number of LD values that are zero (which is the mean in all cases) changed (Supplementary Fig. S7). We then calculated the heritability of the height trait using UKBB data for different minimum LD value and decimal place thresholds using LD score regression (Bulik-Sullivan et al., 2015). We found that both heritability estimate and the standard error on heritability estimate (Fig. 1C) are significantly affected if we keep less than three decimal places, while they remain the same at every minimum LD value threshold up to 0.2. We also observed the same when we compared the LD score regression coefficients of annotations per thresholds (Supplementary Fig. S8). This is because while the minimum LD threshold affects only a subset of LD values, the decimal place threshold affects all of the LD values.

4 Conclusions

In conclusion, we recommend users to set the decimal place threshold to 4 and minimum LD value threshold to 0.1 for most accurate results. This will still result in shrinking the LD matrices to ∼1% of their original size (Fig. 1B). Since the compression rate is large even when we use very small minimum LD value thresholds and a large number of decimal places, we recommend users to optimize their choices based on the utility, that is, the minimal information loss.

Supplementary Material

btad092_Supplementary_Data

Contributor Information

Rockwell J Weiner, Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA; Department of Computer Science, Columbia University, New York, NY 10027, USA.

Chirag Lakhani, New York Genome Center, New York, NY 10013, USA.

David A Knowles, New York Genome Center, New York, NY 10013, USA; Department of Computer Science, Columbia University, New York, NY 10027, USA; Department of Systems Biology, Columbia University, New York, NY 10032, USA.

Gamze Gürsoy, Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA; Department of Computer Science, Columbia University, New York, NY 10027, USA.

Funding

This work was supported by the National Institute of Health grants [U01AG068880 to D.A.K. and C.L. and R00HG010909 and R35GM147004 to G.G.].

Conflict of Interest: none declared.

Data availability

The code and the test data can be accessed through https://github.com/G2Lab/ldmat.

References

  1. Benner C. et al. (2017) Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet., 101, 539–551. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bulik-Sullivan B. et al. ; Schizophrenia Working Group of the Psychiatric Genomics Consortium. (2015) LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet., 47, 291–295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Collins A.R. (2007) Linkage Disequilibrium and Association Mapping. Springer. [DOI] [PubMed] [Google Scholar]
  4. Cutter A.D. (2019) Recombination and linkage disequilibrium in evolutionary signatures. In: Cutter,A.D. (Ed.) A Primer of Molecular Population Genetics. Oxford University Press, Oxford, p. 113. [Google Scholar]
  5. Ennis S. (2007) Linkage Disequilibrium as a Tool for Detecting Signatures of Natural Selection. Humana Press, Totowa, NJ, pp. 59–70. [DOI] [PubMed] [Google Scholar]
  6. Gabriel S.B. et al. (2002) The structure of haplotype blocks in the human genome. Science, 296, 2225–2229. [DOI] [PubMed] [Google Scholar]
  7. Harris C. et al. (2020) Array programming with numpy. Nature, 585, 357–362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Hudson R. (2004) Linkage Disequilibrium and Recombination, Chapter 22. John Wiley Sons, Ltd. [Google Scholar]
  9. Kijas J.W. et al. ; The International Sheep Genomics Consortium. (2014) Linkage disequilibrium over short physical distances measured in sheep using a high-density SNP chip. Anim. Genet., 45, 754–757. [DOI] [PubMed] [Google Scholar]
  10. Li H. (2011) Tabix: Fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics, 27, 718–719. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Mueller J.C. (2004) Linkage disequilibrium for different scales and applications. Brief. Bioinformatics, 5, 355–364. [DOI] [PubMed] [Google Scholar]
  12. Myers T.A. et al. (2020) Ldlinkr: An R package for rapidly calculating linkage disequilibrium statistics in diverse populations. Front. Genet., 11, 157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Slatkin M. (2008) Linkage disequilibrium—understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet., 9, 477–485. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Weissbrod O. et al. (2020) Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat. Genet., 52, 1355–1363. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btad092_Supplementary_Data

Data Availability Statement

The code and the test data can be accessed through https://github.com/G2Lab/ldmat.


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES