Summary
Understanding the relationship between genotypes and phenotypes is crucial for advancing personalized medicine. Expression quantitative trait loci (eQTL) mapping plays a significant role by correlating genetic variants to gene expression levels. Despite the progress made by large-scale projects, eQTL mapping still faces challenges in statistical power and privacy concerns. Multi-site studies can increase sample sizes but are hindered by privacy issues. We present privateQTL, a novel framework leveraging secure multi-party computation for secure and federated eQTL mapping. When tested in a real-world scenario with data from different studies, privateQTL outperformed meta-analysis by accurately correcting for covariates and batch effect and retaining higher accuracy and precision for both eGene-eVariant mapping and effect size estimation. In addition, privateQTL is modular and scalable, making it adaptable for other molecular phenotypes and large-scale studies. Our results indicate that privateQTL is a practical solution for privacy-preserving collaborative eQTL mapping.
Keywords: eQTL mapping, genomic privacy, multi-party computation, security
Graphical abstract

Highlights
-
•
privateQTL is a novel tool for multi-center eQTL mapping studies that protects privacy
-
•
Cryptographic techniques are used for data pre-processing and eQTL mapping
-
•
privateQTL is more accurate than meta-analysis in a real-world study with batch effects
-
•
privateQTL is a scalable and practical solution for real-world applications
Choi et al. developed a novel tool for privacy-preserving cross-institutional eQTL mapping studies. The authors benchmarked their tool against meta-analysis and demonstrated that it achieves higher accuracy in a real-world multi-center study with batch effects. Their tool demonstrates scalability and practical applicability for a wide range of users.
Introduction
Understanding the relationships between genotypes and phenotypes is essential to facilitate the development of personalized medicine and therapies.1,2 A key component of this understanding involves interpreting the functional roles of the genetic variants involved. As large-scale omics projects increasingly generate molecular data from tissue samples, we are now able to interpret the functional roles of variants that influence molecular phenotypes, such as gene expression, through expression quantitative trait loci (eQTL) mapping. eQTLs are genomic regions where genetic variation is associated with gene expression.3 Despite the significant achievements of large-scale initiatives like the GTEx project in cataloging genome-wide eQTLs across various tissues,4 there remains a vast scope for further exploration that necessitates the aggregation of datasets from multiple sources.
Conducting robust eQTL mapping with sufficient statistical power requires a substantial number of samples, particularly for less-common alleles. eQTL studies typically include a minimum of 60 samples, often ranging from a few hundred to several thousand samples.5,6 Obtaining sufficiently large sample sizes becomes an even more pronounced challenge when dealing with tissues that are difficult to obtain, such as brain tissue. To address this issue, researchers often undertake multi-site studies, aggregating data from multiple institutions to increase sample size.7,8,9 Efforts such as eQTL Catalogue10 and eQTLGen11 highlight the crucial value of compiling and meta-analyzing eQTL data from multiple studies. Despite their significance, these projects are labor intensive and present substantial challenges when it comes to scaling.
Privacy concerns related to genomic and transcriptomic data pose significant challenges to integrating datasets from diverse institutions. These concerns can even hinder the re-use of data from a single institution, complicating the necessity for recalculations when complete summary statistics are unavailable. There are restrictions by institutions and funding agencies on data sharing due to the known privacy issues with anonymized genomic data.12,13,14 While current standards allow public sharing of gene expression data, privacy concerns still persist with linking attacks that are able to predict individuals’ genotypes using gene expression values.15,16 Moreover, the diversity of legislative frameworks, such as the General Data Protection Regulation (GDPR) in Europe, along with differing institutional policies at each data center, further complicate collaborative research efforts by imposing additional restrictions on data sharing across institutions, countries, or continents. Consequently, there is a critical need for privacy-preserving techniques to safeguard data while enabling collaborative eQTL mapping.
Recent advancements in cryptographic techniques have proven effective in alleviating data privacy and security concerns in genomics. These techniques offer mathematically guaranteed security, yet their implementation and practical use can be challenging due to the complexity of algorithms and the scale of data in genomics. Among these, secure multi-party computation (MPC)17,18 is particularly well suited for collaborative research. It leverages techniques such as secret sharing to maintain data privacy while ensuring computational efficiency, offering a significant advantage over other cryptographic tools such as homomorphic encryption.19 As such, many tools have been developed to utilize MPC for genome-wide association studies (GWASs).20,21 However, to date, no study has successfully developed MPC tools specifically for eQTL mapping. This gap likely stems from the greater complexity of privacy-preserving eQTL mapping compared with GWASs. The challenges include a larger phenotype space, with over 20,000 genes to analyze, and the need for extensive pre-processing of gene expression data, such as normalization and covariate correction, to mitigate batch effects. In addition, the high computational costs required for phenotype permutation to model the null distributions hinder the straightforward application of QTL mapping using MPC or other cryptographic methods.
To address the challenges with genotype-phenotype associations in collaborative settings and to increase statistical power in eQTL mapping in a responsible way, we present a novel framework for federated secure eQTL mapping, called privateQTL, that leverages MPC. We developed two variations tailored to specific needs: the first version adheres to current data-sharing practices by keeping genotypes private and allowing the public sharing of gene expression data, while the second version ensures the privacy of both genotypes and gene expression data. We showed that privateQTL shows superior accuracy compared with meta-analysis, especially when tested under batch effects on data collected from three different settings. Our approach is fully modular, enabling researchers to utilize our MPC functionalities—such as gene expression normalization, inverse normal transformation, and correlation matrix generation—in applications beyond eQTL mapping and can easily be adopted to other molecular phenotypes (e.g., splicing, chromatin accessibility).
Results
privateQTL is a secure multi-center QTL mapping strategy
Achieving large sample sizes in eQTL mapping is important, especially for tissues that are difficult to obtain. The imperative to protect patient privacy and institutional data-sharing policies impede the combining of datasets spread across different sources. A prevailing strategy to circumvent these constraints is the implementation of meta-analysis, where each site locally conducts their own eQTL mapping and aggregates the summary statistics at the end.7,9,11 However, meta-analysis often results in spurious associations or missing important associations between genotypes and phenotypes.22 In some cases, a single institution may not even have enough power to perform their own QTL mapping to be part of a meta-analysis.23 Moreover, due to the large phenotype space, meta-analysis can be resource inefficient for QTL mapping studies.
We designed privateQTL for multi-center cis-eQTL mapping studies, where both genotype and gene expression data are spread across multiple institutions. privateQTL utilizes MPC to conduct traditional QTL mapping protocols that regress genotypes against gene expression while using a secret sharing mechanism24,25 to ensure security of the shared data. MPC is a cryptographic technique that allows multiple parties to jointly compute a function over their inputs while keeping those inputs private. This means that even though the parties work together to achieve a result, none of them learns anything about the others’ inputs beyond what can be inferred from the outcome.17,18 Please see MPC protocols in the STAR Methods for more details.
privateQTL is an end-to-end protocol that includes federated and privacy-preserving (1) genotype and gene expression data pre-processing for population stratification, hidden covariate correction for gene expression, normalization, and batch effect correction, (2) regression for both nominal (finds associations) and permutation runs (generates null distribution for cis-associations), and (3) post-processing for multiple hypothesis testing correction.
We offer two variations of privateQTL, each designed to meet specific requirements. One version is compatible with existing data-sharing protocols, while the other is forward looking and enhances privacy protections to guard against re-identification through linking attacks on gene expression data. That is, our first scenario (privateQTL-I) enables the privacy preservation of the genotype data, and our second scenario (privateQTL-II) preserves the privacy of both genotype and gene expression data (Figure 1).
Figure 1.
Schematics of the protocol
(A) In both privateQTL-I and -II, the genotype pre-processing is done using a local projection method for population stratification. This allows sites to project their data onto the same reference PCA space while keeping the data local. In privateQTL-I, gene expression data are shared across sites and pre-processing is done on the aggregated data without privacy preservation. In privateQTL-II, gene expression data are shared via secret sharing and normalization takes place with MPC. The normalized gene expression data are then sent back to the sites for covariate correction.
(B) In both privateQTL-I and -II, eQTL mapping and null distribution generation are done using MPC via secret sharing of both gene expression and genotype data. Double brackets indicate secretly shared data.
We first demonstrated the accuracy of our methods by using GTEx whole-blood data that are composed of 670 individuals with matching gene expression and genotypes by distributing the data to three centers with 300/250/120 split. We then demonstrated the accuracy of our methods when there is a batch effect present in the splits. For that, we applied it to datasets collected from three different studies: GTEx project, Geuvadis project, and Taylor et al.4,26,27
We compared eGenes discovery and eGene-eVariant pair findings of both privateQTL versions with meta-analysis relative to the mapping strategy of the GTEx consortium4 that uses tensorQTL28 with an ideal scenario, where there are no privacy and security concerns to consider; hence, the data are aggregated in a central location. We used METAL29 for meta-analysis as our privacy-preserving comparator (see the STAR Methods for details). Both privateQTL and METAL assume that all centers use the same gene annotations for gene expression and the same set of SNPs for genotype data. If needed, genotypes should be imputed with similar panels.
privateQTL ensures security while delivering accurate results
privateQTL-I ensures the confidentiality of genotypes
Before performing the eQTL mapping, we need to make sure that genotype data are pre-processed for population stratification and that gene expression data are pre-processed for batch effect normalization and hidden covariate correction. In privateQTL-I, any computation on the genotype data needs to be privacy preserving. For genotype pre-processing, we adopted an accurate and privacy-preserving population stratification algorithm,30 in which each site locally projects its genotype matrix to principal components from a reference population. This is done after each site performs its own quality control on its genotype data. For gene expression pre-processing, since in this version the gene expression data are considered to be safe to share, we can aggregate data from all sites in one of the data sites and perform pre-processing. We implemented both quantile normalization (QN)31 and relative log expression (RLE) normalization32 for users to choose from. Normalized aggregated gene expression data are then inverse normal transformed and corrected for hidden variables using principal-component analysis (PCA), as this method is suggested to be the most accurate for covariate correction of gene expression data.33
We developed the eQTL mapping to be done on a per-gene basis. That is, before sending secret shares of the genotype matrix, each site creates a submatrix from the variants that are located on the 1 MB cis-window of the gene. This genotype matrix (genotypes by individuals) and the gene expression vector (expression of the gene by individuals) are secretly shared to three computing servers with a linear replicative secret sharing scheme.34 We then developed a strategy that allows us to compute the QTL mapping for both nominal and permutation runs in a single matrix operation (for a depiction of this strategy, please see Figure S1). The secretly shared gene expression vector is shuffled 1,000 times with secretly shared random permutations to create a secretly shared gene expression matrix. In this matrix, the first column represents the true gene expression values, while the remaining 1,000 columns represent random permutations to be used to calculate the null distribution. One thousand permutations were previously shown to be enough to model the null distribution.35 The computing servers then perform one matrix multiplication per gene in MPC for both the nominal and permutation runs. The modularized structure of our privateQTL pipeline that stems from gene-based mapping and the single-shot computation for nominal and permutation runs allow for parallel and extremely scalable computation (Figure 1B). Please see the STAR Methods for details.
After we perform the above calculation for all genes, the resulting correlation matrix (genotypes by gene expression) is revealed to one of the computing servers for downstream statistical analysis for p value adjustment and false discovery rate (FDR) control (please see security analysis in the STAR Methods for details on how privacy and security are maintained). We use the permutation run to model a beta distribution of p values for correcting the nominal p values based on this beta distribution. We compare the most significant variant from the nominal run with those from the permutation run to adjust the nominal p value for the most significant variant. This adjusted p value for each gene is used to calculate q values across genes, and eGenes are designated as genes with q values less than 0.05.
privateQTL-II ensures the confidentiality of genotypes and gene expression values
Recent studies have shown that privacy of research participants can be breached using anonymized gene expression values.15,16,36 These studies have demonstrated that publicly available eQTL datasets can be correlated with gene expression values to infer patients’ private genotypes.15,16 In addition, haplotype-resolved gene expression data have been shown to pinpoint the locations of heterozygous SNPs in patient genomes.36 As such, we developed privateQTL-II, an end-to-end eQTL mapping framework keeping both genotype and gene expression data private. privateQTL-II’s genotype pre-processing and secret sharing scheme is equivalent to those of privateQTL-I. We developed additional MPC-based methodologies for gene expression pre-processing (normalization and hidden covariate correction) for privateQTL-II. We developed two normalization modules for users to choose from: federated QN and RLE as well as an inverse normal transform step in MPC (see Figure S2 for the schematics of these calculations and the STAR Methods for details). This privacy-preserving gene expression pre-processing and sharing framework ensures that neither computing parties nor data centers can access the gene expression values of patients from other data centers. As a result, the gene expression data cannot be exploited to infer patients’ private genotypes.
After privacy-preserving and federated normalization of the user’s choice and inverse normal transform, the normal transformed gene expression values are returned to the participating sites. They then locally compute hidden variable correction with PCA33 with their own samples and residualize their gene expression values based on their local principal components. The fully pre-processed and residualized gene expression values and pre-processed genotype matrix are then secretly shared to the computing servers on a per-gene basis for QTL mapping as in privateQTL-I.
pivateQTL delivers accurate results
We first compared the overlap of eGenes between privateQTL-I, privateQTL-II, meta-analysis, and the GTEx pipeline using a 5% FDR cutoff. Overall, privateQTL-I eGenes overlapped with 93.2% of GTEx pipeline eGenes, privateQTL-II eGenes overlapped with 91.3% of GTEx pipeline eGenes, while meta-analysis eGenes overlapped with only 76.1% of GTEx pipeline eGenes (Figures 2A and 2B). The effect sizes for the most significant eGene-eVariant pairs were identical to those obtained with the GTEx pipeline effect sizes for the same eGene-eVariant pair for all methods (Figures 2C–2E). Note that some of the most significant eVariants found by both privateQTL versions and meta-analysis were not matching the most significant eVariants found by the GTEx pipeline. Therefore, we asked whether these non-overlapping most significant eVariants from different methods are in high linkage disequilibrium (LD) with each other. We found that the most significant eVariants that are produced by both privateQTL versions and meta-analysis are in high LD with the most significant eVariant of the same eGene from the GTEx pipeline (Figures 2F–2H). As a sanity check, we also found that most significant eVariants are in close proximity to the transcription start site (TSS) of the corresponding eGene for meta-analysis and both versions of privateQTL (Figure S3). Overall, meta-analysis had less power in finding eGenes, while both versions of privateQTL had a higher overlap with the GTEx pipeline.
Figure 2.
Results for GTEx data
(A) Overlap of eGenes found in privateQTL-I, meta-analysis, and the GTEx pipeline (tensorQTL). privateQTL-I recovers more significant eGenes than meta-analysis.
(B) Overlap of eGenes found in privateQTL-II, meta-analysis, and the GTEx pipeline (tensorQTL). privateQTL-I and -II recover more significant eGenes than meta-analysis.
(C–E) Comparison of top eVariant-eGene pairs that are found by both the GTEx pipeline and the given method—privateQTL-I, privateQTL-II, or meta-analysis.
(F–H) LD between top eVariants found by the GTEx pipeline and each method, as represented by LD score (). All three methods consistently found top eVariants that are in LD with the top eVariant found by the GTEx pipeline.
(I) Recall comparison between both versions of privateQTL, meta-analysis, and the GTEx pipeline (used as gold standard) as a function of significance of association, adjusted for LD correlation of 0.8 or above. Meta-analysis consistently recovers fewer eVariants regardless of eGene significance (as denoted by q value rank).
(J) Precision comparison between both versions of privateQTL, meta-analysis, and the GTEx pipeline (used as gold standard) as a function of significance of association, adjusted for LD correlation of 0.8 or above. Meta-analysis finds fewer false-positive eVariants than both privateQTL versions for eGenes of higher significance, but precision drops steeply around q value rank of 4,500.
Next, we analyzed false-positive and false-negative eGenes found in privateQTL-I (Figure S4) and privateQTL-II (Figure S5). There were 492 and 636 eGenes that could not pass the significance threshold in privateQTL-I and II, respectively, but are present as significant in the GTEx pipeline. Although these eGenes did not pass significance threshold in the privateQTL, privateQTL produced effect sizes for the most significant eVariants of these eGenes that are in excellent agreement with the GTEx pipeline effect sizes (R = 0.999, Figures S4E and S5E). The absolute effect sizes of the most significant eVariant of these eGenes are small (<0.5). We also found that the gene expression levels of these eGenes and minor allele frequency (MAF) of their most significant eVariants to be small (Figures S4G, S4H, S5G, and S5H). Interestingly, the majority of these eGenes could not pass the significance threshold in meta-analysis either. We believe that due to these small effect sizes and low expression levels, these eGenes and their associated most significant eVariants could not pass the p value and FDR threshold in privateQTL. As can be seen in Figures S4F and S5F, their q values are just below threshold in privateQTL and just above threshold in the GTEx pipeline.
We also found that there were 282 and 337 eGenes that are deemed significant in privateQTL-I and -II, respectively, but did not pass the significance threshold in the GTEx pipeline. privateQTL-I and -II produced effect sizes for the most significant eVariants of these eGenes that are in excellent agreement with the GTEx pipeline effect sizes (R = 0.999 and 0.998, Figures S4A and S5A). The absolute effect sizes of the most significant eVariant of these eGenes are also small (<0.5, Figures S4A and S5A). We believe that due to these small effect sizes, these eGenes and their associated most significant eVariants could not pass the p value and FDR threshold in the GTEx pipeline. We also found that the gene expression levels of these eGenes and MAF of their most significant eVariants to be small (Figures S4C, S4D, S5C, and S5D). As can be seen in Figures S4B and S5B, their q values are just above threshold in privateQTL and just below threshold in the GTEx pipeline.
We also investigated the expected deviations when we use two different non-secure algorithms for eQTL mapping by comparing the results to another QTL mapping method LIMIX (Figure S6).37 We compared eGene discovery for privateQTL-I, privateQTL-II, meta-analysis, and the GTEx pipeline to LIMIX. GTEx pipeline eGenes overlapped with 99.97% of LIMIX eGenes. privateQTL-I and -II was able to recover 99.34% and 98.24% of LIMIX eGenes, respectively. Meta-analysis recovered only 88.79% of LIMIX eGenes (Figures S6A–S6C). On the other hand, LIMIX overlapped with fewer GTEx pipeline eGenes (84.81%) compared to privateQTL (93.25% for privateQTL-I and 91.27% for privateQTL-II). Both privateQTL versions and LIMIX had low numbers of eGenes deemed significant that are not found in the GTEx pipeline (4% for privateQTL-I, 5% for privateQTL-II, and 0.1% for LIMIX, Figures S6A–S6C). We also compared the effect sizes of the overlapping most significant eVariants for eGenes between these two non-secure methods. LIMIX showed trends similar to privateQTL, reproducing the exact effect sizes obtained by the GTEx pipeline (Figures S6D–S6F). We also showed that most significant eVariants from LIMIX and the GTEx pipeline are in high LD of each other (Figure S6G) and the significant eVariants from LIMIX are in close proximity to their corresponding eGene TSSs (Figure S3E). Overall, this shows that the small deviations between privateQTL and the GTEx pipeline are within the expected range, as two non-secure, non-federated algorithms also show similar deviations.
We next assessed the overlap of all significant eGene-eVariant pairs between both privateQTL versions and the GTEx pipeline. This was achieved by asking the question of whether a significant eVariant or a SNP within high LD of it (with an LD score threshold of 0.8) in both versions of privateQTL or meta-analysis is also deemed as a significant eVariant for the same eGene in the GTEx pipeline. This is denoted as the recall of the captured eGene-eVariant pairs and calculated as a function of the q value rank of the eGene. We show that both privateQTL-I and privateQTL-II maintain higher recall even at low significance levels (i.e., high q value rank), while meta-analysis has consistently lower recall at all significance levels (Figure 2I). We also calculated the precision of significant eGene-eVariant pairs. This is calculated by asking whether an eVariant is found to be significant by privateQTL, but neither the eVariant nor a SNP in high LD with it is found to be significant for the same eGene in the GTEx pipeline. We show that both privateQTL-I and privateQTL-II maintain comparable precision levels with meta-analysis at high significance levels, while meta-analysis had steep drops in precision at low significance levels compared with both privateQTL versions (Figure 2J). The steep drop of precision toward less-significant pairs and the consistently low recall indicate that meta-analysis is unreliable in finding significant associations, especially when the significance levels are lower. Although our secure implementation for both privateQTL versions and additional secure computations in privateQTL-II slightly affected eGene discovery (Figure 2), both privateQTL versions are accurate in finding significant eVariant-eGene pairs. The additional gene expression pre-processing, i.e., the additional privacy measures introduced in privateQTL-II, does not affect the significant association discovery, as precision and recall trends of both privateQTL versions are highly overlapping (Figures 2I and 2J).
We further analyzed this trend by looking at the effect size landscape of discovered eVariants for a set of eGenes. We chose the eGenes based on their q value rank to fairly assess the eVariant discovery across different eGene significance levels. Three known eGenes from blood tissues (obtained from the GTEx QTL catalog) are HSPA12B (q value rank 1,024), C1orf145 (q value rank 11,518), and SSSCA1 (q value rank 18,059). We used the nominal p values (thresholded by the permutation run) for the GTEx pipeline and both privateQTL versions and Bonferroni adjusted p values for meta-analysis, following the multiple testing correction method outlined in a previous eQTL meta-analysis study.9 Note that this is because it is not possible to model null distribution via permutation in meta-analysis as the assumption is that we do not have access to the data of individual participating sites. privateQTL-I and privateQTL-II were able to accurately represent the effect size landscape for all cis-eVariants (Figure 3) for all the eGenes tested regardless of their q value rank. Meta-analysis deemed fewer variants as significant. Additional analysis for other example genes can be found in Figure S7. Note that privateQTL p values are kept at , as we work in integer space in MPC and the precision of floating points is bounded by modulo arithmetics.
Figure 3.
p value landscape for significant eQTLs of eGenes
(A) HSPA12B (q value rank 1,024), (B) C1orf145 (q value rank 11,518), and (C) SSSCA1 (q value rank 18,059). Green dashed line denotes the transcription start site. Colored circles are significant variants for the specific gene of interest with changing effect sizes from −1 (blue) to 1 (red). Epsilon was added to p values deemed as 0 under before computing log values for privateQTL.
Demonstration of privateQTL capabilities in a real-world setting with batch effects highlights its practical adaptability
We next showed the utility of our approach when the data are collected from different centers and subject to batch effects. For this, we used gene expression data from blood tissue and lymphoblastoid cell lines (LCLs) collected by three different studies: 250 blood tissue samples from the GTEx project,4 150 LCL samples from the Geuvadis project,26 and 300 LCL samples from Taylor et al.27 Not only were the gene expression data were collected in different laboratories, but there were also differences in how the expression values were quantified. Both GTEx and Geuvadis projects used RNA-SeQC v.1.1.938 and RSEM v.1.3.0,39 while Taylor et al.27 used Salmon v.1.5.240 for the quantification. Most importantly, data from each data source represent an uneven population distribution. The GTEx project contains data from primarily European and a small number of African American and Asian populations; the Geuvadis project contains mostly European samples and a small number of African American samples; while Taylor et al.27 contains samples from African American, European, East Asian, Americas, and South Asian populations. We tested both versions of privateQTL and meta-analysis in a three-center setting. We also aggregated the data and used the GTEx pipeline to compare against our methods. We first compared the overlap of eGenes using a 5% FDR cutoff. Overall, both versions of privateQTL had more overlapping eGenes (91.4% and 82.1%) with the GTEx pipeline compared with meta-analysis (76.4%, Figures 4A and 4B). The effect sizes for the most significant eGene-eVariant pairs were identical to those obtained with the GTEx pipeline effect sizes for both versions of privateQTL, while there were small differences between meta-analysis and the GTEx pipeline (R = 0.9996, 0.9989, 0.956 for privateQTL-I, -II, and meta-analysis, respectively, Figures 4C–4E). We found that most significant eVariants that are produced by both privateQTL versions are identical or in higher LD with the most significant eVariant from the GTEx pipeline compared with meta-analysis (Figures 4F–4H). We again also found that the most significant eVariants are in close proximity to the TSSs of the corresponding eGenes (Figure S8). Overall, we found that, under batch effects, meta-analysis failed to capture a large number of significant eGenes, and both versions of privateQTL had higher precision than meta-analysis.
Figure 4.
Results for real-world setting
(A) Overlap of eGenes found in privateQTL-I, meta-analysis, and the GTEx pipeline (tensorQTL). privateQTL-I recovers more significant eGenes than meta-analysis.
(B) Overlap of eGenes found in privateQTL-II, meta-analysis, and the GTEx pipeline (tensorQTL). privateQTL-II recovers more significant eGenes than meta-analysis.
(C–E) Comparison of top eVariant-eGene pairs that are found by both the GTEx pipeline and privateQTL-I, privateQTL-II, and meta-analysis.
(F–H) LD between top eVariants found by the GTEx pipeline and each method, as represented by LD score (). Both privateQTL versions found top eVariants that were identical or in higher LD with GTEX pipeline eVariants than meta-analysis.
(I) Recall comparison between both versions of privateQTL, meta-analysis, and the GTEx pipeline (used as gold standard) as a function of significance of association, adjusted for LD correlation of 0.8 or above. Both versions of privateQTL found more correct eVariants than meta-analysis, regardless of eGene significance rank.
(J) Precision comparison between both versions of privateQTL, meta-analysis, and GTEx pipeline (used as gold standard) as a function of significance of association, adjusted for LD correlation of 0.8 or above. Both versions of privateQTL is more precise in finding eVariants than meta-analysis, regardless of eGene significance rank.
Next, we asked the question whether a significant eVariant or a SNP within high LD of it in both versions of privateQTL or meta-analysis is also deemed a significant eVariant for the same eGene in the GTEx pipeline. Meta-analysis recovers fewer eVariants than both privateQTL versions for eGenes that are more significant and performs similarly to privateQTL-II for less significant eGenes (Figure 4I). When comparing extra eVariants found by each method (precision), we found that meta-analysis consistently finds more eVariants that were not deemed significant in the GTEx pipeline, regardless of eGene significance (Figure 4J). These results highlight that both versions of privateQTL are more accurate and precise than meta-analysis when samples were collected in different centers and present batch effects.
We further investigated why meta-analysis fails to identify certain eGenes and eGene-eVariant pairs that are detected by both versions of privateQTL. Meta-analysis allows each data center to independently perform eQTL mapping on its own datasets, with effect sizes and significance levels subsequently aggregated using sample sizes and standard errors of each data center. Our analysis revealed that the eGenes identified by each data center were proportional to their respective sample sizes. However, there was even less overlap in eGenes consistently detected across two or more data centers. This underscores the unreliability of locally mapped eGenes for aggregated significance assessment (Table S1; Figure S9). A more detailed analysis of this issue is provided in the meta-analysis vs. privateQTL in eQTL mapping: a comparative analysis section in the STAR Methods. Lastly, we examined whether our hidden covariate correction strategy in privateQTL-II, where each site performs a local PCA on gene expression data after federated normalization, leads to inaccurate removal of batch effects and showed that our methods were able to correct for batch effects (Figure S10).
privateQTL is robust
We also showcased the capabilities of privateQTL in a simulated setting, where we know all the hidden covariates. privateQTL is modularized to provide users with options for their gene expression data pre-processing needs. For example, in both versions, we provide both QN and RLE normalization options. To understand how these different pre-processing choices available in privateQTL affect the overall accuracy and sensitivity of the results, we conducted a federated simulation study. Our simulation draws upon the methods outlined by Zhou et al.33 In addition, we divided the data randomly into three distinct sites and mimiced batch effects. We repeated this for two different splitting scenarios: one had 300, 250, and 120 samples across three different sites (denoted as 300/250/120), and the other had 300, 300, and 70 samples (denoted as 300/300/70) across three different sites. The site information is simulated via addition of noise from different distributions and was incorporated as a covariate in our correction for hidden variables. The goal is to assess whether different gene expression pre-processing steps are able to remove covariates and find the correct SNP as the effect SNP (i.e., an eVariant). Thus, the gene expression matrix was modeled as a linear combination of genotype matrix, binary effect indicator matrix (indicates whether a SNP is an eVariant for a given eGene), hidden and known covariates (continuous hidden covariates modeled from normal distribution and known covariates include site information and sex), effect size matrices, and noise (please see the STAR Methods for more details). eQTL mapping accuracy was compared for a different set of strategies described in Table 1. We then compared the p values with the ground truth effect indicator matrix. That is, if the p value obtained from association is small, then the probability of the SNP to be an eVariant is high. This allows us to model the task as a binary classification, thus allowing us to calculate area under the precision-recall curve (AUPRC) and area under the receiver operating characteristic curve (AUROC) of each strategy compared with the ideal scenario, where all covariates are known. Note that there is no real-world scenario where we know all the covariates affecting gene expression values. To understand the true effect of each strategy on the eQTL mapping, we turned off the secret sharing scheme, thus the accuracy loss can be solely attributed to the choice of strategy for pre-processing (Figures 5A and 5B).
Table 1.
Simulation strategies for phenotype pre-processing comparison
| Label | Description |
|---|---|
| privateQTL-I with QN | gene expression data are aggregated in one of the sites. QN normalization and covariate correction via PCA is performed on the aggregated data. |
| privateQTL-I with RLE | gene expression data are aggregated in one of the sites. RLE normalization and covariate correction via PCA is performed on the aggregated data. |
| privateQTL-II with QN | gene expression data are split across sites. Federated QN is performed, and local PCA covariate correction is performed for each site. |
| privateQTL-II with RLE | gene expression data are split across sites. Federated RLE is performed, and local PCA covariate correction is performed for each site. |
| Unadjusted | gene expression data are aggregated in one of the sites. No normalization or hidden variable correction was performed. |
p values were compared with ground truth effect indicator matrix to compare AUROC and AUPRC.
Figure 5.
Effect of gene expression pre-processing on QTL mapping performance and scalability analysis
(A) Data are split to even numbers of samples per site. The strategy used in privateQTL-I outperforms all other strategies, followed by the strategies used in privateQTL-II.
(B) Data are split into uneven numbers of samples per site. This does not affect the performance of either privateQTL version. The strategy used in privateQTL-I outperforms all other strategies, followed by the strategies used in privateQTL-II.
(C and D) (C) privateQTL-I and (D) privateQTL-II. Runtime in hours was measured separately for (1) genotype pre-processing, (2) phenotype pre-processing, and (3) eQTL mapping including downstream FDR control. Samples were subset from GTEx whole-blood data.
The strategy we used for privateQTL-I produced the highest relative AUPRC and AUROC (0.998 and 0.904 for privateQTL-I with QN and 0.998 and 0.908 for privateQTL-I with RLE, Figure 5A). This strategy represents a gene expression data-sharing scheme, where each site’s data are shared publicly with a participating site to perform the normalization and hidden variable correction on aggregated data. The methods with the next highest relative AUPRC and AUROC were the sharing schemes equivalent to that from privateQTL-II, with different normalization methods (0.995 and 0.869 for privateQTL-II with QN and 0.996 and 0.873 for privateQTL-II with RLE). Note that we did not find significant differences between RLE and QN normalizations in either privateQTL version. We confirmed these trends with 300/300/70 data split, where all privateQTL versions outperformed the unadjusted case (Figure 5B). While the AUROC was not affected by different strategies, there was an expected small drop in AUPRC in settings where gene expression data were also considered private (privateQTL-II).
We performed a similar benchmarking for the genotype pre-processing. While secure and federated PCA techniques for population stratification are available,20,41 they are computationally demanding and can require extensive time to execute. Therefore, we chose to employ a projection-based method, which has demonstrated great accuracy in GWASs.30 We showed that this projection-based method and regular PCA for population stratification had a 94.58% overlap in eGene discovery (Table S2). For a detailed evaluation of different genotype and phenotype pre-processing choices on eGene discovery, please see the plaintext pre-processing method comparison section in the STAR Methods and Table S1.
Next, we measured the degree of precision loss stemming from MPC operations by comparing the output of both privateQTL versions with their plaintext equivalent (Figure S11). privateQTL-I had a minimal difference between its plaintext version (effect size R = 0.999). The inaccuracy arises from floating-point precision limitations in matrix operations during the mapping step, since phenotype pre-processing step does not use MPC. privateQTL-II showed a slight reduction in accuracy (effect size R = 0.959). The additional inaccuracy in privateQTL-II is likely due to differences in the tie-breaking methods of sorting algorithms between the plaintext version used in the GTEx pipeline and the one employed for secure sorting.
privateQTL is practical and scalable
We compared runtime and memory usage of privateQTL versions and meta-analysis. privateQTL-I used 90.64 GB and completed all steps in 18.26 h, including preprocessing, MPC-based correlation mapping, FDR control, and significant pair detection, while meta-analysis used 35.89 GB and took 118.60 h for equivalent tasks. privateQTL-I matches meta-analysis in memory usage but streamlines eQTL analysis in one step, unlike meta-analysis, which requires separate analyses for FDR control and pair detection. privateQTL-I optimizations allow for the entire analysis to be done in 5.5× less time than meta-analysis, without compromising memory consumption. privateQTL-II consumed 90.63 GB peak memory during 60.10 h of runtime, which includes genotype preprocessing, MPC-based gene expression normalization and inverse normal transform, MPC-based genotype-gene expression correlation mapping, variant- and gene-level FDR control, and the detection of all significant eGenes and eQTL-eGene pairs. The additional MPC-based gene expression preprocessing steps added an additional 30.24 h compared with privateQTL-I, but privateQTL-II was still faster than meta-analysis (Table 2). Note that the same analysis took 2.05 h and 82.63 GB of peak memory in a non-secure setting using tensorQTL.
Table 2.
Comprehensive summary of privateQTL compared with meta-analysis
| privateQTL-I | privateQTL-II | Meta-analysis | ||
|---|---|---|---|---|
| Privacy standard | genotype | private | private | private |
| phenotype | public | private | private | |
| eGenes overlap compared with GTEx pipeline (%) | GTEx | 93.2 | 91.3 | 76.1 |
| GTEx, Geuvadis, Taylor et al.27 | 91.4 | 82.1 | 76.4 | |
| Runtime (h) | 18.26 | 60.1 | 118.60 | |
| Peak memory consumption (Gb) | 90.64 | 90.63 | 35.89 | |
Comparison of privateQTL with meta-analysis for privacy, number of overlapping eGenes with plaintext pipeline, runtime, and peak memory consumption. GTEx corresponds to our simulated study, where we divided the GTEx data into three data centers. GTEx, Geuvadis, Taylor et al.27 correspond to real-world studies, where we used each dataset as separate data centers collaborating with each other to perform privacy-preserving eQTL mapping. Total runtime was measured for (1) genotype and phenotype preprocessing, (2) eQTL mapping, (3) variant-level and gene-level false discovery control.
We then measured the scalability of our methods. We subset an increasing number of samples from 670 GTEx whole-blood samples from 100 to 600, with 100 sample increments, and measured the runtime in hours for privateQTL-I and privateQTL-II. Runtime was separately measured for (1) genotype correction and pre-processing, (2) phenotype pre-processing, and (3) eQTL mapping including variant-level and gene-level FDR control and aggregated as total runtime in the end. We found that both privateQTL-I and privateQTL-II have a linear increase in runtime with increasing number of samples, with fitted linear slope suggesting approximately a 47-s and a 3-min increase per sample, respectively (Figures 5C and 5D). All CPU runtime measurements were performed on an Intel Xeon Gold 6132 CPU at 2.60 GHz, and GPU measurements for plaintext tensorQTL were performed on an NVIDIA Tesla V100. We also present a comprehensive evaluation of both versions of privateQTL and the meta-analysis approach. This includes evaluation on the simulation study, where the GTEx dataset was partitioned into separate centers, and the real-world study utilizing data from three independent studies in a federated framework (Table 2). Lastly, we note that we designed the code such that large data chunks are sent over communications in fixed batches of 1.6 MB. The network bandwidth of our server environment was 711.6 megabits per second. The batch size to be sent over the network can be changed based on the per communication bandwidth limitations of computing parties. Details on how communication interruptions are modeled can be found in the STAR Methods in the security assumption section.
Discussion
In this paper, we introduce privateQTL, an MPC-based framework to perform secure, privacy-preserving, and federated QTL mapping across multiple data centers. One of the significant challenges in collaborative eQTL mapping is maintaining the privacy of sensitive genomic and transcriptomic data. privateQTL addresses this by leveraging MPC to ensure that data remain confidential throughout the computation process. privateQTL-I preserves the privacy of genotype data while sharing gene expression data, whereas privateQTL-II ensures the privacy of both genotype and gene expression data, providing a forward-looking approach. Our results demonstrate that privateQTL not only is more accurate than meta-analysis, especially under batch effects, but also offers comparable computational efficiency and scalability. privateQTL employs a single-shot secure matrix multiplication, enabling simultaneous computation of nominal association and null distribution estimation for FDR correction on a per-gene basis, thus allowing for scalability and parallelization.
The practical utility of privateQTL was demonstrated using a real-world setting, where datasets were collected from different studies (GTEx, Geuvadis, and Taylor et al.27). These studies not only contain batch effects due to gene expression data being collected in different laboratories but also samples that are quantified using different tools and are representative of different population breakdowns. We demonstrated that both versions of privateQTL were better at correcting for hidden covariates and batch effects compared with meta-analysis, evident from the higher overlap of eGene and eVariant discovery with standard pipelines, better accuracy in capturing effect sizes, as well as better precision of captured eVariant-eGene pairs. We showed that although meta-analysis performs comparably to privateQTL when there is no batch effect, when the data are presented with large batch effects (see Figure S10A), meta-analysis seems to find false-positive associations and inaccurate effect sizes. In these important scenarios, where getting all the data in one place is a significant challenge, privateQTL is a privacy-preserving and more accurate and precise alternative to meta-analysis.
The scalability analysis shows that privateQTL can handle increasing sample sizes with a linear increase in runtime, making it suitable for large-scale genomic studies. Both privateQTL-I and privateQTL-II exhibited manageable computational demands, with runtime and memory usage comparable to or better than traditional meta-analysis. This scalability is critical for the practical application of privateQTL in real-world settings, where datasets from multiple centers need to be integrated and analyzed securely. We developed privateQTL to be modular and flexible to expand its applications to other types of QTL studies, such as splicing QTLs and chromatin accessibility QTLs, in the future. We also posit that addressing the challenges posed by different legislative frameworks and institutional policies on data sharing will be crucial for the widespread adoption of privateQTL in global research collaborations.
Limitations of the study
The inherent communication overhead required by MPC operations requires computing servers to have adequate network bandwidth and significantly delays runtime compared with its non-secure plaintext counterpart. We expect this overhead to become trivial with increasing network bandwidth. The floating-point inaccuracies introduced by the scaling required for integer-based MPC secret sharing can affect the precision of low effect size and just above and beyond significance threshold associations. In addition, the secure sorting algorithms used in the inverse normal transform step may introduce small inaccuracies compared with non-secure methods. However, we envision that future research that focuses on refining these computational steps will further minimize inaccuracies and enhance the robustness of privateQTL. Another limitation to consider is the lack of efficient MPC-based PCA methods for population stratification of genotype data and covariate correction of gene expression data. Current state-of-the-art MPC-based PCA methods require a large amount of computational overhead; thus, privateQTL employs local genotype projection and local phenotype PCA for practicality and scalability. Future research on practical MPC-based PCA methods can alleviate this issue. Lastly, we note that privateQTL was designed for cross-institutional studies that require ample sample sizes. Utilizing privateQTL with small sample sizes will lead to not only insufficient statistical power but also security concerns, as discussed in security analysis in the STAR Methods.
Resource availability
Lead contact
Requests for further information and resources should be directed to and will be fulfilled by the lead contact, Gamze Gürsoy (gamze.gursoy@columbia.edu).
Materials availability
This study did not generate new materials.
Data and code availability
All original code for privateQTL and benchmarking analysis is publicly available at https://github.com/G2Lab/privateQTL and Zenodo (https://doi.org/10.5281/zenodo.14648851). This paper utilizes publicly available datasets for benchmarking analysis. These datasets are listed in dataset under the STAR Methods. Additional information or analysis reported in this paper is available upon request from the lead contact.
Acknowledgments
This work was funded by NIH grants R00HG010909 and R56HG013319 to G.G.; NSF CNS Award 2247352, Brown Data Science seed grant, Meta research award, Google Research Scholar award, and Amazon research award to P.M.; and NIH grant U24HG012090 to T.L. The authors would like to thank Vasileios P. Kemerlis for valuable discussions and feedback.
Author contributions
Conceptualization, G.G.; methodology, Y.A.C., P.M., and G.G.; investigation, Y.A.C., Y.K., and T.L.; writing – original draft, Y.A.C. and G.G.; writing – review & editing, Y.A.C., G.G., T.L., and P.M.; funding acquisition, G.G., P.M., and T.L.; resources, G.G.; supervision, G.G.
Declaration of interests
T.L. is a scientific advisor, has equity in Variant Bio, and has received speaker honoraria from Abbvie.
STAR★Methods
Key resources table
Method details
MPC protocols
Secure multi-party computation with secret sharing
Secure Multi-party Computation (MPC) is a subfield of cryptography that focuses on enabling multiple parties to jointly compute a function over their inputs while keeping those inputs private. In essence, MPC allows parties to collaborate and compute a result without revealing their individual inputs to each other.17,18 This is particularly useful in scenarios where privacy and confidentiality are important. There are a number of techniques MPC utilizes for security including secret sharing,24,25 garbled circuits,17 homomorphic encryption,19 and oblivious transfer.46
privateQTL uses MPC with secret sharing. In secret sharing, inputs are split into shares distributed among the parties. Computation is done on these shares, and the final result is reconstructed from them. Secret sharing allows private data such as genotype and phenotype values to be shared to a pre-determined number of parties such that each party cannot recover the private value without communicating with each other.24,25 Our protocol relies on three-party linear replicative secret sharing based on the Ring modulo protocol as described by Araki et al.47 For example, a secret value can be shared among three parties , , and by first dissecting the secret x into three uniformly-sampled random numbers , , where modulo . These are then distributed among computing parties such that ’s share is , ’s share is , and ’s share is . Each computing party holds a partial share of the secret and therefore cannot recover the private data. We denote to represent secret sharing of x.
With this secret sharing scheme, the computing parties can jointly compute operations such as additions and multiplications while keeping the data secret. Operations such as addition of two secrets can be done locally–two secrets and can be added locally such that ’s share becomes , ’s share becomes , and ’s share becomes . This allows each computing party to own replicative shares of . Multiplication and other non-linear operations such as sorting require communication between parties and are often times non-trivial. Please see Supplementary Material for more details on MPC multiplication and our usage of the MPC sorting protocol. privateQTL uses secret sharing based addition, multiplication, and sorting as building blocks.
Protocol setup
Our protocol is based on a three party linear replicative secret sharing scheme, where each data owner secretly shares their data to three computing servers s. For a given number n, we denote as integers modulo . Given a secret , data owners create three uniformly-sampled random numbers , , where modulo . These are then distributed among computing parties such that ’s share is , ’s share is , and ’s share is . We denote to represent secret sharing of x. In addition, shares pseudo-random number generators (PRNG) with neighboring parties: with , and with . These PRNGs are used to mask shared values with randomness throughout the protocol.
While addition is trivial and can be done locally by each adding their local shares, multiplication relies on communication between parties. Given secretly shared and , we can compute multiplication as the following.
Protocol 1. MPC Multiplication
Input.,
Output.
each
1. Compute
2. + =
3. - =
4. Send to
5. Receive from
Where is generated from , and is generated from .
We can confirm that all computing parties indeed have a linear replicative share of at the end of the protocol. This protocol can be extended to matrix multiplication, given two secretly shared matrices and . The security of addition and multiplication under the ring modulo with semi-honest adversaries has been proven by Araki et al.47
Dataset
In this study we used GTEx, Geuvadis and Taylor et al. datasets.4,26,27
We obtained both the transcripts per million (TPM) and read count data for whole blood from the GTEx public data portal (GTEx Portal).4 The controlled-access post-QC genotype was obtained from dbGAP with 46,569,704 total variants (accession code: phs000424). Genotype data was filtered to retain bi-allelic and autosomal SNPs. We only retained blood samples that have matched gene expression data (670 samples). Additionally, we filtered the genotypes with missingness of 0.15, variant quality <15, and minor allele frequency of <5%. We processed the gene expression data according to the pipeline outlined by the GTEx consortium.4 This involved retaining the expression data based on the following thresholds: TPM in at least 20% of samples, read counts in at least 20% of samples. To mimic a multi-site eQTL study, we split the samples into three sites using two split ratios: (300,300,70) and (300,250,120).
The Geuvadis project26 utilized lymphoblastoid cell lines (LCL) from 462 individuals of the 1000 Genomes Project.48 From their publicly available data deposit,49 we acquired read count data encompassing 58,813 genes across 462 samples. After aligning these data with the 1000 Genomes Project genotype data and excluding samples used in the Taylor et al., 2024 study, we selected a subset of 150 samples. Gene expression data was filtered based on the following criteria: TPM in at least 20% of the samples and read counts in at least 20% of the samples.
Similarly, Taylor et al.27 used LCLs from 731 individuals from the 1000 Genomes Project. We obtained their filtered read count data, which included 20,154 genes and 734 samples. We selected 300 samples that have matching 1000 Genomes genotype data and are different than the Geuvadis samples we selected. The gene expression data for this subset were filtered based on the same thresholds: TPM in at least 20% of samples and read counts in at least 20% of samples.
For the genotype data, we matched 5,751,798 SNPs from the 1000 Genomes Project with GTEx, selecting only autosomal and biallelic SNPs. Duplicated SNPs in the vcf were removed from the GTEx, Geuvadis, and Taylor et al. genotype data to ensure consistency across datasets.
For the demonstration of the accuracy of our tools, we first used all 670 samples from GTEx. For the investigation of accuracy under batch effect, we used 250 samples from GTEx samples to participate in a federated study with Geuvadis and Taylor et al. datasets.
Plaintext pre-processing method comparison
The GTEx study uses PCA for genotype covariate correction, TMM for expression data normalization, and PEER for gene expression covariate correction. We chose to implement RLE normalization in MPC instead of TMM normalization, because a systematic comparison of normalization methods for eQTL analysis showed similar results for these two methods,50 and RLE normalization chooses a pseudo-reference as a point of reference while TMM uses a real sample to compare library size factors. In order to protect sample privacy across all samples, we chose a method that will not reveal a real sample at any point during normalization. Principal components from PCA are used instead of PEER factors for gene expression covariate correction, as previous studies report that it is more scalable, interpretable, and robust compared to PEER factors.33 In addition, we conducted an experiment to compare eGenes overlap with different combinations of genotype-gene expression pre-processing methods: (1) projection onto reference genotype or aggregated PCA for genotype correction, (2) TMM, RLE, or QN for gene expression normalization, and (3) PCA or PEER for gene expression covariate correction (see Table S1). We found that projecting genotypes onto reference for genotype corrections results in a 95% eGenes overlap with aggregated PCA, RLE normalization results in a 97.24% overlap with TMM normalization, and PCA for gene expression covariate correction results in 95.66% overlap with PEER.
privateQTL
There are two versions of privateQTL, based on different privacy needs: privateQTL-I assumes phenotype data is open to the public, and privateQTL-II considers phenotype data as private. Both versions of privateQTL keep genotype data as private. Our pipeline for eQTL mapping can be broken down as (1) genotype population stratification correction, (2) phenotype data normalization, (3) phenotype covariate correction, and (4) eQTL mapping. privateQTL-I computes (1) and (4) in a privacy-preserving manner, while privateQTL-II assumes all data is kept private throughout the entire process. We describe detailed methodologies of each step below.
In the beginning of the protocol, each data client sends its set of gene and SNP IDs to a designated computing party, which calculates the intersection and shares it with the clients. This functionality also includes automatic filtering of each client’s data based on the common genes and SNPs identified. It is important to note that gene and SNP names or IDs do not contain private information.
Privacy-preserving population stratification
Both versions of privateQTL assume genotype data is private. To ensure genotype privacy while correcting for population stratification, we adapted a local projection method30 where each data owner locally projects their genotypes onto the principal components of a publicly available reference genotype data (e.g., 1000 genomes project48). Given a public reference of size N M, where N is the number of samples and M is the number of SNPs in the reference genotype, PCA is performed on the reference genotype to create an M by k loading matrix for k PCs. Local genotypes of size L M are then projected onto this loading matrix to retrieve an L k matrix that becomes the covariate matrix for L local samples and k PCs for that data owner. The genotypes were then locally residualized using the covariate matrix. For privateQTL, we utilized the 1000 Genomes Project (1KGP)48 genotype data as our reference, and PLINK PCA43 to obtain the loading matrix.
Privacy-preserving gene expression pre-processing for privateQTL-II
privateQTL-II protects the privacy of both genotypes and gene expression values. For that, gene expression data should be pre-processed (normalization and covariate correction) in a federated and privacy-preserving fashion across data centers. Here we developed two different MPC-based options for gene expression normalization for users to choose from. We also developed an MPC-based inverse transform, which is a required step for both normalization methods.
Privacy-preserving and federated Quantile Normalization (QN)
QN requires computing the rank of the genes based on their expression value for each sample and then replacing each rank with their across-sample rank average in a non-secure setting. Since all participating institutions have the same number of genes in their gene expression matrix, we can perform the across gene ranking for each sample locally. This ranked matrix is summed up across samples according to rank such that for M genes, there are M rank sums that are the local sum of expression values for that data owner’s samples. They then secretly share the sum of gene expression values for each rank and the number of samples with the computing parties. The servers use this information to calculate the gene expression values for each rank. They do this by combining the secretly shared sums and the number of samples per site. Afterward, the servers return the average gene expression values per rank to each site. Each site can then re-order their gene expression values based on this global rank (Figure S2A).
Privacy-preserving and federated RLE normalization
RLE normalization finds a pseudo-reference representative of the samples to find library size factors.32 This pseudo-reference is defined as a geometric mean across samples where n is the total number of samples (see denominator in Equation 1). In our privacy-preserving version, this pseudo-reference sample can be shared in plaintext among sites as it does not represent an actual sample. Each data owner locally calculates the log values of their gene expression values– for data owner k () and secretly shares the log matrix to the three computing parties such that the computing parties have shares of the phenotype matrix, thus P–’s share becomes (, ). The computing parties then calculate the sum of log values across the total samples n as . This log sum vector is revealed in plaintext to , which then calculates the average of the log values to obtain the geometric mean across samples. This vector is designated as the pseudo-reference sample, and is returned in plaintext back to the data owners. The data owners then locally compute the log ratio between the gene expression values of their samples and the pseudo-reference. The scaling factor is then the median of these ratios for each sample, defined as below for sample j and gene i (Figure S2B).
| (Equation 1) |
Protocol 2. RLE normalization
Input.’s phenotype matrix
Output. per-sample size factors as in (1)
each
1. Compute .
2. Create shares of matrix and sends to s.
each
1. Receive shares from s.
2. Calculate for n total samples and reveal to .
do
1. Aggregate from and .
2. Return pseudo-reference vector to s.
each
1. Exponentiate to obtain .
2. Locally calculate .
Privacy-preserving inverse normal transform
Inverse normal transform refers to the process of converting values into z-scores such that the resulting data follows a standard normal distribution. For eQTL mapping, after normalization, each gene vector is ranked across samples, and their rank percentiles are converted into corresponding z-scores. There are two challenges to consider when implementing inverse normal transform in MPC: (1) retrieving the rank vector, which requires comparison of secretly shared values, and (2) converting our rank vector into z-scores through a non-trivial process that often requires look-up tables even in plaintext. In our setting, look-ups are not possible since our query is kept in secret.
In order to obtain the rank vector while the expression vector is secretly shared, we adapt a secure sorting algorithm from Asharov et al.44 that outputs a destination permutation vector. Here, the destination permutation vector refers to the permutation vector that sorts the original vector when applied. For example, ρ; (3,4,1,2,5), sorts the original vector ; (3.2, 5.6, 1.7, 3.1, 6.2) when applied to it: ; (1.7, 3.1, 3.2, 5.6, 6.2). Our protocol uses this destination permutation vector as our rank vector. Briefly, the protocol takes in a bit-wise decomposed secret vector and conducts secure radix sort starting from the least significant bit. Instead of generating rank percentiles from the permutation vector, we use the permutation vector outputted from this protocol to reorder a given set of ordered z-scores.
Our MPC-based Inverse Normal Transform protocol takes in the resulting secretly shared permutation vector from secure sorting method, a secretly shared identity permutation , and secretly shared ordered z-scores as input. With a known number of total samples N, ordered z-scores can be viewed as a set of N values that can be designated prior to the protocol. The output of our protocol is a secretly shared set of re-ordered z-scores. We first apply onto to obtain . can be applied to the ordered z-scores to reorder the z-scores to their data-corresponding position. We use the APPLYPERM protocol detailed in Asharov et al.44 to apply secretly shared permutations on secretly shared vectors. This requires masking both and by a random secretly shared permutation , and revealing the randomly masked vector to apply on . The resulting secretly shared reordered Z score vector is used for the eQTL mapping step below (Figure S2C).
Privacy-preserving and federated eQTL mapping
privateQTL maps eQTLs one gene at a time. It is paralellized so can be used to map multiple genes at once. For that, we turn the genotype matrices in each data center (locally) into a cis-window genotype matrix, which contains SNPs only within 1MB cis-window of the gene. After fully pre-processing the cis-window genotype matrix and gene expression vector, the data owners center and normalize them to have zero mean and unit sum of squares before being secretly shared to the three computing servers. This is to ensure that the correlation matrix can be obtained from matrix multiplication, as detailed in MatrixQTL.45 This center and normalize step requires the mean and variance of each matrix to be shared publicly to the data owners. For the inverse normal transformed phenotype vector, the mean and variance are already known as the expression values were replaced by z-scores. The data owners (s) can locally normalize gene expression values as the following: . For the genotype matrix, each sums up the genotype values (G) for their local samples (M) per variant and shares to . aggregates the sum values and returns the total aggregate sum back to the s, who then calculate the mean . In a similar manner, each aggregates their local sum of , sends to , and receive the total aggregate . The s locally calculate from this aggregate, and ultimately calculate for each variant.
These fully pre-processed genotype matrix and gene expression vector are secretly shared to the computing parties. The computing parties first shuffle the gene vector for 1000 times with secretly shared permutation π, where and represents composing permutations where one permutation is applied on another. We denote as secretly shared permutation π where holds . The shuffling protocol was adapted from Asharov et al.,44 where two of the three computing parties iteratively apply their permutation and reshare to the third computing party, after masking with randomness generated by their shared pseudorandom number generator (PRNG). The resulting gene expression matrix should have 1001 columns, where the first column represents the original gene expression vector and the rest are the shuffled gene expression vectors to be used in null distribution (Figure S1). The computing parties compute matrix multiplication of these two secretly shared matrices to obtain the correlation matrix R (see MPC multiplication protocol in supplemental information).
This R matrix is revealed to , and computes the effect sizes, p values, beta-distribution modeling, true degree of freedom, permutation adjusted p values, and beta-distribution adjusted p values. This downstream statistics were computed as in tensorQTL.28
Benchmarking privateQTL with other eQTL mapping techniques
We used tensorQTL on aggregated data to benchmark our tool, where genotype PCA was done without projection onto reference and phenotype was RLE normalized and corrected for covariates with PCA (mimicking the GTEx pipeline). We chose RLE normalization instead of the TMM normalization used by the GTEx pipeline since RLE and TMM have similar performance for eQTL mapping and RLE does not require an actual sample as a reference.51
We set our privacy-preserving comparator as meta-analysis, where each data owner conducts their own eQTL mapping study and attempts to aggregate summary statistics afterward. In meta-analysis, each genotype data was projected onto reference genome and residualized, and the phenotype data was locally RLE normalized and residualized with their own PCs. We assumed each eQTL mapping study used tensorQTL for their eQTL mapping method. For summary statistic aggregation, we used inverse variance based METAL with fixed effect modeling.29 For multiple hypothesis testing correction in meta-analysis, we adapted the method used in Zeng et al.,9 which uses Bonferroni correction for variant-level false discovery control, since permutation-based modeling is not an option in the setting where data should not be shared across sites. It then uses Benjamini-Hochberg method52 for false discovery control for gene-level. Genes with Benjamini-Hochberg adjusted p values less than 0.05 were deemed as eGenes, and variants with Bonferroni adjusted p values less than 0.05 were deemed as eVariants.
We also compared privateQTL results with another widely used non-secure mapping method LIMIX, that employs a univariate association test within a linear mixed model framework.37 In our study, we utilized LIMIX (v.2.0.x) st_scan function with the same 1MB cis-window as in other methods used in this work. We used eigenMT42 to correct for gene-level multiple hypothesis testing and Storey’s q values53 for variant-level false discovery control.
Meta-analysis vs. privateQTL in eQTL mapping: A comparative analysis
Meta-analysis allows each data center to independently perform eQTL mapping on their respective datasets, followed by the aggregation of effect sizes and significance levels using sample sizes and standard errors from individual studies. However, because each mapping instance relies on a smaller sample size than the total dataset, the statistical power of each study is inherently lower than that of privateQTL, which utilizes the entire dataset without sharing raw data. Our hypothesis is that a privacy-preserving approach, such as privateQTL, which integrates all samples into a single high-powered eQTL mapping analysis, will outperform meta-analysis that aggregates multiple low-powered studies.
Evaluating scenarios of low power in meta-analysis
To investigate scenarios where meta-analysis fails to identify significant eQTLs, we compared the eGenes identified by plaintext GTEx pipeline and privateQTL with those missed by meta-analysis.
Comparison of eGenes across methods
-
•
GTEx and privateQTL-I vs. Meta-Analysis: We identified 497 eGenes that were consistently detected by both the GTEx pipeline and privateQTL-I but were missed by meta-analysis. When plotting the p-values of the top matching eVariants (beta-adjusted for GTEx and privateQTL, Bonferroni-adjusted for meta-analysis), meta-analysis showed unreliability in predicting significance (Figures S9A–S9C). Despite analyzing the same gene-variant pairs, GTEx and privateQTL exhibited high conformity, whereas meta-analysis predicted most top eVariants as insignificant.
-
•
GTEx and privateQTL-II vs. Meta-Analysis: Similarly, among 384 eGenes detected as significant by GTEx and privateQTL-II but not by meta-analysis, the p-values of the top eVariants were predominantly insignificant in the meta-analysis (Figures S9D–S9F). These findings further demonstrate that aggregating summary statistics alone is insufficient to compensate for the reduced statistical power resulting from smaller sample sizes in individual studies.
Statistical implications of sample size in eQTL mapping
We examined the number of eGenes identified by individual data centers and by combinations of centers, comparing their overlap with the eGenes identified by the aggregated GTEx pipeline (Table S2).
-
•
Proportional Detection by Sample Size: The number of eGenes identified by each data center correlated with their sample sizes. However, there was minimal overlap in eGenes consistently detected across two or more data centers. This shows the unreliability of using locally mapped eGenes for robust significance assessment in meta-analysis.
-
•
Meta-Analysis vs. Larger Sample Datasets: Although the meta-analysis identified more eGenes than any combination of individual data centers, its overlap with eGenes detected by the Taylor et al. dataset (which had the largest sample size) was notably smaller. This suggests that meta-analysis disproportionately penalizes larger datasets, as summary statistics from smaller data centers negatively impact overall significance predictions.
Our analysis strongly supports the idea that aggregating summary statistics alone is insufficient to overcome the limitations imposed by small sample sizes during eQTL mapping. Even when incorporating sample size information, meta-analysis does not reliably predict statistical significance. A unified, privacy-preserving approach like privateQTL, which leverages the full dataset without data sharing, offers a more reliable and statistically robust solution for eQTL mapping.
Security analysis
Security assumption
We respond on a semi-honest security model where it is assumed that the computing parties are honest-but-curious. The parties are non-colluding and will abide by protocol instructions, but may try to gain information based on their knowledge.
Network communication is essential for MPC operations, but our three-party secret sharing scheme ensures that throughout computation, the computing parties only access random shares of the secret (i.e., original data). If the network connection is interrupted, the parties can simply re-establish communication channels and seamlessly resume operations. Additionally, our MPC framework supports resharing methods. This allows the framework to enable the remaining two parties to reshare the secret using their own shares after masking them with randomness if one of the parties’ connection is disrupted.
As mentioned above, we assume honest or semi-honest parties in our MPC protocol, where each party adheres to the protocol and remains online, albeit curious to recover information given their shares. Parties that are interrupted from the connection and thus deviate from the protocol can be modeled as an adversary. We note that our three-party secret sharing scheme does not achieve “guaranteed output delivery (GOD)”, which guarantees successful completion of the MPC operations amongst the honest parties despite the adversary’s behavior. However, it achieves “security with abort,” where malicious parties can cause honest parties to abort without learning the secret.
Software security
We rigorously tested our codebase for potential issues such as memory leaks, buffer overflows, undefined memory access, heap corruption, and dangling pointers using the industry-standard tool Valgrind. These tests revealed no errors or leaks (detailed logs are available on our GitHub page). Furthermore, the use of MPC with secret sharing provides robust security guarantees under the honest-but-curious security model, ensuring that computing servers only receive randomized, partial shares of the original data. As a result, any potential information leakage during the pipeline would only expose randomness to an attacker. Below, we also evaluated the security assumptions of both versions of our tool separately by examining the conditions under which they remain secure.
Security of privateQTL-I
privateQTL-I follows current data sharing standards, where the full gene expression matrix P is shared in public. According to the current NIH data sharing standards (e.g., those adopted by GTEx Consortium), the resulting eQTL list is also shared in public. In privateQTL-I, the gene expression data is revealed in full to the data owners, but not to the computing parties as they are secretly shared to the s at the beginning of the protocol. The data owners also know the mean and variance of the genotype and gene expression data (, and , ). However, this is equivalent to (1) knowing the allele frequencies of SNPs, and (2) μ and σ from a standard normal distribution for post-processed gene expression matrix. At the end of eQTL mapping, the correlation matrix is revealed to to calculate downstream statistics in plaintext. Because G and P are secretly shared to , cannot recover either matrix from . More specifically, knows their share of the genotype and phenotype , and the final result . Using the distributive properties of the dot product, we can expand the final result as the following:
| (Equation 2) |
From (2), we can determine that has more unknown terms than known terms . Therefore, cannot recover the values of the full genotype G and phenotype P. The summary statistics that are returned to the data owners are: effect size, nominal p value, empirical (beta) p value, learned degree of freedom, and shape of empirical beta distribution.
Here we analyze if there are any private information leakage about the genotypes of the individuals in other data centers when a single data center has access to: (1) their own share of the genotype matrix and the full phenotype matrix in public, (2) mean and variance of both matrices across all data centers, and (3) total number of samples N. For data owner ’s local M samples, they know the dot product of their own genotype and phenotype for SNP i and gene j: and . The genotype and gene expression of other centers are denoted as X and Y, respectively. Given that slope (effect size) is calculated as where is the dot product of particular genotype i and gene expression vector j, can recover the dot product from known and , and ultimately (3) by eliminating their own data’s dot product. From mean of the genotype matrix, can recover (4). From the of genotype matrix and with information from (4), can recover (5).
| (Equation 3) |
| (Equation 4) |
| (Equation 5) |
(3), (4) and (5) together can be represented as (6), where values A, B, C are known.
| (Equation 6) |
Despite the entire gene expression values being known to the data owner in full in privateQTL-I, the system of equations above cannot be uniquely solved. Because the lefthand-side matrix is not a square matrix, it cannot be inverted to directly solve for the unknown genotype values , , … .
The number of unknowns may change if SNP i is within the cis-window of multiple genes. If there is another gene k with unknown phenotypes that has SNP i in its cis-window, then we can retrieve the following equation:
| (Equation 7) |
(3), along with (7), which provides more information about , , … . To assess how many equations there can be that will reduce the number of unknowns, we calculated the maximum number of genes for which a given SNP is included within a cis-window. With the GTEx genotype, the maximum number of genes that considers a given SNP as cis-range is 123. This number represents the maximum number of equations like (3) or (7) that can be produced in the worst case scenario. The average number of genes that share an SNP is 15. If is greater than 123 in the worst case and 15 in the average case, there are more unknowns and the system cannot be uniquely determined. In the worst case, the genotype cannot be recovered as long as each site holds at least 62 samples. The assumption of privateQTL is that data is aggregated from multiple sites such that is at least 250–300 samples. Therefore, the system is underdetermined, and must use approximation to guess the values of X. Even then, because the values of X are continuous real numbers (due to the centering and normalizing the genotype matrix), has too many unknowns.
Security of privateQTL-II
As the case with privateQTL-I, in privateQTL-II, data owners have access to (1) their own share of the genotype matrix, (2) mean and variance of both genotype and gene expression matrices across all data centers, and (3) total number of samples N. Unlike privateQTL-I, privateQTL-II assumes that the gene expression matrix is private, and are unknown to . (6) now has unknown variables, which makes it nearly impossible to recover the values of X. QN and RLE normalization steps require revealing sums of gene expression values across samples, which requires an attacker to determine continuous numbers that add up to the sum. Because the phenotype in (3) and (7) are also unknown, the number of unknowns remain throughout each iteration of genes. Therefore, the number of unknowns is consistently higher than number of equations.
Quantification and statistical analysis
Simulation study
We conducted a simulation study to directly compare phenotype normalization and covariate correction method used in privateQTL against an aggregated method (GTEx pipeline) and meta-analysis. The data simulation is modeled after Zhou et al.’s study,33 with added data split of (300, 250, 120) and (300, 300, 70) for 670 GTEx whole blood samples. To simulate data, we chose 1000 random genes from the GTEx whole blood data and chose 1000 SNPs that are closest in distance to each gene. For n = 670 samples, p = 1000 genes, q = 1000 local SNPs per gene and k covariates, we simulate the expression matrix Y as linear combination of the genotype matrix G, covariate matrix X, and error E as in Zhou et al.
| (Equation 8) |
is denoted as gene-wise multiplication. Each component of the simulation is detailed below.
-
•
G is a 670 (samples) 1000 (SNPs) 1000 (genes) matrix created from real GTEx whole blood data, where 1000 genes were taken from the phenotype matrix, and genotypes of the 1000 closest SNPs were chosen based on distance from gene position.
-
•
I is a binary effect indicator matrix modeled from for each gene.
-
•
is an effect size matrix for I, drawn from .
-
•
X is a covariate matrix consisted of both known (including site information) and unknown covariates.
-
•
is an effect size matrix for X, drawn from .
-
•
E is an added error matrix, drawn from N with different noise parameters according to site.
The resulting expression matrix Y was corrected according to the following to (1) Normalization: QN or RLE, (2) aggregated or separate covariate correction method. The objective of each experiment was to assess how well each pre-processing method accurately corrects the phenotype matrix and ultimately identifies effect SNPs as compared with the effect indicator matrix , which is I corrected for linkage disequilibrium such that any SNP that has high correlation () with any of effect SNPs in I are marked as 1. Both AUPRC and AUROC were calculated for ground truth and p value for accuracy measurements. Simulation was repeated 100 times with different parameters for number of covariates, proportion of variance explained by genotype, and proportion of variance explained by covariates. The reported AUPRC and AUROC per pre-processing method is the average of those 100 simulations.
Published: February 12, 2025
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.xgen.2025.100769.
Supplemental information
References
- 1.Klein R.J., Zeiss C., Chew E.Y., Tsai J.Y., Sackler R.S., Haynes C., Henning A.K., SanGiovanni J.P., Mane S.M., Mayne S.T., et al. Complement factor h polymorphism in age-related macular degeneration. Science. 2005;308:385–389. doi: 10.1126/science.1109557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Aguet F., Alasoo K., Li Y.I., Battle A., Im H.K., Montgomery S.B., Lappalainen T. Molecular quantitative trait loci. Nat. Rev. Methods Primers. 2023;3:4–22. [Google Scholar]
- 4.GTEx Consortium The GTEx consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369:1318–1330. doi: 10.1126/science.aaz1776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lonsdale J., Thomas J., Salvatore M., Phillips R., Lo E., Shad S., Hasz R., Walters G., Garcia F., Young N., et al. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 2013;45:580–585. doi: 10.1038/ng.2653. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Huang Q.Q., Ritchie S.C., Brozynska M., Inouye M. Power, false discovery rate and winner’s curse in eQTL studies. Nucleic Acids Res. 2018;46:e133. doi: 10.1093/nar/gky780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Sieberts S.K., Perumal T.M., Carrasquillo M.M., Allen M., Reddy J.S., Hoffman G.E., Dang K.K., Calley J., Ebert P.J., Eddy J., et al. Large eQTL meta-analysis reveals differing patterns between cerebral cortical and cerebellar brain regions. Sci. Data. 2020;7:340. doi: 10.1038/s41597-020-00642-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Kim Y., Xia K., Tao R., Giusti-Rodriguez P., Vladimirov V., van den Oord E., Sullivan P.F. A meta-analysis of gene expression quantitative trait loci in brain. Transl. Psychiatry. 2014;4:e459. doi: 10.1038/tp.2014.96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Zeng B., Bendl J., Kosoy R., Fullard J.F., Hoffman G.E., Roussos P. Multi-ancestry eQTL meta-analysis of human brain identifies candidate causal variants for brain-related traits. Nat. Genet. 2022;54:161–169. doi: 10.1038/s41588-021-00987-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kerimov N., Hayhurst J.D., Peikova K., Manning J.R., Walter P., Kolberg L., Samoviča M., Sakthivel M.P., Kuzmin I., Trevanion S.J., et al. A compendium of uniformly processed human gene expression and splicing quantitative trait loci. Nat. Genet. 2021;53:1290–1299. doi: 10.1038/s41588-021-00924-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Võsa U., Claringbould A., Westra H.J., Bonder M.J., Deelen P., Zeng B., Kirsten H., Saha A., Kreuzhuber R., Yazar S., et al. Large-scale cis-and trans-eQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression. Nat. Genet. 2021;53:1300–1310. doi: 10.1038/s41588-021-00913-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Arellano A.M., Dai W., Wang S., Jiang X., Ohno-Machado L. Privacy policy and technology in biomedical data science. Annu. Rev. Biomed. Data Sci. 2018;1:115–129. doi: 10.1146/annurev-biodatasci-080917-013416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Mittos A., Malin B., De Cristofaro E. Systematizing genome privacy research: A privacy-enhancing technologies perspective. Proc. Priv. Enhanc. Technol. 2019;2019:87–107. [Google Scholar]
- 14.Erlich Y., Narayanan A. Routes for breaching and protecting genetic privacy. Nat. Rev. Genet. 2014;15:409–421. doi: 10.1038/nrg3723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Harmanci A., Gerstein M. Quantification of private information leakage from phenotype-genotype data: linking attacks. Nat. Methods. 2016;13:251–256. doi: 10.1038/nmeth.3746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Schadt E.E., Woo S., Hao K. Bayesian method to predict individual SNP genotypes from gene expression data. Nat. Genet. 2012;44:603–608. doi: 10.1038/ng.2248. [DOI] [PubMed] [Google Scholar]
- 17.Yao A.C.-C. 27th Annual Symposium on Foundations of Computer Science (Sfcs 1986) IEEE; 1986. How to generate and exchange secrets; pp. 162–167. [Google Scholar]
- 18.Goldreich O., Micali S., Wigderson A. Providing Sound Foundations for Cryptography: On the Work of Shafi Goldwasser and Silvio Micali. Association for Computing Machinery; 2019. How to play any mental game, or a completeness theorem for protocols with honest majority; pp. 307–328. [Google Scholar]
- 19.Gentry C. STOC ’09: Proceedings of the forty-first annual ACM symposium on Theory of computing. Association for Computing Machinery; 2009. Fully homomorphic encryption using ideal lattices; pp. 169–178. [Google Scholar]
- 20.Cho H., Wu D.J., Berger B. Secure genome-wide association analysis using multiparty computation. Nat. Biotechnol. 2018;36:547–551. doi: 10.1038/nbt.4108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Dong C., Weng J., Liu J.N., Yang A., Liu Z., Yang Y., Ma J. Maliciously secure and efficient large-scale genome-wide association study with multi-party computation. IEEE Trans. Dependable Secure Comput. 2023;20:1243–1257. [Google Scholar]
- 22.de Vlaming R., Okbay A., Rietveld C.A., Johannesson M., Magnusson P.K.E., Uitterlinden A.G., van Rooij F.J.A., Hofman A., Groenen P.J.F., Thurik A.R., Koellinger P.D. Meta-GWAS accuracy and power (MetaGAP) calculator shows that hiding heritability is partially due to imperfect genetic correlations across studies. PLoS Genet. 2017;13 doi: 10.1371/journal.pgen.1006495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Zhang Z. Michigan State University; 1998. Power and accuracy of detecting linkage between quantitative trait loci and genetic markers. PhD thesis. [DOI] [Google Scholar]
- 24.Ben-Or M., Goldwasser S., Wigderson A. Proceedings of the twentieth annual ACM symposium on Theory of computing. STOC ’88. Association for Computing Machinery; 1988. Completeness theorems for non-cryptographic fault-tolerant distributed computation; pp. 1–10. [Google Scholar]
- 25.Shamir A. How to share a secret. Commun. ACM. 1979;22:612–613. [Google Scholar]
- 26.Lappalainen T., Sammeth M., Friedländer M.R., 't Hoen P.A.C., Monlong J., Rivas M.A., Gonzàlez-Porta M., Kurbatova N., Griebel T., Ferreira P.G., et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501:506–511. doi: 10.1038/nature12531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Taylor D.J., Chhetri S.B., Tassia M.G., Biddanda A., Yan S.M., Wojcik G.L., Battle A., McCoy R.C. Sources of gene expression variation in a globally diverse human cohort. Nature. 2024;632:122–130. doi: 10.1038/s41586-024-07708-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Taylor-Weiner A., Aguet F., Haradhvala N.J., Gosai S., Anand S., Kim J., Ardlie K., Van Allen E.M., Getz G. Scaling computational genomics to millions of individuals with GPUs. Genome Biol. 2019;20:228. doi: 10.1186/s13059-019-1836-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Willer C.J., Li Y., Abecasis G.R. METAL: fast and efficient meta-analysis of genome-wide association scans. Bioinformatics. 2010;26:2190–2191. doi: 10.1093/bioinformatics/btq340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Li W., Chen H., Jiang X., Harmanci A. Federated generalized linear mixed models for collaborative genome-wide association studies. iScience. 2023;26 doi: 10.1016/j.isci.2023.107227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Hicks S.C., Okrah K., Paulson J.N., Quackenbush J., Irizarry R.A., Bravo H.C. Smooth quantile normalization. Biostatistics. 2018;19:185–198. doi: 10.1093/biostatistics/kxx028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Anders S., Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Zhou H.J., Li L., Li Y., Li W., Li J.J. PCA outperforms popular hidden variable inference methods for molecular QTL mapping. Genome Biol. 2022;23:210. doi: 10.1186/s13059-022-02761-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Ito M., Saito A., Nishizeki T. Secret sharing scheme realizing general access structure. Electron. Comm. Jpn. Pt. III. 1989;72:56–64. [Google Scholar]
- 35.Ongen H., Buil A., Brown A.A., Dermitzakis E.T., Delaneau O. Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics. 2016;32:1479–1485. doi: 10.1093/bioinformatics/btv722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Gürsoy G., Lu N., Wagner S., Gerstein M. Recovering genotypes and phenotypes using allele-specific genes. Genome Biol. 2021;22:263. doi: 10.1186/s13059-021-02477-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Casale F.P., Rakitsch B., Lippert C., Stegle O. Efficient set tests for the genetic analysis of correlated traits. Nat. Methods. 2015;12:755–758. doi: 10.1038/nmeth.3439. [DOI] [PubMed] [Google Scholar]
- 38.DeLuca D.S., Levin J.Z., Sivachenko A., Fennell T., Nazaire M.D., Williams C., Reich M., Winckler W., Getz G. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics. 2012;28:1530–1532. doi: 10.1093/bioinformatics/bts196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Li B., Dewey C.N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinf. 2011;12:323–416. doi: 10.1186/1471-2105-12-323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Patro R., Duggal G., Love M.I., Irizarry R.A., Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods. 2017;14:417–419. doi: 10.1038/nmeth.4197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Froelicher D., Cho H., Edupalli M., Sousa J.S., Bossuat J.P., Pyrgelis A., Troncoso-Pastoriza J.R., Berger B., Hubaux J.P. 2023 IEEE Symposium on Security and Privacy (SP) IEEE; 2023. Scalable and privacy-preserving federated principal component analysis; pp. 1908–1925. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Davis J.R., Fresard L., Knowles D.A., Pala M., Bustamante C.D., Battle A., Montgomery S.B. An efficient multiple-testing adjustment for eQTL studies that accounts for linkage disequilibrium between variants. Am. J. Hum. Genet. 2016;98:216–224. doi: 10.1016/j.ajhg.2015.11.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A.R., Bender D., Maller J., Sklar P., de Bakker P.I.W., Daly M.J., Sham P.C. Plink: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Asharov G., Hamada K., Ikarashi D., Kikuchi R., Nof A., Pinkas B., Takahashi K., Tomida J. Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. CCS ’22. Association for Computing Machinery; 2022. Efficient secure three-party sorting with applications to data analysis and heavy hitters; pp. 125–138. [Google Scholar]
- 45.Shabalin A.A. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics. 2012;28:1353–1358. doi: 10.1093/bioinformatics/bts163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Rabin M.O. Cryptology ePrint Archive; 2005. How to Exchange Secrets with Oblivious Transfer. [Google Scholar]
- 47.Araki T., Furukawa J., Lindell Y., Nof A., Ohara K. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. Association for Computing Machinery; 2016. High-throughput semi-honest secure three-party computation with an honest majority; pp. 805–817. [Google Scholar]
- 48.The 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Fairley S., Lowy-Gallego E., Perry E., Flicek P. The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Res. 2020;48:D941–D947. doi: 10.1093/nar/gkz836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Yang J., Wang D., Yang Y., Yang W., Jin W., Niu X., Gong J. A systematic comparison of normalization methods for eQTL analysis. Brief. Bioinform. 2021;22 doi: 10.1093/bib/bbab193. [DOI] [PubMed] [Google Scholar]
- 51.Yang J., Wang D., Yang Y., Yang W., Jin W., Niu X., Gong J. A systematic comparison of normalization methods for eQTL analysis. Brief. Bioinform. 2021;22 doi: 10.1093/bib/bbab193. [DOI] [PubMed] [Google Scholar]
- 52.Benjamini Y., Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B. 1995;57:289–300. [Google Scholar]
- 53.Storey J.D., Tibshirani R. Statistical significance for genome-wide studies. Proc. Natl. Acad. Sci. USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All original code for privateQTL and benchmarking analysis is publicly available at https://github.com/G2Lab/privateQTL and Zenodo (https://doi.org/10.5281/zenodo.14648851). This paper utilizes publicly available datasets for benchmarking analysis. These datasets are listed in dataset under the STAR Methods. Additional information or analysis reported in this paper is available upon request from the lead contact.





