Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2023 Feb 9:2023.02.09.527747. [Version 1] doi: 10.1101/2023.02.09.527747

Multivariate adaptive shrinkage improves cross-population transcriptome prediction for transcriptome-wide association studies in underrepresented populations

Daniel S Araujo 1, Chris Nguyen 2, Xiaowei Hu 3, Anna V Mikhaylova 4, Chris Gignoux 5, Kristin Ardlie 6, Kent D Taylor 7, Peter Durda 8, Yongmei Liu 9, George Papanicolaou 10, Michael H Cho 11, Stephen S Rich 3, Jerome I Rotter 7; NHLBI TOPMed Consortium, Hae Kyung Im 12, Ani Manichaikul 3, Heather E Wheeler 1,2,*
PMCID: PMC9934635  PMID: 36798214

Abstract

Transcriptome prediction models built on European-descent individuals’ data are less accurate when applied to different populations because of differences in linkage disequilibrium patterns and allele frequencies. We hypothesized multivariate adaptive shrinkage may improve cross-population transcriptome prediction, as it leverages effect size estimates across different conditions - in this case, different populations. To test this hypothesis, we made transcriptome prediction models for use in transcriptome-wide association studies (TWAS) using different methods (Elastic Net, Matrix eQTL and Multivariate Adaptive Shrinkage in R (MASHR)) and tested their out-of-sample transcriptome prediction accuracy in population-matched and cross-population scenarios. Additionally, to evaluate model applicability in TWAS, we integrated publicly available multi-ethnic genome-wide association study (GWAS) summary statistics from the Population Architecture using Genomics and Epidemiology Study (PAGE) and PanUK Biobank with our developed transcriptome prediction models. In regard to transcriptome prediction accuracy, MASHR models had similar performance to other methods when the training population ancestry closely matched the test population, but outperformed other methods in cross-population predictions. Furthermore, in multi-ethnic TWAS, MASHR models yielded more discoveries that replicate in both PAGE and PanUKBB across all methods analyzed, including loci previously mapped in GWAS and new loci previously not found in GWAS. Overall, our study demonstrates the importance of using methods that benefit from different populations’ effect size estimates in order to improve TWAS for multi-ethnic or underrepresented populations.

Keywords: genetics, genomics, human genetics, transcriptome-wide association studies

1. INTRODUCTION

Through genome-wide association studies (GWAS), many associations between single nucleotide polymorphisms (SNPs) and diverse phenotypes have been uncovered1. However, most GWAS to date have been conducted on individuals of European descent, even though they make up less than one fifth of the total global population2,3. Ancestry diversity in human genetic studies is important because as linkage disequilibrium and allele frequencies differ among populations, associations found within European ancestry individuals may not reflect associations for individuals of other ancestries and vice versa3. Some efforts to increase ancestry diversity in human genetics studies include the NHLBI Trans-Omics for Precision Medicine (TOPMed) consortium4, the Population Architecture using Genomics and Epidemiology (PAGE) study5, the Human Heredity and Health in Africa (H3Africa) initiative6, and the Pan-ancestry genetic analysis of the UK Biobank (PanUKBB7).

Alongside GWAS, transcriptome-wide association studies (TWAS) test predicted gene expression levels for association with complex traits of interest, identifying gene-trait associated pairs8. Different TWAS methods, such as PrediXcan and FUSION, work by estimating gene expression through genotype data using transcriptomic prediction models built on expression quantitative trait loci (eQTL) data9,10. Similarly to GWAS, TWAS are also negatively affected by ancestry underrepresentation, as gene expression prediction models for use in TWAS are often trained in European descent datasets, which reduces the power of studies conducted with individuals of other ancestries11,12. Still, we expect the underlying biological mechanisms of complex traits to be shared across human populations11, and thus prediction methods that account for allelic heterogeneity and better estimate effect sizes can improve the discovery rate and interpretation of TWAS across populations.

Here, we used genomic and transcriptomic data from the Multi-Ethnic Study of Atherosclerosis (MESA)13 multi-omics pilot study of TOPMed to build TWAS prediction models (Figure 1). Using three different methods to estimate effect sizes, Elastic-Net14,15, Matrix eQTL16, and multivariate adaptive shrinkage (MASHR)17, we built population-specific transcriptomic prediction models for four MESA-defined populations – African American, Chinese, European, and Hispanic/Latino – across three blood cell types and evaluated their prediction performance in the Geuvadis18 cohort using PrediXcan9. From there, we used S-PrediXcan19 to apply our models to GWAS summary statistics of 28 complex traits from the multi-ethnic PAGE5 study and PanUKBB7. We hypothesized that MASHR may improve transcriptome prediction and increase the number of TWAS hits in comparison to the other methods, as it leverages effect size estimates across different conditions - in this case, different populations - to adjust effect sizes. In agreement to that, our results indicated that in cross-population predictions, MASHR models have a higher transcriptome prediction accuracy than Elastic Net and Matrix eQTL models. Furthermore, in our TWAS, MASHR models discovered the highest number of associated gene-trait pairs across all population models. These findings illustrate that leveraging genetic diversity and effect size estimates across populations can help improve current transcriptome prediction models, which may increase discovery and replication in association studies in underrepresented populations or multi-ethnic cohorts.

Figure 1: Overall study methodology.

Figure 1:

Using TOPMed MESA as a training dataset, we built population-based transcriptome prediction models using three different methods (Elastic Net, Matrix eQTL, and Multivariate adaptive shrinkage). With these transcriptome models, we evaluated their out-of-sample transcriptome prediction accuracy using the GEUVADIS dataset. Additionally, we assessed their applicability in multi-ethnic TWAS using GWAS summary statistics from the PAGE Study and PanUKBB. AFA = African American, CHN = Chinese, EUR = European, HIS = Hispanic/Latino.

2. METHODS

a. Training dataset

To build our transcriptome prediction models, we used data from the Multi-Ethnic Study of Atherosclerosis (MESA)13 multi-omics pilot study of the NHLBI Trans-Omics for Precision Medicine (TOPMed) consortium. This data set includes genotypes derived from whole genome sequencing and transcripts per million (TPM) values derived from RNA-Seq for individuals of four different populations – African American (AFA), Chinese (CHN), European (EUR), and Hispanic/Latino (HIS) – for three different blood cell types: peripheral blood mononuclear cells (PBMC, ALL n = 1287, AFA n = 334, CHN n = 104, EUR n = 528, HIS n= 321), CD16+ monocytes (Mono, ALL n = 395, AFA n = 75, EUR n = 221, HIS n = 99), and CD4+ T-cells (T cells, ALL n = 397, AFA n = 75, EUR n = 224, HIS n = 98).

b. Genotype and RNA-Seq QC

We performed QC on each MESA tissue-population pair separately. For the genotype data4 (Freeze 8, phs001416.v2.p1), we excluded INDELs, multi-allelic SNPs, and ambiguous-strand SNPs (A/G, C/T), and removed the remaining variants with MAF < 0.01 and HWE < 1 × 10−6 using PLINK20 v1.9. For chromosome X, filtering by HWE was only applied in variants found within the pseudoautosomal regions based on GRCh38 positions. Furthermore, for the non-pseudoautosomal region of X, male dosages were assigned either 0 or 2. After QC, the numbers of non-ambiguous SNPs remaining were: AFA = 15.7M; CHN = 8.4M; EUR = 9.7M; HIS = 13.2M.

For the RNA-Seq data, we also performed QC separately by tissue-population. First, we removed genes with average TPM values < 0.1. For some individuals, RNA expression levels were measured at two different time points (Exam 1 and Exam 5); thus, after log-transforming each measurement and adjusting for age and sex as covariates, we took the mean of the two time points (or the single adjusted log-transformed value, if expression levels were only measured once), performed rank-based inverse normal transformation, and adjusted for the first 10 genotype and 10 expression PCs. To estimate genotype and expression principal components, we used PC-AiR21, which accounts for sample relatedness, known or not. For each tissue, we removed genes absent in at least one population. After QC, we had 17,585 genes in PBMC, 14,503 in Mono, and 16,647 in T cells.

c. Gene expression cis-heritability estimation

We estimated gene expression heritability (h2) using cis-SNPs within the 1Mb region upstream of the transcription start site and 1Mb region downstream of the transcription end site. Using the genotype data filtered only by HWE P-value > 1 × 10−6, for each tissue-population pair, we first performed LD-pruning with a 500 variants count window, a 50 variants count step, and a 0.2 r2 threshold using PLINK20 v1.9. Then, for each gene, we extracted cis-SNPs and excluded SNPs with MAF < 0.01. Finally, to assess cis-SNP expression heritability, we estimated the genetic relationship matrix and h2 using GCTA-GREML22 with the “--reml-no-constrain” option. We considered a gene heritable if it had a positive h2 estimate (h2 - 2*S.E. > 0.01 and p-value < 0.05) in at least one MESA population. In total, 9,206 genes were heritable in PBMC, 3,804 in Mono, and 4,053 in T cells. Only these genes are included in the final models and were analyzed in the results.

d. Transcriptome prediction models

With the aforementioned genotype and gene expression data, we built transcriptome prediction models for each MESA tissue-population pair, and for each gene we considered cis-SNPs as defined in the previous section. Additionally, we only considered SNPs present in the GWAS summary statistics of the Population Architecture using Genomics and Epidemiology (PAGE) study5 to build our prediction models to make sure that there would be a high overlap between SNPs in the transcriptome models and SNPs in the GWAS summary statistics. After merging with PAGE SNPs, the average numbers of SNPs left in our dataset were: AFA = 12.8M; CHN = 6.2M; EUR = 7.4M; HIS = 10.5M.

We built our population-based models using three different approaches. The first one consists of a cross-validated elastic-net (EN) regression using the glmnet package in R14,15, with mixing parameter α = 0.5. We considered EN as our baseline model, as it has been previously used to make transcriptome prediction models for the TOPMed MESA data23.

The second method implemented was mash (Multivariate Adaptive Shrinkage)17 in R (MASHR). Unlike EN, MASHR does not estimate weights by itself; rather, it takes zscore (or weight and standard error) matrices as input and adjusts them based on correlation patterns present in the data, allowing for both shared and population-specific effects. We ran MASHR for each gene at a time, using cis-SNPs weights estimated by Matrix eQTL16 and MESA populations as different conditions (Figure 2A). Then, we split MASHR-adjusted weights according to their respective populations, and selected the top SNP (lowest local false sign rate) per gene to determine which SNPs would end up in the final models (Figure 2B). In order to make population-based models, we used population-specific effect sizes, taken from the corresponding MASHR output matrices.

Figure 2: Design of the methodology implemented to make MASHR models.

Figure 2:

(A) Using effect sizes estimated using Matrix eQTL within each population dataset, we combined them across genes, with the different populations as conditions, to use as input for MASHR. The output matrixes contain adjusted effect sizes. (B) For each population, we selected the top SNP (lowest local false sign rate) per gene. Then, we concatenated the Gene-top SNP pairs across populations to determine which SNPs would end up in the final models. Lastly, to make our population-based transcriptome prediction models, we used population-specific effect sizes, taken from the corresponding MASHR output matrices. AFA = African American, CHN = Chinese, EUR = European, HIS = Hispanic/Latino.

The third and last method was based on the effect sizes estimated by Matrix eQTL16 using the linear regression model. We used the same approach taken to build the MASHR models, but the key difference is that we made the models using the unadjusted effect sizes.

e. Assessing transcriptome prediction performance

To evaluate the gene expression prediction performance of all our transcriptome prediction models, we used DNA and lymphoblastoid cell lines RNA-Seq data from 449 individuals in the Geuvadis18 study. Individuals within the testing dataset belong to five different populations (Utah residents with Northern and Western European ancestry (CEU), n = 91; Finnish in Finland (FIN), n = 92; British in England and Scotland (GBR), n = 86; Toscani in Italy (TSI), n = 91; Yoruba in Ibadan, Nigeria (YRI), n = 89), which we analyzed both separately and together (ALL). Similarly to our training dataset, we performed rank-based inverse normal transformation on the gene expression levels, and adjusted for the first 10 genotype and 10 expression PCs. With the Geuvadis genotype data and our transcriptome prediction models, we used PrediXcan9 to estimate gene expression levels, and compared the estimated values to the adjusted, measured expression levels using Spearman correlation.

f. Applications in association studies

To test the applicability of our transcriptome prediction models in multi-ethnic association studies, we applied S-PrediXcan19 to GWAS summary statistics from the Population Architecture using Genomics and Epidemiology (PAGE) study5. The PAGE study consists of 28 different phenotypes tested for association with variants within a multi-ethnic, non-European cohort of 49,839 individuals (Hispanic/Latino [n=22,216], African American [n=17,299], Asian [n=4,680], Native Hawaiian [n=3,940], Native American [n=652] or Other [n=1,052]). Since we tested multiple phenotypes and transcriptome prediction models, we considered genes as significantly associated with a phenotype if the association p-value was less than the Bonferroni corrected GWAS significance threshold of 5e-8.

To replicate the associations found in PAGE, we also applied S-PrediXcan19 to PanUKBB7 GWAS summary statistics (N=441,331; European [n=420,531], Central/South Asian [n=8,876], African [n=6,636], East Asian [n=2,709], Middle Eastern [n=1,599] or Admixed American [n=980]). For similarity purposes, we selected summary statistics of phenotypes that overlap with the ones tested in PAGE (Table S1). As previously described, a gene-trait pair association was considered significant if its p-value was less than the Bonferroni corrected GWAS significance threshold of 5e-8. Furthermore, we deemed significant gene-trait pair associations as replicated if they were detected by the same MESA tissue-population model and had the same direction of effect in PAGE and PanUKBB. To assess if the gene-trait association pairs reported in our study are novel or not, we compared them to studies found in the GWAS Catalog1 (All associations v1.0.2 file downloaded on 11/9/2022).

3. RESULTS

a. Increased sample sizes improve gene expression cis-heritability estimation

With the goal of improving transcriptome prediction in diverse populations, we first determined which gene expression traits were heritable and thus amenable to genetic prediction, using genome-wide genotype and RNA-Seq data from three blood cell types (PBMCs, monocytes, T cells) in TOPMed MESA. We estimated cis-heritability (h2) using data from four different populations (African American - AFA, Chinese - CHN, European - EUR, and Hispanic/Latino - HIS). Variation in h2 estimation between populations is expected due to differences in allele frequencies and LD patterns; however, we show that larger population sample sizes yield more h2 estimates (Figure 3). For instance, with the EUR dataset (n = 528), we assessed h2 for 10,228 genes, however, we estimated h2 for 8,765 genes using the AFA dataset (n = 334) (Figure 3A). Moreover, we see a great impact on the CHN population, which has the smallest sample size. For that population, we managed to estimate h2 for only 3,448 genes. The same pattern repeats when analyzing only the heritable genes (h2 lower bound > 0.01). In EUR, 6,902 genes were deemed heritable, whereas in AFA and CHN the amount of heritable genes is 5,537 and 1,367, respectively (Figure 3B). Thus, larger sample sizes are needed to better pinpoint h2 estimates, especially in non-European populations. In total, analyzing the union across all populations’ results, we detected 9,206 heritable genes in PBMCs, 3,804 in monocytes, and 4,053 in T Cells.

Figure 3: PBMC gene expression cis-heritability estimates across MESA populations.

Figure 3:

(A) Gene expression cis-heritability (h2) estimated for different genes across different MESA population datasets. Only genes with significant estimated h2 (p-value < 0.05) are shown. Gray bars represent the standard errors (2*S.E.). Genes are ordered on the x-axis in ascending h2 order, and colored according to the h2 lower bound (h2 - 2*S.E.). (B) Number of significant heritable genes (p-value < 0.05 and h2 lower bound > 0.01) within each population dataset, by sample size. AFA = African American, CHN = Chinese, EUR = European, HIS = Hispanic/Latino.

b. MASHR models improve cross-population prediction performance

To improve TWAS power for discovery and replication across all populations, we sought to improve cross-population transcriptome prediction accuracy. For this, we used data from four different populations and built gene expression prediction models using three different methods (Elastic Net (EN), Matrix eQTL, and multivariate adaptive shrinkage in R (MASHR)). We chose EN as a baseline approach for comparison in our analysis, as it has been previously shown to have better performance than other common machine learning methods such as random forest, K-nearest neighbor, and support vector regression24. Matrix eQTL estimates univariate effect sizes for each cis-SNP-gene relationship and we developed an algorithm to include top SNPs from each population, but population-estimated effect sizes in each population’s model (Figure 1). Matrix eQTL effect sizes are the input for MASHR, which we hypothesized might better estimate cross-population effect sizes, due to its flexibility in allowing both shared and population-specific effects17,25. By filtering our models to include only genes with positive h2 (h2 lower bound > 0.01) in at least one population, we saw that among all methods used, we obtained more gene models in MatrixeQTL and MASHR in comparison to EN, especially in the CHN population model (Figure 4A).

Figure 4: Comparison of MESA population transcriptome prediction models.

Figure 4:

(A) The number of genes in each MESA population model, by method and tissue. (B) Prediction performance (Spearman’s rho) of MASHR and EN PBMC MESA population models in Geuvadis GBR and YRI populations. Only genes with expression predicted by both methods for each MESA-Geuvadis population pair are shown. Differences in performance assessed through Wilcoxon rank sum tests; ns = not significant, *** = p-value ≤ 0.001, **** = p-value ≤ 0.0001.

To evaluate model performance at population-matched and cross-population transcriptome predictions, we used data from the Geuvadis study, which comprises individuals of West African or European descent. We defined “population-matched predictions” as the scenarios in which the transcriptome model MESA training data and Geuvadis test data have the closest genetic distance with available data, and we defined “cross-population predictions” as any other pairs (Figure S1). Focusing on Geuvadis GBR and YRI populations, which have similar sample sizes and are of distinct continental ancestries, we observed that MASHR models significantly outperform EN models in cross-population transcriptome predictions, considering genes with expression predicted by both methods, as seen in the AFA-GBR and EUR-YRI MESA-Geuvadis populations pairs (Figure 4B). We also see a higher prediction performance by the CHN and HIS MASHR models in comparison to EN, regardless of the Geuvadis population analyzed. However, in population-matched scenarios (AFA-YRI and EUR-GBR), prediction performance does not significantly differ between MASHR and EN methods. Similar results were obtained when comparing Matrix eQTL and EN (Figure S2A). Regarding MASHR and Matrix eQTL models, both methods perform the same in almost all cases, except for EUR-YRI and all CHN predictions, in which MASHR performed better (Figure S2B). Overall, across all Geuvadis populations, MASHR models either performed better or the same as EN and MatrixeQTL models in both population-matched or cross-population transcriptome prediction scenarios (Table S2).

c. Leveraging effect sizes across different populations improves discovery rate in multi-ethnic TWAS

In order to investigate the applicability of the models we built in multi-ethnic TWAS, we used S-PrediXcan with GWAS summary statistics of 28 complex traits from PAGE and PanUKBB. We show that across all tissue-population models, MASHR identified the highest number of gene-trait pair associations (205) that replicated in both PAGE and PanUKBB (P < 5e-8), followed by Matrix eQTL (172) and EN (93) (Table S3). When analyzing the total number of discoveries separately for each population, MASHR had the highest number of gene-trait pairs in most population models, with large discrepancies found in AFA and CHN models when comparing MASHR and EN (Figure 5A). Additionally, when comparing gene-trait pairs, we saw that most MASHR hits were shared between population models (Figure 5B), whereas in EN, the models have higher population-specific discoveries (Figure 5C). These findings suggest that MASHR models show high consistency and also suggest that TWAS results are not as affected by the MASHR population model used as compared to EN.

Figure 5: Number of significant S-PrediXcan gene-trait pairs in PAGE and PanUKBB GWAS summary statistics.

Figure 5:

(A) Total number of significant gene-trait pairs discovered by each MESA population model (considering the union of the three tissues), by method. (B) Number of significant gene-trait pairs discovered by MASHR MESA population models (considering the union of the three tissues). (C) Number of significant gene-trait pairs discovered by EN MESA population models (considering the union of the three tissues).

To contextualize our models’ findings, we investigated whether the discovered gene-trait pairs had been previously reported in any studies in the GWAS Catalog (https://www.ebi.ac.uk/gwas/home). We saw that across 72 distinct gene-trait pairs associations found (totaling 475 across all models), 19 (26.39%) have not been reported in the GWAS Catalog, and therefore may be novel associations that require further investigation (Table S3). Out of those potential new biological associations, most of them (13) were discovered with MASHR AFA models (Table S3). Furthermore, out of the 53 distinct known GWAS catalog associations discovered, MASHR models identified most of them (Table S3). For instance, MASHR EUR models found 34 known associations, followed by MASHR AFA with 33, and MatrixeQTL with 32 (Figure S3).

4. DISCUSSION

In this work, we sought to build population-based transcriptome prediction models for TWAS using data from the TOPMed MESA cohort using three distinct approaches. We saw that although the AFA and HIS populations’ datasets contained the highest numbers of SNPs after quality control, EUR yielded the highest number of gene expression traits with significant heritability estimates across all tissues analyzed. This is most likely due to the higher sample size in EUR in comparison to AFA and HIS, as larger sample sizes provide higher statistical power to detect eQTLs with smaller effects26. Test data sample size has also been shown to positively correlate with gene expression prediction accuracy27.

In addition to sample size, gene expression prediction accuracy is known to be greater when the training and testing datasets have similar ancestries12,23,27,28; however, non-European ancestries are vastly underrepresented in human genetics studies2,3, which compromises the ability to build accurate TWAS models for them. Thus, using data from the Geuvadis cohort, we evaluated the transcriptome prediction performance of our models and found out that MASHR models either significantly outperformed EN and MatrixeQTL models, or had similar performance. Previous studies have shown that by borrowing information across different conditions, such as tissues17 or cell types29, MASHR identifies shared- or condition-specific eQTLs, which can enhance causal gene identification25, as well as improve effect size estimation accuracy17. Similarly, by leveraging effect size estimates across multiple populations, MASHR improved cross-population transcriptome prediction without compromising population-matched prediction accuracy.

Discovery and replication of TWAS associations are also related to the ancestries of the transcriptome prediction model training dataset and ancestries of the TWAS sample dataset11. Thus, we assessed the applicability of our models in TWAS using S-PrediXcan on PAGE and PanUKBB GWAS summary statistics and found out that across all tissues and populations, MASHR models yielded the highest number of total gene-trait pairs associations, with MASHR AFA reporting the highest number. In this manner, it seems that although MASHR improved gene expression prediction accuracy for all populations analyzed, using transcriptome prediction models that match the ancestries of the GWAS dataset still yields the highest number of TWAS discoveries, which is in agreement with many previous works11,3033. Furthermore, by investigating which associations had been previously reported in the GWAS Catalog, we saw that most new discoveries were found by MASHR models. Some of these possible new discoveries are unique to MASHR models and have been corroborated previously, such as YJEFN3 (also known as AIBP2) and triglycerides, whose low expression in zebrafish increases cellular unesterified cholesterol levels34, consistent with our S-PrediXcan effect size directions (PAGE effect size = −0.522, p-value = 6.07e-16; PanUKBB effect size = −0.860, p-value = 7.12e-86). Additionally, we also saw that MASHR models showed higher consistency than EN, which means that TWAS results are not as affected by the population model used as EN.

One limitation of our TWAS is that we used transcriptome prediction models trained in PBMCs, monocytes and T cells, and those tissues might not be the most appropriate for some phenotypes in PAGE or PanUKBB. Additionally, because of the smaller sample sizes for some populations in our training dataset, h2 and eQTL effect sizes estimates have large standard errors, which may affect the ability of MASHR to adjust effect sizes across different conditions based on correlation patterns present in the data. Regardless of that, our results mainly demonstrate that we can implement cross-population effect size leveraging using a method first applied to do cross-tissue effect size leveraging - and improve cross-population transcriptome prediction accuracy in doing so. Thus, increasing sample size for underrepresented populations will improve current MASHR TWAS models’ performances, as well as increase genetic diversity in the data. MASHR is most useful when population effects are shared, as demonstrated by the more consistent S-PrediXcan results, but population-specific effects are also relevant. For instance, a study in a large African American and Latino cohort discovered eQTLs only present at appreciable allele frequencies in African ancestry populations33. Moreover, since our MASHR and MatrixeQTL models focus on the top SNPs, we might not be including enough eQTLs in the models, especially for those genes whose expression is genetically regulated by multiple eQTLs with small effects.

In conclusion, our results demonstrate the importance and the benefits of increasing ancestry diversity in the field of human genetics, especially regarding association studies. As shown, sample size is valuable for assessing gene expression heritability and for accurately estimating eQTL effect sizes, and thus some populations are negatively affected due to the lack of data. However, by making transcriptome prediction models that leverage effect size estimates across different populations using multivariate adaptive shrinkage, we were able to increase gene expression prediction performance for scenarios in which the training data and test data have distant (“cross-population”) genetic distances with available data. Additionally, when applied to multi-ethnic TWAS, the aforementioned models yielded more discoveries across all methods analyzed, even detecting well-known associations that were not detected by other methods. Thus, in order to further improve TWAS in multi-ethnic or underrepresented populations and possibly reduce health care disparities, it is necessary to use methods that consider shared and population-specific effect sizes, as well as increase available data of underrepresented populations.

Supplementary Material

Supplement 1
media-1.tif (1,022.4KB, tif)
Supplement 2
media-2.tif (1.8MB, tif)
Supplement 3
media-3.tif (511.8KB, tif)
Supplement 4
media-4.xlsx (14.6KB, xlsx)
Supplement 5
media-5.xlsx (27.2KB, xlsx)
Supplement 6
media-6.xlsx (111.2KB, xlsx)
Supplement 7
media-7.xlsx (17.4KB, xlsx)
Supplement 8

5. ACKNOWLEDGEMENTS

This work is supported by the NIH National Human Genome Research Institute Academic Research Enhancement Award R15 HG009569 (HEW). Whole genome sequencing (WGS) for the Trans-Omics in Precision Medicine (TOPMed) program was supported by the National Heart, Lung and Blood Institute (NHLBI). WGS for “NHLBI TOPMed: Multi-Ethnic Study of Atherosclerosis (MESA)” (phs001416.v1.p1) was performed at the Broad Institute of MIT and Harvard (3U54HG003067-13S1). Centralized read mapping and genotype calling, along with variant quality metrics and filtering were provided by the TOPMed Informatics Research Center (3R01HL-117626-02S1). Phenotype harmonization, data management, sample-identity QC, and general study coordination, were provided by the TOPMed Data Coordinating Center (3R01HL-120393-02S1), and TOPMed MESA Multi-Omics (HHSN2682015000031/HSN26800004). The MESA projects are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for the Multi-Ethnic Study of Atherosclerosis (MESA) projects are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts 75N92020D00001, HHSN268201500003I, N01-HC-95159, 75N92020D00005, N01-HC-95160, 75N92020D00002, N01-HC-95161, 75N92020D00003, N01-HC-95162, 75N92020D00006, N01-HC-95163, 75N92020D00004, N01-HC-95164, 75N92020D00007, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168, N01-HC-95169, UL1-TR-000040, UL1-TR-001079, UL1-TR-001420, UL1TR001881, DK063491, and R01HL105756. The MESA Epigenomics and Transcriptomics Studies were funded by National Institutes of Health grants 1R01HL101250, 1RF1AG054474, R01HL126477, R01DK101921, and R01HL135009. The authors thank the other investigators, the staff, and the participants of the MESA study for their valuable contributions. A full list of participating MESA investigators and institutes can be found at http://www.mesa-nhlbi.org.

Footnotes

7.

DECLARATION OF INTERESTS

All authors declare that they have no conflicts of interest.

6. DATA AVAILABILITY

All scripts used for analyses are available at https://github.com/danielsarj/TOPMed_MESA_crosspop_portability. MESA populations prediction models and raw S-PrediXcan TWAS output files are available at https://doi.org/10.5281/zenodo.7551845. TOPMed MESA data are under controlled access in dbGaP at https://www.ncbi.nlm.nih.gov/gap/ through study accession phs001416.v2.p1. Geuvadis expression data is at Array Express (E-GEUV-1) and genotype data is at http://www.internationalgenome.org/. PAGE GWAS summary statistics are available in the GWAS Catalog at https://www.ebi.ac.uk/gwas/publications/31217584. PanUKBB GWAS summary statistics are available at https://pan.ukbb.broadinstitute.org/phenotypes/index.html.

8. REFERENCES

  • 1.Buniello A., MacArthur J.A.L., Cerezo M., Harris L.W., Hayhurst J., Malangone C., McMahon A., Morales J., Mountjoy E., Sollis E., et al. (2019). The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Research 47, D1005–D1012. 10.1093/nar/gky1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Morales J., Welter D., Bowler E.H., Cerezo M., Harris L.W., McMahon A.C., Hall P., Junkins H.A., Milano A., Hastings E., et al. (2018). A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog. Genome Biol 19, 21. 10.1186/s13059-018-1396-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Martin A.R., Kanai M., Kamatani Y., Okada Y., Neale B.M., and Daly M.J. (2019). Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet 51, 584–591. 10.1038/s41588-019-0379-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Taliun D., Harris D.N., Kessler M.D., Carlson J., Szpiech Z.A., Torres R., Taliun S.A.G., Corvelo A., Gogarten S.M., Kang H.M., et al. (2021). Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299. 10.1038/s41586-021-03205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Wojcik G.L., Graff M., Nishimura K.K., Tao R., Haessler J., Gignoux C.R., Highland H.M., Patel Y.M., Sorokin E.P., Avery C.L., et al. (2019). Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518. 10.1038/s41586-019-1310-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.The H3Africa Consortium, Matovu E., Bucheton B., Chisi J., Enyaru J., Hertz-Fowler C., Koffi M., Macleod A., Mumba D., Sidibe I., et al. (2014). Enabling the genomic revolution in Africa. Science 344, 1346–1348. 10.1126/science.1251546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Pan UKBB Team Pan UKBB. https://pan.ukbb.broadinstitute.org/.
  • 8.Wainberg M., Sinnott-Armstrong N., Mancuso N., Barbeira A.N., Knowles D.A., Golan D., Ermel R., Ruusalepp A., Quertermous T., Hao K., et al. (2019). Opportunities and challenges for transcriptome-wide association studies. Nat Genet 51, 592–599. 10.1038/s41588-019-0385-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Gamazon E.R., Wheeler H.E., Shah K.P., Mozaffari S.V., Aquino-Michaels K., Carroll R.J., Eyler A.E., Denny J.C., Nicolae D.L., Cox N.J., et al. (2015). A gene-based association method for mapping traits using reference transcriptome data. Nature Genetics 47, 1091–1098. 10.1038/ng.3367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Gusev A., Ko A., Shi H., Bhatia G., Chung W., Penninx B.W.J.H., Jansen R., de Geus E.J.C., Boomsma D.I., Wright F.A., et al. (2016). Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet 48, 245–252. 10.1038/ng.3506. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Geoffroy E., Gregga I., and Wheeler H.E. (2020). Population-Matched Transcriptome Prediction Increases TWAS Discovery and Replication Rate. iScience 23, 101850. 10.1016/j.isci.2020.101850. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Keys K.L., Mak A.C.Y., White M.J., Eckalbar W.L., Dahl A.W., Mefford J., Mikhaylova A.V., Contreras M.G., Elhawary J.R., Eng C., et al. (2020). On the cross-population generalizability of gene expression prediction models. PLOS Genetics 16, e1008927. 10.1371/journal.pgen.1008927. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Bild D.E., Bluemke D.A., Burke G.L., Detrano R., Diez Roux A.V., Folsom A.R., Greenland P., Jacobs D.R. Jr., Kronmal R., Liu K., et al. (2002). Multi-Ethnic Study of Atherosclerosis: Objectives and Design. Am J Epidemiol 156, 871–881. 10.1093/aje/kwf113. [DOI] [PubMed] [Google Scholar]
  • 14.Zou H., and Hastie T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 301–320. 10.1111/j.1467-9868.2005.00503.x. [DOI] [Google Scholar]
  • 15.Friedman J.H., Hastie T., and Tibshirani R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33, 1–22. 10.18637/jss.v033.i01. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Shabalin A.A. (2012). Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics 28, 1353–1358. 10.1093/bioinformatics/bts163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Urbut S.M., Wang G., Carbonetto P., and Stephens M. (2019). Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. Nat Genet 51, 187–195. 10.1038/s41588-018-0268-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lappalainen T., Sammeth M., Friedländer M.R., ‘t Hoen P.A.C., Monlong J., Rivas M.A., Gonzàlez-Porta M., Kurbatova N., Griebel T., Ferreira P.G., et al. (2013). Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511. 10.1038/nature12531. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Barbeira A.N., Dickinson S.P., Bonazzola R., Zheng J., Wheeler H.E., Torres J.M., Torstenson E.S., Shah K.P., Garcia T., Edwards T.L., et al. (2018). Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun 9, 1825. 10.1038/s41467-018-03621-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A.R., Bender D., Maller J., Sklar P., de Bakker P.I.W., Daly M.J., et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575. 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Conomos M.P., Miller M.B., and Thornton T.A. (2015). Robust Inference of Population Structure for Ancestry Prediction and Correction of Stratification in the Presence of Relatedness. Genetic Epidemiology 39, 276–293. 10.1002/gepi.21896. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Yang J., Benyamin B., McEvoy B.P., Gordon S., Henders A.K., Nyholt D.R., Madden P.A., Heath A.C., Martin N.G., Montgomery G.W., et al. (2010). Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42, 565–569. 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Mogil L.S., Andaleon A., Badalamenti A., Dickinson S.P., Guo X., Rotter J.I., Johnson W.C., Im H.K., Liu Y., and Wheeler H.E. (2018). Genetic architecture of gene expression traits across diverse populations. PLOS Genetics 14, e1007586. 10.1371/journal.pgen.1007586. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Okoro P.C., Schubert R., Guo X., Johnson W.C., Rotter J.I., Hoeschele I., Liu Y., Im H.K., Luke A., Dugas L.R., et al. (2021). Transcriptome prediction performance across machine learning models and diverse ancestries. Human Genetics and Genomics Advances 2, 100019. 10.1016/j.xhgg.2020.100019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Barbeira A.N., Melia O.J., Liang Y., Bonazzola R., Wang G., Wheeler H.E., Aguet F., Ardlie K.G., Wen X., and Im H.K. (2020). Fine-mapping and QTL tissue-sharing information improves the reliability of causal gene identification. Genetic Epidemiology 44, 854–867. 10.1002/gepi.22346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Aguet F., Brown A.A., Castel S.E., Davis J.R., He Y., Jo B., Mohammadi P., Park Y., Parsana P., Segrè A.V., et al. (2017). Genetic effects on gene expression across human tissues. Nature 550, 204–213. 10.1038/nature24277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Fryett J.J., Morris A.P., and Cordell H.J. (2020). Investigation of prediction accuracy and the impact of sample size, ancestry, and tissue in transcriptome-wide association studies. Genet Epidemiol 44, 425–441. 10.1002/gepi.22290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Mikhaylova A.V., and Thornton T.A. (2019). Accuracy of Gene Expression Prediction From Genotype Data With PrediXcan Varies Across and Within Continental Populations. Frontiers in Genetics 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Sheng X., Guan Y., Ma Z., Wu J., Liu H., Qiu C., Vitale S., Miao Z., Seasock M.J., Palmer M., et al. (2021). Mapping the genetic architecture of human traits to cell types in the kidney identifies mechanisms of disease and potential treatments. Nat Genet 53, 1322–1333. 10.1038/s41588-021-00909-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Schubert R., Geoffroy E., Gregga I., Mulford A.J., Aguet F., Ardlie K., Gerszten R., Clish C., Berg D.V.D., Taylor K.D., et al. (2022). Protein prediction for trait mapping in diverse populations. PLOS ONE 17, e0264341. 10.1371/journal.pone.0264341. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Bhattacharya A., Hirbo J.B., Zhou D., Zhou W., Zheng J., Kanai M., the Global Biobank Meta-analysis Initiative, Pasaniuc B., Gamazon E.R., and Cox N.J. (2021). Best practices for multi-ancestry, meta-analytic transcriptome-wide association studies: lessons from the Global Biobank Meta-analysis Initiative (Genetic and Genomic Medicine) 10.1101/2021.11.24.21266825. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Bhattacharya A., García-Closas M., Olshan A.F., Perou C.M., Troester M.A., and Love M.I. (2020). A framework for transcriptome-wide association studies in breast cancer in diverse study populations. Genome Biol 21, 42. 10.1186/s13059-020-1942-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Kachuri L., Mak A.C.Y., Hu D., Eng C., Huntsman S., Elhawary J.R., Gupta N., Gabriel S., Xiao S., Keys K.L., et al. (2021). Gene expression in African Americans and Latinos reveals ancestry-specific patterns of genetic architecture (Genetics) 10.1101/2021.08.19.456901. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Fang L., Choi S.-H., Baek J.S., Liu C., Almazan F., Ulrich F., Wiesner P., Taleb A., Deer E., Pattison J., et al. (2013). Control of angiogenesis by AIBP-mediated cholesterol efflux. Nature 498, 118–122. 10.1038/nature12166. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1
media-1.tif (1,022.4KB, tif)
Supplement 2
media-2.tif (1.8MB, tif)
Supplement 3
media-3.tif (511.8KB, tif)
Supplement 4
media-4.xlsx (14.6KB, xlsx)
Supplement 5
media-5.xlsx (27.2KB, xlsx)
Supplement 6
media-6.xlsx (111.2KB, xlsx)
Supplement 7
media-7.xlsx (17.4KB, xlsx)
Supplement 8

Data Availability Statement

All scripts used for analyses are available at https://github.com/danielsarj/TOPMed_MESA_crosspop_portability. MESA populations prediction models and raw S-PrediXcan TWAS output files are available at https://doi.org/10.5281/zenodo.7551845. TOPMed MESA data are under controlled access in dbGaP at https://www.ncbi.nlm.nih.gov/gap/ through study accession phs001416.v2.p1. Geuvadis expression data is at Array Express (E-GEUV-1) and genotype data is at http://www.internationalgenome.org/. PAGE GWAS summary statistics are available in the GWAS Catalog at https://www.ebi.ac.uk/gwas/publications/31217584. PanUKBB GWAS summary statistics are available at https://pan.ukbb.broadinstitute.org/phenotypes/index.html.


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES