Abstract
Genetic variation that influences gene expression and splicing is a key source of phenotypic diversity1–5. Although invaluable, studies investigating these links in humans have been strongly biased towards participants of European ancestries, which constrains generalizability and hinders evolutionary research. Here to address these limitations, we developed MAGE, an open-access RNA sequencing dataset of lymphoblastoid cell lines from 731 individuals from the 1000 Genomes Project6, spread across 5 continental groups and 26 populations. Most variation in gene expression (92%) and splicing (95%) was distributed within versus between populations, which mirrored the variation in DNA sequence. We mapped associations between genetic variants and expression and splicing of nearby genes (cis-expression quantitative trait loci (eQTLs) and cis-splicing QTLs (sQTLs), respectively). We identified more than 15,000 putatively causal eQTLs and more than 16,000 putatively causal sQTLs that are enriched for relevant epigenomic signatures. These include 1,310 eQTLs and 1,657 sQTLs that are largely private to underrepresented populations. Our data further indicate that the magnitude and direction of causal eQTL effects are highly consistent across populations. Moreover, the apparent ‘population-specific’ effects observed in previous studies were largely driven by low resolution or additional independent eQTLs of the same genes that were not detected. Together, our study expands our understanding of human gene expression diversity and provides an inclusive resource for studying the evolution and function of human genomes.
Subject terms: Gene expression, Genetic variation, Quantitative trait loci, RNA splicing
A new open-access RNA sequencing dataset, MAGE, of 731 individuals across geographically diverse human populations provides a valuable resource to study genetic diversity and evolution and expands the capacity to identify new genetic associations.
Main
Genetic variation that affects gene expression and splicing accounts for a large proportion of phenotypic differences within and between species1. By correlating patterns of expression and splicing with variation at the level of DNA, past research has helped reveal the genetic basis of these molecular traits and their relationships with higher-order phenotypes2–5. Previous molecular association studies in humans have been strongly biased towards individuals of European ancestries, which potentially constrains generalizability and hinders our understanding of human gene expression diversity and evolution7–9. Research has also demonstrated that the inclusion of diverse samples breaks up linkage disequilibrium (LD), which improves resolution for identifying causal variants10.
Motivated by these points, several studies have profiled gene expression in geographically diverse samples11–13. These studies have generally observed that gene expression and splicing differences between populations are rare and that divergence in these molecular phenotypes does not clearly reflect patterns of population divergence. Studies have also revealed an abundance of genetic variants associated with levels of gene expression (termed eQTLs). Promoter proximal eQTLs possessed larger effects, on average, and tended to be shared across populations12. Although foundational, these studies were generally characterized by small sample sizes and/or assayed gene expression using microarrays. This limits statistical power and resolution for molecular QTL mapping and hinders integration and comparison with modern sequencing-based datasets. Meanwhile, recent work by consortia such as MESA, GALA II and SAGE have generated RNA sequencing (RNA-seq) data from thousands of samples and include representation from African American and Latin American populations14,15, but their controlled access poses barriers to re-use, and in some cases are restricted to disease-related research.
To address this gap, here we developed MAGE, an open resource for multi-ancestry analysis of gene expression. MAGE comprises RNA-seq data from a large sample of lymphoblastoid cell lines (LCLs) derived from individuals across geographically diverse human populations. Using these data, we performed the following analyses: (1) quantified the distribution of gene expression and splicing diversity; (2) mapped genetic variation that influences gene expression and splicing at high resolution; and (3) examined the evolutionary forces that shape such variation and the causes of apparent heterogeneity in its effects across populations. Together, our work offers a more complete view of the magnitude, distribution and genetic sources of human gene expression and splicing diversity.
A multi-ancestry RNA-seq resource
We performed RNA-seq of LCLs from 731 individuals from the 1000 Genomes Project6 (1KGP), which represented 26 globally distributed populations (27–30 individuals per population) across 5 continental groups (Fig. 1a). Although we emphasize the greater genetic diversity within African populations—a point obscured by ADMIXTURE analysis and principal component analysis (PCA)16—these visualizations demonstrate that our study includes data from several non-African ancestry groups that were poorly represented in previous studies (Fig. 1b–d and Supplementary Figs. 2 and 3; also see the African Functional Genomics Resource (AFGR)17). All 731 samples were sequenced in a single laboratory across 17 batches, and sample populations were stratified across batches to avoid confounding between population and batch (Supplementary Fig. 5). We quantified gene expression levels using gene annotations from GENCODE (v.38) and used an annotation agnostic approach implemented by LeafCutter18 to quantify alternative splicing patterns (Supplementary Methods and Supplementary Fig. 4).
Gene expression and splicing diversity
The majority of variation in DNA sequence is distributed within as opposed to between human populations19,20. Previous studies have explored the extent to which this pattern holds for gene expression diversity, finding that population labels explain 3–25% of the total variation in gene expression4,13. However, these studies were limited by either sample size or diversity, motivating our analysis within MAGE.
To this end, we fit a linear model relating the expression level of each gene with continental group and population labels from the 1KGP. After regressing out sequencing batch and sex effects, continental group explained an average of 2.92% of variance in gene expression level across tested genes (s.d. = 3.18%), whereas population label explained an average of 8.40% of variance (s.d. = 4.43%; Fig. 2a). Although small, these proportions exceed null expectations assuming no population structure (one-tailed permutation test: Pcontinental group, Ppopulation < 1 × 10−3). Notably, the proportion of variance explained was smaller, on average, than reported in a previous study that included samples from the San population, whose ancestors diverged (with subsequent gene flow) from other populations in the dataset >100,000 years ago13,21.
We observed similar patterns for alternative splicing, whereby—after regressing out technical variation—continental group and population explained an average of only 1.23% (s.d. = 1.93%) and 4.58% (s.d. = 2.24%) of variance, respectively (Fig. 2c; one-tailed permutation test: Pcontinental group, Ppopulation < 1 × 10−3). The proportions of variance in gene expression and splicing explained by population label are not directly comparable because of differences in their units of measurement. However, our observations are qualitatively consistent with previous reports that expression level varies more between populations than splicing13.
Notably, we also observed that within-population variance in expression (one-tailed analysis of deviance: χ2 (4, N = 100,890) = 17,623, P < 1 × 10−10) and splicing (one-tailed analysis of deviance: χ2 (4, N = 164,335) = 1550.6, P < 1 × 10−10) differed among continental groups. That is, there were higher average variances (across all tested genes) observed within the African continental group compared with the Admixed American continental group (Fig. 2b,d and Supplementary Fig. 7). These results are consistent with the demonstrated decline in genetic diversity resulting from serial founder effects during human global migrations22,23. Although significant, the magnitudes of these differences in variances were smaller than the magnitude of the decline in genetic diversity, which probably reflect the non-genetic environmental and stochastic contributions to gene expression and splicing variance that similarly affect all samples.
Genetic effects on gene expression
Mapping eQTLs and sQTLs at high resolution
MAGE offers a valuable resource for uncovering the genetic factors that drive variation in gene expression and splicing, including genetic variation that is largely private to historically underrepresented populations. By intersecting published genotype data from the same set of samples24, we mapped cis-eQTLs and cis-sQTLs within 1 Mb of the transcription start site (TSS) of each gene. We define eGenes and sGenes as genes with an eQTL or sQTL, respectively, and eVariants and sVariants as the individual genetic variants defining an eQTL or sQTL signal, respectively. We note that although we performed QTL mapping for genes on the autosomes and the X chromosome, we focus on results from the autosomes here owing to several methodological details that are specific to the X chromosome (Supplementary Methods). Across 19,539 autosomal genes that passed expression-level filtering thresholds (Supplementary Methods), we discovered 15,022 eGenes and 1,968,788 unique eVariants (3,538,147 significant eVariant–eGene pairs; 5% false discovery rate (FDR)). Additionally, across 11,912 autosomal genes that passed splicing-filtering thresholds, we discovered 7,727 sGenes and 1,383,540 unique sVariants (2,416,177 significant sVariant–sGene pairs; 5% FDR).
The inclusion of genetically diverse samples in association studies reduces the extent of LD and improves mapping resolution8,10 (Supplementary Fig. 11). With this advantage in mind, we used SuSiE25 to perform fine mapping for all eGenes and the introns of all sGenes to identify causal variants that drive each QTL signal. For each gene and intron, SuSiE identifies one or more credible sets, representing independent causal eQTL and sQTL signals and whereby each credible set contains as few variants as possible while maintaining a high probability of containing the causal variant. To obtain a gene-level summary of the sQTL fine-mapping results, we collapsed intron-level credible sets into gene-level credible sets by iteratively merging intron-level credible sets for each sGene (Supplementary Methods). We identified at least one credible set for 9,807 (65%) eGenes and 6,604 (85%) sGenes, which we define as fine-mapped eGenes and sGenes, respectively. Consistent with previous results4,26,27, we observed widespread allelic heterogeneity across fine-mapped genes, with 3,951 (40%) of fine-mapped eGenes and 3,490 (53%) of fine-mapped sGenes exhibiting more than one distinct credible set (Fig. 3a and Extended Data Fig. 2c). We also achieved high resolution in identifying putative causal variants that drive expression changes. That is, of 15,664 eQTL credible sets, 3,992 (25%) contained a single variant (median 5 variants per credible set; mean = 15.8, s.d. = 65.7; Fig. 3b). Similarly, for sQTLs, 3,569 out of 16,451 (22%) credible sets contained a single variant (median 7 variants per credible set; mean = 23.6, s.d. = 99.1; Extended Data Fig. 2d). For downstream analyses, we selected a single representative ‘lead QTL’ from each eGene and sGene gene-level credible set.
For each lead eQTL, we calculated its effect size using an implementation of the allelic fold change (aFC)28 statistic that quantifies eQTL effect sizes conditional on all other lead eQTLs for that gene (Supplementary Methods). We observed that 2,031 (13%) lead eQTLs had a greater than twofold effect on gene expression (median |log2(aFC)| = 0.30; mean = 0.51, s.d. = 0.64; Extended Data Fig. 1). This was a slightly smaller proportion than previously reported by GTEx26, but we propose that this is partially explained by the small sample sizes in some GTEx tissues, which drives a stronger ‘winner’s curse’, whereby effects are systematically overestimated29.
Evidence of selective constraint
Previous studies of large population cohorts have identified sets of genes under strong mutational constraint, whereby negative selection has depleted loss-of-function point mutations and copy number variation30. One metric for quantifying mutational constraint on genes is the probability of intolerance to loss-of-function mutations (pLI)30. In our data, we observed that eGenes possessed significantly lower mean pLI scores (mean = 0.261, s.d. = 0.395) than non-eGenes (mean = 0.304, s.d. = 0.409; two-tailed Wilcoxon rank-sum test: W = 11,596,590, P = 3.89 × 10−7). Additionally, highly constrained eGenes (top 10% of pLI) tended to possess fewer credible sets (mean = 0.80, s.d. = 0.82) than other eGenes (mean = 1.12, s.d. = 1.04; two-tailed quasi-Poisson generalized linear model: = −0.354, P = 5.91 × 10−25; Fig. 3c). Moreover, the average effect size of lead eQTLs within highly constrained genes (mean |log2(aFC)| = 0.25; s.d. = 0.36) was smaller than that of other genes (mean |log2(aFC)| = 0.53; s.d. = 0.65; two-tailed Wilcoxon rank sum test: W = 3,789,053, P = 1.87 × 10−96; Fig. 3d). This difference was apparent regardless of whether the minor allele is associated with higher (Δmean |log2(aFC)| = −0.277; two-tailed Wilcoxon rank-sum test: W = 928,592, P = 1.39 × 10−50) or lower expression (Δmean |log2(aFC) = −0.268; two-tailed Wilcoxon rank sum test: W = 967,228, P = 2.97 × 10−47), consistent with a model of stabilizing selection whereby gene expression is maintained within an optimal range. These results indicate an association between constraint against loss-of-function protein-coding sequence variation (that is, pLI) and constraint against expression-altering variation (that is, number of credible sets and eQTL effect sizes). This association held for several other metrics of mutational constraint that include intolerance to copy number variation (that is, pHaplo and pTriplo) as well as divergence-based estimates of sequence conservation in putative promoter elements (Extended Data Fig. 3). Together, our results are consistent with previous analyses demonstrating weak, but measurable, selection against expression-altering variation31.
Functional enrichment of QTLs
Taking advantage of the high resolution of putative causal signals, we quantified the enrichment of fine-mapped lead eQTLs in 15 predicted chromatin-state annotations across 127 reference epigenomes from the Roadmap Epigenomics chromHMM model32. Enrichment was most pronounced within promoter regions, specifically at active TSSs (TssA) and flanking regions (TssAFlnk), but modest enrichments were also apparent within enhancer regions (Enh and EnhG), especially for blood cell types (Fig. 4a and Supplementary Fig. 12B). Conversely, quiescent, repressive and heterochromatic regions were depleted of eQTLs. We further extended our analysis to primary DNase hypersensitivity site (DHS) annotations, and we observed a strong enrichment of lead eQTLs in DHSs of blood and T cell samples (Supplementary Fig. 12C).
Focusing on data from LCLs, we next explored the relationship between epigenomic enrichments and eQTL effect sizes (|log2(aFC)|). Promoter-associated enrichment was consistent across eQTL effect size deciles, and enrichment within poised regulatory regions such as bivalent TSS (TSSBiv) and bivalent enhancers (EnhBiv) was most apparent for eQTLs of large effect sizes (Supplementary Fig. 13A,B). By contrast, eQTLs located within chromatin states associated with transcribed regions (Tx, TxWk and TxFlnk) predominantly exhibited lower effect sizes (Supplementary Fig. 13C). These qualitative trends were replicated in other primary blood cell types (Supplementary Figs. 14–17). Using additional DHS-based annotations from Roadmap Epigenomics32, we observed larger median eQTL effect sizes in promoter regions relative to enhancers and dyadic (that is, acting as both promoter and enhancer) regions (Fig. 4b). This pattern was similarly replicated across other primary blood-related cell types (Supplementary Figs. 14–17). Using chromatin immunoprecipitation followed by sequencing data from ENCODE33, we also observed that lead eQTLs were significantly enriched within 312 (92.30%; Bonferroni-adjusted P < 0.05) transcription factor (TF) binding sites, including canonical promoter-associated TFs such as POLR2A, TAF1, JUND, ATF2 and KLF5, as well as TFs such as HDACs, EP300 and YY1, which are typically associated with enhancers (Supplementary Fig. 12A).
We also investigated the genomic context of our fine-mapped cis-sQTLs. We observed strong enrichment of lead sQTLs in several key splicing-relevant annotations, including splice donor sites (log2(fold enrichment) = 6.07, 95% confidence interval (CI) = 4.09–8.04) splice acceptor sites (log2(fold enrichment) = 5.52, 95% CI = 3.54–7.50) and nearby regions (log2(fold enrichment) = 4.15, 95% CI = 3.70–4.62) at intron–exon boundaries (Fig. 4c). Despite their magnitude of enrichment, variants in canonical splice sites and splice regions represented a minority of lead sQTLs, with a greater abundance of sQTLs falling within 5′ and 3′ untranslated regions (UTRs), as well as exons of both coding and noncoding genes. Although exhibiting weaker enrichments, these annotation categories together covered a much larger mutational target size and may encompass splicing enhancers and cryptic splice sites. By contrast, intergenic regions were strongly depleted of lead sQTLs (log2(fold enrichment) = −2.51, 95% CI = −2.58 to −2.43). Together, these findings provide support for the biological validity of the fine-mapped cis-QTLs and insight into the mechanisms by which these variants affect gene expression and splicing.
Colocalization of eQTLs and sQTLs and GWAS hits
To explore the role of expression-associated genetic variation in human complex traits, we next sought to discover shared signals between fine-mapped MAGE cis-eQTLs and cis-sQTLs and results from genome-wide association studies (GWAS). As a multi-ancestry resource, we anticipate that MAGE will facilitate the interpretation of GWAS from underrepresented populations. One such cohort is the Population Architecture using Genomics and Epidemiology (PAGE) study8, which comprises 49,839 non-European individuals, including large samples of individuals who self-reported as Hispanic/Latin American or African American, as well as smaller samples of individuals who self-reported as Asian, Native Hawaiian or Native American. We performed colocalization analysis to identify shared signals between GWAS of 25 complex traits from PAGE and cis-eQTLs and cis-sQTLs from MAGE. PAGE GWAS data include quantitative biomedical traits such as platelet count and cholesterol levels, as well as diseases such as type 2 diabetes (see Supplementary Table 1 for a full list of the traits included in this analysis).
Across these 25 traits, we identified 384 independent GWAS signals. For each independent GWAS signal, we tested for eQTL colocalization with each eGene within 500 kbp. We implemented this analysis using a combination of SuSiE25 and coloc34,35 to allow for multiple causal variants at each signal and to allow for different patterns of LD between the two datasets. We defined moderate colocalizations as those with posterior probabilities ≥ 0.5 and strong colocalizations as those with posterior probabilities ≥ 0.8.
Using this approach, we identified moderate colocalizations with MAGE cis-eQTLs for 39 independent GWAS signals across 14 traits and strong colocalizations for 25 independent GWAS signals across 13 traits (Supplementary Fig. 18). These included 6 GWAS signals across 6 traits for which the GWAS variant was rare (minor allele frequency (MAF) < 0.05) or unobserved in the European continental group in the 1KGP. Among these, one notable result involved colocalization (Pcoloc = 0.998) between a platelet count GWAS hit (sentinel variant rs73517714) and an eQTL hit of the tropomyosin gene TPM4, whereby the lead eQTL variant (rs143558304) falls within the 3′ UTR. Previous work has implicated rare missense variants in TPM4 with platelet abnormalities and excessive bleeding36, findings that provide support for a role of this gene in platelet function. The MAGE lead eQTL and the GWAS sentinel variant were in strong LD (R2 = 0.874 in MAGE) and were rare (MAF < 0.05) in the European continental group of the 1KGP but more common in the African continental group.
We repeated this colocalization analysis for MAGE sQTLs. Across the same set of 384 GWAS signals, we identified moderate colocalizations with MAGE cis-sQTLs for 30 independent GWAS signals across 12 traits and strong colocalizations for 24 independent GWAS signals across 10 traits (Supplementary Fig. 18). These included three GWAS signals across two traits for which the GWAS variant was rare or unobserved in the European continental group in the 1KGP. Together these results highlight the utility of paired globally diverse gene expression and WGS datasets like MAGE and 1KGP, respectively, in interpreting complex trait GWAS of non-European cohorts.
Population-specificity of QTLs
A fundamental question in association studies is the extent to which genetic associations replicate across human groups and the underlying factors that drive heterogeneity between groups. Several previous studies have demonstrated that although QTL effects are strongly correlated across populations12,37, the predictive power of association study summary statistics (for example, polygenic scores) declines when applied to groups whose ancestry does not match that of the discovery sample9,14. The underlying causes of such poor portability is a topic of active debate37,38, and several non-mutually exclusive explanations have been proposed: (1) differences in the allele frequency (AF) of causal variants between groups can lead to differential statistical power; (2) differences in patterns of LD (either between a tag variant and a causal variant or between multiple causal variants) between groups can lead to nominal effect size heterogeneity; and (3) epistatic interaction between multiple causal variants, one or both of which vary in AF across groups can lead to nominal effect size heterogeneity. Gene-by-environment interactions may also drive effect size heterogeneity, but we anticipate that such interactions are less relevant to our data given the common conditions used for deriving and culturing immortalized LCLs, as well as the block-randomized nature of the experimental design (Supplementary Methods).
To gain insight into the relative importances of these phenomena, we identified and characterized two broad classes of population-specific QTLs: (1) QTLs for which AF differs between continental groups (which we term frequency differentiated QTLs (fd-QTLs)) and (2) QTLs that exhibit effect size heterogeneity between continental groups (which we term heterogeneous effect QTLs (he-QTLs)). We consider each class in turn in the subsequent sections.
Frequency differentiation of QTLs
We proposed that the diversity of our sample would facilitate discovery of new QTLs that are private to populations that were underrepresented in previous molecular association studies. To test this hypothesis, we evaluated the frequency distribution across continental groups of the 15,664 fine-mapped lead eQTLs in MAGE. We observed that 8,837 (56.4%) lead eQTLs are ‘globally common’ (MAF > 0.05 in each continental group), a result consistent with the fact that statistical power for eQTL discovery scales with MAF and that most common variation is shared across human populations39,40 (Extended Data Fig. 4a,b). However, we also identified 1,310 (8.3%) lead eQTLs that are unobserved in the European continental group but present in one or more other continental groups (Extended Data Fig. 4c). An additional 115 (0.6%) lead eQTLs are unobserved in both European and African ancestry groups (Extended Data Fig. 4d). Qualitatively similar patterns were also apparent for sQTLs (Extended Data Fig. 5).
To further contextualize our results, we compared our eQTL fine-mapping data to that from GTEx, which largely comprises individuals of European ancestries and some African American individuals. To account for the multi-tissue nature of GTEx, we took the union of credible sets across tissues for a focal gene to compare with the credible sets for that same gene in MAGE (Supplementary Methods). Overall, we found that 8,069 MAGE credible sets (6,421 genes) replicated in GTEx compared with 7,595 credible sets (5,545 genes) that did not replicate (Fig. 5a). We additionally identified 701 genes with at least one credible set in MAGE but no apparent credible set in GTEx. Notably, we observed that lead eQTLs in MAGE that did not replicate in GTEx tended to exhibit greater geographical differentiation, with higher frequencies outside Europe relative to variants that replicated between studies, which tended to be common across all populations (Fig. 5a). Moreover, the 79,915 GTEx lead eQTLs that were not replicated in MAGE (7,913 lead eQTLs replicated) are enriched for tissue-specific effects (two-tailed Mann–Whitney U-test: P < 10−10; Extended Data Fig. 6). This was despite showing qualitatively similar patterns of functional enrichment that support their biological validity (Supplementary Fig. 19). Together, these results highlight important aspects of experimental design across multiple axes of diversity, such as ancestry and tissue composition, that shape the statistical findings of molecular QTL studies.
One example of a fd-eQTL that we identified was rs115070172, for which the T allele was common (AF > 0.05) only within the Admixed American continental group in MAGE and was at high frequency (AF = 0.63) in the Peruvian population (Fig. 5b). This variant was the lead eQTL for one of two credible sets of GSTP1, a tumour suppressor gene for which its expression has been implicated in breast cancer41–43. The T allele of the rs115070172 variant was significantly associated with lower expression of GSTP1 (Fig. 5c). Intersection with epigenomic data indicated that this fine-mapped lead eQTL lies within a putative enhancer region (Supplementary Fig. 20). Notably, the expression of GSTP1 was significantly lower in individuals from the Peruvian population compared with other global populations (Fig. 5d), and we propose that this eQTL signal may explain this trend.
To more broadly examine the role of fd-eQTLs in driving differential gene expression between continental groups, we quantified FST values44 for each lead eQTL and intersected these values with the differential expression results for the respective eGene for each eQTL (Fig. 5e). Among continental groups, differentially expressed eGenes (FDR-adjusted P ≤ 0.05) possessed higher FST values than non-differentially expressed eGenes (two-tailed Mann–Whitney U-test: Z = 0.022 ± 0.001, 95% CI; P < 0.05). This result suggests that gene expression differences across populations can be attributed to frequency differentiation of causal eQTLs.
Consistency of eQTL effects
We sought to test for he-eQTLs in MAGE given recent debates about their prevalence and causes8,15,37,38,45. Because the genotypes are derived from high-coverage whole-genome sequencing in the 1KGP, MAGE should be robust to effect size heterogeneity resulting from population-specific LD patterns with untyped casual variants (which commonly affects microarray data), barring large structural variation that may escape detection with short-read sequencing. This enables investigation of other sources of effect size heterogeneity.
For each fine-mapped eGene, we first assessed whether its top nominal pass eQTL exhibited effect size heterogeneity between continental groups by fitting a model that included a genotype-by-continental group interaction term. Across 8,376 top nominal pass eQTLs that passed filtering (MAF ≥ 0.05 in at least two continental groups), 70 (0.84%) exhibited a significant genotype-by-continental group interaction after Bonferroni correction (Fig. 5f). Notably, we observed that eGenes with more fine-mapped credible sets were more likely to exhibit significant interaction effects, which suggested that the additive effects of multiple causal variants may drive apparent interaction effects.
To test this hypothesis, we discovered he-eQTLs from among our fine-mapped eQTLs. For each fine-mapped eGene, we included the lead eQTL from each of its credible set (or sets) as predictors and tested for a genotype-by-continental group interaction effect for one lead eQTL at a time. Supporting our hypothesis, 64 (91%) of eGenes with a significant interaction effect had no significant interaction effects after controlling for the additive effects of multiple causal signals for that gene (Fig. 5f). The few remaining interaction effects (9 eQTLs; 0.07% of all eQTLs that passed filtering) may be driven by non-additive epistatic interactions between variants, by additional untested causal variants that did not meet nominal MAF thresholds or by population-specific LD patterns with untyped causal variants. Qualitatively similar patterns were observed when testing for interactions between genotype and global genotype principal components (Extended Data Fig. 7).
An alternative approach based on stratified eQTL mapping and fine mapping within each continental group (Supplementary Methods) likewise indicated high consistency in effect sizes, such that 97.5–99.8% of credible sets had similar effect sizes between pairs of continental groups (Extended Data Fig. 8 and Supplementary Fig. 21). Together, these results indicate that effect size heterogeneity of eQTLs between populations is rare, and apparent heterogeneity may instead reflect the failure to control for the additive effects of multiple independent causal signals.
Discussion
Combined with existing whole-genome sequencing data from the same samples24, MAGE offers a large open-access dataset for studying the diversity and evolution of human gene expression and splicing. Our study also offers insight into the genetic sources of variation in these key molecular phenotypes, which may in turn mediate variation in organismal traits. By evenly spanning samples from all 26 populations of the 1KGP6, MAGE includes several ancestry groups that were poorly represented in previous molecular association studies7.
The scale and diversity of the dataset enabled the discovery of numerous potentially new genetic associations while offering high resolution for identifying putatively causal variants and elucidating their mechanisms of action. Our study also demonstrated that the effect sizes of eQTLs are highly consistent across populations, which implies that trans-genetic effects (driven by global ancestry patterns), if adequately controlled for, generally do not have a strong impact on the effects of causal variants in cis. Although addressing a point of recent debate within the field8,15,37,38,45,46, this conclusion is in agreement with several previous studies that used orthogonal approaches for evaluating effect size heterogeneity based on analysis of admixed individuals37,45. This finding is encouraging for predictive applications such as polygenic risk scores and transcriptome-wide association studies, as it suggests that models that focus on causal signals and do not make assumptions about the number of such signals may exhibit better portability between groups. The extent to which this conclusion applies to more polygenic complex traits is an open question, but a recent study that investigated effect size heterogeneity in admixed individuals across 38 complex traits found that nominal effect sizes are consistent across local ancestries37. Such consistency of genetic effects further motivates the use of diverse samples for association studies, as a common causal variant identified in one population may inform the effect of that variant in a population in which the same variant is rare and association testing would be underpowered. Thus, all populations—not only underrepresented populations—benefit from the inclusion of greater diversity in genetic studies, which empowers more accurate and generalizable predictions for personalized medicine47.
Intersection of eQTLs and sQTLs with data from GWAS may facilitate understanding of the molecular mechanisms that link genetic variation to organismal phenotypes. Using GWAS data from the PAGE study of ancestrally diverse individuals, we identified 54 GWAS signals that colocalize with eQTL and sQTL signals. Although informative and substantial in absolute number, these reflect a minority of all GWAS hits. Limited colocalization between molecular QTLs and GWAS hits is well described, largely stemming from distinct selective pressures shaping genetic variation that can be identified (with incomplete statistical power) in the two analyses48. GWAS hits tend to occur within genes under strong purifying selection, whereas molecular QTLs are most easily identified for genes under relaxed constraint. This is consistent with our finding that genes exhibiting strong signals of selection are depleted of MAGE eQTLs. Although inclusion of additional tissues and cell types modestly increases the rate of colocalization, these qualitative observations hold even for multi-tissue studies such as GTEx48. Despite these general limitations of colocalization analyses, our results demonstrated instances in which MAGE facilitated the interpretation of GWAS results, particularly in underrepresented populations. We anticipate that this utility will further improve as GWAS continue to expand to more diverse cohorts.
By design, our study focused on a single cell type of LCLs, which offers a useful model for studying gene expression given their low somatic mutation rates and robust gene expression patterns encompassing key metabolic pathways49. Although this enabled us to mitigate the effects of environmental variation and to compare our results to related studies performed in the same cell lines4, future studies may seek to understand ancestry differences in expression across developmental, cellular and other environmental contexts, including with respect to dynamic QTLs for which effects vary based on those contexts50. Future studies of diverse cohorts may also leverage new technologies (such as long-read genome, cDNA or direct RNA-seq51–53) to achieve higher resolution for isoform detection as well as improved analysis of genes that occur within highly repetitive or structurally complex regions.
Finally, although geographically diverse, the sampling of the 1KGP is not without biases, for example, narrowly sampling the vast diversity within Africa and excluding indigenous populations from Oceania and the Americas, as well as countless other populations. Addressing these biases will require deeper community engagement and respect for the rights, interests and expectations of research participants from diverse human groups54. This expansion of diversity in functional genomics parallels efforts for improved representation of diversity in genome sequencing and assembly, including construction of pangenomes55,56. Although the current study was based on alignment to a linear representation of the reference genome, given the maturity of software tools and annotations built on this paradigm, MAGE offers a valuable data resource for testing pangenomic methods over the coming decade as they are developed by the research community.
Our work provides a more complete picture of the links between genetic variation and genome function across diverse populations, as well as the evolutionary forces that have shaped this variation within our species. Complemented by existing high-coverage whole-genome sequencing data, we anticipate that this dataset will serve as a valuable resource to facilitate future research into the complex genetic basis of variation in human genome function.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41586-024-07708-2.
Supplementary information
Acknowledgements
Thanks to S. Montgomery and members of the Montgomery Laboratory; J. Leek; members of the McCoy Lab, the Department of Biology and the Center for Computational Biology at Johns Hopkins for helpful discussion and feedback; C. Runnels for initial exploration of signatures of selection; B. Huang for initial exploration of splicing enrichments; the staff of Advanced Research Computing at Hopkins for support; and M. Mitchell and other staff at the Coriell Institute, P. Boyle, B. Kerr, A. James, A. Bronzato-Badial, M. Carter and other staff at Genewiz/Azenta for assistance in generating RNA-seq data. R.C.M., M.G.T. and A. Biddanda are supported by NIH/NIGMS award R35GM133747 and NIH Award OT2OD034190. D.J.T. is supported by NIH/NHGRI award F31HG012900. S.M.Y. is supported by NIH/NHGRI award F31HG012495. A. Battle is supported by NIH/NIGMS award R35GM139580. G.L.W. is supported by NIH/NHGRI award R35HG011944. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Extended data figures and tables
Author contributions
D.J.T. and R.C.M. initially conceived the project. Raw data generation was managed by D.J.T. and R.C.M. Alignment was done by R.C.M. Expression quantification was done by D.J.T. and M.G.T. Splicing quantification was done by M.G.T. PCA and ADMIXTURE comparisons with previous datasets were done by R.C.M. Analyses of between and within population expression and splicing diversity were done by D.J.T. and M.G.T. All eQTL and sQTL mapping and fine mapping, as well as eQTL effect size estimation were done by D.J.T. Intersection of eQTLs with selection metrics was done by R.C.M. and A. Biddanda. Functional annotation and characterization of eQTLs and sQTLs and subsequent analyses were performed by S.B.C. and A. Battle. Colocalization analysis with PAGE was performed by D.J.T. with data and guidance provided by G.L.W. Quantification of population-stratified eQTLs and sQTLs (fd-QTLs) was performed by A. Biddanda. Comparison of MAGE and GTEx fine-mapped eQTLs was performed by A. Biddanda. Analyses of the correlation between fd-eQTLs and differential gene expression between populations was performed by M.G.T. Interaction test-based discovery of he-eQTLs was performed by D.J.T. Population-stratified discovery of eQTLs and subsequent he-eQTL analysis was performed by S.M.Y. D.J.T. curated published output data. D.J.T., S.B.C., M.G.T., A. Biddanda, S.M.Y. and R.C.M. wrote the manuscript. D.J.T. and R.C.M. edited the manuscript, with the assistance of all authors.
Peer review
Peer review information
Nature thanks Heather Wheeler and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
Newly generated RNA-seq data for the 731 individuals (779 total libraries) are available from the Sequence Read Archive (accession: PRJNA851328). Processed gene expression matrices and QTL mapping results are available from Zenodo (10.5281/zenodo.10535719)60. Alignment was performed using the GRCh38 reference genome (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_000001405.15/), and gene and transcript annotations were obtained from GENCODE (v.38; https://www.gencodegenes.org/human/release_38.html). Our study also relied on published whole-genome sequencing data from 1KGP samples (https://www.internationalgenome.org/data-portal/data-collection/30x-grch38). QTL mapping results were compared with GTEx (dbGaP accession: phs000424.v9.p2) and Geuvadis (EBI ArrayExpress accessions: E-GEUV-1, E-GEUV-2 andE-GEUV−3). Functional enrichment analyses relied on annotations from ENCODE (https://www.encodeproject.org/data/annotations/) and Roadmap Epigenomics (https://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html). Colocalization analyses were conducted using data from the PAGE study (harmonized GWAS sumstats from the GWAS Catalog: https://www.ebi.ac.uk/gwas/publications/31217584; genotypes from TOPMED, dbGaP accession: phs001974.v5.p1).
Code availability
Code used for the analyses presented in this paper is available on GitHub (https://github.com/mccoy-lab/MAGE) and archived on Zenodo (10.5281/zenodo.10072080)61.
Competing interests
A. Battle is a co-founder and equity holder of CellCipher, a stockholder in Alphabet, and has consulted for Third Rock Ventures. The other authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
is available for this paper at 10.1038/s41586-024-07708-2.
Supplementary information
The online version contains supplementary material available at 10.1038/s41586-024-07708-2.
References
- 1.Li, Y. I. et al. RNA splicing is a primary link between genetic variation and disease. Science352, 600–604 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Brem, R. B., Yvert, G., Clinton, R. & Kruglyak, L. Genetic dissection of transcriptional regulation in budding yeast. Science296, 752–755 (2002). [DOI] [PubMed] [Google Scholar]
- 3.Morley, M. et al. Genetic analysis of genome-wide variation in human gene expression. Nature430, 743–747 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature501, 506–511 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.GTEx Consortium. Genetic effects on gene expression across human tissues. Nature550, 204–213 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature526, 68–74 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature538, 161–164 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wojcik, G. L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature570, 514–518 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet.107, 788–789 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kita, R., Venkataram, S., Zhou, Y. & Fraser, H. B. High-resolution mapping of cis-regulatory variation in budding yeast. Proc. Natl Acad. Sci. USA114, E10736–E10744 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Storey, J. D. et al. Gene-expression variation within and among human populations. Am. J. Hum. Genet.80, 502–509 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Stranger, B. E. et al. Patterns of cis regulatory variation in diverse human populations. PLoS Genet.8, e1002639 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Martin, A. R. et al. Transcriptome sequencing from diverse human populations reveals differentiated regulatory architecture. PLoS Genet.10, e1004549 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Mogil, L. S. et al. Genetic architecture of gene expression traits across diverse populations. PLoS Genet.14, e1007586 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kachuri, L. et al. Gene expression in African Americans, Puerto Ricans and Mexican Americans reveals ancestry-specific patterns of genetic architecture. Nat. Genet.55, 952–963 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Carlson, J., Henn, B. M., Al-Hindi, D. R. & Ramachandran, S. Counter the weaponization of genetics research by extremists. Nature610, 444–447 (2022). [DOI] [PubMed] [Google Scholar]
- 17.DeGorter, M. K. et al. Transcriptomics and chromatin accessibility in multiple African population samples. Preprint at bioRxiv10.1101/2023.11.04.564839 (2023).
- 18.Li, Y. I. et al. Annotation-free quantification of RNA splicing using LeafCutter. Nat. Genet.50, 151–158 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lewontin, R. C. in Evolutionary Biology (eds Dobzhansky, T. et al.) 381–398 (Springer US, 1972).
- 20.Jorde, L. B. et al. The distribution of human genetic diversity: a comparison of mitochondrial, autosomal, and Y-chromosome data. Am. J. Hum. Genet.66, 979–988 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science367, eaay5012 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Ramachandran, S. et al. Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc. Natl Acad. Sci. USA102, 15942–15947 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Prugnolle, F., Manica, A. & Balloux, F. Geography predicts neutral genetic diversity of human populations. Curr. Biol.15, R159–R160 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell185, 3426–3440.e19 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Zou, Y., Carbonetto, P., Wang, G. & Stephens, M. Fine-mapping from summary data with the ‘Sum of Single Effects’ model. PLoS Genet.18, e1010299 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.The GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science369, 1318–1330 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Jansen, R. et al. Conditional eQTL analysis reveals allelic heterogeneity of gene expression. Hum. Mol. Genet.26, 1444–1451 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Mohammadi, P., Castel, S. E., Brown, A. A. & Lappalainen, T. Quantifying the regulatory effect size of cis-acting genetic variation using allelic fold change. Genome Res.27, 1872–1884 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Huang, Q. Q., Ritchie, S. C., Brozynska, M. & Inouye, M. Power, false discovery rate and Winner’s Curse in eQTL studies. Nucleic Acids Res.46, e133 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature536, 285–291 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Glassberg, E. C., Gao, Z., Harpak, A., Lan, X. & Pritchard, J. K. Evidence for weak selective constraint on human gene expression. Genetics211, 757–772 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.The Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature518, 317–330 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature489, 57–74 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wallace, C. A more accurate method for colocalisation analysis allowing for multiple causal variants. PLoS Genet.17, e1009440 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet.10, e1004383 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Stapley, R. J. et al. Rare missense variants in tropomyosin-4 (TPM4) are associated with platelet dysfunction, cytoskeletal defects, and excessive bleeding. J. Thromb. Haemost.20, 478–485 (2022). [DOI] [PubMed] [Google Scholar]
- 37.Hou, K. et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat. Genet.55, 549–558 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Patel, R. A. et al. Genetic interactions drive heterogeneity in causal variant effect sizes for gene expression and complex traits. Am. J. Hum. Genet.109, 1286–1297 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Visscher, P. M. et al. 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet.101, 5–22 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H. & Bustamante, C. D. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet.5, e1000695 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Fang, C. et al. Aberrant GSTP1 promoter methylation is associated with increased risk and advanced stage of breast cancer: a meta-analysis of 19 case–control studies. BMC Cancer15, 920 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Louie, S. M. et al. GSTP1 is a driver of triple-negative breast cancer cell metabolism and pathogenicity. Cell Chem. Biol.23, 567–578 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Arai, T. et al. Association of GSTP1 CpG islands hypermethylation with poor prognosis in human breast cancers. Breast Cancer Res. Treat.100, 169–176 (2006). [DOI] [PubMed] [Google Scholar]
- 44.Weir, B. S. & Cockerham, C. C. Estimating F-statistics for the analysis of population structure. Evolution38, 1358–1370 (1984). [DOI] [PubMed] [Google Scholar]
- 45.Saitou, M., Dahl, A., Wang, Q. & Liu, X. Allele frequency differences of causal variants have a major impact on low cross-ancestry portability of PRS. Preprint at medRxiv10.1101/2022.10.21.22281371 (2022).
- 46.Rau, C. D. et al. Modeling epistasis in mice and yeast using the proportion of two or more distinct genetic backgrounds: evidence for ‘polygenic epistasis’. PLoS Genet.16, e1009165 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Weissbrod, O. et al. Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores. Nat. Genet.54, 450–458 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Mostafavi, H., Spence, J. P., Naqvi, S. & Pritchard, J. K. Systematic differences in discovery of genetic effects on gene expression and complex traits. Nat. Genet.55, 1866–1875 (2023). [DOI] [PubMed]
- 49.Cheung, V. G. et al. Natural variation in human gene expression assessed in lymphoblastoid cells. Nat. Genet.33, 422–425 (2003). [DOI] [PubMed] [Google Scholar]
- 50.Strober, B. J. et al. Dynamic genetic regulation of gene expression during cellular differentiation. Science364, 1287–1290 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Workman, R. E. et al. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat. Methods16, 1297–1305 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Glinos, D. A. et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature608, 353–359 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Reese, F. et al. The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity. Preprint at bioRxiv10.1101/2023.05.15.540865 (2023).
- 54.Claw, K. G. et al. A framework for enhancing ethical genomic research with Indigenous communities. Nat. Commun.9, 2957 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Liao, W.-W. et al. A draft human pangenome reference. Nature617, 312–324 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Sibbesen, J. A. et al. Haplotype-aware pantranscriptome analyses using spliced pangenome graphs. Nat. Methods20, 239–247 (2023). [DOI] [PubMed] [Google Scholar]
- 57.The International HapMap Consortium. The International HapMap Project. Nature426, 789–796 (2003). [DOI] [PubMed] [Google Scholar]
- 58.Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res.19, 1655–1664 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Marcus, J. H. & Novembre, J. Visualizing the geography of genetic variants. Bioinformatics33, 594–595 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Taylor. D & McCoy, R. MAGE: Multi-ancestry Analysis of Gene Expression v1.0. Zenodo10.5281/zenodo.10535719 (2024).
- 61.Taylor, D., McCoy, R., Biddanda, A. & Tassia, M. mccoy-lab/MAGE: MAGE v.1.0.0. Zenodo10.5281/zenodo.10072080 (2023).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Newly generated RNA-seq data for the 731 individuals (779 total libraries) are available from the Sequence Read Archive (accession: PRJNA851328). Processed gene expression matrices and QTL mapping results are available from Zenodo (10.5281/zenodo.10535719)60. Alignment was performed using the GRCh38 reference genome (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_000001405.15/), and gene and transcript annotations were obtained from GENCODE (v.38; https://www.gencodegenes.org/human/release_38.html). Our study also relied on published whole-genome sequencing data from 1KGP samples (https://www.internationalgenome.org/data-portal/data-collection/30x-grch38). QTL mapping results were compared with GTEx (dbGaP accession: phs000424.v9.p2) and Geuvadis (EBI ArrayExpress accessions: E-GEUV-1, E-GEUV-2 andE-GEUV−3). Functional enrichment analyses relied on annotations from ENCODE (https://www.encodeproject.org/data/annotations/) and Roadmap Epigenomics (https://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html). Colocalization analyses were conducted using data from the PAGE study (harmonized GWAS sumstats from the GWAS Catalog: https://www.ebi.ac.uk/gwas/publications/31217584; genotypes from TOPMED, dbGaP accession: phs001974.v5.p1).
Code used for the analyses presented in this paper is available on GitHub (https://github.com/mccoy-lab/MAGE) and archived on Zenodo (10.5281/zenodo.10072080)61.