Skip to main content
American Journal of Human Genetics logoLink to American Journal of Human Genetics
. 2023 Sep 26;110(10):1628–1647. doi: 10.1016/j.ajhg.2023.09.001

Frequencies of pharmacogenomic alleles across biogeographic groups in a large-scale biobank

Binglan Li 1, Katrin Sangkuhl 1, Ryan Whaley 1, Mark Woon 1, Karl Keat 2, Michelle Whirl-Carrillo 1, Marylyn D Ritchie 3,4, Teri E Klein 1,5,6,
PMCID: PMC10577080  PMID: 37757824

Summary

Pharmacogenomics (PGx) is an integral part of precision medicine and contributes to the maximization of drug efficacy and reduction of adverse drug event risk. Accurate information on PGx allele frequencies improves the implementation of PGx. Nonetheless, curating such information from published allele data is time and resource intensive. The limited number of allelic variants in most studies leads to an underestimation of certain alleles.

We applied the Pharmacogenomics Clinical Annotation Tool (PharmCAT) on an integrated 200K UK Biobank genetic dataset (N = 200,044). Based on PharmCAT results, we estimated PGx frequencies (alleles, diplotypes, phenotypes, and activity scores) for 17 pharmacogenes in five biogeographic groups: European, Central/South Asian, East Asian, Afro-Caribbean, and Sub-Saharan African. PGx frequencies were distinct for each biogeographic group. Even biogeographic groups with similar proportions of phenotypes were driven by different sets of dominant PGx alleles. PharmCAT also identified “no-function” alleles that were rare or seldom tested in certain groups by previous studies, e.g., SLCO1B131 in the Afro-Caribbean (3.0%) and Sub-Saharan African (3.9%) groups.

Estimated PGx frequencies are disseminated via the PharmGKB (The Pharmacogenomics Knowledgebase: www.pharmgkb.org). We demonstrate that genetic biobanks such as the UK Biobank are a robust resource for estimating PGx frequencies. Improving our understanding of PGx allele and phenotype frequencies provides guidance for future PGx studies and clinical genetic test panel design, and better serves individuals from wider biogeographic backgrounds.

Keywords: pharmacogenomics, pharmacogenetics, population genetics, frequency, clinical implementation, biobank


The frequencies of pharmacogenomic alleles/haplotypes, diplotypes, and phenotypes in five biogeographic groups were estimated using an integrated UK Biobank genetic data set and the Pharmacogenomics Clinical Annotation Tool (PharmCAT). Rare partial alleles and combinations of known pharmacogenomic alleles were reported. The frequencies will be disseminated via the Pharmacogenomics Knowledgebase (PharmGKB).

Introduction

Pharmacogenomics (PGx) investigates the genetic influence on drug response. Clinical implementation of PGx is made possible given extensive expert-led reviews and curations of PGx discoveries by, for example, the Pharmacogenomics Knowledgebase (PharmGKB), the Pharmacogene Variation Consortium (PharmVar), the Clinical Pharmacogenetics Implementation Consortium (CPIC), and the Dutch Pharmacogenetics Working Group (DPWG).1,2,3,4

As one of the low-hanging fruits in precision medicine, researchers and medical experts have heeded the clinical utility and benefit of preemptive PGx-guided drug-prescribing recommendations.5,6,7,8 The awareness of the clinical use of PGx is also evident in the increasing coverage of clinical PGx testing, including PGx panel testing, by medical insurance programs, such as Medicare and Medicaid.9,10,11 Moreover, studies have also suggested the wide prevalence of PGx variants (>90% of the studied individuals possessing PGx variants) and consequential clinical actionability in various cohorts.5,12,13,14,15

As PGx becomes actively implemented in routine clinical care at a global scale, it is increasingly important to ensure the inclusivity of diverse PGx haplotypes/diplotypes for robust clinical decision support. Various studies have estimated the frequencies of PGx alleles, diplotypes, and phenotypes in different populations. Due to scarcity of resources, some studies suffered from limited sample size, some focused on a finite subset of PGx alleles, some restricted themselves to a specific population, and some investigated only PGx variants rather than the PGx haplotypes/diplotypes which translate to guideline-based drug-prescribing recommendations.16,17,18 Meanwhile, the rise of genetic biobanks across the globe provides the opportunity to understand PGx frequencies in a variety of global populations.19,20,21,22,23

In this study, we performed a comprehensive estimation of PGx frequencies by taking advantage of the large-scale genetic UK Biobank.19 Using the Pharmacogenomics Clinical Annotation Tool (PharmCAT)24,25,26 and the UK Biobank, we estimated the frequencies of PGx alleles, diplotypes, phenotypes, and activity scores, if applicable, in five biogeographic groups (N = 200,044). While UK Biobank is dominated by self-reported British Europeans, there are still nontrivial numbers of participants from other biogeographic groups, including Central/South Asian, East Asian, Afro-Caribbean, and Sub-Saharan African. The Afro-Caribbean group is referred to as the African-American/Afro-Caribbean group hereafter to conform with the biogeographic grouping system described in the material and methods. We found that 100% of the participants in our dataset harbored at least one genetic variation at PGx positions. PGx frequencies (allele, diplotype, phenotypes, and/or activity scores) are reported using CPIC allele, function, and phenotype standardized nomenclature.27 The estimated frequencies are disseminated via PharmGKB for future PGx studies, clinical PGx testing panel design, and clinical decision support.

Material and methods

UK Biobank genetic data

The UK Biobank genotype data were collected following informed consent obtained from all participants.28 UK Biobank scientific protocol and operational procedures were reviewed and approved by the North West Research Ethics Committee (REC reference number: 06/MRE08/65).28 Data for this study were obtained under the UKB application number 33722.

We generated an integrated genetic dataset of 200,044 UK Biobank participants from the whole-exome sequence (WES) data generated from the OQFE pipeline (data field: 23156), microarray-based genotype calls (data field: 22418), and the imputed data (data field: 22828).19,29,30 Quality control (QC) was performed on each of the three genetic data sets before we combined the genotypes from the three data sources as the 200K integrated genetic dataset (Figure S1).

UK Biobank WES data are available in GRCh38 coordinates but not normalized. We first extracted the 200,044 samples shared with the imputed data and microarray-based genotyping dataset. QC was performed using PLINK2 and bcftools following UK Biobank’s QC procedures on the WES data.29,30,31,32,33 Samples were retained if the participant was not withdrawn from the UK Biobank and had a sample-based missing rate ≤ 10%. Genetic variants were retained if genetic positions had a genotype quality (GQ) ≥ 20, suggesting a <1% probability of false genotype call. Single-nucleotide variants (SNVs) had to meet additional QC thresholds: (1) allele balance (AB) ≥ 0.15 or ≤ 0.85, which was calculated for heterozygotes as the number of bases supporting the least-represented allele over the total number of base observations and (2) read depth (DP) ≥ 7. Indels used different QC thresholds from SNVs: (1) AB ≥ 0.2 or ≤ 0.8 and (2) DP ≥10. We identified 17 PGx allele-defining positions that had >10% variant-based missing rate or failed the Hardy-Weinberg equilibrium (HWE) test (p < 1e−15). These positions were further subjected to biogeographic group-specific QC to properly account for minor allele frequency (MAF) differences across groups. While multi-allelic positions were retained, positions that otherwise failed biogeographic group-specific variant-based missing rate filtering and HWE test were removed from the specific biogeographic group (Table S2; Figure S1).

Instead of assuming homozygous references for positions not present in the WES data, similar QC standards were imposed on the UK Biobank WES gVCF (data field: 23151) to extract homozygous reference calls. These homozygous reference genotypes were required to have GQ ≥ 20; in addition, SNVs were expected to have DP ≥ 7 and indels DP ≥ 10. Genotypes at positions that failed QC due to insufficient sequencing quality were set to missing in an individual-by-individual manner.

UK Biobank microarray-based genotype data are aligned to the GRCh37 genomic coordinates. The microarray-based genotype calls had been quality controlled for batch effects, plate effects, sex effects, array effects, and discordance across control replicates. Genotypes for markers that failed at least one test in a certain batch were set to missing. In this study, we conducted further QC on the 200,044 samples that were shared among WES, imputed, and microarry-based genotype data. The following QC steps were carried out sequentially: (1) normalization to combine multiallelic loci, (2) liftover to GRCh38 (RefSeq: GCF_000001405.39), (3) removal of sample outliers with high heterozygosity, which indicated poor sample quality (data field: 22027), (4) removal of variants with sample missing rate ≥ 10%, and (5) removal of variants that failed the HWE test (p < 1e−15) in respective biogeographic groups (Figure S1). Only one position, chr10:31096368 (build 38) (rs9923231), which was located in the gene VKORC1, failed the HWE test in the “other” group. As such, the genotypes at this position were set to be missing in the “other” group.

The UK Biobank imputed data were also in GRCh37 coordinates. The 200,044 samples that are shared among the three genetic datasets were first extracted. QC on the imputed data was performed in the following sequence: (1) normalization to combine multiallelic loci, (2) removal of sample outliers with high heterozygosity or with >5% variant missing rate, which indicated poor sample quality (data field: 22027), (3) removal of variants with low imputation score (R2 < 0.3; data field: 1967), (4) removal of variants that had >10% variant-based missing rate, (5) removal of variants with sample missing rate ≥ 10%, and (6) removal of variants that failed the HWE test (p < 1e−15) in respective biogeographic groups (Figure S1). Following QC, the dataset was lifted over from GRCh37 to GRCh38 (RefSeq: GCF_000001405.39).

Generating the UK Biobank 200K integrated genetic dataset

The UK Biobank 200K integrated genetic dataset was generated from the combination of QC’d WES, genotype array, and imputed data described above. WES data took higher precedence over the microarray-based genotype calls, which, in turn, had priority over the imputed data, given the concordance rates among these data based on the preliminary analysis by the UK Biobank.31 In total, eight PGx positions fell outside the WES and were obtained from the microarray-based genotype calls or imputed data (Table S1). The UK Biobank 200K integrated genetic dataset was not phased.

For more details regarding the PGx positions that failed QC and were excluded from the 200K integrated genetic dataset, please see Table S2 and supplemental methods.

A genetic variant that is not a named CPIC allele is referred to using the RefSeq assembly GCF_000001405.39 in this study.

Concordance

We calculated the non-reference concordance (NRC) rates between WES and imputed data and between WES and microarray-based genotype calls to understand the probability of observing the same alternative genotypes between two genetic datasets. The non-reference concordance rate (NRC) was by the following equation:

NRC=mRA+mAAmRA+mAA+xRR+xRA+xAA

where RR denotes homozygous reference, RA denotes heterozygous, AA denotes homozygous alternative, m denotes matches, and x denotes mismatches.

NRC was calculated for each genetic position and by allele frequency bins using the statistics derived from the Bcftools stats function.33 0% NRC suggested that every alternative genotype was different between the two genetic datasets. On the other hand, 100% indicated that every alternative genotype was the same between the two genetic datasets. G6PD, an X-linked gene, was not included in the evaluation of non-reference concordance rate as genotypes were represented in different ways for samples with one or two X chromosomes in different genetic datasets.

Validating PharmCAT calls

We validated the accuracy of PharmCAT PGx allele calls by comparing these results against the consensus call from the Genetic Testing Reference Material Coordination (GeT-RM) Program available on the Centers for Disease Control and Prevention (CDC) website (https://www.cdc.gov/labquality/get-rm/index.html).34,35,36,37 The comparison of PGx allele calls was performed on 133 GeT-RM samples whose 30× Whole-Genome Sequencing (WGS) VCFs were publicly available under the International Genome Sample Resource (IGSR).38,39 The VCF files for these 133 GeT-RM samples were first standardized and normalized using the PharmCAT VCF Preprocessor, which was then fed to PharmCAT to match for PGx alleles. We manually validated each pair of the PGx allele calls between PharmCAT and GeT-RM and the underlying genotypes (supplemental methods).

G6PD analysis

For the purpose of reporting PGx frequencies, we performed additional quality control for X-linked G6PD to appropriately calibrate the G6PD PGx allele frequencies. We took additional measures to appropriately account for the number of X chromosomes in a participant (data field: 22001). The karyotypes were inferred from the measured intensities of chromosomes X and Y described in Bycroft et al.19 Five participants were excluded at this step due to uncertain or unknown genetic sex (data field: 22001). We excluded 316 individuals for whom sex aneuploidies were detected based on the average log2Ratios for Y and X chromosomes. This resulted in 110,080 XX and 89,727 XY samples in the downstream G6PD analyses in this study.

Allele frequencies were counted in participants with one and two X chromosomes, separately, and then combined into a single frequency table.

PharmCAT analysis

PharmCAT was run on the UK Biobank 200K integrated genetic dataset.24,25 The PharmCAT VCF preprocessor v.2.2.1 was first applied to the integrated genetic dataset to normalize and standardize the representation format of genetic variants in the input VCF.40 Secondly, the whole PharmCAT with the research functionalities (“--research combinations,cyp2d6”) was run to call (1) partial and combination alleles and (2) CYP2D6 based on limited SNVs and indels. Nonetheless, due to the lack of representation of structural variations (e.g., copy-number variation [CNV]) in a VCF file format, PharmCAT is only able to call alleles that are based on non-structural variations. PharmCAT produced JSON files as a result of each module (Named Allele Matcher, Phenotyper, and Reporter) and a Reporter HTML file. We extracted the PGx annotations from the JSON files for downstream analyses.

Biogeographic grouping

UK Biobank collected self-reported ethnicity data upon the initial Assessment Center visit as part of the touchscreen questionnaire (data field: 21000). Multiple responses for self-reported ethnicity may be available if a participant filled out the questionnaire more than once. The UK Biobank used a different category system than the biogeographic grouping system implemented by PharmGKB and CPIC for PGx allele frequency tables.41 To identify the comparable categories between the two systems, we first matched the UK Biobank self-reported ethnicity system to the PharmGKB biogeographic grouping system (Table S3).

We mapped the level 1 (or top level if level 1 was not available) self-reported ethnicity classes to the nine PharmGKB biogeographical groups. PharmGKB “European” was matched to the combination of UK Biobank “White,” “White - British,” “White - Irish,” and “White - Any other white background.” PharmGKB “Central/South Asian” was composed of “Indian,” “Pakistani,” and “Bangladeshi” under the UK Biobank “Asian or Asian British.” PharmGKB “African-American/Afro-Caribbean” was mapped to “Black or Black British Caribbean” in the UK Biobank. PharmGKB “Sub-Saharan African” was mapped to the UK Biobank “Black or Black British - African.” PharmGKB “East Asian” was equated to the UK Biobank “Chinese.” We created a separate “other” population group to include any other self-reported ethnicity or classifications due to ambiguity of biogeographical background or adherence to more than one identity, including “do not know,” “prefer not to answer,” “other ethnic group,” any classification labeled “other” (except “Any other white background”) and certain top-level ethnicity classifications (“mixed,” “Asian or Asian British,” and “Black or Black British”). We also placed individuals who reported different ethnicities at different visits to the “other” group (n = 209).

Frequency estimation

We summarized the frequency distributions of PGx alleles, diplotypes, phenotypes, and activity scores (if applicable) for 18 genes and one CYP2C cluster variant. At each level, only participants with a single PharmCAT call were included in the frequency estimation. For allele and diplotype frequency estimation, only participants with a single diplotype call were used in the analysis. Similarly, for the phenotype or activity score, PGx frequencies were calculated using participants with a single phenotype call or a single activity score, respectively. Figure S2 presents the exact numbers and percentages of participants in this study who had one diplotype call, or one phenotype call, and whose results were used for estimating PGx frequencies. We used CPIC standardized nomenclature across this study.27

The frequency tables included special cases of annotations from PharmCAT. These special cases included combination or partial allele calls or no calls due to missing PGx variants or indeterminable diplotypes. These cases were considered valid and aggregated into the “remainder” category in the tables. The reference tabs in the frequency tables summarized the counts of each specific case.

CFTR frequencies were handled separately in this study. Cystic fibrosis (CF), a genetic disorder caused by autosomal-recessive variations in the coding region of CFTR, may be caused by more than one thousand different variants in various biogeographic groups.42,43 Meanwhile, the CPIC CFTR-ivacaftor drug-prescribing guideline is specifically indicated for individuals with CF who have a particular CFTR variant.44 Given the specific focus of the CPIC CFTR-ivacaftor guideline, understanding the frequency of CFTR variants in the general population beyond individuals with CF translates to little clinical utility. While we have identified and evaluated the frequencies of several CFTR variants (Table S6), we did not inspect the CF disease status of the UK Biobank 200K participants. As a result, the CFTR variant frequencies did not indicate the proportion of individuals with CF who will gain benefit from PGx-guided therapeutic plans. In this regard, CFTR frequencies were presented separately and not discussed in the results section in this study.

Frequencies were calculated for all DPYD variants found in this study. However, no DPYD diplotype frequency table was generated from the UK Biobank data. The DPYD-Fluoropyrimidines CPIC guideline assigns the likely activity score and phenotype of an individual based on the sum of the two lowest variant activity values regardless of how many variants are found.45 Thus, all DPYD variants may not be taken into consideration when a phenotype is inferred for an individual. We instead provide a DPYD phenotype frequency table that carries pragmatic insights into the distribution of likely DPYD metabolizer phenotype.

Most CPIC genes use the star () allele nomenclature to refer to cataloged combinations of variants (haplotypes) that span the entire gene sequence. This is not true for DPYD, UGT1A1, RYR1, CACNA1S, and G6PD. For these genes, the CPIC guidelines mainly refer to variants that can be found in different combinations in cis on the same allele or in trans. In these cases, “reference” or “1” alleles cannot reasonably be assumed based on the absence of the given set of variants. Therefore, reference or 1 alleles were not used in the frequency tables or figures for DPYD, UGT1A1, RYR1, CACNA1S, and G6PD.

CPIC does not assign allele functions for CYP4F2, IFNL3, and VKORC1. These genes do not have a phenotype assignment or phenotype frequencies.

R package tidyverse was used for data reshaping and vi-sualization.46

Results

Overview

In this study, we investigated the biogeographic-group-specific PGx frequencies for 18 genes and a CYP2C cluster variant by applying PharmCAT to the UK Biobank data (Tables S4, S5, S6, S7, S8, S9, S10, S11, S12, S13, S14, S15, S16, S17, S18, S19, S20, S21, and S22). CFTR and the CYP2C cluster variant (rs12777823) were reported separately for various reasons (see material and methods and supplemental notes). In the rest of the main text, we focused on 17 pharmacogenes: ABCG2, CACNA1S, CYP2B6, CYP2C19, CYP2C9, CYP2D6, CYP3A5, CYP4F2, DPYD, G6PD, IFNL3, NUDT15, RYR1, SLCO1B1, TPMT, UGT1A1, and VKORC1.

To obtain as comprehensive a coverage of PGx positions as possible, we constructed a UK Biobank 200K integrated genetic dataset from three UK Biobank genetic datasets (Figure 1). PharmCAT then processed this 200K integrated genetic dataset to identify PGx alleles and diplotypes and predict phenotypes for each of the 200,044 participants in this dataset. All participants had at least one genetic variation at PGx positions. The frequencies were estimated for alleles, allele functions, diplotypes, phenotypes, and activity scores, where applicable (Tables S4, S5, S6, S7, S8, S9, S10, S11, S12, S13, S14, S15, S16, S17, S18, S19, S20, S21, and S22).

Figure 1.

Figure 1

Overview of the study

The UK Biobank participants used in this study came from a variety of biogeographic groups, including European (N = 187,660, 93.81%), Central/South Asian (N = 3,460, 1.73%), East Asian (N = 637, 0.32%), African-American/Afro-Caribbean (N = 1,926, 0.96%), and Sub-Saharan African (N = 1,235, 0.62%). The remaining 5,126 (2.56%) participants in the dataset were grouped as “other” (see material and methods; Figure 1). The African-American/Afro-Caribbean group was composed of “Black or Black British Caribbeans” and the category conformed with the biogeographic grouping system described in the material and methods.

The UK Biobank 200K integrated genetic dataset covered all the 771 allele-defining positions used in PharmCAT (Table S23; supplemental methods). Certain genetic positions failed to pass the biogeographic-group-specific QC and therefore were removed from the specific biogeographic group (Table S2). This resulted in slightly different PGx position coverages by the 200K integrated dataset in distinct biogeographic groups, but overall, more than 98% of PGx position coverage was found in any group (Table S23).

PharmCAT performance

We validated the accuracy of PharmCAT PGx allele calls for the following 10 genes: CYP2B6, CYP2C19, CYP2C9, CYP3A5, CYP4F2, DPYD, SLCO1B1, TPMT, UGT1A1, and VKORC1. These genes were chosen because their PGx diplotypes were characterized by multiple GeT-RM research studies for certain samples (https://www.cdc.gov/labquality/get-rm/index.html). For some of these samples, their WGS VCFs had recently become publicly available on the IGSR.38,39 We obtained the WGS VCFs for in total 133 samples and called PGx diplotypes for these 10 genes using PharmCAT. Not all 133 samples had GeT-RM consensus calls for each gene.

After manually reviewing each PGx call and resolving any discrepancies driven by the PGx allele updates and the differences between GeT-RM’s genotyping assays and WGS data, we found that PharmCAT had 100% concordance rates for all 10 genes. Some of the known discrepancies between GeT-RM consensus calls and PharmCAT results have been published in detail in previous PharmCAT papers (supplemental methods).25

Coverage of pharmacogenomic alleles with non-zero frequencies

PharmCAT revealed a variety of PGx alleles in the 200K integrated genetic dataset. PharmCAT identified in total 425 distinct PGx alleles in the 17 pharmacogenes. These 425 distinct PGx alleles were found in at least one participant and covered 55.0% of the total 773 PGx alleles specified on the CPIC allele definition tables (Figures 2 and S3). These 773 PGx alleles did not include the “reference” alleles for CACNA1S, DPYD, and RYR1, or the UGT1A11.

Figure 2.

Figure 2

PharmCAT on the UK Biobank 200K integrated dataset complemented the PharmGKB/CPIC frequency tables

It provided PGx frequency information for more alleles than found in the PharmGKB/CPIC allele frequency tables. While more PGx alleles were investigated in this study, only PGx alleles that were found to be possessed by at least one participant were included in this figure. The coverage of PGx allele frequencies was summed in each data source, CPIC alone (vertical bar), UK Biobank alone (dots), and the two combined (arrows), across 17 pharmacogenes. In this study, PharmCAT identified in total 425 distinct PGx alleles with non-zero frequencies in the 17 pharmacogenes (55.0% of the 773 interrogated PGx alleles). Combined with what is available in the CPIC frequency tables, we reported non-zero allele frequencies for in total 497 (64.2% of all the 773) distinct PGx alleles, a 40% increase from the CPIC frequency tables alone.

PharmCAT delineated a more comprehensive picture of PGx allele frequencies in all five biogeographic groups using the UK Biobank 200K integrated genetic dataset. PharmCAT identified330 (42.7%), 184 (23.8%), 151 (19.5%), 165 (21.3%), and 107 (13.8% of all the 773) distinct PGx alleles in the European, African-American/Afro-Caribbean, Sub-Saharan African, Central/South Asian, and East Asian group, respectively (Figure 2).

Updated or new allele frequencies

UK Biobank 200K integrated genetic dataset allowed frequency estimation for a nontrivial number of PGx alleles that currently had zero or missing frequency information on CPIC frequency tables due to several reasons. Approximately one-third of the 425 PGx alleles present in our dataset (n = 152) had such zero or missing frequency information in one of the biogeographic groups in CPIC frequency tables (Figure 2).

Similarly in each biogeographic group, we found 166 (21.5%), 91 (11.8%), 73 (9.5%), 72 (9.3%), and 28 (3.6% of all the 773) distinct PGx alleles in the European, African-American/Afro-Caribbean, Sub-Saharan African, Central/South Asian, and East Asian groups, respectively, in our UK Biobank 200K integrated dataset that had zero or missing frequency information in the CPIC frequency tables (Table 1).

Table 1.

Numbers of PGx alleles in each biogeographic group

Gene European
African-American/Afro-Caribbean
Sub-Saharan African
Central/South Asian
East Asian
# Alleles (CPIC updates) # Altered function alleles # Alleles # Altered function alleles # Alleles # Altered function alleles # Alleles # Altered function alleles # Alleles # Altered function alleles
Overall 327 (166) 151 (71) 182 (91) 80 (34) 149 (73) 56 (23) 162 (72) 71 (28) 104 (28) 53 (17)
CYP2B6 25 (10) 14 (8) 20 (10) 12 (7) 15 (5) 8 (3) 16 (7) 7 (4) 9 (3) 6 (3)
CYP2C19 21 (10) 12 (4) 17 (6) 10 (3) 16 (8) 8 (3) 11 (7) 6 (3) 6 (1) 3 (0)
CYP2C9 41 (30) 17 (10) 15 (8) 10 (4) 12 (4) 8 (2) 19 (10) 8 (4) 13 (4) 10 (4)
CYP2D6 63 (47) 21 (7) 39 (20) 14 (3) 29 (16) 10 (3) 36 (24) 13 (6) 20 (3) 10 (0)
CYP3A5 5 (2) 3 (1) 4 (0) 3 (0) 4 (0) 3 (0) 3 (1) 2 (1) 3 (1) 2 (1)
CYP4F2 3 (0) 0 (0) 3 (1) 0 (0) 3 (0) 0 (0) 3 (0) 0 (0) 3 (0) 0 (0)
DPYD 50 (14) 18 (8) 27 (10) 7 (3) 24 (6) 3 (1) 21 (1) 8 (1) 13 (2) 3 (1)
G6PD 29 (16) 25 (15) 9 (5) 6 (4) 6 (6) 3 (3) 12 (3) 10 (2) 8 (4) 7 (3)
NUDT15 12 (6) 3 (1) 7 (7) 3 (3) 5 (5) 1 (1) 7 (2) 3 (1) 6 (0) 2 (0)
RYR1 23 (10) 21 (10) 1 (1) 0 (0) 1 (1) 0 (0) 1 (1) 0 (0) 0 (0) 0 (0)
SLCO1B1 22 (13) 7 (2) 20 (17) 6 (5) 19 (16) 6 (5) 16 (11) 5 (3) 10 (5) 4 (2)
TPMT 24 (4) 4 (1) 10 (4) 3 (0) 6 (3) 1 (0) 7 (1) 3 (0) 4 (1) 1 (0)
UGT1A1 7 (4) 6 (4) 6 (2) 5 (2) 5 (3) 4 (2) 6 (4) 5 (3) 5 (4) 4 (3)

The numbers in parentheses “()” indicated the number of the PGx alleles that had zero or missing frequency information on the CPIC frequency tables but were found to be present in at least one UK Biobank participant in this study. Altered function alleles were alleles that had decreased, increased, or no-function. “0” suggested that while we identified PGx alleles for the gene, CPIC frequency tables have already reported frequencies for the PGx alleles in this gene.

We estimated frequencies for certain PGx alleles that had increased, decreased, or no-function but whose frequency information were not in CPIC frequency tables (Table 1). Some of these PGx alleles were not rare, i.e., >1% allele frequency in certain biogeographic groups. Examples include CYP2B69 (decreased function) in the Sub-Saharan African group (9.1%) and CYP2B67 (decreased function) in the Central/South Asian group (3.0%), G6PD A- 202A_376G (class III/deficient) in the African-American/Afro-Caribbean (frequency = 9.2%) and Sub-Saharan African (frequency = 10.7%) groups, SLCO1B131 (no function) in the African-American/Afro-Caribbean (frequency = 3.0%) and Sub-Saharan African (frequency = 3.9%) groups, and SLCO1B120 (increased function) in the African-American/Afro-Caribbean (frequency = 5.9%) and Sub-Saharan African (frequency = 6.0%) groups.

Pharmacogenomic frequencies

We summarized the frequency distributions of PGx alleles, diplotypes, phenotypes, and activity scores, if applicable, for 17 pharmacogenes that had CPIC drug-prescribing guidelines (Figures 3, 4, 5, 6, and S4–S11).

Figure 3.

Figure 3

Frequencies for G6PD, CYP2C9, and DPYD

Figures for G6PD only included the UK Biobank participants with two X chromosomes.

(A and B) Allele (A) and phenotype (B) frequencies for G6PD. Only individuals with predicted G6PD variable, G6PD deficient, or G6PD deficient with chronic nonspherocytic hemolytic anemia (CNSHA) phenotypes are shown in (B) for visual clarity. Participants with one X chromosome were not displayed due to the rarity of G6PD alleles.

(C and D) Allele and phenotype frequencies for CYP2C9, respectively.

(E) Proportions of the UK Biobank participants in this study who had more than one DPYD variant with activity values of 0 or 0.5 in each biogeographic group.

(F) DPYD phenotype frequencies. DPYD c.2194G>A (6), c.1896T>C, c.1601G>A (4), c.1129_5923C>G, c.1236G>A (HapB3), and c.496A>G failed QC in the European group and thus were removed and not included in frequency estimation.

“Remainder” includes alleles of uncertain function, unassigned function, no call (due to missing genotypes), partial/combination alleles, etc.

Figure 4.

Figure 4

Frequencies for CYP2B6, CYP2C19, and CYP3A5

(A, C, and E) Allele frequencies for CYP2B6, CYP2C19, and CYP3A5, respectively.

(B, D, and F) Phenotype frequencies for CYP2B6, CYP2C19, and CYP3A5, respectively.

“Remainder” includes alleles of uncertain function, unassigned function, no call (due to missing genotypes), partial/combination alleles, etc. Specifically, CYP3A53 was not callable in the Other group as the defining genetic variant failed QC.

Figure 5.

Figure 5

Frequencies for NUDT15 and TPMT

(A and C) Allele frequencies for NUDT15 and TPMT, respectively.

(B and D) Phenotype frequencies for NUDT15 and TPMT, respectively.

“Remainder” includes alleles of uncertain function, unassigned function, no call (due to missing genotypes), partial/combination alleles, etc. For TMPT, as chr6:18130687 was discarded during QC for the European group, no participants in the European group were assigned TPMT3A, 3C, and 41, which was defined solely or partly by genetic variants found at chr6:18130687, This also affected the estimated frequency of TPMT1 and 3B alleles in the European group.

Figure 6.

Figure 6

Frequencies for SLCO1B1 and UGT1A1

(A and C) Allele frequencies for SLCO1B1 and UGT1A1, respectively.

(B and D) Phenotype frequencies for SLCO1B1 and UGT1A1, respectively.

“Remainder” includes alleles of uncertain function, unassigned function, no call (due to missing genotypes), partial/combination alleles, etc. UGT1A11 and normal metabolizers were taken out from (C) and (D) as these do not reflect the actual frequencies (see discussion).

PGx diplotypes for some genes were defined by only one SNV, including ABCG2, IFNL3, and VKORC1 (Figure S5; supplemental methods). We observed differences in the allele frequencies in these genes across biogeographic groups. ABCG rs2231142 variant (T) was at a 28.10% frequency in the East Asian group, while being 8.57% in the Central/South Asian group and 0.97% in the Sub-Saharan African group. Similarly, IFNL3 rs12979860 variant (T) was at an approximately 61% frequency in the African-American/Afro-Caribbean and Sub-Saharan African groups but merely 6.04% frequent in the East Asian group. VKORC1 rs9923231 variant (T) was observed at a higher 87.83% frequency in the East Asian group in comparison with the 18.15% frequency in the Central/South Asian group.

For genes defined by more than one genetic position, some genes received multiple PGx calls from PharmCAT due to missing genotypes at certain PGx allele-defining positions, which make it impossible for PharmCAT to discern among matching PGx alleles. To eliminate the influence of ambiguous PGx calls, each step of the frequency estimation only included participants with a single PGx call from PharmCAT (Figure S2). For alleles and diplotypes, only participants with a single diplotype call from PharmCAT were included in the frequency estimation. Similarly, for phenotypes, we focused on the participants with a single phenotype call. For example, for SLCO1B1, 173,241 (86.6%) individuals had only one diplotype call from PharmCAT (Figure S2J). These individuals were included for estimating PGx allele and diplotype frequencies. Next, some participants had multiple possible diplotype calls, but all possible diplotypes led to the same phenotype prediction. These participants were then included for estimating phenotype frequencies. As a result, the estimation of phenotype frequencies consisted of in total 189,656 (94.8%) participants for SLCO1B1 (Figure S2J). Statistics for other genes including CYP2B6, CYP2C19, CYP2C9, CYP2D6, G6PD, NUDT15, SLCO1B1, TPMT, and UGT1A1 are available in Figure S2.

G6PD had varied allele and phenotype frequency distributions in different biogeographic groups (Figure 3). Figure 3 shows the frequency distributions for the UK Biobank participants with two X chromosomes. We did not include participants with one X chromosome in Figure 3 due to the rarity of G6PD alleles in this group of participants, which consisted of one participant from the Central/South Asian group harboring Insuli (class IV/normal), one from the East Asian group with Fushan (class II/deficient), two distinct people from the European group having a Farroupilha (class II-III/deficient) and Split (III/deficient) separately, and an individual from the “other” group having a Namouru (class II/deficient).

While varied, low frequencies of G6PD-deficient alleles and G6PD-deficient phenotype were expected across biogeographic groups.47,48 In accordance with previous observations, Asian and African biogeographic groups in the UK Biobank exhibited a higher frequency of G6PD-deficient alleles and heterozygous G6PD-deficient phenotype (Figure S6A).47,48 Each biogeographic group had distinct dominant low-function alleles (G6PD-deficient alleles) and some common G6PD-deficient alleles were named after the place the variant was reported. G6PD A- 202A_376G was the most common one in the African-American/Afro-Caribbean and Sub-Saharan African groups (9.18% and 10.65%, respectively; Figures S17A and S17B). Mediterranean, Dallas, Panama Sassari, Cagliari, Birmingham (2.02%) was the most common deficient allele in the Central/South Asia group (Figure S17C); Kaiping, Anant, Dhon, Sapporo-like, Wosera (0.8%) was the most common deficient allele in the East Asian group in our dataset (Figure S17D). A small percentage of participants from Asian, African, and other biogeographic groups were found to possess G6PD alleles of uncertain function, resulting in indeterminate phenotypes (Figure 3B). Some participants possessed undocumented variants that are different from the CPIC allele definitions (“no result” phenotype in Figure 3B). These G6PD alleles warrant future research.

For CYP2C9, different biogeographic groups had distinct sets of prevalent alleles (Figures 3C and S13). For example, CYP2C92, 3, 11, and 12 were the most common alleles in the European group, while in the Central/South Asian group, 14 replaced 12 in the top alleles. The East Asian group was dominated by 3, 13, 16, and 8. On the other hand, the most frequent alleles in the African-American/Afro-Caribbean and Sub-Saharan African groups were 9, 8, and 11. Differences in the most frequent CYP2C9 alleles led to different proportions of allele functions and phenotypes among biogeographic groups (Figures S6B and 3D).

We observed 19 UK Biobank participants who had been assigned multiple CYP2C9 activity scores (Figure S2B) suggesting the influence of missing PGx genotypes on clinical annotations of an individual’s PGx phenotype (supplemental methods).

The percentages of CYP2C9 poor metabolizers with an activity score of 0 were low across biogeographic groups (Figure 3D), as was expected based on the frequency of CYP2C9 no-function alleles (activity value of 0; Figure S6B).49,50,51 The frequency of CYP2C9 poor metabolizers with an activity score of 0 or 0.5 were >5% in all biogeographic groups. Currrently, drug-prescribing guidelines for CYP2C9 do not differentiate poor metabolizers with an activity score of 0.5 or 0. In Figure 3D, we distinguished each CYP2C9 phenotype based on activity scores for specificity.49,50,51

We evaluated the proportions of the 200K UK Biobank participants having DPYD variants with activity values of 0 or 0.5 who may need dosage adjustment (Figure 3E). For clarity, variants with an activity value of one were not shown in Figure 3E. Each biogeographic group had their distinct distribution of DPYD variants and phenotypes (Figures 3E and 3F). For example, European and Central/South Asian groups had similar proportions of DPYD phenotypes. However, these phenotypes were driven by different underlying DPYD variants (Table S13). We observed low frequencies of DPYD intermediate metabolizers and poor metabolizers in the East Asian group as previously reported.45,52 It is unclear whether the low frequencies are due to unidentified or uncharacterized DPYD variants.52 Moreover, chr1:97573863, a genetic position that harbors one of the defining variants for the DPYD HapB3 haplotype, failed QC in the European group. As a result, DPYD HapB3 frequency was not calculated for the European group (see supplemental methods).

Almost 7% of the participants (n = 13,339, 6.67%) had one decreased or no-function variant (Figure S2G). Two hundred and sixty-four participants (n = 264, 0.13%) bore two decreased or no-function DPYD variants. Only four participants (n = 4) in our 200K participants were found to harbor three decreased or no-function variants (Figure S2G). All of these four participants had at least one g.97579893G>C (GenBank: NC_000001.11) variant (one of the defining variants for the HapB3 haplotype which has a decreased function with an activity value of 0.5). Two of them were homozygous for the g.97579893G>C (GenBank: NC_000001.11) variant and had an additional DPYD c.2846A>T variant (decreased function). Another had one g.97579893G>C (GenBank: NC_000001.11) variant, an additional c.1905+1G>A (2A) variant (no function), and a c.2846A>T variant (decreased function). The last one harbored a DPYD g.97579893G>C (GenBank: NC_000001.11) variant, a c.1905+1G>A (2A) variant (no function), and a c.1679T>G (13) variant (no function). Following the CPIC guideline for fluoropyrimidines and DPYD, DPYD phenotype is assigned based on the two lowest activity values.45 Based on this, the phenotypes of the four participants were intermediate metabolizers for the former two individuals who had three decreased function variants and poor metabolizers for the latter two who had one or two no-function variants.

For the rest of the pharmacogenes, we investigated the most prevalent alleles in each gene and estimated the PGx allele, diplotype, and phenotype frequency distributions across biogeographic groups (Figures 4, 5, and 6). The flow of participants from certain PGx alleles to their allele function assignment and finally to phenotype predictions were available in Figures S12–S21.

CYP2C19 allele definitions were extensively curated during the transition process of the content from the P450 nomenclature webpage into the PharmVar database using standardized allele designation criteria and evidence levels.53 Under the original designation, CYP2C191 could include the variant g.94842866A>G (GenBank: NC_000010.11) (p.Ile331Val) (legacy names CYP2C191B, 1C, etc.), but could also be defined without the variant (legacy name CYP2C191A). As part of the transition, the haplotype without the g.94842866A>G (GenBank: NC_000010.11) (p.Ile331Val) variant received its own star allele, CYP2C1938, which matches the CYP2C19 RefSeq sequence, while the haplotype containing the g.94842866A>G (GenBank: NC_000010.11) (p.Ile331Val) variant remained 1. Since this change occurred recently, most frequency data retrieved from literature does not distinguish between these two alleles. Using the UK Biobank data, we are able to differentiate and estimate the frequencies of both 1 and 38.

We observed a higher proportion of CYP2C19 poor metabolizers in the East and Central/South Asian groups, approximately quadruple that of the European group and twice that of the African-American/Afro-Caribbean and Sub-Saharan African groups (Figure 4D). These frequencies fit the reported observations in previous studies.54,55,56 The greater frequencies of CYP2C19 poor metabolizers puts individuals from the two Asian groups at a higher risk for certain drugs, such as clopidogrel.54,55,56 We also noted a nontrivial number of individuals with indeterminate phenotypes in the African-American/Afro-Caribbean (2.34%) and Sub-Saharan African (3.24%) groups (Figure 4D). These individuals harbored alleles with uncertain function and thus cannot be used for drug-prescribing recommendations. These frequencies highlight the need for future research on alleles with unknown or uncertain functions.

With the use of the UK Biobank data, we are able to differentiate between CYP2C19 1 and 38 and capture both. Otherwise, the data reflect the expected patterns (Figures 4C, 4D, and S7B). CYP2C1917 is the most prevalent allele in the European, African-American/Afro-Caribbean, and Sub-Saharan groups, while 2 is the most common one in the East Asian and Cental/South Asian groups (Figure 4C). While 8, 4, and 6 were the most prevalent no-function alleles, secondary to 2, in the European group, 35 was the most common no-function allele, next to ∗2,in the African-American/Afro-Caribbean and Sub-Saharan African groups (originally reported in African populations), suggesting a different driving no-function allele underlying CYP2C19 poor metabolizers in different biogeographic groups.57

We did not estimate the frequency of CYP3A53 in the “other” group as the allele-defining position, chr7:99672916, failed the HWE test (Figure 4E; Table S2). This resulted in the assignment of CYP3A51 to all subjects in this group, though CYP3A53 is known to be the most prevalent allele in most global populations.

For TPMT, frequency estimation for the European group was affected by chr6:18130687, which failed the HWE test in this biogeographic group (Table S2). As a result of discarding this multiallelic position during QC, no participants in the European group were assigned TPMT3A, 3C, and 41, which was defined solely or partly by the g.18130687T>C (GenBank: NC_000006.12) (p.Tyr240Cys) or g.18130687T>G (GenBank: NC_000006.12) (p.Tyr240Ser) variants (Figure 5C). The removal of this multiallelic position overinflated the frequency estimate for TPMT1 in the European group (Figure 5C).

Variants were extremely rare for CACNA1S and RYR1 (Tables S5 and S17). For CACNA1S, there were only two individuals who were found to possess CACNA1S c.520C>T. These two individuals belonged to the “other” group and no CACNA1S variant was found in any other population. As for RYR1, we found 135 participants from the European group and one participant from the “other” group who harbored one malignant hyperthermia-associated variant and were predicted to be susceptible to malignant hyperthermia (Figure S11; Table S17). Other biogeographic groups did not have any malignant hyperthermia-associated variants. Three participants (0.12%) in the Sub-Saharan African group were found to harbor RYR1 c.1598G>A. This was the only RYR1 variant found in the Sub-Saharan African group but exhibited a higher prevalence in comparison to that of the European group (0.008%, Figure S11).

Novel combinations or partial alleles

PharmCAT relies on allele definitions from established sources such as PharmVar and TPMT nomenclature committee. Novel combination and partial alleles refer to combinations of known PGx variants not cataloged as a PGx allele in the CPIC allele definition tables. Partial alleles are a combination of genetic variants that make up part of a currently defined PGx allele. Combination alleles are more than one currently defined PGx allele on a haplotype. Partial or combination alleles were present in different genes and biogeographic groups at different proportions (Figure 7). African-American/Afro-Caribbeans had higher proportions of partial or combination alleles in CYP2B6, CYP2C19, CYP2C9, and TPMT than the rest of the groups. Part of these are predicted to be no-function alleles, such as CYP2B6 [18 + g.41006923C>T] where 18 is a no-function allele and CYP2C19 [2 + 30] where 2 is a no-function allele. The Sub-Saharan African group exhibited a higher proportion of partial or combination alleles in CYP2B6 and CYP2C19. The Central/South Asian group was found to possess more partial or combination alleles in SLCO1B1. The East Asian group, on the other hand, had more partial or combination alleles in UGT1A1, e.g., UGT1A1 [80 + 27 + 28] and even [80 + 6 + 27 + 28] where 6, 27, and 28 are all decreased-function alleles.

Figure 7.

Figure 7

Frequency of partial or combination alleles in each biogeographic group

Each panel shows the proportions of participants who had been identified with partial or combination alleles in each gene.

Non-reference concordance among genetic data sources

We investigated the NRC rates among WES, imputed, and microarray-based genotype data for the 200,044 participants in this study. The WES data had a greater NRC rate with the microarray-based genotype calls than the imputed data (Figure S22). The average NRC rate between the WES and imputed data were 52.2% for the 205 shared PGx positions (Figure S22B). On the other hand, the average NRC rate between the WES and microarray-based genotype data were 76.7% for the 87 shared pharmacogenetic positions (Figure S22D). Quality control on HWE did not sway the NRC statistics (Figure S22A compared to S22B and Figure S22C compared to S22D).

NRC rates were random across the PGx positions of interest. No gene had consistently higher NRC rates than the others (Figure S23).

We observed different patterns of NRC between the WES and imputed data and between the WES and microarray-based genotype calls (Figure S24). For the WES and imputed data, NRC rates declined with a decrease in MAF (Figure S24). For example, NRC rates for the overall population in this study were about 6% for extremely rare variants with MAF between 0.001% and 0.005%, approximately 36% for rare variants with MAF between 0.005% and 0.01%, and on average greater than 90% for variants with MAF > 1% (Figure S24B). This trend was observed across all the biogeographic groups (Figure S24B). For the WES and microarray-based genotype calls, on the other hand, NRC rates plummeted significantly for variants with MAF < 0.005% (Figure S24D). NRC rates were on average greater than 80% for variants with MAF > 0.005% and greater than 90% for variants with MAF > 0.5%. NRC rates, nonetheless, dropped to below 60% for variants with MAF < 0.005%, while only 81 instances of alternative genotypes were observed below this MAF threshold (Figure S24D).

Discussion

This study estimated PGx frequencies for 17 pharmacogenes with CPIC drug-prescribing guidelines in five biogeographic groups (Figure 1). All participants (N = 200,044; 100%) harbored at least one genetic variation at PGx positions of interest in this study. The PGx frequencies were calculated from PharmCAT results on the UK Biobank 200K integrated genetic dataset that we generated for this study. The 200K integrated genetic dataset covered all the 771 PGx allele-defining positions used in PharmCAT. Genotypes may be missing for some participants due to genotyping or sequencing quality issues (Table S23), and not all known PGx alleles in the 17 pharmacogenes were present in the 200,044 participants we interrogated in this study.

PharmCAT identified 425 distinct PGx alleles across 17 pharmacogenes (55% of the total 773 CPIC PGx alleles) and complemented PGx frequency information in the CPIC frequency tables (Figure 2). Frequencies of approximately 36% of the PGx alleles we found in this study had not been reported for certain biogeographic groups in the CPIC frequency tables. Combining this study and the CPIC frequency tables, frequencies are available for 497 (64.2%) distinct PGx alleles, an approximately 40% increase than what is currently available in the CPIC frequency tables alone (nCPIC = 352).

For each biogeographic group, the total numbers of PGx alleles with frequency estimates rose to 389 (50.3%), 214 (27.6%), 171 (22.1%), 206 (26.6%), and 204 (26.5%) in the European, African-American/Afro-Caribbean, Sub-Saharan African, Central/South Asian, and East Asian groups, respectively (Figure 2). The surge of estimated allele frequencies varied for each pharmacogene and biogeographic group (Figure S3). These indicated an approximately 75%, 55%, 16%, 75%, and 78% increase in the allele frequency coverage in the respective biogeographic groups, in comparison with what is currently available in the CPIC frequency tables.

While biogeographic groups may have comparable proportions of clinical phenotypes, the underlying compositions of alleles or allele functions could be quite different (Figures S12–S21). Alleles that were most common and had altered functionalities varied among biogeographic groups (Figures S12–S21). For example, in SLCO1B1, the most prevalent no-function alleles were 15, 5, and 46 in the Central/South Asian and European groups, but 31 and 15 in the African-American/Afro-Caribbean and Sub-Saharan African groups (Figure S19). The variety in PGx alleles across biogeographic groups led to distinct flows from alleles to diplotypes and to phenotypes (Figures S12–S21). PGx phenotypes that result in alternative drugs or dosage adjustment recommendations in non-European biogeographic groups would be missed or underestimated if the most common alleles in these biogeographic groups are not included in clinical PGx testing panels. Such individuals will be assigned incorrect genotypes, clinical phenotypes, and PGx-guided drug-prescribing recommendations.

Of note, reference or 1 alleles indicate an absence of genetic variation at the interrogated genetic positions and are assigned by default when no variant is found at the queried positions. They do not suggest a lack of genetic variation at every position in the gene and should not be mistaken to mean an exact match to the entire reference sequence for the gene.

Concordance rates between the imputed and WES data diminished with a decrease in the variant frequency (Figure S24B). The NRC rates, which measure the concordance of alternative genotypes between two genetic datasets, dropped to, on average, less than 80% across biogeographic groups for rare variants (MAF < 0.1%) and were even lower (∼30% on average) in any biogeographic group for extremely rare variants (MAF < 0.01%). This means that imputation didn’t work as well for rare variants, and the rarer the variant, the less well imputation worked. On the other hand, microarray-based genotype data maintained an on average greater than 80% NRC rate with the WES data for variants with a MAF > 0.005%. As rare variants play an important role in determining a person’s precise PGx diplotype, which in turn is needed for accurate drug-prescribing recommendations, the PGx community may need to be wary of using imputation for predicting any individual’s variation. Genotyping may be more accurate than imputation for rare variants.

Rare PGx combinations

Partial and combination alleles were rare yet present for CYP2B6, CYP2C19, CYP2C9, CYP2D6, CYP3A5, SLCO1B1, TPMT, and UGT1A1. These were rare variant combinations, most frequent in CYP2D6 (0.66%) and CYP2B6 (0.50%). Some of these partial or combination alleles consisted of genetic variants that define no-function alleles, e.g., CYP2B6 [18 + g.41006923C>T] where 18 is a no-function allele and CYP2D6 [3 + 122] where 3 is a no-function allele, both of which had not been recorded in PharmVar.2,58,59 These examples of participants who did not contain a “known” allele but were found to contain one or more variants in the gene implies there are alleles/haplotypes that have not yet been cataloged in PharmVar. After PharmVar submission, the alleles can be defined, cataloged, and assayed for in the future. Functional characterization of these alleles are also needed to gain an understanding of clinical phenotypes and applicable prescribing recommendations. We see more uncataloged alleles in underserved populations, not surprisingly. This highlights the known need to study genetic variation in all global populations for translation and implementation for equity.

In addition, we identified a variety of instances of alternative genotypes at known PGx positions. For example, while only RYR1 c.488G>T was cataloged as an malignant hyperthermia-associated variant in RYR1, we found three participants, two in the European group and one in the African-American/Afro-Caribbean group, had a g.38444212G>A (GenBank: NC_000019.10) (p.Arg163His) missense variant at the same genetic position (Table S17). These instances of alternative genotypes at PGx positions also warrant functional analyses to ensure an accurate representation of an individual’s phenotype for PGx clinical implementation.

A substantial proportion of participants was found to harbor CYP4F2 [2 + 3], which recently has been assigned CYP4F24 by PharmVar. We have included the 4 allele in our analysis and it is not considered as a combination allele in Figure 7.

Equity and underrepresented groups

Despite having fewer than 2,000 participants, PharmCAT found a nontrivial number of PGx alleles that had not been reported in the CPIC frequency tables for the African-American/Afro-Caribbean (n = 1,926) and Sub-Saharan African (n = 1,235) groups. These groups of participants have been underrepresented in genomic studies relative to their worldwide populations.60 This often led to unreported functional genetic variations and overlooked genetic diversity among individuals from various areas in Africa.60,61 We showed the benefits and importance of leveraging genetic data from these biogeographic groups of participants, even on the datasets of limited sample sizes in comparison with that of the Europeans, which would bring us a step closer to equitable precision medicine.

We reported the frequencies of 168 PGx alleles in the non-European groups whose frequencies had not been recorded in CPIC frequency tables. Different biogeographic groups turned out to have a wider spectrum of PGx alleles, such as CYP4F2, SLCO1B1, and UGT1A1, than what was previously reported in the CPIC frequency tables, as some PGx alleles might not have been included in a study and subsequently assumed to 1 or reference. Some other genes had been observed with an absence of PGx alleles in non-European groups, for example, the low numbers of RYR1 variants in non-European groups. It was unclear whether the absence of PGx alleles in non-European groups was due to a lack of PGx alleles, the scarcity of studies of functional alleles in non-European groups, or a combination of both.

We found alleles in the UK Biobank that we did not find reported in the literature which was used to create the CPIC frequency tables. This new information helps us better understand global as well as population-level frequencies. Alleles that were thought to “not exist” in certain populations have now been documented, though perhaps they are rare. Based on current PGx knowledge, however, individuals from any population who had alleles with altered functionality—no matter how rarely found in a given population—may warrant personalized prescribing. These cases should not be ignored. For this reason, testing for as many alleles as possible, regardless of population frequency, is desirable if one wants to catch as many affected people as possible. This approach may not be cost effective, but researchers and implementers should be aware that not testing for all alleles will mean the possibility that a small percentage of people will be assumed having the reference or 1 allele and thus not be accurately reported.

Furthermore, in research settings, if alleles are not assayed, they cannot be found and cannot be reported, skewing the frequency not only for these untested alleles, but for the 1 alleles as well since anyone containing an untested allele is reported as 1. This overestimates the frequency of 1 in studies and confuses data analysis when participants thought to be 1 have unexpected phenotypes due to the unknown alleles. Identification of alleles is not enough for implementation, however. It is important to characterize alleles found in all populations, even rare alleles, in order to understand function which leads to clinical phenotypes and possible prescribing changes in order to truly realize the potential of personalized medicine.

Differences compared to the CPIC frequencies

As stated above, we found many alleles in the UK Biobank genetic dataset that had not been reported in the CPIC frequency tables based on curated literature. A few pharmacogenes exhibited differences in the UK Biobank from the reported allele frequencies. The UK Biobank integrated genetic dataset estimated a lower frequency of CYP2B66 allele, which has been reported as one of the most prevalent functional alleles of CYP2B6 across a variety of populations.62,63,64 The difference in CYP2B66 frequencies between the CPIC frequency table and this study could be traced back to the widespread absence of information for g.41009358A>G (GenBank: NC_000019.10) (p.Lys262Arg) (missing rate = 32.3%). This SNV by itself defines CYP2B64 (increased function) and, if found in combination with the 9 (decreased function) variant, defines CYP2B66 (decreased function). The absence of g.41009358A>G (GenBank: NC_000019.10) (p.Lys262Arg) genotypes was attributable to non-called genotypes from the UK Biobank gVCF. The missing genotypes at this position make it impossible for PharmCAT to identify 4 or to distinguish between 6 and 9. In the case of participants with missing genotypes at chr19:41009358 and having no other variants, the participant will be assigned 1. In the case of samples missing g.41009358A>G (GenBank: NC_000019.10) (p.Lys262Arg) but containing the 9 SNV, PharmCAT reports both 6 and 9 as it would be impossible to distinguish between them without g.41009358A>G (GenBank: NC_000019.10) (p.Lys262Arg). We indeed observed a great percentage (27.1%) of participants with multiple diplotype calls in this study (Figure S2). It is clear from the results of this study, compared to previous allele frequency reports, that the frequency of CYP2B66 is severely underestimated in the UK Biobank population due to missing g.41009358A>G (GenBank: NC_000019.10) (p.Lys262Arg). The frequency of CYP2B64 is largely underestimated. And CYP2B61 is over-inflated in this study, at a rate of about 20%. Interestingly, this study shows a substantially increased frequency of true CYP2B69 among UK Biobank participants across all populations. These are participants who indeed harbor the CYP2B69 variant(s) and have the reference genotypes at chr19:41009358. It is possibly due to the fact that many studies test only for the 9 variant but assume a 6 genotype due to the fact that the 6 is so much more common. For critical positions that discern functional PGx alleles, such as chr19:41009358 in the above example, it is important to ensure the genotype data quality at each step and, importantly, retain the position regardless of the presence of any genetic variation to be able to make definitive predictions and collect accurate allele frequency.

The UK Biobank integrated genetic dataset was unable to estimate frequency for TPMT3A, 3C, and 41 in the European group as chr6:18130687, a genetic position harboring g.18130687T>C (GenBank: NC_000006.12) (p.Tyr240Cys) and g.18130687T>G (GenBank: NC_000006.12) (p.Tyr240Ser), failed QC and was omitted. TPMT3A is known to be the most common TPMT variant to cause toxicity in Europeans, typically found at a frequency of around 3%. Given the typical 3% frequency of TPMT3A in this population, we would have expected to see at least that amount of g.18130687T>C (GenBank: NC_000006.12) (p.Tyr240Cys) variants. Surprisingly, g.18138997C>T (GenBank: NC_000006.12) (p.Ala154Thr), the defining variant for TPMT3B and the other part of the TPMT3A allele, was not detected in the entirety of the UK Biobank European group. This variant by itself is typically found at a frequency of about 0.2% for this population. Due to these circumstances, there is a likely over-inflated frequency estimate for TPMT1 by up to 4% in the European group.

The CPIC frequency table for UGT1A1 reported on individual variants. Though it has previously been reported that 80 is in high linkage disequilibrium with 28,65 it was unclear with what frequency either variant was found alone versus in cis on the same gene copy. With the UK Biobank genetic dataset, we were able to assess how often this occurred across populations. UGT1A128 and 80 were observed together from 10% (East Asian) to 40% (Sub-Saharan African), while 28 without 80 was found in less than 0.2% across populations. However, 80 was also found in combination with 37 (up to 4% of Sub-Saharan Africans) as well as without either 28 or 37 (up to 2.2% in Sub-Saharan Africans). So while it is very rare for 28 to be found without 80, assaying for the 80 variant alone as a proxy for the more-difficult-to-assay 28 repeat variant can lead to incorrect genotype assignment for up to 6% of people.

We estimated CYP2D610 to be at a 56.0% frequency in East Asians rather than the 43% currently reported in the CPIC frequency table. This may be also due to the missing genotypes at key PGx allele-defining positions which led to a large number of participants with multiple top possible CYP2D6 calls (Figure S2).

Limitations

We did not find equivalent classifications of self-reported ethnicities for four PharmGKB biogeographical groups in the UK Biobank: “American,” “Oceanian,” “Near Eastern,” and “Latino.” We defer PGx frequency estimations for these four biogeographic groups to future studies or collaborations.

This study focused on only a specific subset of genes that PharmCAT annotates. Since v.2.0, PharmCAT is capable of digesting and providing PGx reports for genes in both the CPIC and DPWG guidelines. This study, however, focused specifically on the genes included in the CPIC guidelines. The PGx frequencies estimated in this study are available on the PharmGKB and will be used to update the frequency statistics in the relevant PharmGKB/CPIC allele frequency tables. We will expand our research into other genes (for instance, CYP3A4 used in the DPWG quetiapine guideline) in the future.

A handful of genetic variants failed the biogeographic-group-specific HWE tests during the WES QC (Table S2). The removal of these genetic variants impact the frequency estimation of certain PGx alleles and diplotypes, including (1) ABCG2 in the European group, (2) CFTR 3272-26A->G in African-American/Afro-Caribbean, Central/South Asian, and East Asian groups, (3) CYP3A53 in the “other” group, (4) DPYD c.2194G>A (6), c.1896T>C, c.1601G>A (4), c.1129_5923C>G, c.1236G>A (HapB3), and c.496A>G in the European group, and (5) TPMT3A, 3C, and 41 in the European group. Missing genotypes at chr19:41009358 in CYP2B6 affected multiple alleles. Several of these alleles, such as TPMT3A and 3C, have important clinical implications.

CYP2D6 phenotype frequency distribution was skewed due to the lack of structural variations. We could not call ultrarapid metabolizers based solely on SNVs and indels. We defer the estimation of CYP2D6 frequencies to future studies.

The modest number of PGx alleles in the East Asian group could likely be attributed to the smaller sample size relative to the rest of the groups in this study (n = 637).

Conclusion

In this study, we estimated the PGx frequencies (PGx alleles/haplotypes, diplotypes, phenotypes, and activity scores, if applicable) in five biogeographic groups for 17 pharmacogenes that have CPIC drug-prescribing guidelines. The PGx frequencies were based on a large-scale UK Biobank 200K integrated genetic dataset and complemented the existing expert-curated frequency information in the CPIC frequency tables. The PGx frequencies estimated from this study are integrated and disseminated via PharmGKB. We hope that a more comprehensive picture of PGx frequencies will contribute to inclusivity of diverse PGx haplotypes/diplotypes for equitable, robust clinical decision support.

Data and code availability

The UK Biobank data are available through a procedure described at https://www.ukbiobank.ac.uk/using-the-resource/. For the 133 GeT-RM individuals, their 30× whole-genome sequencing (WGS) VCFs were publicly available under the International Genome Sample Resource (IGSR) (https://www.internationalgenome.org/data-portal/data-collection/30x-grch38). PharmCAT is available at https://github.com/PharmGKB/PharmCAT. Frequencies of pharmacogenomic alleles, diplotypes, phenotypes, and/or activity scores are available on PharmGKB (https://www.pharmgkb.org/).

Acknowledgments

This work is supported by NIH NHGRI U24HG010862. We thank William Hardison from Wesleyan University for his contribution in testing and validating PharmCAT results against GeT-RM data.

Author contributions

B.L., K.S., M.W.-C., and T.E.K. contributed to the conceptualization of the study. K.S., M.W.-C., M.W., R.W., M.D.R., and T.E.K. contributed to the design of the software. B.L., K.S., M.W.-C., K.K., M.W., and R.W. contributed to the software development and application, testing of existing code components, and validation of the software outputs. B.L. performed the formal data analysis and data visualization. B.L., K.S., and M.W.-C. contributed to the investigation and the writing of the original draft. M.D.R. and T.E.K. contributed to the acquisition of the financial support for the project leading to this publication. All authors contributed to the review and editing of the manuscript.

Declaration of interests

T.E.K. is on the Scientific Advisory Board for Galatea Bio.

Published: September 26, 2023

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2023.09.001.

Web resources

Supplemental information

Document S1. Figures S1–S24, Tables S1, S3, and S23 and supplemental methods
mmc1.pdf (9.7MB, pdf)
Table S2. PGx positions that failed QC and had been removed from the specific biogeographic groups
mmc2.xlsx (74.7KB, xlsx)
Table S4. ABCG2 frequency tables
mmc3.xlsx (14.5KB, xlsx)
Table S5. CACNA1S frequency tables
mmc4.xlsx (13.8KB, xlsx)
Table S6. CFTR frequency tables
mmc5.xlsx (19KB, xlsx)
Table S7. CYP2B6 frequency tables
mmc6.xlsx (37.4KB, xlsx)
Table S8. CYP2C19 frequency tables
mmc7.xlsx (28.5KB, xlsx)
Table S9. CYP2C9 frequency tables
mmc8.xlsx (34.3KB, xlsx)
Table S10. CYP2D6 frequency tables
mmc9.xlsx (21.8KB, xlsx)
Table S11. CYP3A5 frequency tables
mmc10.xlsx (16KB, xlsx)
Table S12. CYP4F2 frequency tables
mmc11.xlsx (13.1KB, xlsx)
Table S13. DPYD frequency tables
mmc12.xlsx (20.5KB, xlsx)
Table S14. G6PD frequency tables
mmc13.xlsx (30.7KB, xlsx)
Table S15. IFNL3 frequency tables
mmc14.xlsx (12.4KB, xlsx)
Table S16. NUDT15 frequency tables
mmc15.xlsx (17.8KB, xlsx)
Table S17. RYR1 frequency tables
mmc16.xlsx (18.2KB, xlsx)
Table S18. SLCO1B1 frequency tables
mmc17.xlsx (29.4KB, xlsx)
Table S19. TPMT frequency tables
mmc18.xlsx (20.4KB, xlsx)
Table S20. UGT1A1 frequency tables
mmc19.xlsx (18KB, xlsx)
Table S21. VKORC1 frequency tables
mmc20.xlsx (12.4KB, xlsx)
Table S22. rs12777823 frequency tables
mmc21.xlsx (12.4KB, xlsx)
Document S2. Article plus supplemental information
mmc22.pdf (13.6MB, pdf)

References

  • 1.Whirl-Carrillo M., Huddart R., Gong L., Sangkuhl K., Thorn C.F., Whaley R., Klein T.E. An Evidence-Based Framework for Evaluating Pharmacogenomics Knowledge for Personalized Medicine. Clin. Pharmacol. Ther. 2021;110:563–572. doi: 10.1002/cpt.2350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Gaedigk A., Whirl-Carrillo M., Pratt V.M., Miller N.A., Klein T.E. PharmVar and the Landscape of Pharmacogenetic Resources. Clin. Pharmacol. Ther. 2020;107:43–46. doi: 10.1002/cpt.1654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Matic M., Nijenhuis M., Soree B., de Boer-Veger N.J., Buunk A.-M., Houwink E.J.F., Mulder H., Rongen G.A.P.J.M., Weide J.v.d., Wilffert B., et al. Dutch Pharmacogenetics Working Group (DPWG) guideline for the gene–drug interaction between CYP2D6 and opioids (codeine, tramadol and oxycodone) Eur. J. Hum. Genet. 2022;30:1105–1113. doi: 10.1038/s41431-021-00920-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Relling M.V., Klein T.E., Gammal R.S., Whirl-Carrillo M., Hoffman J.M., Caudle K.E. The Clinical Pharmacogenetics Implementation Consortium: 10 Years Later. Clin. Pharmacol. Ther. 2020;107:171–175. doi: 10.1002/cpt.1651. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Dunnenberger H.M., Crews K.R., Hoffman J.M., Caudle K.E., Broeckel U., Howard S.C., Hunkler R.J., Klein T.E., Evans W.E., Relling M.V. Preemptive clinical pharmacogenetics implementation: current programs in five US medical centers. Annu. Rev. Pharmacol. Toxicol. 2015;55:89–106. doi: 10.1146/annurev-pharmtox-010814-124835. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hoffman J.M., Haidar C.E., Wilkinson M.R., Crews K.R., Baker D.K., Kornegay N.M., Yang W., Pui C.-H., Reiss U.M., Gaur A.H., et al. PG4KDS: a model for the clinical implementation of pre-emptive pharmacogenetics. Am. J. Med. Genet. C Semin. Med. Genet. 2014;166C:45–55. doi: 10.1002/ajmg.c.31391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Makary M.A., Daniel M. Medical error-the third leading cause of death in the US. BMJ. 2016;353:i2139. doi: 10.1136/bmj.i2139. [DOI] [PubMed] [Google Scholar]
  • 8.Swen J.J., van der Wouden C.H., Manson L.E., Abdullah-Koolmees H., Blagec K., Blagus T., Böhringer S., Cambon-Thomsen A., Cecchin E., Cheung K.-C., et al. A 12-gene pharmacogenetic panel to prevent adverse drug reactions: an open-label, multicentre, controlled, cluster-randomised crossover implementation study. Lancet. 2023;401:347–356. doi: 10.1016/S0140-6736(22)01841-4. [DOI] [PubMed] [Google Scholar]
  • 9.Anderson H.D., Crooks K.R., Kao D.P., Aquilante C.L. The landscape of pharmacogenetic testing in a US managed care population. Genet. Med. 2020;22:1247–1253. doi: 10.1038/s41436-020-0788-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Empey P.E., Pratt V.M., Hoffman J.M., Caudle K.E., Klein T.E. Expanding evidence leads to new pharmacogenomics payer coverage. Genet. Med. 2021;23:830–832. doi: 10.1038/s41436-021-01117-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Park S.K., Thigpen J., Lee I.J. Coverage of pharmacogenetic tests by private health insurance companies. J. Am. Pharm. Assoc. 2020;60:352–356.e3. doi: 10.1016/j.japh.2019.10.003. [DOI] [PubMed] [Google Scholar]
  • 12.McInnes G., Lavertu A., Sangkuhl K., Klein T.E., Whirl-Carrillo M., Altman R.B. Pharmacogenetics at Scale: An Analysis of the UK Biobank. Clin. Pharmacol. Thera. 2020;109:1528–1537. doi: 10.1002/cpt.2122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Rasmussen-Torvik L.J., Stallings S.C., Gordon A.S., Almoguera B., Basford M.A., Bielinski S.J., Brautbar A., Brilliant M.H., Carrell D.S., Connolly J.J., et al. Design and anticipated outcomes of the eMERGE-PGx project: a multicenter pilot for preemptive pharmacogenomics in electronic health record systems. Clin. Pharmacol. Ther. 2014;96:482–489. doi: 10.1038/clpt.2014.137. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Verma S.S., Keat K., Li B., Hoffecker G., Risman M., Regeneron Genetics Center. Sangkuhl K., Whirl-Carrillo M., Dudek S., Verma A., et al. Evaluating the frequency and the impact of pharmacogenetic alleles in an ancestrally diverse Biobank population. J. Transl. Med. 2022;20:550. doi: 10.1186/s12967-022-03745-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Chanfreau-Coffinier C., Hull L.E., Lynch J.A., DuVall S.L., Damrauer S.M., Cunningham F.E., Voight B.F., Matheny M.E., Oslin D.W., Icardi M.S., Tuteja S. Projected Prevalence of Actionable Pharmacogenetic Variants and Level A Drugs Prescribed Among US Veterans Health Administration Pharmacy Users. JAMA Netw. Open. 2019;2 doi: 10.1001/jamanetworkopen.2019.5345. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.El Shamieh S., Zgheib N.K. Pharmacogenetics in developing countries and low resource environments. Hum. Genet. 2022;141:1159–1164. doi: 10.1007/s00439-021-02260-9. [DOI] [PubMed] [Google Scholar]
  • 17.Markianos K., Dong F., Gorman B., Shi Y., Dochtermann D., Saxena U., Devineni P., Moser J., Muralidhar S., Ramoni R., et al. Pharmacogenetic allele variant frequencies: An analysis of the VA’s Million Veteran Program (MVP) as a representation of the diversity in US population. PLoS One. 2023;18 doi: 10.1371/journal.pone.0274339. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Zhou Y., Krebs K., Milani L., Lauschke V.M. Global Frequencies of Clinically Important HLA Alleles and Their Implications For the Cost-Effectiveness of Preemptive Pharmacogenetic Testing. Clin. Pharmacol. Ther. 2021;109:160–174. doi: 10.1002/cpt.1944. [DOI] [PubMed] [Google Scholar]
  • 19.Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Gaziano J.M., Concato J., Brophy M., Fiore L., Pyarajan S., Breeling J., Whitbourne S., Deen J., Shannon C., Humphries D., et al. Million Veteran Program: A mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 2016;70:214–223. doi: 10.1016/j.jclinepi.2015.09.016. [DOI] [PubMed] [Google Scholar]
  • 21.Mulder N., Abimiku A., Adebamowo S.N., de Vries J., Matimba A., Olowoyo P., Ramsay M., Skelton M., Stein D.J. H3Africa: current perspectives. Pharmgenomics. Pers. Med. 2018;11:59–66. doi: 10.2147/PGPM.S141546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Nagai A., Hirata M., Kamatani Y., Muto K., Matsuda K., Kiyohara Y., Ninomiya T., Tamakoshi A., Yamagata Z., Mushiroda T., et al. Overview of the BioBank Japan Project: Study design and profile. J. Epidemiol. 2017;27:S2–S8. doi: 10.1016/j.je.2016.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.All of Us Research Program Investigators. Denny J.C., Rutter J.L., Goldstein D.B., Philippakis A., Smoller J.W., Jenkins G., Dishman E. The ‘All of Us’ Research Program. N. Engl. J. Med. 2019;381:668–676. doi: 10.1056/NEJMsr1809937. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Li B., Sangkuhl K., Keat K., Whaley R.M., Woon M., Verma S., Dudek S., Tuteja S., Verma A., Whirl-Carrillo M., et al. How to Run the Pharmacogenomics Clinical Annotation Tool (PharmCAT) Clin. Pharmacol. Ther. 2023;113:1036–1047. doi: 10.1002/cpt.2790. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Sangkuhl K., Whirl-Carrillo M., Whaley R.M., Woon M., Lavertu A., Altman R.B., Carter L., Verma A., Ritchie M.D., Klein T.E. Pharmacogenomics Clinical Annotation Tool (PharmCAT) Clin. Pharmacol. Ther. 2020;107:203–210. doi: 10.1002/cpt.1568. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Klein T.E., Ritchie M.D. PharmCAT: A Pharmacogenomics Clinical Annotation Tool. Clin. Pharmacol. Ther. 2018;104:19–22. doi: 10.1002/cpt.928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Caudle K.E., Dunnenberger H.M., Freimuth R.R., Peterson J.F., Burlison J.D., Whirl-Carrillo M., Scott S.A., Rehm H.L., Williams M.S., Klein T.E., et al. Standardizing terms for clinical pharmacogenetic test results: consensus terms from the Clinical Pharmacogenetics Implementation Consortium (CPIC) Genet. Med. 2017;19:215–223. doi: 10.1038/gim.2016.87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Halldorsson B.V., Eggertsson H.P., Moore K.H.S., Hauswedell H., Eiriksson O., Ulfarsson M.O., Palsson G., Hardarson M.T., Oddsson A., Jensson B.O., et al. The sequences of 150,119 genomes in the UK Biobank. Nature. 2022;607:732–740. doi: 10.1038/s41586-022-04965-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Backman J.D., Li A.H., Marcketta A., Sun D., Mbatchou J., Kessler M.D., Benner C., Liu D., Locke A.E., Balasubramanian S., et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature. 2021;599:628–634. doi: 10.1038/s41586-021-04103-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Szustakowski J.D., Balasubramanian S., Kvikstad E., Khalid S., Bronson P.G., Sasson A., Wong E., Liu D., Wade Davis J., Haefliger C., et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat. Genet. 2021;53:942–948. doi: 10.1038/s41588-021-00885-0. [DOI] [PubMed] [Google Scholar]
  • 31.Van Hout C.V., Tachmazidou I., Backman J.D., Hoffman J.D., Liu D., Pandey A.K., Gonzaga-Jauregui C., Khalid S., Ye B., Banerjee N., et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature. 2020;586:749–756. doi: 10.1038/s41586-020-2853-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Chang C.C., Chow C.C., Tellier L.C., Vattikuti S., Purcell S.M., Lee J.J. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Danecek P., Bonfield J.K., Liddle J., Marshall J., Ohan V., Pollard M.O., Whitwham A., Keane T., McCarthy S.A., Davies R.M., Li H. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10:giab008. doi: 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Gaedigk A., Turner A., Everts R.E., Scott S.A., Aggarwal P., Broeckel U., McMillin G.A., Melis R., Boone E.C., Pratt V.M., Kalman L.V. Characterization of Reference Materials for Genetic Testing of CYP2D6 Alleles: A GeT-RM Collaborative Project. J. Mol. Diagn. 2019;21:1034–1052. doi: 10.1016/j.jmoldx.2019.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Gaedigk A., Boone E.C., Scherer S.E., Lee S.-B., Numanagić I., Sahinalp C., Smith J.D., McGee S., Radhakrishnan A., Qin X., et al. CYP2C8, CYP2C9, and CYP2C19 Characterization Using Next-Generation Sequencing and Haplotype Analysis: A GeT-RM Collaborative Project. J. Mol. Diagn. 2022;24:337–350. doi: 10.1016/j.jmoldx.2021.12.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Pratt V.M., Everts R.E., Aggarwal P., Beyer B.N., Broeckel U., Epstein-Baak R., Hujsak P., Kornreich R., Liao J., Lorier R., et al. Characterization of 137 Genomic DNA Reference Materials for 28 Pharmacogenetic Genes: A GeT-RM Collaborative Project. J. Mol. Diagn. 2016;18:109–123. doi: 10.1016/j.jmoldx.2015.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Pratt V.M., Turner A., Broeckel U., Dawson D.B., Gaedigk A., Lynnes T.C., Medeiros E.B., Moyer A.M., Requesens D., Vetrini F., Kalman L.V. Characterization of Reference Materials with an Association for Molecular Pathology Pharmacogenetics Working Group Tier 2 Status: CYP2C9, CYP2C19, VKORC1, CYP2C Cluster Variant, and GGCX: A GeT-RM Collaborative Project. J. Mol. Diagn. 2021;23:952–958. doi: 10.1016/j.jmoldx.2021.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Byrska-Bishop M., Evani U.S., Zhao X., Basile A.O., Abel H.J., Regier A.A., Corvelo A., Clarke W.E., Musunuri R., Nagulapalli K., et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell. 2022;185:3426–3440.e19. doi: 10.1016/j.cell.2022.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Fairley S., Lowy-Gallego E., Perry E., Flicek P. The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Res. 2020;48:D941–D947. doi: 10.1093/nar/gkz836. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Tan A., Abecasis G.R., Kang H.M. Unified representation of genetic variants. Bioinformatics. 2015;31:2202–2204. doi: 10.1093/bioinformatics/btv112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Huddart R., Fohner A.E., Whirl-Carrillo M., Wojcik G.L., Gignoux C.R., Popejoy A.B., Bustamante C.D., Altman R.B., Klein T.E. Standardized biogeographic grouping system for annotating populations in pharmacogenetic research. Clin. Pharmacol. Ther. 2019;105:1256–1262. doi: 10.1002/cpt.1322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Cutting G.R. Cystic fibrosis genetics: from molecular understanding to clinical application. Nat. Rev. Genet. 2015;16:45–56. doi: 10.1038/nrg3849. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Lim R.M., Silver A.J., Silver M.J., Borroto C., Spurrier B., Petrossian T.C., Larson J.L., Silver L.M. Targeted mutation screening panels expose systematic population bias in detection of cystic fibrosis risk. Genet. Med. 2016;18:174–179. doi: 10.1038/gim.2015.52. [DOI] [PubMed] [Google Scholar]
  • 44.Clancy J.P., Johnson S.G., Yee S.W., McDonagh E.M., Caudle K.E., Klein T.E., Cannavo M., Giacomini K.M., Clinical Pharmacogenetics Implementation Consortium Clinical Pharmacogenetics Implementation Consortium (CPIC) Guidelines for Ivacaftor Therapy in the Context of CFTR Genotype. Clin. Pharmacol. Ther. 2014;95:592–597. doi: 10.1038/clpt.2014.54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Amstutz U., Henricks L.M., Offer S.M., Barbarino J., Schellens J.H.M., Swen J.J., Klein T.E., McLeod H.L., Caudle K.E., Diasio R.B., Schwab M. Clinical Pharmacogenetics Implementation Consortium (CPIC) Guideline for Dihydropyrimidine Dehydrogenase Genotype and Fluoropyrimidine Dosing: 2017 Update. Clin. Pharmacol. Ther. 2018;103:210–216. doi: 10.1002/cpt.911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Wickham H., Averick M., Bryan J., Chang W., McGowan L., François R., Grolemund G., Hayes A., Henry L., Hester J., et al. Welcome to the Tidyverse. J. Open Source Softw. 2019;4:1686. [Google Scholar]
  • 47.Relling M.V., McDonagh E.M., Chang T., Caudle K.E., McLeod H.L., Haidar C.E., Klein T., Luzzatto L., Clinical Pharmacogenetics Implementation Consortium Clinical Pharmacogenetics Implementation Consortium (CPIC) guidelines for rasburicase therapy in the context of G6PD deficiency genotype. Clin. Pharmacol. Ther. 2014;96:169–174. doi: 10.1038/clpt.2014.97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Gammal R.S., Pirmohamed M., Somogyi A.A., Morris S.A., Formea C.M., Elchynski A.L., Oshikoya K.A., McLeod H.L., Haidar C.E., Whirl-Carrillo M., et al. Expanded Clinical Pharmacogenetics Implementation Consortium Guideline for Medication Use in the Context of G6PD Genotype. Clin. Pharmacol. Ther. 2023;113:973–985. doi: 10.1002/cpt.2735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Theken K.N., Lee C.R., Gong L., Caudle K.E., Formea C.M., Gaedigk A., Klein T.E., Agúndez J.A.G., Grosser T. Clinical Pharmacogenetics Implementation Consortium Guideline (CPIC) for CYP2C9 and Nonsteroidal Anti-Inflammatory Drugs. Clin. Pharmacol. Ther. 2020;108:191–200. doi: 10.1002/cpt.1830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Karnes J.H., Rettie A.E., Somogyi A.A., Huddart R., Fohner A.E., Formea C.M., Ta Michael Lee M., Llerena A., Whirl-Carrillo M., Klein T.E., et al. Clinical Pharmacogenetics Implementation Consortium (CPIC) Guideline for CYP2C9 and HLA-B Genotypes and Phenytoin Dosing: 2020 Update. Clin. Pharmacol. Ther. 2021;109:302–309. doi: 10.1002/cpt.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Johnson J.A., Caudle K.E., Gong L., Whirl-Carrillo M., Stein C.M., Scott S.A., Lee M.T., Gage B.F., Kimmel S.E., Perera M.A., et al. Clinical Pharmacogenetics Implementation Consortium (CPIC) Guideline for Pharmacogenetics-Guided Warfarin Dosing: 2017 Update. Clin. Pharmacol. Ther. 2017;102:397–404. doi: 10.1002/cpt.668. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Hishinuma E., Narita Y., Obuchi K., Ueda A., Saito S., Tadaka S., Kinoshita K., Maekawa M., Mano N., Hirasawa N., Hiratsuka M. Importance of Rare DPYD Genetic Polymorphisms for 5-Fluorouracil Therapy in the Japanese Population. Front. Pharmacol. 2022;13 doi: 10.3389/fphar.2022.930470. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Botton M.R., Whirl-Carrillo M., Del Tredici A.L., Sangkuhl K., Cavallari L.H., Agúndez J.A.G., Duconge J., Lee M.T.M., Woodahl E.L., Claudio-Campos K., et al. PharmVar GeneFocus: CYP2C19. Clin. Pharmacol. Ther. 2021;109:352–366. doi: 10.1002/cpt.1973. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Hicks J.K., Bishop J.R., Sangkuhl K., Müller D.J., Ji Y., Leckband S.G., Leeder J.S., Graham R.L., Chiulli D.L., LLerena A., et al. Clinical Pharmacogenetics Implementation Consortium (CPIC) Guideline for CYP2D6 and CYP2C19 Genotypes and Dosing of Selective Serotonin Reuptake Inhibitors. Clin. Pharmacol. Ther. 2015;98:127–134. doi: 10.1002/cpt.147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Lima J.J., Thomas C.D., Barbarino J., Desta Z., Van Driest S.L., El Rouby N., Johnson J.A., Cavallari L.H., Shakhnovich V., Thacker D.L., et al. Clinical Pharmacogenetics Implementation Consortium (CPIC) Guideline for CYP2C19 and Proton Pump Inhibitor Dosing. Clin. Pharmacol. Ther. 2021;109:1417–1423. doi: 10.1002/cpt.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Moriyama B., Obeng A.O., Barbarino J., Penzak S.R., Henning S.A., Scott S.A., Agúndez J., Wingard J.R., McLeod H.L., Klein T.E., et al. Clinical Pharmacogenetics Implementation Consortium (CPIC) Guidelines for CYP2C19 and Voriconazole Therapy. Clin. Pharmacol. Ther. 2017;102:45–51. doi: 10.1002/cpt.583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Chaudhry A.S., Prasad B., Shirasaka Y., Fohner A., Finkelstein D., Fan Y., Wang S., Wu G., Aklillu E., Sim S.C., et al. The CYP2C19 Intron 2 Branch Point SNP is the Ancestral Polymorphism Contributing to the Poor Metabolizer Phenotype in Livers with CYP2C19∗35 and CYP2C19∗2 Alleles. Drug Metab. Dispos. 2015;43:1226–1235. doi: 10.1124/dmd.115.064428. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Gaedigk A., Ingelman-Sundberg M., Miller N.A., Leeder J.S., Whirl-Carrillo M., Klein T.E., PharmVar Steering Committee The Pharmacogene Variation (PharmVar) Consortium: Incorporation of the Human Cytochrome P450 (CYP) Allele Nomenclature Database. Clin. Pharmacol. Ther. 2018;103:399–401. doi: 10.1002/cpt.910. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Gaedigk A., Casey S.T., Whirl-Carrillo M., Miller N.A., Klein T.E. Pharmacogene Variation Consortium: A Global Resource and Repository for Pharmacogene Variation. Clin. Pharmacol. Ther. 2021;110:542–545. doi: 10.1002/cpt.2321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Sirugo G., Williams S.M., Tishkoff S.A. The Missing Diversity in Human Genetic Studies. Cell. 2019;177:1080–1131. doi: 10.1016/j.cell.2019.04.032. [DOI] [PubMed] [Google Scholar]
  • 61.Fan S., Spence J.P., Feng Y., Hansen M.E.B., Terhorst J., Beltrame M.H., Ranciaro A., Hirbo J., Beggs W., Thomas N., et al. Whole-genome sequencing reveals a complex African population demographic history and signatures of local adaptation. Cell. 2023;186:923–939.e14. doi: 10.1016/j.cell.2023.01.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Desta Z., Gammal R.S., Gong L., Whirl-Carrillo M., Gaur A.H., Sukasem C., Hockings J., Myers A., Swart M., Tyndale R.F., et al. Clinical Pharmacogenetics Implementation Consortium (CPIC) Guideline for CYP2B6 and Efavirenz-Containing Antiretroviral Therapy. Clin. Pharmacol. Ther. 2019;106:726–733. doi: 10.1002/cpt.1477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Li J., Menard V., Benish R.L., Jurevic R.J., Guillemette C., Stoneking M., Zimmerman P.A., Mehlotra R.K. Worldwide variation in human drug-metabolism enzyme genes CYP2B6 and UGT2B7: implications for HIV/AIDS treatment. Pharmacogenomics. 2012;13:555–570. doi: 10.2217/pgs.11.160. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Mukonzo J.K., Bisaso R.K., Ogwal-Okeng J., Gustafsson L.L., Owen J.S., Aklillu E. CYP2B6 genotype-based efavirenz dose recommendations during rifampicin-based antituberculosis cotreatment for a sub-Saharan Africa population. Pharmacogenomics. 2016;17:603–613. doi: 10.2217/pgs.16.7. [DOI] [PubMed] [Google Scholar]
  • 65.Horsfall L.J., Zeitlyn D., Tarekegn A., Bekele E., Thomas M.G., Bradman N., Swallow D.M. Prevalence of clinically relevant UGT1A alleles and haplotypes in African populations. Ann. Hum. Genet. 2011;75:236–246. doi: 10.1111/j.1469-1809.2010.00638.x. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S24, Tables S1, S3, and S23 and supplemental methods
mmc1.pdf (9.7MB, pdf)
Table S2. PGx positions that failed QC and had been removed from the specific biogeographic groups
mmc2.xlsx (74.7KB, xlsx)
Table S4. ABCG2 frequency tables
mmc3.xlsx (14.5KB, xlsx)
Table S5. CACNA1S frequency tables
mmc4.xlsx (13.8KB, xlsx)
Table S6. CFTR frequency tables
mmc5.xlsx (19KB, xlsx)
Table S7. CYP2B6 frequency tables
mmc6.xlsx (37.4KB, xlsx)
Table S8. CYP2C19 frequency tables
mmc7.xlsx (28.5KB, xlsx)
Table S9. CYP2C9 frequency tables
mmc8.xlsx (34.3KB, xlsx)
Table S10. CYP2D6 frequency tables
mmc9.xlsx (21.8KB, xlsx)
Table S11. CYP3A5 frequency tables
mmc10.xlsx (16KB, xlsx)
Table S12. CYP4F2 frequency tables
mmc11.xlsx (13.1KB, xlsx)
Table S13. DPYD frequency tables
mmc12.xlsx (20.5KB, xlsx)
Table S14. G6PD frequency tables
mmc13.xlsx (30.7KB, xlsx)
Table S15. IFNL3 frequency tables
mmc14.xlsx (12.4KB, xlsx)
Table S16. NUDT15 frequency tables
mmc15.xlsx (17.8KB, xlsx)
Table S17. RYR1 frequency tables
mmc16.xlsx (18.2KB, xlsx)
Table S18. SLCO1B1 frequency tables
mmc17.xlsx (29.4KB, xlsx)
Table S19. TPMT frequency tables
mmc18.xlsx (20.4KB, xlsx)
Table S20. UGT1A1 frequency tables
mmc19.xlsx (18KB, xlsx)
Table S21. VKORC1 frequency tables
mmc20.xlsx (12.4KB, xlsx)
Table S22. rs12777823 frequency tables
mmc21.xlsx (12.4KB, xlsx)
Document S2. Article plus supplemental information
mmc22.pdf (13.6MB, pdf)

Data Availability Statement

The UK Biobank data are available through a procedure described at https://www.ukbiobank.ac.uk/using-the-resource/. For the 133 GeT-RM individuals, their 30× whole-genome sequencing (WGS) VCFs were publicly available under the International Genome Sample Resource (IGSR) (https://www.internationalgenome.org/data-portal/data-collection/30x-grch38). PharmCAT is available at https://github.com/PharmGKB/PharmCAT. Frequencies of pharmacogenomic alleles, diplotypes, phenotypes, and/or activity scores are available on PharmGKB (https://www.pharmgkb.org/).


Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

RESOURCES