Skip to main content
. Author manuscript; available in PMC: 2023 Jun 1.
Published in final edited form as: Pharmacogenet Genomics. 2022 Feb 21;32(4):159–172. doi: 10.1097/FPC.0000000000000466

Table 1.

Description of genetic data Approaches 1–6

Approach Sample Description
Approach 1 (A1) Amplicon Exon Sequencing EUR n=935; AFR n=506 The sample was exon sequenced using deep amplicon sequencing targeting all exons of CYP2A6, CYP2A13, and CYP2B6, and exons 1–2 of CYP2A7, as described previously[9, 38]. Insert sizes were 262–475 bp, and paired read lengths were 2×248 bp. Sanger sequencing in a sample of 120 Japanese individuals was used to validate the exon sequencing approach and yielded 100% concordance[38]. Thus, A1 was used as the “gold standard” for exonic variant calls in subsequent analyses.
Initial inspection of .bam sequence alignment files showed unexpectedly low read depth at CYP2A6 exon 9 and high read depth at CYP2A7 exon 9 (which was not targeted); the same was observed for CYP2B6 exons 4–5, with high read depth at CYP2B7 exons 4–5 (also not targeted). We speculated that these were spurious alignments due to gene homology. For example, the variant CYP2A6*1B allele results in the CYP2A6 3’-UTR being identical to the 3’-UTR of CYP2A7, potentially causing a misalignment of CYP2A6*1B reads to CYP2A7 sequence (Figure 1a). To address the spurious calls, re-alignment was performed using Bazam[39]. Sequence alignment .bam files were re-aligned to a modified reference chromosome 19, with CYP2A7 exons 3–9 masked (i.e. sequence replaced with placeholder “N” characters) using bedtools maskfasta (Figure 1b)[40]. Variant calling was then performed on the resulting .bam files using GATK HaplotypeCaller, yielding VCF files[41]. Post-CYP2A7 exon 3–9 masking, and realignment, CYP2A6*1B calls were compared to internal genotyping of the variant using Sanger sequencing; concordance was ~97% (internal data), confirming that spurious alignment of CYP2A6*1B reads to CYP2A7 sequence resulted in the poor genotype calling. The same masking and realignment technique was used for all CYP2B7 exons to prevent spurious alignment of CYP2B6 exon 4–5 reads to CYP2B7.
Approach 2 (A2) SNP Array EUR n=935; AFR n=506 Individuals were genotyped using the Illumina HumanOmniExpressExome-8 version 1.2 SNP array with >2500 additional variants added on; additions were in areas known to be associated with nicotine metabolism or smoking behaviours (e.g. CYP2ABFGST cluster on chromosome 19). A full list of added variants and a description of QC procedures are available elsewhere; array markers failing Hardy-Weinberg equilibrium tests were removed[20]. Output files in PLINK binary format (.bed/.bim/.fam) were converted to VCF files using PLINK v1.9[21].
Approach 3 (A3) Haplotype Reference Consortium Panel Imputation EUR n=935 Missing genotypes were imputed using the Haplotype Reference Consortium (HRC) Version 1.1 reference panel based on SNP array data from A2 using the Michigan Imputation Server, as described previously[42].
Approach 4 (A4) 1000G Imputation AFR n=506 Missing genotypes were imputed using the 1000 Genomes Phase 3 reference panel based on SNP array data from A2 using the Michigan Imputation Server.
Approach 5 (A5) TOPMED Imputation AFR n=506 In a separate round of imputation, missing genotypes in AFR were imputed using the TOPMED reference panel based on SNP array data from A2 using the TOPMED Imputation Server[43].
Approach 6 (A6) Targeted Capture Sequencing EUR n=209; AFR n=166 A subset of the total sample (n=209 EUR and n=166 AFR) were sequenced for the entire genomic region containing CYP2A6, CYP2A13, and CYP2B6 (including intergenic regions and introns from GRCh37 chr19:41322500–41615000). A custom hybridization target capture design next-generation sequencing (NGS) method was used, as described previously[18]. Paired-end read lengths were 2×151 bp.

1000G: 1000 Genomes Project; AFR: African-ancestry individuals; EUR: European-ancestry individuals; HRC: Haplotype Reference Consortium; QC: Quality Control