Abstract
An unexpectedly high proportion of SNPs on the X chromosome in the 1000 Genomes Project phase 3 data were identified with significant sex differences in minor allele frequencies (sdMAF). sdMAF persisted for many of these SNPs in the recently released high coverage whole genome sequence of the 1000 Genomes Project that was aligned to GRCh38, and it was consistent between the five super-populations. Among the 245,825 common (MAF>5%) biallelic X-chromosomal SNPs in the phase 3 data presumed to be of high quality, 2,039 have genome-wide significant sdMAF (p-value <5e-8). sdMAF varied by location: non-pseudo-autosomal region (NPR) = 0.83%, pseudo-autosomal regions (PAR1) = 0.29%, PAR2 = 13.1%, and X-transposed region (XTR)/PAR3 = 0.85% of SNPs had sdMAF, and they were clustered at the NPR-PAR boundaries, among others. sdMAF at the NPR-PAR boundaries are biologically expected due to sex-linkage, but have generally been ignored in association studies. For comparison, similar analyses found only 6, 1 and 0 SNPs with significant sdMAF on chromosomes 1, 7 and 22, respectively. Similar sdMAF results for the X chromosome were obtained from the high coverage whole genome sequence data from gnomAD V 3.1.2 for both the non-Finnish European and African/African American samples. Future X chromosome analyses need to take sdMAF into account.
Author summary
The human X chromosome contains over 800 genes and is the 8th largest human chromosome. Genome-wide associations studies have generally failed to examine variants on the X chromosome for association with diseases and traits, partly due to complexities of the data analysis, and challenges with genotype imputation. We examined X chromosomal variants from the 1000 Genomes Project for sex differences in allele frequency and found that many variants showed significant differences. These variants cluster at the centromeric parts of the pseudoautosomal regions 1 and 2, as well as the putative pseudo-autosomal region 3 (also termed X-transposed region). This pattern was observed in high coverage whole genome sequence data from the same subjects that was aligned to GRCh38, suggesting that is not an artefact of low coverage sequencing or problems specific to GRCh37. In addition, we replicated this phenomenon in high coverage whole genome sequence aligned to GRCh38 from the gnomAD database in both the non-Finnish European and African/African American samples. These findings have implications for the analysis of X chromosomal variants for disease and trait associations.
Introduction
After the striking observation that the X chromosome was excluded from most genome-wide association studies (GWAS) [1], there has been a slow increase in the incorporation of analysis of the X chromosome [2–5]. Several association methods have recently been developed for the X chromosome [2–12], focussing on the phenomenon of X-inactivation [13], also known as dosage compensation. However, these X chromosome specific downstream methods, similar to those developed for the autosomes, typically presume high quality data and implicitly assume that there is no sex difference in minor allele frequency (sdMAF).
Most genotype calling, imputation and sequence analyses of X chromosomal variants apply methods and tools that were developed for the autosomes [1,2]. However, there are reports that genotype missing rate is higher for SNPs on the X chromosome than autosomes [1,2]. In the non-pseudoautosomal region (NPR) of the X chromosome males are hemizygous, meaning that the intensity of allele signals from genotyping arrays, or the number of reads from sequencing, is half that of homozygous females. This may result in variant positions having higher missing rates for males than females.
In addition to higher missing rates for variants on the X chromosome, the two pseudo-autosomal regions (PARs), PAR1 and PAR2, create further challenges for the analysis at the boundaries between PARs and NPR [14]. PAR1 is 2.75 Mb at the end of the short arms of the X and Y chromosomes (Xp22.33 and Yp11.32-p11.2) containing 16 genes, while PAR2 is at the tip of the long arms (Xq28 and Yq12) and is 320 kb, containing 4 genes [15–17]. Because there is no recombination of NPR in males, but recombination occurs in PAR1 and PAR2 in both sexes, variants in PARs close to the PAR-NPR boundaries are linked to variants in NPR of the X and Y chromosomes in a sex-specific fashion. Obligatory recombination occurs in PAR1 in males, making it the region with the highest recombination rate per physical distance in the human genome, while only ~2% of meiosis feature recombination at PAR2. Although the effects of sex-specific recombination rates in PAR1 and PAR2 on linkage have been examined for non-parametric linkage analysis of affected sibpairs [18], the implications for X chromosomal data collected for association studies have not been well explored. The XTR/PAR3 in Xq21.3/Yp11.1 adds further complexity as this 3.91 Mb region on the X chromosome (3.38 Mb on the Y chromosome) is embedded in the non-pseudoautosomal region and has 98.8% sequence homology between X and Y [19–22].
Recently, sex-differences in allele frequency in PAR1 and PAR2 were described using the African super-population from the phase 3 data of the 1000 Genomes Project [23], but the rest of the X chromosome and other four super-populations of the 1000 Genomes Project were not examined. Evolutionary dynamics, including recombination within the PAR regions, were reasoned as a major contributing factor to sdMAF, but genotyping errors and the agreement with the high coverage data were not examined.
We hypothesize that sex difference in MAF exists across the X chromosome in human populations, including NPR and at the boundaries between NPR and PAR1, PAR2 and PAR3, and it is more prevalent in low-coverage whole genome sequence data. We use publicly available phase 3 data from the 1000 Genomes Project to test for sdMAF across and within each of the five super-populations. We compare the results with the recently released high coverage whole genome sequence to determine whether genotyping error is contributory. Finally, we examined genotype data from the high coverage whole genome sequence gnomAD v3.1.2 resource to further evaluate sdMAF on the X chromosome in two defined populations with the largest sample sizes.
Better understanding of the different sources of sdMAF is critical to developing X chromosome-suitable analytical strategies, from improved data collection and imputation to more robust association methods for variants on the X chromosome.
Results
1000 Genomes Project
The proportions of males and females were similar across the 26 populations of the 1000 Genomes Project (S1 Fig).
Phase 3 X chromosome-wide sdMAF across super-populations (ALL)
For biallelic SNPs with global MAF≥5% (Fig 1 and S1 Table), the Manhattan plot for sdMAF p-values (Fig 2A) shows that a non-negligible proportion of X chromosomal SNPs have significant sex difference in MAF even at the genome-wide significance threshold of p-value <5e-8: 0.83% of SNPs in NPR, 0.29% in PAR1, 13.1% in PAR2, and 0.85% in PAR3. The excesses of small p-values for sdMAF testing in all four regions of the X chromosome are also evident from both the QQ plots (S2 Fig) and histograms of the sdMAF p-values (S3 Fig). SNPs with significant sdMAF are located across the X chromosome but tend to cluster in specific regions (Fig 2A).
Super-population specific analyses
Our primary analysis (above) combined data from all five super-populations. We also performed analysis separately for each super-population to determine whether the effect magnitude and direction were consistent across super-populations. S4–S8 Figs present the results for each super-population and show that, based on the comparisons of the p values, there are generally similar sdMAF patterns across the five super-populations. The consistency is also present for the magnitude and direction of sdMAF (S9–S10 Figs).
The top row of S9 Fig shows that the ALL analysis is overall much more powerful than the super-population-specific analysis, identifying many SNPs with significant sdMAF that would be missed by the individual super-population analysis, and often the sdMAF p value is several magnitudes smaller.
In contrast, although the individual super-population (continent-stratified) analysis identified some SNPs with significant sdMAF that would be missed by the ALL analysis (S2 Table), the sdMAF p values are comparable as shown in S9 Fig. There is also a remarkable consistency in sdMAF estimates between the five super populations as shown in S10 Fig. We also performed a meta-analysis with sample-size based weighting of all five super-populations and compared the results to the primary ALL analysis (above). S11 Fig (Miami) and S12 Fig (pp plot) show that the results are consistent between the two analyses.
Although unlikely, it is possible that the above continent-stratified analysis may fail to identify sdMAF SNPs with heterogeneity (in terms of signs of the sdMAF estimates) in sdMAF between populations. Thus, for the eight X chromosomal SNPs with the most salient results (as initially discussed in S1 Data) from the 1000 Genomes phase 3 data, we further conducted a population-stratified sdMAF analysis, separately for each of the 26 populations.
Fig 3 shows that, as expected, the sdMAF results for these SNPs are consistent across the 26 populations. For example, for rs6634333 (POS = 140993859) from the NPR region (Fig 3H), the sdMAF estimate in the overall sample is 0.338 (p-value = 3.78E-151). While the 26 population-specific estimates are not identical, all 26 estimates are positive and range from 0.206 (p-value = 0.011) in STU to 0.443 (p-value = 2.90E-9) in ESN (S5 Data). Remarkable consistencies across populations are also observed for the other seven SNPs (Fig 3).
Regional variation in sdMAF
Fig 2B shows that the direction and magnitude of the sdMAF varies by genomic location. Among the analysed SNPs, 0.39%, 16.88%, 1.54% and 1.18% of the SNPs have absolute sdMAF ≥0.05, respectively in PAR1, PAR2, PAR3, and NPR. For these SNPs, the medians of absolute(sdMAF) are respectively 0.095, 0.121, 0.074, and 0.077, and the [Q1,Q3] are [0.060,0.274], [0.064,0.286], [0.060,0.099], and [0.059,0.111], respectively, in PAR1, PAR2, PAR3, and NPR regions. Of the 2,039 SNPs with significant sdMAF we observed females generally having higher MAF than males: NPR = 93%, PAR2 = 59% and PAR3 = 86% among the SNPs with genome-wide significant sdMAF, except for PAR1 = 31% (Figs 4–6). Specifically, females have higher MAF at around 30 Mb (GRCh37) and at the q-arm of the centromere (Fig 2B), as well as at the centromeric boundary of PAR3 (Fig 6). In contrast, at the region of PAR1 close to the NPR boundary, males tend to have higher MAF among the SNPs with significant sdMAF (Fig 4). Finally, there are sets of PAR2 SNPs close to the NPR boundary that have higher MAF in females, while other sets of SNPs have higher MAF in males (Fig 5).
We then examined how the sdMAF relates to the sex-combined MAF. Fig 6 provides Bland-Altman plots [24] separately for each of the four regions. Of note, for NPR and PAR3 (Fig 7A and 7D), SNPs with significant sdMAF tend to have higher MAFs in females and predominantly had sex-combined MAFs in the range 25%-40%. In contrast, for PAR1 (Fig 7B), SNPs with significant sdMAF tend to have higher MAFs in males. Finally, for PAR2 (Fig 7C), there are sets of clustered SNPs with significant sdMAF in either direction (Fig 7C).
In-depth analysis of eight SNPs with the most significant sdMAF in the phase 3 data
For the eight selected SNPs, two from each of the four regions (NPR, PAR1, PAR2, and PAR3) with the smallest sdMAF p-values in the combined sample, the population-specific sdMAF p-values remain genome-wide significant (S1 Data). Moreover, the directions of the sdMAF are consistent across the five super-populations. For example, for rs201194898 in NPR (GRCh37 position = 9,377,082), the sdMAF p-values are <1e-200, 8.82e-149, 1.12e-49, 2.50e-89, 2.91e-45, and 1.95e-101, respectively in the ALL (combined) and the EAS, EUR, AFR, AMR, and SAS super-populations; the corresponding female minus male sdMAFs are all >0.27. That is, the MAFs of rs201194898 are significantly larger in females than in males, across all five super-populations. Note that the minor allele, defined based on sex- and population-combined sample, may not have MAF less than 0.5 in a sex- and/or population-stratified sample, and for a SNP in NPR and PAR3 each male only contributes a single allele to the MAF calculation. Similar results are reported in S1 Data for rs6634333 in NPR, as well as the two PAR3 SNPs in the 88,460,295 and 88,462,611 positions (GRCh37).
For the PAR1 SNP at 2,697,599, the super-population sdMAF p-values are all <1e-200 (except 2.69e-80 in AMR); the corresponding female minus male sdMAFs are all more extreme than −0.46 (−0.34 in AMR). Similar results were observed for the PAR2 SNP at 154,934,295, for which sdMAF p-values <1e-200 in all five super-populations: MAFs are significantly smaller in females than males. In addition, all females in EAS, EUR and AMR are homozygous TT, while males in EAS, AFR and AMR are all AT heterozygotes. In other words, genotype is fixed by sex in EAS and AMR.
The population-specific HWE testing results [25–27], however, vary across the eight SNPs or across the five super-populations. For example, for rs6634333 in NPR, the HWE testing p-values in females are genome-wide significant in all but SAS: 1.71e-30, 3.21e-34, 1.09e-18, 1.97–22, and 2.41e-4 respectively in EAS, EUR, AFR, AMR and SAS. The corresponding population-specific HWD delta estimates are all negative, with an excess of heterozygous females. For rs201194898 in NPR and the two PAR3 SNPs, population-stratified HWE testing in females are genome-wide significant, with excesses of heterozygous females in all five super-populations.
For the PAR1 SNP in the 2,697,599 position where HWE testing could be performed in both females and males, HWE testing in males are all genome-wide significant, but in females the HWE p-values >0.8; consistent results were observed for the other three SNPs in PAR1 and PAR2 (S1 Data). In addition, a SNP can be monomorphic in females (HWE p-value = NA in that case), while in the male sample the heterozygous genotype may be present for some of the five super-populations.
Minor allele switch between males and females
Of the 2,039 SNPs with genome-wide significant sdMAF, 66 (3.24%) have minor allele flips between males and females. For this set of SNPs we provide S2 Data (similar in format to S1 Data).
Additionally, without requiring the sdMAF testing to be genome-wide significant, S13 Fig shows Bland-Altman plots, stratified by the four regions, for X chromosome SNPs with different minor alleles between males and females. Among this set of SNPs, only one SNP, in PAR2 (POS = 154941870), has close to sex-specific genotypes: A1A1 = 0, A1A2 = 20, A2A2 = 1251 in females, and A1A1 = 9, A1A2 = 1219, A2A2 = 5 in males, and has genotype-wide significant sdMAF (S2 Data).
Sliding window analysis of phase 3 1000 Genomes X chromosome data
We performed a sliding window analysis, using a window size of 50 SNPs, sliding 25 SNPs at a time using the mean of the single SNP–log(10) p values. This identified ten regions across the X chromosome, of which the three at the boundaries of the PARs with NPR remain (S14 Fig).
sdMAF in NPR homologs
There are 19 NPR genes with homologs on X and Y (S3 Data). Of the 2,039 SNPs with significant sdMAF, 52 were in homologs (S4 Data).
Comparison to chromosomes 1, 7 and 22 in the 1000 Genomes Project phase 3 data
Chromosomes 1, 7 and 22 were selected as the longest, most similar in length to the X chromosome, and one of the shortest. There were 530,434, 406,057 and 97,216 biallelic SNPs with global MAF ≥5% on these chromosomes, respectively. Manhattan plots of sdMAF p-values (S15–S17 Figs), as well as the histograms and QQ plots (S18 Fig), show that there are very few SNPs on these autosomes with genome-wide significant sdMAF. Specifically, only 6, 1 and 0 SNPs on chromosomes 1, 7 and 22 had sdMAF p-values <5e-8, respectively. These numbers are significantly lower than that found for the X chromosome.
To obtain better insight into the nature and source the sdMAF on the autosomes, we then further examined six SNPs, two from each of the three autosomes with the smallest sdMAF p-values (S1 Data). For example, on chromosome 1 rs10803097 (GRCh37 position = 243050350; sdMAF p-value = 2.15e-25) has higher MAFs in males than females in all five super-populations, but the population-specific sdMAFs are only genome-wide significant in EAS and AFR (S1 Data) while sdMAF p-value = 0.12 in SAS, suggesting heterogeneity in sdMAF between the super-populations at this autosomal SNP. The HWE testing are genome-wide significant in males in all super-populations (except SAS) with excess of heterozygous males, but not in females. A BLAST [28] search of a 100 nucleotide sequence flanking this SNP identified perfect match to sequence on the NPR region of the Y chromosome (GRCh38 position:11786038), with the Y chromosome having the alternate allele at the SNP, suggesting that it is a Paralogous Sequence Variant [29].
In contrast, on chromosome 7 rs78984847 (GRCh37 position = 72053830; sdMAF p-value = 3.5e-18) has higher MAFs in females than males in all five super-populations. The population-stratified sdMAF p-values are consistently small, 7.66e-5, 3.06e-4, 1.76e-5, 7.52e-3, and 1.79e-6 respectively in EAS, EUR, AFR, AMR and SAS, but not genome-wide significant as a result of reduced sample sizes. In addition, HWE p-values are much smaller in females than males for all five super-populations, with excess of heterozygous females; the HWE p-values in females are <5e-8 in EUR, AFR and SAS, and 2.92e-7 in EAS and 1.48e-5 in SAS (S1 Data). There is also evidence for departure from HWE in males; the HWE p-values in males are 2.02e-2, 1.67e-3, 1.21e-4, 2.57e-4 and 1.57e-2 respectively in EAS, EUR, AFR, AMR, and SAS, with excess of CC males. BLAST of a 100 nucleotide sequence centred on this SNP identified multiple close matches to other chromosomes, including the X chromosome.
Comparison for specific SNPs between the phase 3 and high coverage sequence data of the X chromosome
sdMAFs and sex-specific genotype agreements between the two phases were attempted for 130 SNPs, selected based on the smallest sdMAF p-values in the phase 3 data (GRCh37) from the four regions (NPR, PAR1, PAR2, and PAR3). We required success of liftover onto GRCh38, that they were biallelic in both datasets, and with no genotype missingness. The liftover failure rates differed by region: NPR = 92%, PAR1 = 0%, PAR2 = 55%, PAR3 = 82%. Using all criteria left 33 SNPs: 4 in NPR, 10 in PAR1, 10 in PAR2, and 9 in PAR3 (S2 Note with SNPs ordered by the GRCh37 positions).
For the 10 SNPs in PAR1, the high coverage data did not resolve the genome-wide significant sdMAF observed in the phase 3 data (Tables A-AD (pages 2–11) in S2 Note). In addition, they showed good genotype agreement with the phase 3 data; similar results were observed for 10 SNPs in PAR2 (Tables BR-CU (pages 25–34) in S2 Note). For the 4 SNPs in NPR (Tables AE-AM (pages 12–14) and Tables BO-BQ, and (page 24) in S2 Note), sdMAFs are no longer genome-wide significant in the high coverage data, suggesting genotyping error in phase 3. For the 9 SNPs in PAR3 (Tables AN-BN (pages 15–23) in S2 Note), two persisted with genome-wide sdMAF in the high coverage data, located at the centromeric boundary between PAR3 and NPR, while the remaining seven in PAR3 were no longer significant.
High coverage 1000 Genomes X chromosome-wide sdMAF
Results of X chromosome-wide sdMAF analysis of the high coverage data, without the liftover restriction, are reported in S3 Table and S19–S25 Figs. S19 Fig shows the analytical pipeline of selecting bi-allelic SNPs with population-pooled MAF ≥5% using the high coverage data, and S3 Table contains the counts of variants by region, MAF threshold and those excluded.
The Manhattan plot of sdMAF p-values in S20 Fig shows that, as compared to Fig 2 for the phase 3 data, the prevalence of genome-wide significant sdMAF is reduced in the high coverage data for NPR and PAR3, about a 10-fold deduction: 0.11% of SNPs in NPR and 0.07% in PAR3 have sdMAF p-values <5e-8 in the high coverage data, as compared to 0.83% in NPR and 0.85% in PAR3 in phase 3. This suggests that genotyping error is a contributing factor to some of the significant sdMAF observed in the phase 3 data. The causes of the remaining sdMAF in NPR and PAR3 in the high coverage data require further examination.
The sdMAF results for PAR1 (Figs 4 and S21) and PAR2 (Figs 5 and S22) are practically the same between the two phases: 0.30% of SNPs in PAR1 and 12.2% in PAR2 have sdMAF p-values <5e-8 in the high coverage data, as compared to 0.29% in PAR1 and 13.1% in PAR2 in phase 3. The sdMAF SNPs in the high coverage data still occur at the NPR-PAR1 and NPR-PAR2 boundaries, which are evident from the zoomed-in Manhattan plots (S21 and S22 Figs).
SNPs with significant sdMAF also remain at the centromeric NPR-PAR3 boundary (S23 Fig). The persistent presence of small sdMAF p-values, including in the NPR region, is also evident from the QQ plots (S24 Fig) and histograms of p values (S25 Fig). This strongly suggests that sex-linkage is a major driver for sdMAF at both PAR1 and PAR2, and at the centromeric boundary of PAR3. Association studies of the X chromosome thus must consider sdMAF.
X chromosome sdMAF analysis of high-coverage whole genome sequence data from gnomAD v3.1.2
To examine consistency of sdMAF in other high coverage whole genome sequence data, we used the genotype and allele counts from gnomAD separately from the two largest populations: the non-Finnish Europeans and Africans/African Americans to examine sdMAF on the X chromosome.
To be consistent with the 1000 Genomes Project analysis, the primary analyses were restricted to SNPs with MAF>5% and where the majority of samples had genotypes. This resulted in 53,002 SNPs in non-Finnish Europeans and 97,438 in African/African Americans (S26 and S27 Figs, respectively). These analyses confirm the existence of many regions with sdMAF that we documented in the 1000 Genomes Project data, including the PAR1-NPR boundary, the PAR2-NPR boundary and the centromeric boundary of PAR3 (Figs 8 and 9, respectively). Online locuszoom plot [30] for the non-Finnish European data is available: https://my.locuszoom.org/gwas/717341/?token=b784386eb4574ef7ba46c117ed711ccf
To illustrate the sdMAF results for specific SNPs, we selected eight SNPs with the most significant sdMAF from the gnomAD non-Finnish European and African/African American populations, one each from the four regions (Table 1). There is remarkably consistency of sdMAF direction and magnitude across these two populations. Of note, the SNP genotype call rate in males are generally lower than females (paired non-parametric test p value = 0.0026) for these SNPs, which may provide insight into some mechanisms resulting in sdMAF.
Table 1. SNPs with most significant sdMAF from gnomAD v 3.1.2 genome in non-Finnish European and African/African Americans.
X chr Region | SNP (GRCh38) | gnomAD Population | Sex | Call rate (%) | RR | RA | AA | Minor Allele | sdMAF | sdMAF.p | HWD.delta | HWE.p |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PAR1 | X-2779827-G-A | NFE | F | 99.72 | 19436 | 194 | 1 | A | 0.000 | 0.46 | ||
PAR1 | X-2779827-G-A | NFE | M | 99.50 | 626 | 13577 | 67 | A | -0.475 | <1E-300 | -0.226 | <1E-300 |
PAR1 | X-2779827-G-A | AFR | F | 99.16 | 6772 | 3707 | 514 | A | 0.000 | 0.82 | ||
PAR1 | X-2779827-G-A | AFR | M | 99.08 | 248 | 7346 | 1975 | A | -0.375 | <1E-300 | -0.142 | <1E-300 |
PAR1 | X-2780882-G-A | NFE | F | 99.85 | 19567 | 89 | 0 | A | 0.000 | 0.75 | ||
PAR1 | X-2780882-G-A | NFE | M | 99.64 | 91 | 14150 | 39 | A | -0.496 | <1E-300 | -0.245 | <1E-300 |
PAR1 | X-2780882-G-A | AFR | F | 99.51 | 8781 | 2118 | 133 | A | 0.000 | 0.68 | ||
PAR1 | X-2780882-G-A | AFR | M | 99.64 | 137 | 8442 | 1044 | A | -0.439 | <1E-300 | -0.191 | <1E-300 |
NPR | X-52861869-T-C | NFE | F | 98.54 | 14176 | 5222 | 1 | C | -0.018 | 1.45E-103 | ||
NPR | X-52861869-T-C | NFE | M | 85.50 | 12189 | NA | 73 | C | 0.129 | <1E-300 | NA | NA |
NPR | X-52861869-T-C | AFR | F | 99.36 | 9642 | 1368 | 5 | C | -0.003 | 6.00E-10 | ||
NPR | X-52861869-T-C | AFR | M | 90.36 | 8552 | NA | 175 | C | 0.042 | 1.43E-84 | NA | NA |
PAR3 | X-89688092-A-G | NFE | F | 99.48 | 13994 | 5106 | 483 | G | 0.001 | 0.50 | ||
PAR3 | X-89688092-A-G | NFE | M | 89.78 | 11538 | NA | 1338 | G | 0.051 | 1.37E-55 | NA | NA |
PAR3 | X-89688092-A-G | AFR | F | 99.56 | 9694 | 1300 | 43 | G | 0.000 | 0.93 | ||
PAR3 | X-89688092-A-G | AFR | M | 84.52 | 7865 | NA | 298 | G | 0.026 | 2.44E-23 | NA | NA |
PAR3 | X-90958827-T-C | NFE | F | 99.38 | 15016 | 4218 | 330 | C | 0.001 | 0.09 | ||
PAR3 | X-90958827-T-C | NFE | M | 91.08 | 12039 | NA | 1025 | C | 0.046 | 1.66E-57 | NA | NA |
PAR3 | X-90958827-T-C | AFR | F | 99.31 | 9611 | 1353 | 45 | C | 0.000 | 0.72 | ||
PAR3 | X-90958827-T-C | AFR | M | 85.60 | 7990 | NA | 277 | C | 0.032 | 3.17E-35 | NA | NA |
NPR | X-141051591-A-G | NFE | F | 99.31 | 14846 | 4700 | 5 | G | -0.014 | 6.07E-79 | ||
NPR | X-141051591-A-G | NFE | M | 79.08 | 11253 | NA | 89 | G | 0.113 | <1E-300 | NA | NA |
NPR | X-141051591-A-G | AFR | F | 98.57 | 8826 | 2095 | 7 | G | -0.009 | 2.57E-25 | ||
NPR | X-141051591-A-G | AFR | M | 84.68 | 8057 | NA | 121 | G | 0.082 | <1E-300 | NA | NA |
PAR2 | X-155706522-G-A | NFE | F | 98.23 | 19321 | 16 | 0 | A | 0.000 | 0.95 | ||
PAR2 | X-155706522-G-A | NFE | M | 97.92 | 30 | 14012 | 2 | A | -0.499 | <1E-300 | -0.249 | <1E-300 |
PAR2 | X-155706522-G-A | AFR | F | 98.71 | 10933 | 10 | 0 | A | 0.000 | 0.96 | ||
PAR2 | X-155706522-G-A | AFR | M | 99.26 | 11 | 9573 | 3 | A | -0.499 | <1E-300 | -0.249 | <1E-300 |
PAR2 | X-155712209-T-G | NFE | F | 99.63 | 19585 | 28 | 0 | G | 0.000 | 0.92 | ||
PAR2 | X-155712209-T-G | NFE | M | 99.12 | 29 | 14186 | 2 | G | -0.498 | <1E-300 | -0.249 | <1E-300 |
PAR2 | X-155712209-T-G | AFR | F | 99.58 | 10520 | 516 | 3 | G | 0.000 | 0.19 | ||
PAR2 | X-155712209-T-G | AFR | M | 99.78 | 14 | 9404 | 218 | G | -0.487 | <1E-300 | -0.238 | <1E-300 |
SNP name is from gnomAD which incorporates chromosome-nucleotide position and reference-alternate alleles; Population NFE = non-Finnish European; AFR = African/African American. Sex: F = female, M = male; call rate is proportion of individuals with non-missing genotypes; RR = reference homozygote/hemizygote genotype count; RA = heterozygote genotype count; AA = alternate allele homozygote/hemizygote genotype count. sdMAF is female-male allele frequency difference, along with p value. HWD delta is the Hardy-Weinberg equilibrium delta, along with p value, NA = not applicable.
Rarer variants in gnomAD v3.1.2 non-Finnish Europeans
To determine whether sdMAF also is present for rarer variants, we used the non-Finnish Europeans from gnomAD v3.1.2. to examine 155,828 X chromosomal variants with MAF 0.1% to 5%. These would typically have a minimum minor allele count of 50. Significant sdMAF was observed at the PAR1 and PAR2 boundaries with NPR, as well as those in PAR3 (S28 Fig). In addition, there were other regions including around 142 Mb, where there were clusters of rarer SNPs with significant sdMAF where female MAFs were greater than male. This later region was not observed strongly in the variants with MAF>5% in the same data (Fig 8). Online locuszoom of these results is available: https://my.locuszoom.org/gwas/473034/?token=78b9004c71f04892adf5f514be014044
Discussion
Our initial sdMAF analysis focused on the 1000 Genomes Project phase 3 data since it has been examined extensively for association analysis, and is one of the most commonly used imputation panels for GWAS [31]. With the recent release of the high coverage data, we first compared results between the two phases for specific SNPs. In addition, we also performed a separate X chromosome-wide analysis of the 1000 Genomes Project high coverage data aligned to GRCh38 as well as mostly high-coverage whole genome sequence aligned to GRCh38 from two populations from gnomAD.
In the 1000 Genomes Project phase 3 data, focused on eight selected SNPs, we showed that the sdMAF was robust to population stratification either coming from super-population or population levels. Yet, we cannot exclude the possibility that there could be additional SNPs with sdMAF that we have missed. Similar concerns relate to the analysis of the gnomAD data, where our analysis was performed separately for the two major super-populations (likely defined by autosomal population structure), the results could be confounded by population structure and/or admixture within them. However, S10 Fig shows that sdMAF estimates are consistent between the five super populations across the whole X chromosome in the 1000 Genomes phase 3 data. Additionally, Fig 3 shows that the sdMAF estimates are extremely consistent between the 26 populations for the eight selected SNPs with the smallest sdMAF p value.
We identified two likely sources of sdMAF: genotyping error and sex-linkage. Genotyping error (which may in part be due to differences between GRCh37 and 38) accounts for many NPR and PAR3 sdMAF in phase 3 since they were mostly resolved in the high coverage data. However, sdMAF for some NPR and PAR3 SNPs remain in the high coverage data and are also present in gnomAD. Thus, despite the recent advances in how to analyze NPR SNPs [6], our findings here show that robust X chromosomal association methods must consider sdMAF caused by genotyping error and/or sex-specific selection. For example, for sex-dimorphic traits, others have suggested sex-specific analysis genome-wide [32]. Updated reference sequence along with alternate haplotypes for regions of the X and Y chromosomes may improve the mapping of sequence reads and genotype quality [33,34].
In contrast, the impact of sex-linkage in PAR1 and PAR2 results in most sdMAF identified in phase 3 persisting in the high coverage data. Sex-linkage at PARs has previously been discussed for linkage analyses using affected sibpairs [18], but it has not been examined in the context of association studies [16,17]. Multiple authors have stated that association methods routinely used for autosomes can be applied to the PARs [3–5]. However, our results indicate that association analyses at the PAR-NPR boundaries (and at the centromeric boundary of PAR3) should consider sdMAF caused by sex-linkage. The optimal region-specific solution is an open research question.
We searched the NHGRI-EBI GWAS catalog [35] and identified multiple signals at the PAR boundaries. These include loci: for BMI, multiple lipids, red cell traits, non-syndromic metopic craniosynostosis at the PAR2 boundary [36–39]; ANCA-associated vasculitis, Alzheimer’s disease, non-syndromic metopic craniosynostosis, susceptibility to TB, 3-hydroxy-1-methylpropylmercapturic acid levels and adenosine diphosphate at PAR3 [36,40–45]; age of onset of myopia, mean corpuscular volume, eosinophil count, and asthma [43][39,46,47] close to PAR1/NPR boundary. Apart from two studies [40,45], none of others provided sex-specific results, making it difficult to determine the effect of sdMAF on these associations.
The PAR1-NPR boundary is in intron 3 of XG. The genetic basis of the Xga blood group has recently been studied [48–50] with groups independently identifying rs311103 in PAR1 as the potential causal variant. Other work has also suggested that a large deletion that spans PAR1-NPR could also be a separate causal variant for Xga [51]. Polymorphic duplications of the Y chromosome neighbouring PAR1, as well as a deletion of the overlapping region on the X chromosome have been characterized [52]. Since these CNVs are relatively rare, e.g. as described in gnomAD v 2.1 (Web Resources) [53], they are unlikely to result in the major sdMAF seen at the PAR1-NPR boundary.
A small proportion (3%) of sdMAF in phase 3 were in the 19 X/Y homologs. Further examination of allele frequencies of the Y chromosome is interesting but beyond the scope of this work.
It is unlikely that strand flips are a major cause of the difference in genotype by sex, since variant calling was performed blind to sex and typically joint-called, especially for the high coverage 1000 Genomes Project data [54] and the gnomAD data [55]. Further, inspection of Integrative Genomics Viewer (IGV) plots in gnomAD v3.1.2 shows that there are few discrepancies of sequencing reads around these SNPs, making it unlikely that they are strand flips. Sex-specific quality metrics for variants on the X chromosome could help identify variants with technical interference.
Of note, for many variants with sdMAF, there was also deviation from Hardy-Weinberg equilibrium [27] in each of the super-populations in females (NPR and PAR3) and males (PAR1 and PAR2) (S1 Data). HWD testing of variants in the 1000 Genomes phase 3 data has been examined separately in the JPT and YRI populations [56,57]. SNPs with missing rate <5% and possessing an rs identifier were used. The earlier work [56] found lower rates of deviation from HWE on the X chromosome (after exclusion of PAR1 and PAR2) than on any of the autosomes, but the sample size for the X chromosome is ~3/4 of that for the autosomes. Results from our sdMAF analysis also calls for new X chromosome-aware HWD methods that consider sdMAF.
Earlier work has examined HWD of X chromosomal variants (59), using a 2 degrees of freedom (df) Pearson Chisq-based test that jointly analyzes both females and males. The test includes, in addition to the female data, the deviation of male genotype counts from the expected, based on pooled allele frequency estimate using both male and female data. However, it has been shown that this 2 df HWD test is equivalent to testing for HWD in females alone and simultaneously testing for sdMAF between the sexes [58]. Because of this confounding between HWD and sdMAF, for variants in the PAR1 and PAR2 regions, we performed the HWD analysis stratified by the sex. And for NPR and PAR3 variants, we performed the HWD analysis in females only; the males have two hemizygous genotypes, leaving no df to perform the HWD test after using one df to estimate the MAF in males.
In addition to the 2 df Chisq test, the work of [57] also proposed and recommended an Exact test, jointly analyzing both females and males. Although the Exact test does not specify the degrees of freedom, our preliminary work suggests that it is similar to the 2 df Chisq test in spirit, because the Exact test includes males and is derived conditional on the “number of A alleles (nA) and the observed male and female frequencies.” This could lead to increased type I error HWD test in the presence of true sdMAF. For example, consider a SNP from the NPR region with significant sdMAF but no deviation from HWE in females: female MAF = 0.2 with genotype counts of (640, 320, 40) and male MAF = 0.5 with genotype counts of (500, NA, 500). In that case, the sex-combined EXACT test p = 6.41E-61 (using the R program HardyWeinberg v1.7.4 [59]), but this highly significant Exact HWD testing result is due to the significant sdMAF alone. When the HWD analysis is performed using only females and in the presence of small sample, the Exact test can be advantageous [56,57].
Beyond the X chromosome, joint and separate analyses of sdMAF and HWD have also been performed for autosomal variants [60]. Interestingly, an application to autosomal SNPs using the 104 individuals from the JPT sample of the 1000 Genomes Project showed that, in the presence of sdMAF, true HWD may go unnoticed if applying the standard HWD testing using both males and females.
How to solve the visualization of Manhattan/Miami plots for the X chromosome stratified by sex, while also highlighting the four regions, is a worthy challenge for bioinformaticians. Additionally, as both the association and sdMAF analyses differ between regions, it is crucial to assign variants at the NPR-PAR boundaries to the correct region.
Since the 2013 call for X chromosome-inclusive GWAS [1] several methods have been developed, all focused on robustifying the association analysis against the well-known phenomenon of X-inactivation uncertainty [2,3,5–12]. However, the sex differences in minor allele frequencies phenomenon documented here could affect the validity of existing X-inactivation-aware association methods against spurious sdMAF due to, for example, genotyping error. As sex is typically included in X chromosomal association analysis, it is reasonable to assume that sdMAF is accounted for through the inclusion of sex as covariate. However, further research is needed particularly for studies of traits displaying significant sexual dimorphism. For those traits, there could be a sex-ratio difference between the cases and controls, and this combined with sdMAF could have implications for association studies of the X chromosome. Finally, sdMAF at the NPR-PAR boundaries is likely a biological phenomenon. Thus, how to leverage true sdMAF to increase association power is also an open research question.
Materials and methods
The X chromosome phase 3 data of the 1000 Genomes Project
The 1000 Genomes Project generated an integrated call set of variants for phase 3 (release 5) data based on four data types: Illumina 2.5M genotyping array, Affymetrix SNP6.0, high-coverage whole exome sequence (WES), and low coverage whole genome sequence (WGS) [61]. Because only a minority of variant positions are covered by the first three, >93% of the data come from the WGS, which is generally of low quality due to its low-coverage nature (average depth of 7.4 on the autosomes).
The phase 3 data of the 1000 Genomes Project identified 3,468,093 variants on the X chromosome [61], with an average depth of 6.2. Assuming equal proportions of males and females, the sex-specific average depth is 7.0 for females and 3.5 for males.
Focusing on bi-allelic common SNPs presumed to be of high quality, we first removed variants that had >2 alleles, as well as indels. We then analysed all the bi-allelic SNPs from the whole of the X chromosome, including the pseudo-autosomal region 1, PAR2 and PAR3 [19], in addition to the non-pseudo-autosomal region. X chromosomal locations for these four regions, NPR, PAR1, PAR2, and PAR3, were obtained from The Genome Reference Consortium, based on GRCh37.
SNPs were included based on global minor allele frequency ≥5%, because they have been shown to have the highest quality in this data [61]. We defined the minor allele based on the overall sample of the 1000 Genomes Project, with males providing a single allele count in the non-PAR regions for the MAF calculation. The definition of a minor allele has no necessary relationship to the reference allele in GRCh37, and the minor allele defined in the sex- and population-pooled sample may not be minor in sex- or population-stratified samples. The counts of variants by region, MAF threshold and those excluded are provided in S1 Table; the analytic pipeline and flow-chart of the analyses are provided in Fig 1; Supplementary codes for all the analytical steps are also provided (Web Resource).
Testing for X chromosomal variants with sex difference in MAF (sdMAF)
Sex difference in MAF is an indication of either potential genotyping error or a biological phenomenon particularly at the NPR-PAR boundaries. Although sdMAF analyses could be performed separately for each of the five super-populations, or each of the 26 populations, this is not a powerful approach. As the sex ratio is similar across the populations (S1 Fig), we tested sdMAF using the whole sample of the 1000 Genomes Project. This more powerful approach can detect sex difference in MAF if it is present in any of the sub-samples. If there are no sdMAFs in any of the sub-samples, the test remains valid (i.e. accurate control of false positives) even if the MAFs differ drastically between the sub-populations. This is because the sdMAF test detects the difference in MAF between females and males, not the difference in MAF between populations.
1.1. sdMAF test
For each bi-allelic SNP in the NPR and PAR3 regions of the X chromosome, notations in S4A Table denote sex-stratified genotype counts, where, without loss of generality, allele A represents the minor allele defined in the sex-pooled whole sample.
To identify variants in the NPR and PAR3 regions with sex difference in MAF, we used the following conservative test statistic,
(1) |
where the numerator, , contrasts the frequency estimates of allele A between the female and male groups, and the denominator is the estimate of the variance of while allowing for Hardy-Weinberg disequilibrium (HWD) in females via . Specifically,
(2) |
(3) |
where
(4) |
is the estimate of HWD present in females [62]. Under the null of no sdMAF, H0, the test statistic TX is asymptotically distributed, as it is a straightforward application of the classic two-sample comparison with the consideration of HWD. Note that the H0 of interest here refers to no sex difference in MAF, while allowing for population difference in MAF.
When applied to a sample consisting of individuals from multiple populations (e.g. the whole sample of the 1000 Genomes Project), the sdMAF test based on TX is conservative because the denominator in Eq (1) contains a population-pooled HWD estimate. Note that even if each of the five super-populations or 26 populations is in Hardy-Weinberg equilibrium (HWE), the combined population may not be in HWE unless the MAFs are the same across all populations [63]. In addition, we show in the S1 Note that the bias factor for the population-pooled HWD estimate is always greater or equal to zero, resulting in TX being (slightly) conservative in testing for sdMAF.
We note that using a conservative sdMAF test is not an issue for the purpose of this study, because for SNPs declared to have significant sdMAF we are then more confident about the conclusion. Likewise, we used the genome-wide significance level of p-value <5e-8 [64] to declare sdMAF significance, which is conservative for our X chromosome-focused analysis.
S4B Table shows the genotype counts for each bi-allelic SNP in the PAR1 and PAR2 regions, for which a male has three genotypes as for an autosomal SNP. In that case, we used the following test statistic to test for sdMAF,
(5) |
where all notations with subscripts f are the same as in Eqs (3) and (4), while for males,
(6) |
where
(7) |
We note that, in finite sample, the Wald’s test shown in Eq (5) is more conservative than the Score test, where the sex-stratified MAF and HWD estimates in the denominator of [5] would be replaced with sex-pooled estimates; asymptotically the two tests are equivalent.
Super-population specific analysis and meta-analysis
As a second complimentary analysis we also performed super-population specific analysis, as well as meta-analysis of the super-populations. The meta-analysis test statistic is sample size based [65], which takes the weighted average of Z-scores with the square root of sample size as weights. We limited the analysis to bi-allelic SNPs with global MAF ≥5% in the combined sample, and also polymorphic in each of the five superpopulations to have a valid and comparable meta-analysis statistics across SNPs. Due to this constraint, there is a reduction in the number of SNPs analyzed in meta-analysis compared to the primary analysis.
In each super-population, a SNP’s alleles can be sex-specific, for example where one allele is present only in females, and the other is present only in males (MAF would be > 5%). In that case, the sdMAF test statistic would be infinity due to the variance being zero in the super-population sdMAF analysis, which leads to meta-analysis p-value being zero. In those cases, the p-values are plotted as 47,982.36 on the–log10 scale, the smallest p-value observed in the mega-analysis.
In-depth analyses of eight X chromosomal SNPs with genome-wide significant sdMAF in the phase 3 data
To better understand the patterns of sdMAF, we selected eight SNPs for additional analyses, two from each of the four regions (NPR and PAR1-3) with the smallest sdMAF p-values in phase 3 data of the 1000 Genomes Project. For each SNP, we calculated population-specific and sex-stratified allele frequency estimates; for a NPR or PAR3 SNP, each male only contributed a single allele count to the allele frequency calculation.
For the four SNPs in the NPR and PAR3 regions, we also performed population-stratified female-only HWE testing for each of the five super-populations, using the standard autosomal method as only females were analyzed here. That is, we used in Eq (4) to estimate HWD and the following Pearson’s chi-square test statistic to test for HWD in females,
(8) |
where under the null of HWE, THWE,female is asymptotically distributed.
We first note that earlier work [62], has shown that the Pearson’s chi-square-based HWD test can be reformulated as a test based on the HWD estimate, and more recent work [66] has shown that the same HWD test can be derived from a robust allele-based reverse regression. That is, THWE,female in Eq (7) can be rewritten equivalently as
(9) |
We also note that, although the expressions for and in Eq (8) are the same as those in Eqs (3) and (4), here and are calculated separately for each of the five super-populations, as HWD testing in the whole sample using the naïve population-pooled HWD estimate is not valid.
For the four SNPs in the PAR1 and PAR2 regions, HWD estimation and testing were also performed in males, as well as jointly with females using sex-pooled estimates. Briefly, we use in Eq (6) to estimate HWD and use
(10) |
to test for HWD in males, separately for each of the five super-populations. For the sex-combined analysis,
(11) |
where
(12) |
using the notations in S4B Table for a bi-allelic SNP in the PAR1 and PAR2 regions.
To make the HWD test robust to low genotype counts we additionally performed a HWD exact test [25–27] for SNPs in each of the five super-populations, which accommodates both females and males for SNPs from the PAR regions.
Analyses of autosomal 1, 7 and 22 phase 3 data for benchmarking
To compare results of variants with significant sex difference in MAF between the X chromosome and autosomes, chromosomes 1, 7, and 22 were first selected to represent the longest, similar size to the X chromosome, and one of the shortest autosomes. Biallelic and common (sex- and population-pooled MAF≥5%) SNPs were then selected for sdMAF analysis, using TA, the sdMAF test statistic shown in Eq (5). When HWE evaluation was warranted, THWE,female, THWE,male and THWE,sex-combined, shown respectively in Eqs (8), (9) and (10), were applied to the phase 3 data of the 1000 Genomes Project.
Sliding window approach
To better define regions with sdMAF we performed a simple sliding window analysis using a window size of 50 consecutive SNPs, shifted by 25 SNPs each time, and calculated the average of the -log10 p value.
The X chromosome high-coverage sequence data of the 1000 Genomes Project
To validate the results from the phase 3 data (GRCh37) of the 1000 Genomes Project, we repeated the sdMAF analyses using the recently released high coverage (GRCh38) whole genome sequence data [67].
Specifically, 50, 10, 20, and 50 SNPs, respectively from NPR, PAR1, PAR2, and PAR3, with the smallest sdMAF p-values in the phase 3 data were first selected. Among these SNPs, 4, 10, 10, and 9 SNPs, respectively from NPR, PAR1, PAR2 and PAR3, were successfully lifted-over, bi-allelic in the high coverage data, had no missingness in both sets of data. For each of the 33 SNPs, the direction and magnitude of sdMAF were examined, separately for the phase 3 and high coverage data. Genotype agreements between the two sets of data within an individual, separately by sex, were also generated.
Finally, without the liftover constraint between the two phases of the 1000 Genomes Project, we performed an X-chromosome wide sdMAF analysis for the high coverage data using the same sdMAF methods described earlier for the phase 3 data. Supplementary codes for all the analytical steps are also provided (Web Resource).
gnomAD v.3.1.2
To examine sdMAF in larger samples with high coverage whole genome sequence we used the genotype and allele counts from the genomes chr X sites VCF file from the non-Finnish European population from gnomAD v 3.1.2 [55] which has a maximum of 19,686 females and 14,343 males (Web Resources). To limit analyses to high quality variants we required them to have both MAF≥0.05 and also allele numbers >50,000 (i.e. >93% call rate in NPR). Similar analysis was performed for African/African American population with the same MAF threshold, where there are a maximum of 11,086 females and 9,658 males, and this was restricted to variants with allele numbers >30,000 (i.e. >94.2% call rate in NPR).
Web resources
The Genome Reference Consortium: https://www.ncbi.nlm.nih.gov/grc/human
The 1000 Genomes Project: https://www.internationalgenome.org
Phase 3 data of the 1000 Genomes Project: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ and the specific vcf file used:
High coverage phased data of the 1000 Genomes Project: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/CCDG_14151_B01_GRM_WGS_2020-08-05_chrX.filtered.eagle2-phased.v2.vcf.gz
gnomAD v.3.1.2 allele counts from genome sequence:
gnomAD v.3.1.2: https://gnomad.broadinstitute.org/downloads
chrX sites VCF: 95.6 GiB, MD5: 040080a18046533728fa60800eedcf4b
gnomAD v.2.1 structural variants: (https://gnomad.broadinstitute.org/gene/ENSG00000124343?dataset=gnomad_sv_r2_1)
Supporting information
Data Availability
All genotype data files are available from: Phase 3 data of the 1000 Genomes Project: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ and the specific vcf file used: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz High coverage data of the 1000 Genomes Project: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/CCDG_14151_B01_GRM_WGS_2020-08-05_chrX.filtered.eagle2-phased.v2.vcf.gz gnomAD v.3.1.2 allele counts from genome sequence: https://gnomad.broadinstitute.org/downloads chrX sites VCF: 95.6 GiB, MD5: 040080a18046533728fa60800eedcf4b gnomAD v.2.1 structural variants: (https://gnomad.broadinstitute.org/gene/ENSG00000124343?dataset=gnomad_sv_r2_1) code is here: https://github.com/ZhongWang99.
Funding Statement
LS received the awards: This research is funded by the Natural Sciences and Engineering Research Council of Canada (NSERC, RGPIN-04934) and the Canadian Institutes of Health Research (CIHR, PJT-180460). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Wise AL, Gyi L, Manolio TA. eXclusion: toward integrating the X chromosome in genome-wide association analyses. Am J Hum Genet. 2013;92(5):643–7. doi: 10.1016/j.ajhg.2013.03.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Konig IR, Loley C, Erdmann J, Ziegler A. How to include chromosome X in your genome-wide association study. Genet Epidemiol. 2014;38(2):97–103. doi: 10.1002/gepi.21782 [DOI] [PubMed] [Google Scholar]
- 3.Loley C, Ziegler A, König IR. Association tests for X-chromosomal markers—a comparison of different test statistics. Hum Hered. 2011;71(1):23–36. doi: 10.1159/000323768 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hickey PF, Bahlo M. X chromosome association testing in genome wide association studies. Genet Epidemiol. 2011;35(7):664–70. doi: 10.1002/gepi.20616 [DOI] [PubMed] [Google Scholar]
- 5.Clayton D. Testing for association on the X chromosome. Biostatistics. 2008;9(4):593–600. doi: 10.1093/biostatistics/kxn007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Chen B, Craiu RV, Strug LJ, Sun L. The X factor: A robust and powerful approach to X-chromosome-inclusive whole-genome association studies. Genet Epidemiol. 2021;45(7):694–709. doi: 10.1002/gepi.22422 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Özbek U, Lin HM, Lin Y, Weeks DE, Chen W, Shaffer JR, et al. Statistics for X-chromosome associations. Genet Epidemiol. 2018;42(6):539–50. doi: 10.1002/gepi.22132 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wang J, Yu R, Shete S. X-chromosome genetic association test accounting for X-inactivation, skewed X-inactivation, and escape from X-inactivation. Genet Epidemiol. 2014;38(6):483–93. doi: 10.1002/gepi.21814 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wang J, Talluri R, Shete S. Selection of X-chromosome Inactivation Model. Cancer Inform. 2017;16:1176935117747272. doi: 10.1177/1176935117747272 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Gao F, Chang D, Biddanda A, Ma L, Guo Y, Zhou Z, et al. XWAS: A Software Toolset for Genetic Data Analysis and Association Studies of the X Chromosome. J Hered. 2015;106(5):666–71. doi: 10.1093/jhered/esv059 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Chen B, Craiu RV, Sun L. Bayesian model averaging for the X-chromosome inactivation dilemma in genetic association study. Biostatistics. 2020;21(2):319–35. doi: 10.1093/biostatistics/kxy049 [DOI] [PubMed] [Google Scholar]
- 12.Chen Z, Ng HK, Li J, Liu Q, Huang H. Detecting associated single-nucleotide polymorphisms on the X chromosome in case control genome-wide association studies. Stat Methods Med Res. 2017;26(2):567–82. doi: 10.1177/0962280214551815 [DOI] [PubMed] [Google Scholar]
- 13.Tukiainen T, Villani AC, Yen A, Rivas MA, Marshall JL, Satija R, et al. Landscape of X chromosome inactivation across human tissues. Nature. 2017;550(7675):244–8. doi: 10.1038/nature24265 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Graves JA, Wakefield MJ, Toder R. The origin and evolution of the pseudoautosomal regions of human sex chromosomes. Hum Mol Genet. 1998;7(13):1991–6. doi: 10.1093/hmg/7.13.1991 [DOI] [PubMed] [Google Scholar]
- 15.Helena Mangs A, Morris BJ. The Human Pseudoautosomal Region (PAR): Origin, Function and Future. Curr Genomics. 2007;8(2):129–36. doi: 10.2174/138920207780368141 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Flaquer A, Rappold GA, Wienker TF, Fischer C. The human pseudoautosomal regions: a review for genetic epidemiologists. Eur J Hum Genet. 2008;16(7):771–9. doi: 10.1038/ejhg.2008.63 [DOI] [PubMed] [Google Scholar]
- 17.Flaquer A, Fischer C, Wienker TF. A new sex-specific genetic map of the human pseudoautosomal regions (PAR1 and PAR2). Hum Hered. 2009;68(3):192–200. doi: 10.1159/000224639 [DOI] [PubMed] [Google Scholar]
- 18.Dupuis J, Van Eerdewegh P. Multipoint linkage analysis of the pseudoautosomal regions, using affected sibling pairs. Am J Hum Genet. 2000;67(2):462–75. doi: 10.1086/303008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Veerappa AM, Padakannaya P, Ramachandra NB. Copy number variation-based polymorphism in a new pseudoautosomal region 3 (PAR3) of a human X-chromosome-transposed region (XTR) in the Y chromosome. Funct Integr Genomics. 2013;13(3):285–93. doi: 10.1007/s10142-013-0323-6 [DOI] [PubMed] [Google Scholar]
- 20.Tilford CA, Kuroda-Kawaguchi T, Skaletsky H, Rozen S, Brown LG, Rosenberg M, et al. A physical map of the human Y chromosome. Nature. 2001;409(6822):943–5. doi: 10.1038/35057170 [DOI] [PubMed] [Google Scholar]
- 21.Mumm S, Molini B, Terrell J, Srivastava A, Schlessinger D. Evolutionary features of the 4-Mb Xq21.3 XY homology region revealed by a map at 60-kb resolution. Genome Res. 1997;7(4):307–14. doi: 10.1101/gr.7.4.307 [DOI] [PubMed] [Google Scholar]
- 22.Schwartz A, Chan DC, Brown LG, Alagappan R, Pettay D, Disteche C, et al. Reconstructing hominid Y evolution: X-homologous block, created by X-Y transposition, was disrupted by Yp inversion through LINE-LINE recombination. Hum Mol Genet. 1998;7(1):1–11. doi: 10.1093/hmg/7.1.1 [DOI] [PubMed] [Google Scholar]
- 23.Monteiro B, Arenas M, Prata MJ, Amorim A. Evolutionary dynamics of the human pseudoautosomal regions. PLoS Genet. 2021;17(4):e1009532. doi: 10.1371/journal.pgen.1009532 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Altman D, JM B. Measurement in medicine: the analysis of method comparison studies. The Statistician. 1983;32(3):307–17. [Google Scholar]
- 25.Wigginton JE, Cutler DJ, Abecasis GR. A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet. 2005;76(5):887–93. doi: 10.1086/429864 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Graffelman J, Moreno V. The mid p-value in exact tests for Hardy-Weinberg equilibrium. Stat Appl Genet Mol Biol. 2013;12(4):433–48. doi: 10.1515/sagmb-2012-0039 [DOI] [PubMed] [Google Scholar]
- 27.Crow JF, Kimura M. An introduction in Population Genetics Theory. New York: Harper and Row; 1970. [Google Scholar]
- 28.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10. doi: 10.1016/S0022-2836(05)80360-2 [DOI] [PubMed] [Google Scholar]
- 29.Fredman D, White SJ, Potter S, Eichler EE, Den Dunnen JT, Brookes AJ. Complex SNP-related sequence variation in segmental genome duplications. Nat Genet. 2004;36(8):861–6. doi: 10.1038/ng1401 [DOI] [PubMed] [Google Scholar]
- 30.Boughton AP, Welch RP, Flickinger M, VandeHaar P, Taliun D, Abecasis GR, et al. LocusZoom.js: Interactive and embeddable visualization of genetic association study results. Bioinformatics. 2021. doi: 10.1093/bioinformatics/btab186 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Das S, Abecasis GR, Browning BL. Genotype Imputation from Large Reference Panels. Annu Rev Genomics Hum Genet. 2018;19:73–96. doi: 10.1146/annurev-genom-083117-021602 [DOI] [PubMed] [Google Scholar]
- 32.Flynn E, Tanigawa Y, Rodriguez F, Altman RB, Sinnott-Armstrong N, Rivas MA. Sex-specific genetic effects across biomarkers. Eur J Hum Genet. 2021;29(1):154–63. doi: 10.1038/s41431-020-00712-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Miga KH, Koren S, Rhie A, Vollger MR, Gershman A, Bzikadze A, et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature. 2020;585(7823):79–84. doi: 10.1038/s41586-020-2547-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Nurk S, Koren S, Rhie A, Bzikadze A, Mikheenko A, Vollger MR, et al. The complete sequence of a human genome. https://doiorg/101101/20210526445798. 2021. [Google Scholar]
- 35.MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 2017;45(D1):D896–D901. doi: 10.1093/nar/gkw1133 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Justice CM, Cuellar A, Bala K, Sabourin JA, Cunningham ML, Crawford K, et al. A genome-wide association study implicates the BMP7 locus as a risk factor for nonsyndromic metopic craniosynostosis. Hum Genet. 2020;139(8):1077–90. doi: 10.1007/s00439-020-02157-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Hoffmann TJ, Theusch E, Haldar T, Ranatunga DK, Jorgenson E, Medina MW, et al. A large electronic-health-record-based genome-wide study of serum lipids. Nat Genet. 2018;50(3):401–13. doi: 10.1038/s41588-018-0064-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Hoffmann TJ, Choquet H, Yin J, Banda Y, Kvale MN, Glymour M, et al. A Large Multiethnic Genome-Wide Association Study of Adult Body Mass Index Identifies Novel Loci. Genetics. 2018;210(2):499–515. doi: 10.1534/genetics.118.301479 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Hu Y, Stilp AM, McHugh CP, Rao S, Jain D, Zheng X, et al. Whole-genome sequencing association analysis of quantitative red blood cell phenotypes: The NHLBI TOPMed program. Am J Hum Genet. 2021;108(6):1165. doi: 10.1016/j.ajhg.2021.04.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Carrasquillo MM, Zou F, Pankratz VS, Wilcox SL, Ma L, Walker LP, et al. Genetic variation in PCDH11X is associated with susceptibility to late-onset Alzheimer’s disease. Nat Genet. 2009;41(2):192–8. doi: 10.1038/ng.305 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Park SL, Carmella SG, Chen M, Patel Y, Stram DO, Haiman CA, et al. Mercapturic Acids Derived from the Toxicants Acrolein and Crotonaldehyde in the Urine of Cigarette Smokers from Five Ethnic Groups with Differing Risks for Lung Cancer. PLoS One. 2015;10(6):e0124841. doi: 10.1371/journal.pone.0124841 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Lyons PA, Rayner TF, Trivedi S, Holle JU, Watts RA, Jayne DR, et al. Genetically distinct subsets within ANCA-associated vasculitis. N Engl J Med. 2012;367(3):214–23. doi: 10.1056/NEJMoa1108735 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Tedja MS, Wojciechowski R, Hysi PG, Eriksson N, Furlotte NA, Verhoeven VJM, et al. Genome-wide association meta-analysis highlights light-induced signaling as a driver for refractive error. Nat Genet. 2018;50(6):834–48. doi: 10.1038/s41588-018-0127-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Rhee EP, Ho JE, Chen MH, Shen D, Cheng S, Larson MG, et al. A genome-wide association study of the human metabolome in a community-based cohort. Cell Metab. 2013;18(1):130–43. doi: 10.1016/j.cmet.2013.06.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Schurz H, Kinnear CJ, Gignoux C, Wojcik G, van Helden PD, Tromp G, et al. A Sex-Stratified Genome-Wide Association Study of Tuberculosis Using a Multi-Ethnic Genotyping Array. Front Genet. 2018;9:678. doi: 10.3389/fgene.2018.00678 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Vuckovic D, Bao EL, Akbari P, Lareau CA, Mousas A, Jiang T, et al. The Polygenic and Monogenic Basis of Blood Traits and Diseases. Cell. 2020;182(5):1214-31.e11. doi: 10.1016/j.cell.2020.08.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Mikhaylova AV, McHugh CP, Polfus LM, Raffield LM, Boorgula MP, Blackwell TW, et al. Whole-genome sequencing in diverse subjects identifies genetic correlates of leukocyte traits: The NHLBI TOPMed program. Am J Hum Genet. 2021;108(10):1836–51. doi: 10.1016/j.ajhg.2021.08.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Möller M, Lee YQ, Vidovic K, Kjellström S, Björkman L, Storry JR, et al. Disruption of a GATA1-binding motif upstream of. Blood. 2018;132(3):334–8. doi: 10.1182/blood-2018-03-842542 [DOI] [PubMed] [Google Scholar]
- 49.Yeh CC, Chang CJ, Twu YC, Chu CC, Liu BS, Huang JT, et al. The molecular genetic background leading to the formation of the human erythroid-specific Xg. Blood Adv. 2018;2(15):1854–64. doi: 10.1182/bloodadvances.2018018879 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Lane WJ, Aguad M, Smeland-Wagman R, Vege S, Mah HH, Joseph A, et al. A whole genome approach for discovering the genetic basis of blood group antigens: independent confirmation for P1 and Xg. Transfusion. 2019;59(3):908–15. doi: 10.1111/trf.15089 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Lee YQ, Storry JR, Karamatic Crew V, Halverson GR, Thornton N, Olsson ML. A large deletion spanning XG and GYG2 constitutes a genetic basis of the Xg. Transfusion. 2019;59(5):1843–9. doi: 10.1111/trf.15242 [DOI] [PubMed] [Google Scholar]
- 52.Mensah MA, Hestand MS, Larmuseau MH, Isrie M, Vanderheyden N, Declercq M, et al. Pseudoautosomal region 1 length polymorphism in the human population. PLoS Genet. 2014;10(11):e1004578. doi: 10.1371/journal.pgen.1004578 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Collins RL, Brand H, Karczewski KJ, Zhao X, Alföldi J, Francioli LC, et al. A structural variation reference for medical and population genetics. Nature. 2020;581(7809):444–51. doi: 10.1038/s41586-020-2287-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Byrska-Bishop Marta, Evani Uday S., Zhao Xuefang, Basile Anna O., Abel Haley J., Regier Allison A., et al. High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. biorxiv. 2021; 10.1101/2021.02.06.430068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581(7809):434–43. doi: 10.1038/s41586-020-2308-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Graffelman J, Jain D, Weir B. A genome-wide study of Hardy-Weinberg equilibrium with next generation sequence data. Hum Genet. 2017;136(6):727–41. doi: 10.1007/s00439-017-1786-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Graffelman J, Weir BS. Testing for Hardy-Weinberg equilibrium at biallelic genetic markers on the X chromosome. Heredity (Edinb). 2016;116(6):558–68. doi: 10.1038/hdy.2016.20 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Sun L, Zhang L, Wang X, Paterson A. A novel regression-based method for X-chromosome-inclusive Hardy-Weinberg equilibrium test. ASHG annual meeting. 2021:3161. [Google Scholar]
- 59.Graffelman J. Exploring Diallelic Genetic Markers: The HardyWeinberg Package. Journal of Statistical Science Software. 2015;64(3). [Google Scholar]
- 60.Graffelman J, Weir BS. On the testing of Hardy-Weinberg proportions and equality of allele frequencies in males and females at biallelic genetic markers. Genet Epidemiol. 2018;42(1):34–48. doi: 10.1002/gepi.22079 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. doi: 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Weir BS. Genetic data analysis II: methods for discrete population genetic data. Sunderland, Mass.: Sinauer Associates; 1996. xii, 445 p. p. [Google Scholar]
- 63.Lange K. Mathematical and statistical methods for genetic analysis. New York: Springer; 1997. xii, 265 p. p. [Google Scholar]
- 64.Dudbridge F, Gusnanto A. Estimation of significance thresholds for genomewide association scans. Genet Epidemiol. 2008;32(3):227–34. doi: 10.1002/gepi.20297 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics. 2010;26(17):2190–1. doi: 10.1093/bioinformatics/btq340 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Zhang L, Sun L. A generalized robust allele-based genetic association test. Biometrics. 2021; 10.1111/biom.13456 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Byrska-Bishop Marta, Evani Uday S., Zhao Xuefang, Basile Anna O., Abel Haley J., Regier Allison A., et al. High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. biorxiv https://doiorg/101101/20210206430068. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]