Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 Mar 1.
Published in final edited form as: J Hum Genet. 2010 Jun 24;55(9):613–620. doi: 10.1038/jhg.2010.77

Examination of disease based selection, demographic history and population structure in European Y chromosome haplogroup I

Efe Sezgin 1,*, Alyssa Drosdak 1, Carl McIntosh 1, Bailey Kessing 4, James Lautenberger 1, James J Goedert 2, John P Phair 3, Jennifer L Troyer 4, Michael W Smith 5, Stephen J O’Brien 1
PMCID: PMC2945452  NIHMSID: NIHMS210305  PMID: 20574427

Abstract

We attempted to refine the understanding of an association of Y-chromosomal haplogroup I (hg-I) with enhanced AIDS progression that had been previously reported. First, we compared the progression phenotype between hg-I and its phylogenetically closest haplogroup J (hg-J). Then, we took a candidate gene approach resequencing DDX3Y, a crucial autoimmunity gene, in hg-I and other common European Y- chromosome haplogroups looking for functional variants. We extended the genetic analyses to CD24L4 and compared and contrasted the roles of disease based selection, demographic history, and population structure shaping the contemporary genetic landscape of hg-I chromosomes. Our results confirmed and refined the AIDS progression signal to hg-I, though no gene variant was identified that can explain the disease association. Molecular evolutionary and genetic analyses of the examined loci suggested a unique evolutionary history in hg-I, probably shaped by complex interactions of selection, demographic history, and high geographical differentiation leading to the formation of distinct hg-I subhaplogroups that today are associated with HIV/AIDS onset. Clearly, further studies on Y chromosome candidate loci sequencing to discover functional variants and discern the roles of evolutionary factors are warranted.

Keywords: AIDS progression, CD24L4, DDX3Y, population growth, population structure, selection, Y chromosome

Introduction

The importance of Y chromosome in human health is becoming more evident with the accumulation of data from disease association studies. The non-recombining region of the human Y chromosome (NRY) makes up 95% of the total Y chromosome and contains over 78 genes involved in a range of developmental and physiological processes such as sex determination, spermatogenesis, stature formation, enamel formation, immunological responses, and behavioral traits.1,2 Several slowly mutating bi-allelic polymorphisms on NRY can be used to group human Y chromosomes into monophyletic lineages called Y chromosome haplogroups that have been essential markers for investigating human population histories.1 Mutations on the Y chromosome ranging from a single nucleotide polymorphisms (SNPs) to large deletions and inversions have been linked to several diseases such as infertility and cancers.3 Recently, our group showed faster acquired immune deficiency syndrome (AIDS) progression associated with a European-specific Y chromosome haplogroup, the haplogroup I (hg-I) in European American patients.4 However, the same features of the Y chromosome that makes it an excellent marker for population structure/history make it rather difficult to identify causative variants in the Y chromosome associated with disease.

The lack of recombination in the NRY makes mapping of a locus/loci responsible for disease, such as the AIDS progression signal, nearly impossible with traditional approaches, since all genes can be positional candidate genes. In addition, the Y chromosome contains large sections of low complexity, repetitive sequences, making re-sequencing efforts to detect causative alleles challenging. Therefore, to further investigate the human Y chromosome in association studies, alternative evolutionary and population genetic analyses have been proposed to determine potential disease associations of specific Y chromosome haplogroups rather than specific Y chromosome loci.5

As for any fitness characteristics, Y chromosome specific phenotypic variations can be subject to selection. For example, infertility will be subjected to strong purifying selection, while weaker selection may be acting on more subtle phenotypic differences, such as the immune response, over a period of time under changing environmental conditions. These selective pressures can leave DNA polymorphism signatures detectable by population genetic analyses. However, demographic history and population structure will also effect the observed DNA sequence variation and must be considered, so that they do not confound potential disease based selection signatures.

In this study, we aimed to refine our understanding of AIDS susceptibility associated with hg-I. We first analyzed AIDS progression differences between hg-I subhaplogroups and the phylogenetically closest haplogroup, hg-J. Previously, we reported faster progression to AIDS outcomes in hg-I compared to the most common European haplogroup R, but did not analyze hg-J. As hg-I and hg-J are sister clades with similar time-depths, and likely to have relatively closely related sequence organisations, it is interesting to see if the disease susceptibility is unique to hg-I or common among the hg-IJ clades. Once we clearly identified AIDS susceptibility to be hg-I specific, we adopted a candidate gene approach and resequenced a single copy, X-degenerate, non-recombining Y chromosome locus, DDX3Y, a crucial gene in human autoimmunity, to identify potential functional variants that may influence immune response. The X chromosome paralog of DDX3Y, DDX3X has been shown to be involved in HIV-1 replication and antiviral activity.6,7 We also focused on detecting signals of potential selective pressure on the hg-I Y chromosomes that may indicate that the natural history of this haplogroup was influenced by infectious disease susceptibility/resistance. In addition, we examined the potential effects of historical population size changes and contemporary population structure on the observed distribution of DNA sequence polymorphism on the Y chromosome. For population genetics and molecular evolutionary analyses, besides DDX3Y, we chose another single copy, non-recombining Y chromosome locus with a distinct evolutionary history, CD24L4, an intronless pseudogene of a functional autosomal CD24 gene (on chromosome 6) that retrotransposed after the human chimpanzee lineage split.8 We expect fewer evolutionary constraints on pseudogenes. Therefore, the selective forces on CD24L4 and DDX3Y are expected to be different, acting as an internal control to allow for differentiation of selective and demographic inferences in our samples.

Although we did not identify a causal variant that can explain AIDS association in hg-I, based on interpreting AIDS survival associations in the context of population genetic analyses, we detected evidence for historical purifying selection, population expansion and sub-structure that have shaped the observed genetic diversity in hg-I Y chromosomes.

Materials and methods

Study subjects and assignment of Y chromosomal haplogroups

The origin of individual samples, DNA extraction and assignment to specific Y chromosome haplogroups were described in Sezgin et al.4 The Y-chromosome haplotype of each male was determined by typing ten single nucleotide polymorphisms (M60, M216, M203, M96, M89, M69, M170, M9, M45 and M207) located on the non-recombining region of the Y chromosome (NRY) using the 5′ nuclease assay (TaqMan Assay-by-Design SNP genotyping products, Applied Biosystems). The samples that were combined to form the juxtaposed F* (xI, H, K) group in the previous study were further genotyped in this study for P130 (rs17250887), P124(rs17315772), and P209 (rs17315835) and assigned to respective J and I haplogroups following the Y chromosome consortium (YCC 2008) nomenclature.9 The hg-I samples were genotyped for five additional markers, M253, M227, P37, M26, and M223 to define the I1*, I1b*, I2a*, I2a2* and I2b* subhaplogroups (the corresponding YCC 2002 names are I1a, I1a4, I1b, I1b2, and I1C, respectively). The remaining hg-I samples that lacked the derived alleles at these five markers were assigned into the I* paragroup (Figure 1).

Figure 1.

Figure 1

Presented is a Y chromosome haplogroup tree with the indicated haplogroup defining markers following the Y chromosome consortium 2008 nomenclature. Indicated below is number of individuals (n) genotyped for the –A662G allele that was discovered upon sequencing of CD24L4 region. The AIDS susceptible haplogroup I samples were further divided into the respective subhaplogroups.

The CD24L4 and DDX3Y sequencing panels consisted of 90 European Americans: 10 samples from each of the haplogroups E, J, K*, R and subhaplogroups I1*, I2a*, I2a2*, I2b*, and I*. Further genotyping of the CD24L4 –A662G polymorphism included additional 484 samples: 431 haplogroup I; 10 samples from each of the haplogroups R, P(xR), J, E, and B; and 1 sample from each of the C, D, and H haplogroups (Figure 1). This study was approved by the Protocol Review Office of the Institutional Review Board of the National Cancer Institute. Informed consent was obtained from all individuals.

AIDS progression analyses

Progression to four AIDS-related outcomes (time to CD4+ T lymphocyte count of fewer than 200 per cubic millimeter (CD4 < 200), AIDS-1993 definition, AIDS-1987 definition, and death from AIDS-defining illnesses prior to HAART use was compared in J and hg-I subhaplogroup seroconverter (date of HIV-1 infection known) subjects using Cox proportional hazard model and Kaplan-Meier analyses as described in Sezgin et al.4 The Cox models were adjusted for the known autosomal AIDS restriction/susceptibility genes (CCR5-Δ32, CCR2-64I, CCR5-P1/P1, HLA-B*27, HLA-B*57, HLA-B*35, HLA-B*35Px, KIR3DS1, TSG101, SDF, RANTES, and HLA class I heterozygosity) using a genetic propensity index.10 Statistical analyses were performed with SAS 9.1 (SAS Institute, Cary, North Carolina). P values reported were nominal and uncorrected for multiple comparisons.

Autosomal marker based population stratification/substructure discovery

700K autosomal (Single Nucleotide Polymorphism) SNP data from an Affymetrix 6.0 Genechip platform with call rates ≥ 95% was available (Troyer et al. in preparation) to perform a principal components analysis (PCA) via EIGENSOFT package.11 The distribution of samples belonging to haplogroups I and J was compared along the top two eigenvectors via an ANOVA F statistic. Moreover, the eigenvectors were used in AIDS association Cox models as covariates to adjust for potential autosomal stratification.

DNA sequencing and polymorphism screening

A DNA fragment of 6795 base pairs (bp) covering 3500 bp upstream of putative CD24L4 mRNA, 2195 bp putative CD24L4 mRNA, and 1100 bp downstream regions was selected according to NCBI Reference Sequence (RefSeq) NC_000024.8 and accession number AC009235. This region was divided into six overlapping amplicons (Supplementary Figure 1) and sequenced with primer pairs listed in supplementary table 1. Newly discovered CD24L4 – A662G polymorphism was screened in a larger set of samples using the 5′ nuclease assay (TaqMan Assay-by-Design SNP genotyping products, Applied Biosystems; Supplementary Table 2). For DDX3Y (formerly known as DBY), a DNA fragment of 22613 bp covering 3570bp upstream of first exon, 15613 bp mRNA (longest transcript covering all 17 exons), and 3430 bp downstream regions was selected according to NCBI RefSeq NC_000024.8 and RefSeq mRNA NM_001122665. The DDX3Y region was divided into eleven overlapping amplicons (Supplementary Figure 1) and sequenced with primer pairs listed in supplementary table 3. A 1900 bp low complexity region between the third and fourth amplicons was not sequenced. The Repeat Masker web server (http://www.repeatmasker.org) was used to identify the repetitive elements around and within the sequenced regions. As both CD24L4 and DDX3Y have highly similar homologs in the genome (CD24 on Chromosome 6 and DDX3X on the X chromosome), female samples were used as a control that the primer pairs only amplified the Y chromosome copies of these genes. All DNA fragments were Sanger sequenced and analyzed by ABI 3730xl DNA Analyzer. Sequence fragments were assembled and aligned with computer program Sequencher, version 4.6 (GeneCodes, Ann Arbor, MI) and MEGA version 4.0.12 The nucleotide sequence data reported are available in the GenBank databases under the accession number (XXX-XXX).

The NCBI, Seattle SNPs and ENCODE web sites were searched for previously reported CD24L4 and DDX3Y mutations. Also, the 236 resequenced (Accession codes BV678971 – BV679207) PCR fragments by Repping et al.13 were aligned and examined for DDX3Y region nucleotide changes.

Summary statistics and population genetic analyses

Two estimators, one based on the number of segregating sites in a sample, θw14, and another based on average number of pair wise nucleotide differences per site between two sequences, π15 were used to summarize the nucleotide diversity statistics. To test the neutrality of the polymorphism frequency distribution and demographic equilibrium, Tajima’s D16 and Fu and Li’s D*17 tests were applied. Further inferences of population size changes were tested using Fu’s Fs,18 Ramos-Onsins and Roza’s R2,19 and mismatch distributions.20 The significance of D, D*, Fs, and R2 was inferred from the null distributions generated by 10,000 replicates of coalescent simulations without recombination and should be interpreted as the probability of obtaining values equal or lower than the observed values. Parameter estimates and statistical tests were calculated by DNAsp version 4.50.3.21 The sum of square deviations (SSD) between the observed and expected mismatch distributions,22 and Chakraborty’s test of population amalgamation23 were calculated using ARLEQUIN version 3.11.24 A potential role for sampling bias influencing the D* and D estimates was investigated by sampling hg-I subhaplogroups and constructing five, ten, eight, five, and one sample-populations made up of one (individual I subhaplogroups), two, three, four, and five (all hg-I subhaplogroups) subhaplogroups, respectively. The D* and D for each constructed sample-population were estimated as mentioned above.

Results

Autosomal and Y haplogroup population substructure/stratification in the study population

As human Y chromosome variation shows high levels of geographic differentiation, disease association studies involving the Y chromosome are particularly prone to population stratification/substructure. Moreover, in European populations the gene variants associated with susceptibility to/protection against AIDS may also be susceptible to population structure. For example, the AIDS protective CCR5-Δ32 variant shows a north south gradient in European populations, 25 a trend similar to that of hg-I distribution in Europe.26 Moreover, the geographical distributions of hg-J and hg-I are also different in Europe, where the hg-J is mostly found in southern European populations.27 In order to prevent autosomal population stratification from confounding Y chromosome based AIDS association signal, we applied a PCA approach to 700K autosomal SNP (Single Nucleotide Polymorphism) data and observed that the distribution of hg-I and hg-J samples were significantly different along the first axis (P = 0.001; Figure 2). Moreover, when only hg-I subhaplogroups were compared against each other, I* differed significantly compared to other subhaplogroups along the first axis (P = 0.02; Figure 2). Although the recovered eigen axes were informative about the genomic ancestry differences between the (sub)haplogroups, they were not significantly correlated with the AIDS progression outcome phenotypes (i.e. for AIDS-1987, Eigen 1: r2 = 0.02, P = 0.09; Eigen 2: r2 = 0.01, P = 0.15) upon a regression analysis. For completeness, we still corrected the Cox models of AIDS progression analyses for these top two most informative eigen vectors.

Figure 2.

Figure 2

Distribution of individuals belonging to (A) haplogroup I, haplogroup J and (B) haplogroup I subhaplogroups among the top principal components (PCs) based on 700K autosomal SNP data.

AIDS progression analyses

There were significant differences between the hg-I subhaplogroups and J haplogroup samples for AIDS progression, where hg-I subhaplogroups depleted CD4+ T-cells and progressed to AIDS 1993 and AIDS 1987 faster (Table 1). However, the faster AIDS progression signal (compared to the J haplogroup samples) was not significant for every individual hg-I subhaplogroup, probably due to small sample sizes resulting in lack of statistical power. When all hg-I subhaplogroups were combined (representing hg-I) and compared against haplogroup J, the significant AIDS progression differences were more evident (P ≤ 0.02 for each AIDS end point; Table 1).

Table 1.

Comparison of progression to AIDS outcomes among haplogroup J and hg-I subhaplogroup seroconverters

CD4 < 200 AIDS-1993 AIDS-1987
(Sub)haplogroupa n/events RHa (95% CI) P n/events RHa (95% CI) P n/events RHa (95% CI) P
 J 97/48 1.00 - 97/49 1.00 - 98/47 1.00 -
 I1* 55/30 1.95 (1.00–3.81) 0.05 55/30 1.83 (0.96–3.48) 0.07 55/29 1.62 (0.84–3.13) 0.15
 I2a* 12/6 1.97 (0.71–5.48) 0.19 12/6 1.74 (0.64–4.78) 0.28 12/5 1.97 (0.65–5.96) 0.22
 I2b* 23/11 2.85 (1.21–6.70) 0.02 23/11 2.56 (1.11–5.93) 0.03 24/10 2.13 (0.93–4.89) 0.07
 I* 4/1 1.70 (0.22–13.16) 0.61 4/2 3.21 (0.71–14.50) 0.36 4/3 4.96 (1.30–18.91) 0.02
 I (all) 97/48 2.07 (1.12–3.81) 0.02 97/49 2.02 (1.12–3.64) 0.02 98/47 2.02 (1.11–3.68) 0.02
Within hg-I subhaplogroups
 I1* 55/30 0.74 (0.37–1.48) 0.40 55/30 0.77 (0.39–1.49) 0.43 55/29 0.71 (0.36–1.39) 0.32
 I2a* 12/6 0.98 (0.38–2.54) 0.97 12/6 0.88 (0.34–2.27) 0.79 12/5 1.06 (0.37–3.02) 0.91
 I2b* 23/11 1.60 (0.74–3.46) 0.23 23/11 1.44 (0.67–3.10) 0.35 24/10 1.28 (0.60–2.73) 0.53
 I* 4/1 0.85 (0.09–8.27) 0.89 4/2 2.30 (0.39–13.62) 0.36 4/3 3.69 (0.52–26.21) 0.19
a

Haplogroup J is the reference group. For within hg-I subhaplogroup comparisons, the reference group is all other subhaplogroups combined against the subhaplogroup of interest. Due to small sample size (n=3) there were no events in the I2a2

*

group and similar analyses were not conducted.

In contrast, when the hg-I subhaplogroup samples were compared against each other, the Cox analyses did not show a significant AIDS progression differences between them (Table 1).

Sequencing to identify functional variants and population genetic analyses of DDX3Y

No mutation in DDX3Y corresponded with AIDS outcome. However, the number of polymorphic sites in and around this gene allowed for population genetic analyses within and between Y haplogroups. There were 14 polymorphic sites defining 13 haplotypes in the full set of 90 European American samples selected for DNA sequencing (Figure 3). No polymorphism was observed in the coding regions. Only two of the polymorphisms, -2303 deletion (-T2303D) and –A1850G were in the low complexity/repeat regions and all segregating sites mutated only once, thus the data fit the infinite sites model quite well. The deletion site was excluded from further analyses. The summary statistics of polymorphisms in the 15003 bp of noncoding region are presented in Table 2. The estimates of nucleotide diversity, negative D result and the lack of polymorphism in the coding region are in agreement with previous reports.28-30 Five of the polymorphisms T2766G, G4675A, A9577C, G10682T and C13905T correspond to the binary NRY markers M191, P170, M173, M201 and M198 respectively. The M201 marker was used to assign three of the samples to the haplogroup G.

Figure 3.

Figure 3

Distribution of DDX3Y polymorphisms among 90 individuals belonging to six Y chromosome haplogroups and five hg-I subhaplogroups (see Figure 1). The numbers indicate the position of changes. Negative base numbers are with respect to the start position of the first exon. None of the changes are in the exons. Chimpanzee (Chimp) sequence is used to show the ancestral states of the observed polymorphisms in human DDX3Y sequences.

Table 2.

Summary statistics for the 15003bp noncoding region of DDX3Y

Samples Na Sb π (%) θw (%) hdc nsd Tajima’s D Fu-Li’s D* Fu’s Fs R2 SSD Pe
All 90 14 0.010 0.018 0.60 6 −1.260 −1.781 −4.915# 0.052 0.33
Iallf 50 7 0.004 0.010 0.36 5 −1.577# −2.651# −3.710# 0.053# 0.20
Non-If 40 7 0.012 0.011 0.77 1 0.212 0.512 −0.388 0.125 0.15
E 10 3 0.006 0.007 0.51 1 −0.507 0.174 0.300 0.155 0.37
G 3 0 0.000 0.000 0.00 0
J 7 0 0.000 0.000 0.29 0
I1* 10 1 0.001 0.002 0.20 1 −1.111 −1.243 −0.339 0.300 0.09
I2a* 10 1 0.001 0.002 0.20 1 −1.111 −1.243 −0.339 0.300 0.12
I2a2* 10 0 0.000 0.000 0.00 0
I2b* 10 1 0.001 0.002 0.20 1 −1.111 −1.243 −0.339 0.300 0.10
I* 10 4 0.010 0.009 0.71 2 0.143 −0.338 −0.077 0.167 0.46
R 10 1 0.002 0.002 0.36 0 0.015 0.804 0.417 0.178 0.46
K *(xP) 10 1g 0.000 0.000 0.00 0g
a

Number of individuals.

b

Number of polymorphic sites.

c

Haplotype diversity.

d

Number of singletons.

e

Probability of Sum of Square Deviations (mismatch distribution).

f

Iall: All haplogroup I samples analyzed together; Non-I: E, G, J, R and K* haplogroup samples analyzed together.

g

The -T2303deletion site was excluded in the following statistical analyses.

#

P < 0.05

Excess of singleton mutations in the hg-I samples: selection or population expansion

Analyses of all samples (n =90) showed that six of the polymorphisms were singletons driving negative but not significant D and D* results. However, the more powerful test for detecting an excess of rare alleles, Fs, showed that the number of singletons observed was significantly higher than neutral expectations, suggesting either positive selection or population growth. The samples were then assigned to their corresponding haplogroups and hg-I samples (n=50) were compared against non-I haplogroup (n=40) samples. Although both groups of samples have the same number of segregating sites (S=7) generating seven haplotypes, in the hg-I samples most of the polymorphisms were singletons (5/7) scaling up to five times more singletons compared to the non-I haplogroup samples (1/7). The excesses of low frequency polymorphisms in the hg-I samples were also detected by the rather negative and significant D, D*, and Fs values (Table 2) suggesting either positive selection or a population history out of demographic equilibrium. The conservative mismatch distribution based tests also did not reject the null hypothesis of population growth for any of the haplogroups including hg-I (Table 2). Interestingly, the unimodality of mismatch distributions was much more evident in the non-I haplogroup samples (Supplementary Figure 2) suggesting a larger population expansion compared to hg-I samples. The small and significant R2 test value in the hg-I samples was also consistent with a recent population growth. Further examination of each I subhaplogroup showed that, except the I*, all others had negative but not significant D, D*, and Fs values. The R2 is a superior test for detecting population growth when the sample sizes and the number of segregating sites are small. However, again, none of the individual I haplogroups indicated population growth upon the R2 test. The D, D*, and Fs estimates for E, and R samples were also not significantly different from zero and population demographic equilibrium cannot be rejected by the R2 test. Similar analyses were not conducted in the G, J, and K samples as there were no segregating sites. The I* samples, with positive D value, showed the highest haplotype diversity in the hg-I samples, where two distinct haplotypes characterized by C15368T and A16775G were evident (Table 2 and Figure 3).

We also calculated D and D* values for DDX3X (X chromosome homolog of DDX3Y) and CD24 (chromosome 6 homolog of CD24L4) using the HapMap data from 60 CEU samples. The original DDX3X variation data based on 18 segregating sites generated significantly positive D (2.36, P < 0.05) and D* (1.68, P < 0.05) values. However, it is known that the HapMap genotype data is biased towards high frequency polymorphisms and excludes singleton changes (http://www.hapmap.org/). After applying a correction for the ascertainment bias in the HapMap data generation,31 the D and D* values were not significantly different from neutral expectation of zero.

Next, we examined if sampling from non-homogeneous populations, in addition to population growth, can bias the D and D* results. Firstly, we tested if our samples originated from a single homogeneous population following Chakraborty’s23 approach and compared the observed number of alleles against their expected values given the observed homozygosity in our samples. For all samples (n = 90), the observed number of alleles (k = 14) was significantly higher than the expected number (k = 6.8, P = 0.0003) indicating non-homogeneous population sampling. When hg-I (n = 50) and other (non-I, n = 40) haplogroup samples are examined separately, the observed number of alleles (k = 7) was significantly higher than the expected number (k = 3.4, P = 0.004) for the hg-I, but not significantly different for the non-I haplogroup samples (8 vs. 6.2, P = 0.63). The results were not significant for individual I subhaplogroups and haplogroups E, G, J, R and K*. Following, the number of subhaplogroups sampled vs. D and D* estimates were analyzed. A significant negative correlation between the number of subhaplogroups sampled and the estimated D and D* values were observed (r2 = 0.95, P < 0.001 and r2 = 0.84, P < 0.01, respectively. Supplementary Figure 3). A similar analysis using only the non-I haplogroup samples did not indicate a significant correlation between the number of populations sampled and the estimated D and D* values (For D: r2 = 0.01, P = 0.74; For D*: r2 = 0.001, P = 0.97).

Population genetic analyses of CD24L4

An initial DNA sequence analysis of 50 samples (10 from each of the haplogroups E, I, J, K*, and R) found an A to G transition at 662 bp upstream of CD24L4 mRNA starting site (-A662G, Supplementary Figure 1) segregating only in the haplogroup I (hg-I) individuals. The -A662G change was in a repeat region characteristic of LINEs. This high LINE resembling stretch of sequence is most probably a remnant of the retrotransposition event that generated the CD24L4 on the Y chromosome. Further sequencing of 10 samples from each of the I1*, I2a*, I2a2*, I2b*, and I* subhaplogroups and additional genotyping of 484 samples from ten haplogroups showed that the derived G allele was found only in the I2a*, I2a2*, I2b*, and I* samples, whereas all I1* samples had the ancestral A allele (Figure 1). This observation is in agreement with the recently published YCC 2008 Y chromosomal haplogroup tree9 in which the P215 marker shows that I2a* and I2b* lineages share a more recent common ancestor with each other than with the I1* lineages. When I2a* and I2b* lineages were combined and compared against I1* lineage in terms of progression to AIDS outcomes, no significant progression difference was observed (P > 0.3).

Discussion

In a previous study we discovered faster AIDS progression associated with European specific haplogroup I individuals when compared against the individuals belonging to the most common European haplogroup R.4 A standing question was whether AIDS susceptibility was specific to hg-I or it was also shared by its sister clade hg-J. In this study, using new genotyping data, we firstly identified individuals belonging to haplogroup J, the phylogenetically closest lineage to hg-I, and we confirmed the fast AIDS progression signal is specific to hg-I individuals using the hg-J as a reference group. The robustness of hg-I disease susceptibility association to correction for known effects of several autosomal loci and genome-wide allelic variants suggests that the disease signal seen here reflects Y chromosome specific loci. Moreover, we observed that none of the hg-I subhaplogroups can account for the faster progression signal alone and the AIDS signal can be best explained due to genetic variants present in all hg-I Y chromosomes. However, we note that generalization of our conclusions to individual hg-I subhaplogroups should be interpreted cautiously given the small sample sizes in some of these groups. Finally, a candidate gene approach of resequencing DDX3Y, a crucial autoimmunity gene, did not identify a causal variant that can explain the disease association.

Although resequencing of DDX3Y did not identify a causal variant, the sequence data indicated a significant deviation from neutrality in the frequency distribution of DDX3Y polymorphisms in the hg-I samples. Therefore, we conducted comprehensive comparative evolutionary and population genetic analyses on Y chromosomes belonging to hg-I and other common European haplogroups to examine potential roles of disease based selection, demographic history and geographic differentiation affecting the evolutionary history of hg-I.

The significantly excess number of singletons observed in the DDX3Y among the hg-I samples can be caused by positive/directional selection or presence of weakly deleterious mutations. However, none of the observed mutations are in the coding regions. It is highly unlikely that these mutations will be advantageous (or weakly deleterious) and give selective advantage (or disadvantage) to the hg-I Y chromosome. It is more likely that purifying selection is acting on DDX3Y. A recent survey of X-degenerate, single-copy protein coding genes from Y chromosomes representing worldwide diversity also found very little variation in the exons of these genes (including DDX3Y) suggesting purifying selection as the primary factor in the evolutionary history of these genes.32 Therefore, an explanation is needed for the excess of presumably neutral intonic polymorphisms observed in the hg-I samples.

Next, we examined if a population growth model can account for the observed patterns of variation in DDX3Y and applied three more population genetic tests in addition to the D and D* estimates. All of the tests indicated population growth for hg-I, although population equilibrium could not be rejected for the non-I samples. However, individual hg-I subhaplogroups did not significantly deviate from population equilibrium. We acknowledge a sampling caveat that may be inflating the population growth estimators. As we were interested in a thorough genetic investigation of hg-I, we sampled representatives of all common hg-I subhaplogroups for sequencing. Due to the unique topology of Y haplogroup phylogenetic tree, this sampling scheme means that we are more prone to sampling from the tips of the Y haplogroup phylogenetic tree for the hg-I samples, capturing more singleton changes and inflating the D and D* estimates (Supplementary Figure 3).

The hg-I makes up only around 18% of the European Y chromosome haplogroups, 26 therefore it is unlikely that hg-I have experienced a more drastic exponential growth compared to the other haplogroups. Moreover, when non-I haplogroup samples are pooled together, they did not show any evidence of population growth. It is expected that exponential population growth should affect polymorphism patterns throughout the genome. Some studies suggested that the DNA sequence data from autosomes and X chromosome did not show a rapid expansion in Europeans,33-35 whereas others argued that the data was consistent with population growth after a past bottleneck.36,37 To see what kind of population demography can be inferred from the X chromosome and autosomal homologs of DDX3Y and CD24L4, we calculated D and D* values for DDX3X and CD24 using the HapMap data from 60 CEU samples. After considering for the ascertainment bias inherent in the HapMap genotype data, we cannot reject demographic equilibrium hypothesis for these loci. However, we note that Northern European descendant HapMap (CEU) DDX3X and CD24 results may not be representative of genotype variation in all European populations. Adequate sampling of geographically diverse European populations and unbiased resequencing are crucial for understanding the demographic history of European populations.38

Due to biological forces (such as drift) and cultural practices (such as patrilocatility) Y chromosome shows the most geographically structured variation pattern in the human genome. Recently, high levels of population substructure within European populations has been shown using high density marker studies.39,40 How much of this European population structure can be traced by Y chromosome haplogroups is currently unknown. Hg-I shows strong geographic differentiation in Europe. Subhaplogroups I1* and I2b* are mostly found in north and west Europe, respectively. I2a* is most frequent in Eastern Europe and Balkans, whereas I2a2* is the predominant subhaplogroup in Sardinia.26 Our high density autosomal marker analyses with European descendant Americans suggest that there are genome-wide allelic differences between hg-I and other haplogroups and within hg-I subhaplogroups. These differences may be tracing the different geographic origin of European American individuals with different Y (sub)haplogroups.

A different and more complex population and demographic history within the hg-I subhaplogroups was also evident. For the DDX3Y region, the I* samples had at least twice the number of segregating sites and more than three times the haplotype diversity compared to the other I subhaplogroups, where two distinct haplotypes were evident. Also, the positive D estimate in I* samples was in the opposite direction compared to other subhaplogroups. These observations suggest that I* samples represent an older and/or more mixed population compared to other hg-I subhaplogroups that is in demographic equilibrium. Moreover, the presence of CD24L4 -A662G change in I2a*, I2b* and I* samples suggests that either this mutation originated independently on the I2* and I* Y chromosomes or that, at least some, I* chromosomes share a more recent common ancestor with the I2* clade. The first scenario of recurrent mutations on different Y chromosomes is unlikely given the low mutation rates observed on the Y chromosome and lack of observation of any other change on other examined (older) haplogroups (as they have existed for longer periods of time they would have had more chances to accumulate mutations). The paragroup I* is a collection of hg-I Y chromosomes defined by the lack of derived I subhaplogroup markers. Although the paragroup represents the oldest hg-I lineages, it makes up only about 5% of the hg-I in European populations26 and has not been thoroughly studied. Highest frequency I* samples are found in southwestern Europe (2.9%–5.8%) and Balkans (3.6%), the most probable last glacial period refuge areas of I1*, I2b*, and I2a*, respectively.26 However, whether these geographically separate I* populations represent a panmictic population or a long term separated, and structured population is currently unknown. Considering that the high density SNP data also suggest genome wide allelic differences between I* and the rest of the subhaplogroups, testing a scenario of historic geographical population division for the two major haplotypes observed in I* samples should be interesting using larger samples from Iberian peninsula, western and southeastern Europe.

The participants of this study were self reported European descendants from the U.S. population. We assumed that we will be able to detect the pre-and post-Neolithic evolutionary factors that had been acting on the European Y chromosome lineages, without any sigificant influence of the post-Columbian expansion in the New World. This should be a reasonable assumption. First, the mutation rates on the human Y chromosome are rather slow and it is unlikely that there had been enough genetic diversity accumulation on the paternal lineages since post-Columbian expansion in the New World. Second, previously we observed that haplogroups I, J, and R made up 17%, 10%, and 63% of the 2000 Y chromosomes studied from U.S. population, respectively.4 These haplogroup frequencies sampled from U.S. population are rather similar to the frequency distribution of these haplogroups in Europe, not indicating any bias for or against hg-I expansion in the New World.

In conclusion, we present a unique evolutionary history of hg-I, the first human Y chromosome haplogroup showing infectious disease association. Our analyses suggest that, genetic changes occurred on the ancestor of hg-I Y chromosomes, after its split from its sister clade hg-J, may be responsible for AIDS susceptibility. However, we were not able to identify a causative locus. We also observed that even within a continent, Y chromosomes belonging to the same haplogroup may have very different life histories representing their diverse genomic backgrounds. This is an important observation that needs to be considered in designing of future Y chromosome related disease association and replication studies. Determining the roles of disease based selection, demographic history and population structure during the evolutionary history of a haplogroup is obviously complex. Although AIDS is a recent epidemic and might not have had enough time to affect Y chromosome sequence diversity, overall infectious disease susceptibility might be acting on certain Y chromosomes for longer periods of time. Our analyses could not show a clear selection against (or for) the hg-I. However, this study should be considered as early stages of a relatively new approach and our conclusions should serve as a stimulant for further research.

Supplementary Material

01

Acknowledgements

We thank the individuals and staff of all the participating cohorts in the study. San Francisco City Cohort samples were provided by Susan Buchbinder. We also thank Michael Malasky, and Mary Thompson for their assistance. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. government. The project described in this manuscript has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under contract N01-CO-12400.

Footnotes

Conflict of interest: Authors declare no conflict of interest

References

  • 1.Jobling MA, Tyler-Smith C. The human Y chromosome: an evolutionary marker comes of age. Nat. Rev. Genet. 2003;4:598–612. doi: 10.1038/nrg1124. [DOI] [PubMed] [Google Scholar]
  • 2.Skaletsky H, Kuroda-Kawaguchi T, Minx PJ, Cordum HS, Hillier L, Brown LG, et al. The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature. 2003;423:825–837. doi: 10.1038/nature01722. [DOI] [PubMed] [Google Scholar]
  • 3.Krausz C, Quintana-Murci L, Forti G. Y chromosome polymorphisms in medicine. Ann. Med. 2004;36:573–583. doi: 10.1080/07853890410018853. [DOI] [PubMed] [Google Scholar]
  • 4.Sezgin E, Lind JM, Shrestha S, Hendrickson S, Goedert JJ, Donfield S, et al. Association of Y chromosome haplogroup I with HIV progression, and HAART outcome. Hum. Genet. 2009;125:281–294. doi: 10.1007/s00439-008-0620-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Tyler-Smith C. An evolutionary perspective on Y-chromosomal variation and male infertility. Int. J. Androl. 2008;31:376–382. doi: 10.1111/j.1365-2605.2008.00889.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Soulat D, Burckstummer T, Westermayer S, Goncalves A, Bauch A, Stefanovic A, et al. The DEAD-box helicase DDX3X is a critical component of the TANK-binding kinase 1-dependent innate immune response. Embo J. 2008;27:2135–2146. doi: 10.1038/emboj.2008.126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Yedavalli VS, Neuveut C, Chi YH, Kleiman L, Jeang KT. Requirement of DDX3 DEAD box RNA helicase for HIV-1 Rev-RRE export function. Cell. 2004;119:381–392. doi: 10.1016/j.cell.2004.09.029. [DOI] [PubMed] [Google Scholar]
  • 8.Hughes JF, Skaletsky H, Rozen S, Wilson RK, Page DC. Has the chimpanzee Y chromosome been sequenced? Nat. Genet. 2006;38:853–854. doi: 10.1038/ng0806-853b. author reply 854-855. [DOI] [PubMed] [Google Scholar]
  • 9.Karafet TM, Mendez FL, Meilerman MB, Underhill PA, Zegura SL, Hammer MF. New binary polymorphisms reshape and increase resolution of the human Y chromosomal haplogroup tree. Genome Res. 2008;18:830–838. doi: 10.1101/gr.7172008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.O’Brien SJ, Nelson GW. Human genes that limit AIDS. Nat. Genet. 2004;36:565–574. doi: 10.1038/ng1369. [DOI] [PubMed] [Google Scholar]
  • 11.Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
  • 12.Tamura K, Dudley J, Nei M, Kumar S. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol. Biol. Evol. 2007;24:1596–1599. doi: 10.1093/molbev/msm092. [DOI] [PubMed] [Google Scholar]
  • 13.Repping S, van Daalen SK, Brown LG, Korver CM, Lange J, Marszalek JD, et al. High mutation rates have driven extensive structural polymorphism among human Y chromosomes. Nat. Genet. 2006;38:463–467. doi: 10.1038/ng1754. [DOI] [PubMed] [Google Scholar]
  • 14.Watterson GA. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 1975;7:256–276. doi: 10.1016/0040-5809(75)90020-9. [DOI] [PubMed] [Google Scholar]
  • 15.Nei M. Molecular evolutionary genetics, x. Columbia University Press; New York: 1987. [Google Scholar]
  • 16.Tajima F. Evolutionary relationship of DNA sequences in finite populations. Genetics. 1983;105:437–460. doi: 10.1093/genetics/105.2.437. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Fu YX, Li WH. Statistical tests of neutrality of mutations. Genetics. 1993;133:693–709. doi: 10.1093/genetics/133.3.693. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Fu YX. Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics. 1997;147:915–925. doi: 10.1093/genetics/147.2.915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ramos-Onsins SE, Rozas J. Statistical properties of new neutrality tests against population growth. Mol. Biol. Evol. 2002;19:2092–2100. doi: 10.1093/oxfordjournals.molbev.a004034. [DOI] [PubMed] [Google Scholar]
  • 20.Rogers AR, Harpending H. Population growth makes waves in the distribution of pairwise genetic differences. Mol. Biol. Evol. 1992;9:552–569. doi: 10.1093/oxfordjournals.molbev.a040727. [DOI] [PubMed] [Google Scholar]
  • 21.Rozas J, Sanchez-DelBarrio JC, Messeguer X, Rozas R. DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics. 2003;19:2496–2497. doi: 10.1093/bioinformatics/btg359. [DOI] [PubMed] [Google Scholar]
  • 22.Schneider S, Excoffier L. Estimation of past demographic parameters from the distribution of pairwise differences when the mutation rates vary among sites: application to human mitochondrial DNA. Genetics. 1999;152:1079–1089. doi: 10.1093/genetics/152.3.1079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Chakraborty R. Mitochondrial DNA polymorphism reveals hidden heterogeneity within some Asian populations. Am. J. Hum Genet. 1990;47:87–94. [PMC free article] [PubMed] [Google Scholar]
  • 24.Schneider S, Roessli D, Excoffier L. A software for population genetics data analysis. Genetics and Biometry Laboratory, University of Geneva; Switzerland: 2000. [Google Scholar]
  • 25.Libert F, Cochaux P, Beckman G, Samson M, Aksenova M, Cao A, et al. The deltaccr5 mutation conferring protection against HIV-1 in Caucasian populations has a single and recent origin in Northeastern Europe. Hum. Mol. Genet. 1998;7:399–406. doi: 10.1093/hmg/7.3.399. [DOI] [PubMed] [Google Scholar]
  • 26.Rootsi S, Magri C, Kivisild T, Benuzzi G, Help H, Bermisheva M, et al. Phylogeography of Y-chromosome haplogroup I reveals distinct domains of prehistoric gene flow in europe. Am. J. Hum. Genet. 2004;75:128–137. doi: 10.1086/422196. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Semino O, Magri C, Benuzzi G, Lin AA, Al-Zahery N, Battaglia V, et al. Origin, diffusion, and differentiation of Y-chromosome haplogroups E and J: inferences on the neolithization of Europe and later migratory events in the Mediterranean area. Am. .J Hum. Genet. 2004;74:1023–1034. doi: 10.1086/386295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Sun C, Skaletsky H, Birren B, Devon K, Tang Z, Silber S, et al. An azoospermic man with a de novo point mutation in the Y-chromosomal gene USP9Y. Nat. Genet. 1999;23:429–432. doi: 10.1038/70539. [DOI] [PubMed] [Google Scholar]
  • 29.Shen P, Wang F, Underhill PA, Franco C, Yang WH, Roxas A, et al. Population genetic implications from sequence variation in four Y chromosome genes. Proc. Natl. Acad. Sci. USA. 2000;97:7354–7359. doi: 10.1073/pnas.97.13.7354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Thomson R, Pritchard JK, Shen P, Oefner PJ, Feldman MW. Recent common ancestry of human Y chromosomes: evidence from DNA sequence data. Proc. Natl. Acad. Sci. USA. 2000;97:7360–7365. doi: 10.1073/pnas.97.13.7360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Nielsen R, Hubisz MJ, Clark AG. Reconstituting the frequency spectrum of ascertained single-nucleotide polymorphism data. Genetics. 2004;168:2373–2382. doi: 10.1534/genetics.104.031039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Rozen S, Marszalek JD, Alagappan RK, Skaletsky H, Page DC. Remarkably little variation in proteins encoded by the Y chromosome’s single-copy genes, implying effective purifying selection. Am. J. Hum. Genet. 2009;85:923–928. doi: 10.1016/j.ajhg.2009.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Voight BF, Adams AM, Frisse LA, Qian Y, Hudson RR, Di Rienzo A. Interrogating multiple aspects of variation in a full resequencing data set to infer human population size changes. Proc. Natl. Acad. Sci USA. 2005;102:18508–18513. doi: 10.1073/pnas.0507325102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Pluzhnikov A, Di Rienzo A, Hudson RR. Inferences about human demography based on multilocus analyses of noncoding sequences. Genetics. 2002;161:1209–1218. doi: 10.1093/genetics/161.3.1209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Wall JD, Przeworski M. When did the human population size start increasing? Genetics. 2000;155:1865–1874. doi: 10.1093/genetics/155.4.1865. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Adams AM, Hudson RR. Maximum-likelihood estimation of demographic parameters using the frequency spectrum of unlinked single-nucleotide polymorphisms. Genetics. 2004;168:1699–1712. doi: 10.1534/genetics.104.030171. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Marth GT, Czabarka E, Murvai J, Sherry ST. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics. 2004;166:351–372. doi: 10.1534/genetics.166.1.351. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Wall JD, Cox MP, Mendez FL, Woerner A, Severson T, Hammer MF. A novel DNA sequence database for analyzing human demographic history. Genome Res. 2008;18:1354–1361. doi: 10.1101/gr.075630.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Heath SC, Gut IG, Brennan P, McKay JD, Bencko V, Fabianova E, et al. Investigation of the fine structure of European populations with applications to disease association studies. Eur. J. Hum. Genet. 2008;16:1413–1429. doi: 10.1038/ejhg.2008.210. [DOI] [PubMed] [Google Scholar]
  • 40.Tian C, Plenge RM, Ransom M, Lee A, Villoslada P, Selmi C, et al. Analysis and application of European genetic substructure using 300 K SNP information. PLoS Genet. 2008;4:e4. doi: 10.1371/journal.pgen.0040004. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

01

RESOURCES