Abstract
Objective
To determine the relationship between highly-conserved extended-haplotypes (CEHs) in the major histocompatibility complex (MHC) and MS-susceptibility.
Background
Among the ~200 MS-susceptibility regions, which are known from genome-wide analyses of single nucleotide polymorphisms (SNPs), the MHC accounts for roughly a third of the currently explained variance and the strongest MS-associations are for certain Class II alleles (e.g., HLA-DRB1*15:01; HLA-DRB1*03:01; and HLA-DRB1*13:03), which frequently reside on CEHs within the MHC.
Design/Methods
Autosomal SNPs (441,547) from 11,376 MS cases and 18,872 controls in the WTCCC dataset were phased. The most significant MS associated SNP haplotype was composed of 11 SNPs in the MHC Class II region surrounding the HLA-DRB1 gene. We also phased alleles at the HLA-A, HLA-C, HLA-B, HLA-DRB1, and HLA-DQB1 loci. This data was used to probe the relationship between CEHs and MS susceptibility.
Results
We phased a total of 59,884 extended haplotypes (HLA-A, HLA-C, HLA-B, HLA-DRB1, HLA-DQB1 and SNP haplotypes) from 29,942 individuals. Of these, 10,078 unique extended haplotypes were identified. The 10 most common CEHs accounted for 22% (13,302) of the total. By contrast, the 8,446 least common extended haplotypes also accounted for approximately 20% (12,298) of the total. This extreme frequency-disparity among extended haplotypes necessarily complicates interpretation of reported disease-associations with specific HLA alleles. In particular, the HLA motif HLA-DRB1*15:01~HLA-DQB1*06:02 is strongly associated with MS risk. Nevertheless, although this motif is almost always found on the a1 SNP haplotype, it can rarely be found on others (e.g., a27 and a36), and, in these cases, it seems to have no apparent disease-association (OR = 0.7; CI = 0.3–1.3 and OR = 0.7; CI = 0.2–2.2, respectively). Furthermore, single copy carriers of the a1 SNP-haplotype without this HLA motif still have an increased disease risk (OR = 2.2; CI = 1.2–3.8). In addition, even among the set of CEHs, which carry the Class II motif of HLA-DRB1*15:01~HLA-DQB1*06:02~a1, different CEHs have differing strengths in their MS-associations.
Conclusions
The MHC in diverse human populations consists, primarily, of a very small collection of very highly-selected CEHs. Our findings suggest that the MS-association with the HLA-DRB1*15:01~HLA-DQB1*06:02 haplotype may be due primarily to the combined attributes of the CEHs on which this particular HLA-motif often resides.
Introduction
The basis of genetic susceptibility to multiple sclerosis (MS) is complex [1–3]. Thus, currently, there are over 200 MS associated common risk variants in different genomic regions that have been identified by genome wide association screens (GWAS) comparing MS patients to controls [4–12]. These GWAS studies typically evaluate the disease associations for ~500,000 single nucleotide polymorphisms (SNPs) scattered throughout the genome [4–12]. Despite the large number of genetic associations defined by these increasingly available GWAS studies, several alleles of the human leukocyte antigens (HLA), located in the major histocompatibility complex (MHC) on the short arm of chromosome 6 (6p21.3), were identified more than four decades ago. The most prominent of these HLA associations (by far) is with the HLA-DRB1*15:01 allele, which typically has an odds ratio (OR) of more than three for heterozygotes and more than six for homozygotes [9, 13–20]. Also, other alleles at the DRB1 locus (e.g., HLA-DRB1*03:01 and HLA-DRB1*13:03) are known to be associated with an increased risk of getting MS [1, 11, 21]. However, even with the large number of defined genetic associations with MS, most of the genetic risk in MS remains unexplained. In addition, as shown in Figure A in S3 File, the large majority of the population does not even belong to the subgroup of individuals who are “genetically susceptible” to getting MS [3]. Observations such as these have created a so-called “heritability gap”. Such a gap is a common finding in many complex genetic disorders [1, 2] and is likely due (at least in part) to the phenomenon of “synthetic association” [22], in which a reported association is simply tagging a genomic region rather than identifying a causal variant. Indeed, both single SNPs and single alleles can be associated with several haplotypes sometimes spread over a considerable genetic distance [23–34]. For example, despite the apparently well-established association of MS susceptibility with the HLA-DRB1*15:01 allele, this association might be due to a synthetic association [18, 19]. Moreover, as demonstrated in Figure A in S3 File, even for the HLA-DRB1*15:01 allele, the large majority of its carriers do not even belong to the subset of individuals who are “genetically susceptible” to getting MS [3].
Some of the haplotypes in the MHC region are highly conserved extended haplotypes (CEHs), which span more than 2.7 megabases (mb) [23–28, 30, 32–36]. These CEHs exist even though the MHC region encompasses several recombination hotspots and the region as a whole has an average recombination rate of ~0.4 centimorgans (cM) per mb [27, 34, 37, 38]. Proposed mechanisms to account for this kind of extended linkage are: “frozen blocks” of DNA, preservation of ancestral lineages, haplotype-specific suppression of recombination / mutation in parts of the MHC region, or some form of balancing evolution, in which heterozygosity is favored [24, 39–43]. Several of these CEHs include HLA-DRB1*15:01, HLA-DRB1*03:01, HLA-DRB1*13:03, or other alleles. For example, the haplotypes:
and:
have been consistently observed in Caucasian populations [23–28, 32, 35, 38]. Necessarily, the existence of such CEHs in the MHC region complicates the interpretation of the disease association with any specific HLA allele. We recently explored a method for reducing the size of the heritability gap by analyzing SNP haplotypes (rather than single SNPs) throughout the genome [32]. In addition to improving significantly the explained genetic risk, this method also provides an opportunity to explore in greater depth the genetic associations of the MHC reported previously.
For example, using the Wellcome Trust Case Control Consortium dataset (WTCCC), we found an 11-SNP haplotype in the MHC region, which had the greatest MS disease association of any, and which we labeled the a1 SNP haplotype (OR [single copy] ≈ 3; p<10−300) [29, 30]. This SNP haplotype represents a specific string of 11 SNPs spanning a total of 246.3 kilobases (kb) surrounding the HLA-DRB1 gene (Fig 1) and includes the SNPs (rs2395173, rs2395174, rs3129871, rs7192, rs3129890, rs9268832, rs532098, rs17533090, rs2187668, rs1063355, and rs9275141). These 11 SNPs define 174 haplotypes in this region (e.g., Table 1), with each SNP haplotype having its own Class II HLA haplotype specificity (e.g., Table 1; Fig 2). As with other previously reported SNP “hits” in this genomic region [9, 13–17], the a1 SNP haplotype is tightly coupled to the MHC Class II haplotype of HLA-DRB1*15:01~HLA-DQB1*06:02. In the present paper, we have analyzed the haplotype structure of the MHC (including both HLA alleles and SNP haplotypes) to better understand the specific genetic relationship of this genomic region to MS.
Table 1. Selected SNP haplotypes in the Class II region of chromosome 6†.
SNP | HLA | |||
---|---|---|---|---|
Name | Haplotype | Association | WTCCC | EPIC |
a1 | 10110100010 |
HLA-DRB1*15:01~ HLA-DQB1*06:02 |
0.12 | 0.11 |
a2 | 00000000100 |
HLA-DRB1*03:01~ HLA-DQB1*02:01 |
0.02 | 0.02 |
a3 | 00000010001 | multiple haplotypes†† | 0.19 | 0.21 |
a4 | 00000000001 |
HLA-DRB1*11:01~ HLA-DQB1*03:01 |
0.11 | 0.13 |
a5 | 10100010001 |
HLA-DRB1*07:01~ HLA-DQB1*02:02 |
0.09 | 0.08 |
a6 | 01011100100 |
HLA-DRB1*03:01~ HLA-DQB1*02:01 |
0.10 | 0.09 |
a8 | 10110100011 |
HLA-DRB1*15:01~ HLA-DQB1*05:02 |
0.00 | 0.00 |
a9 | 01000001010 |
HLA-DRB1*01:01~ HLA-DQB1*05:01 |
0.11 | 0.11 |
a11 | 00000010010 |
HLA-DRB1*13:01~ HLA-DQB1*06:03 |
0.02 | 0.03 |
a14 | 10111111001 |
HLA-DRB1*13:03~ HLA-DQB1*03:01 |
0.01 | 0.01 |
a27 | 10100100011 | two haplotypes§ | 0.00 | 0.00 |
a34 | 10111100010 |
HLA-DRB1*15:01~ HLA-DQB1*06:02§§ |
0.00 | 0.00 |
a36 | 10100100010 |
HLA-DRB1*15:01~ HLA-DQB1*06:02§§ |
0.00 | 0.00 |
a43 | 00000100010 |
HLA-DRB1*15:01~ HLA-DQB1*06:02 |
0.00 | 0.00 |
† The "Name" is arbitrary and indicates the order of haplotype identification in the EPIC dataset [29, 30]. The SNP haplotype represents the haplotypes identified using the set of 11 SNPs shown in Fig 1 and provided in text. The number “0” indicates the presence of the major allele and the number “1” indicates the presence of the minor allele (in the control population) at the particular SNP location. Only 14 selected SNP-haplotypes (of the 174 present in the WTCCC) are listed. Haplotype frequencies found in two independent datasets (EPIC and WTCCC) are shown [29, 30]. Frequencies are provided to 2 significant digits after the decimal. Those listed as (0.00) were less than 0.005. Each of the 174 haplotypes had very specific associations with specific Class II haplotypes. For example, each of the associations (shown in the Table) of specific SNP-haplotypes with specific HLA haplotypes were highly significant. Almost all had of p-value (by Chi square analysis) of (p<10−300). The only two exceptions to this were for HLA-DRB1*07:01~HLA-DQB1*02:02~a3 (p<10−151) and for HLA-DRB1*15:01~HLA-DQB1*06:02~a34 (p<10−290). Moreover, both the EPIC and the WTCCC datasets had the same Class II HLA associations with the different SNP-haplotypes.
†† In both EPIC and the WTCCC, a3 was equally associated with four HLA haplotypes: HLA-DRB1*04:01~HLA-DQB1*03:01, HLA-DRB1*04:01~HLA-DQB1*03:02, HLA-DRB1*04:04~HLA-DQB1*03:02, and HLA-DRB1*07:01~HLA-DQB1*02:02.
§ In both EPIC and WTCCC, a27 is associated with two haplotypes: HLA-DRB1*15:01~HLA-DQB1*06:02, and HLA-DRB1*15:01~HLA-DQB1*05:02,. In WTCCC, 58% (28/48) were HLA-DRB1*15:01~HLA-DQB1*06:02, whereas, in EPIC, none of the five a27 SNP haplotypes were associated with this particular HLA haplotype.
§§ The single example of the a34 SNP haplotype in EPIC was associated with the HLA-DRB1*15:01~HLA-DQB1*06:02 HLA haplotype. No examples of the a36 SNP haplotype were present in EPIC who also had HLA information.
Results
Highly conserved haplotypes of the MHC
Some of the CEHs in the MHC region, which are highly conserved, involve both Class I and Class II loci [24–38]. The different combinations of alleles at three Class I loci (HLA-A, HLA-B, and HLA-C) and two Class II loci (HLA-DRB1 and HLA-DQB1) together with a specific 11-SNP haplotype represent more than 4 billion possible unique haplotypes spanning a genomic distance of 2.7 mb. Despite this huge number of possibilities, the frequency distribution for these extended haplotypes in the WTCCC is definitely non-Gaussian, with many very rare haplotypes together with a small number of very common haplotypes (e.g., Fig 3; Figure B in S3 File; S1 Table; S2 Table). Thus, there were just 10,078 unique haplotypes represented within the 29,942 individuals of the WTCCC accounting for 59,884 total observed haplotypes. Of these, 13,302 (22%) were accounted for by the most common 10 CEHs, 30% by the most common 25 CEHs, 48% by the 146 CEHs with 50 or more representations in the WTCCC, and 71% by the most common 810 CEHs (S1 Table). On the other end, 6,016 (60%) of the unique extended haplotypes were observed only once in the WTCCC dataset. An additional 1,397 (14%) had only 2 representations so that 7,413 (74%) of the unique haplotypes had two or fewer representations. However, these 74% of the unique haplotypes accounted for only 8,810 (15%) of the total number of observed haplotypes in the WTCCC dataset. Consequently, there exists a small set of very common CEHs, which have been strongly selected (see S2 File), and which, nonetheless, have notably different compositions in different populations, even among relatively nearby geographic regions (Fig 4; S1 and S2 Tables). Moreover, there also appears to be a substantial amount of mixing between specific Class I and Class II motifs (see S1 File).
In addition, the prevalence of individuals in the WTCCC who were homozygous for these CEHs was increased relative to expectations (expected = 269; observed = 383; z = 6.97; p<10−11). Such an increase was found for both the cases (expected = 152; observed = 208; z = 4.59; p<10−5) and the controls (expected = 138; observed = 175; z = 3.13; p = 0.0018).
Haplotype associations with MS susceptibility
The fact that much (possibly most) of the MHC is composed of a small group of CEHs necessarily complicates the interpretation of any disease associations previously reported for specific alleles such as HLA-DRB1*15:01, HLA-DRB1*03:01, and HLA-DRB1*13:03 [1, 9, 13–17, 19, 21, 29, 30]. For example, it is unclear to what extent the effect of HLA-DRB1*15:01 on disease susceptibility can be separated from an effect of the full CEHs (comprising both the 5 HLA alleles and the SNP haplotypes) on which this allele resides. To investigate this, we undertook two alternative approaches. First, we examined the disease association of different CEHs, which contained HLA-DRB1*15:01~HLA-DQB1*06:02~a1, HLA-DRB1*03:01~HLA-DQB1*02:01~a2, HLA-DRB1*03:01~HLA-DQB1*02:01~a6, or HLA-DRB1*13:03~HLA-DQB1*03:01~a14. Second, we examined the disease associations for haplotypes that either contained these same Class II HLA motifs but a different SNP haplotype motif or contained the same SNP haplotype motif but a different Class II HLA motif.
HLA-DRB1*15:01~HLA-DQB1*06:02
The HLA-DRB1*15:01~HLA-DQB1*06:02 haplotype is very closely associated with the (a1) SNP haplotype; 99% of all (a1)-carriers also carry HLA-DRB1*1501~HLA-DQB1*0602 and the reciprocal statement is true as well (Fig 2). The disease associations of all CEHs containing HLA-DRB1*15:01~HLA-DQB1*06:02~a1 with 50 or more representations in the WTCCC dataset are shown in Table 2. Each of these extended haplotypes is significantly associated with an increased disease risk (Table 2). However, for several of them, the magnitude of the association with disease risk varied significantly (Figure C in S3 File). Indeed, for example, the disease-association for haplotype (c2) was significantly greater that for than both the (c3) and the (c11)) haplotypes (Figure C in S3 File). By contrast, the haplotype (c3) had a significantly smaller disease-association than that of several other haplotypes (Figure C in S3 File). Especially notable, however, was haplotype (c282), consisting of HLA-A*03:01~HLA-C*15:02~HLA-B*51:01~HLA-DRB1*15:01~HLA-DQB1*06:02~a1, which had an extremely strong disease association (OR = 20.3; CI = 6.1− 67.3; p<10−11), and which differed significantly from every other haplotypes with the exception of the (c173) haplotype (Figure C in S3 File). However, regardless of the fact that the magnitude of disease association depends upon the particular CEH, on which the HLA-DRB1*15:01~HLA-DQB1*06:02~a1 motif resides, some disease risk seems to be attributable to the HLA-DRB1*15:01~HLA-DQB1*06:02~a1 haplotype by itself because the disease risk is still significantly increased for those individuals who both carry this complete Class II motif and, yet, whose full CEH has only a single representation in the WTCCC (OR = 3.0; CI = 2.7−3.4; p<10−10).
Table 2. Common a1-containing extended haplotypes in the WTCCC††.
HLA Haplotype | ||||
---|---|---|---|---|
Name† | A~C~B~DRB1~DQB1~SNP | Frequency | OR* | p-value** |
c2§ | 03:01~07:02~07:02~15:01~06:02~a1 | 2961 | 3.2 (3.0–3.5) | < E-168 |
c3§ | 02:01~07:02~07:02~15:01~06:02~a1 | 1465 | 2.2 (2.0–2.5) | < E-38 |
c6 | 24:02~07:02~07:02~15:01~06:02~a1 | 728 | 2.8 (2.4–3.3) | < E-36 |
c11 | 25:01~12:03~18:01~15:01~06:02~a1 | 440 | 3.9 (3.1–4.8) | < E-39 |
c13 | 01:01~07:02~07:02~15:01~06:02~a1 | 405 | 3.4 (2.7–4.2) | < E-29 |
c16 | 01:01~07:01~08:01~15:01~06:02~a1 | 320 | 3.7 (2.9–4.8) | < E-27 |
c19 | 02:01~05:01~44:02~15:01~06:02~a1 | 289 | 2.1 (1.6–2.7) | < E-7 |
c22 | 11:01~07:02~07:02~15:01~06:02~a1 | 229 | 2.5 (1.9–3.4) | < E-9 |
c28 | 01:01~06:02~37:01~15:01~06:02~a1 | 178 | 4.5 (3.2–6.3) | < E-20 |
c44 | 31:01~07:01~18:01~15:01~06:02~a1 | 135 | 2.9 (2.0–4.2) | < E-9 |
c50 | 02:01~03:04~40:01~15:01~06:02~a1 | 124 | 3.1 (2.0–4.7) | < E-7 |
c58 | 02:01~03:03~15:01~15:01~06:02~a1 | 105 | 3.2 (2.1–5.0) | < E-7 |
c78 | 29:02~16:01~44:03~15:01~06:02~a1 | 84 | 3.7 (2.2–6.1) | < E-7 |
c87 | 31:01~07:02~07:02~15:01~06:02~a1 | 73 | 3.4 (2.0–5.6) | < E-6 |
c91 | 26:01~07:02~07:02~15:01~06:02~a1 | 71 | 2.6 (1.6–4.3) | < E-3 |
c108 | 32:01~07:02~07:02~15:01~06:02~a1 | 64 | 3.1 (1.8–5.4) | < E-4 |
c116 | 31:01~15:02~51:01~15:01~06:02~a1 | 60 | 4.3 (2.4–7.9) | < E-6 |
c120 | 03:01~04:01~35:01~15:01~06:02~a1 | 58 | 4.5 (2.5–8.1) | < E-7 |
c125 | 11:01~03:03~55:01~15:01~06:02~a1 | 57 | 1.9 (1.1–3.3) | < 0.05 |
c128 | 68:01~07:04~44:02~15:01~06:02~a1 | 55 | 2.9 (1.6–5.1) | < E-3 |
c132 | 01:01~06:02~57:01~15:01~06:02~a1 | 54 | 1.8 (1.0–3.3) | < 0.05 |
c139§§ | 02:01~03:04~15:01~15:01~06:02~a1 | 52 | 3.2 (1.6–6.3) | < E-3 |
c140 | 11:01~15:02~51:01~15:01~06:02~a1 | 52 | 3.3 (1.7–6.4) | < E-3 |
c143 | 68:01~07:02~07:02~15:01~06:02~a1 | 51 | 3.0 (1.6–5.6) | < E-3 |
c173 | 23:01~07:01~49:01~15:01~06:02~a1 | 43 | 5.5 (2.8–10.9) | < E-7 |
c282 | 03:01~15:02~51:01~15:01~06:02~a1 | 29 | 20.3 (6.1–67.3) | < E-11 |
†† a1 containing haplotypes with ≥ 50 representations in the WTCCC. Two additional haplotypes with fewer representations are also shown.
† Arbitrary name for haplotype (sorted in descending order of frequency) for the entire WTCCC population.
* Odds ratio (OR) of disease for individuals having 1 copy of the listed haplotype compared to having no other copies of the HLA-DRB1*15:01~HLA-DQB1*06:02~a1 Class II haplotype (95% CI range in parenthesis). A Bonferroni correction for the number of haplotypes with 50 or more representations (146) would require a significance level of p<3*E-4.
** Significance of the association between having 1 copy of the specific allele and the disease (MS) compared to having no copies. The p-values are expressed in scientific notation as powers of 10 (E). All observations with (p<0.001) still demonstrated a statistically significant effect even after adjustment for population stratification, geographic stratification, and gender. Moreover, including each of these haplotypes in the same regression equation demonstrated that each of the listed CEHs was independently associated with having MS.
§ These two haplotypes also differed (non-significantly) in their disease-association for having two copies of each allele compared to having no copies of the HLA-DRB1*15:01~HLA-DQB1*06:02~a1 Class II haplotype. Thus, these ORs are
For c2: OR [two copies] = 5.8 (3.4–9.9)
And, for c3: OR [two copies] = 2.7 (1.3–5.5)
§§ The Class I and Class II portions of each listed haplotype were significantly associated with each other beyond the Bonferroni-adjusted level of significance. The only exception to this rule was for the haplotype c139. In this case, the association had a p-value of: p = 4.42*E−8
Despite the extremely strong association of the (a1) SNP-haplotype with this particular HLA haplotype, some HLA-DRB1*15:01~HLA-DQB1*06:02 motifs occur in association with other SNP-haplotypes and some of these combinations seem not to have any disease-association (Fig 5A). Thus, for example, single-copy carriers of either HLA-DRB1*15:01~HLA-DQB1*06:02~a27 or the HLA-DRB1*15:01~HLA-DQB1*06:02~a36 haplotypes, seem not to have any increase in their disease-risk (OR = 0.7; CI = 0.3−1.3 and OR = 0.7; CI = 0.2−2.2, respectively). These ORs are significantly different for both the (a27)-containing haplotype (z = 2.5; p = 0.01) and for the (a36)-containing haplotype (z = 4.2; p<10−4) compared to the same HLA-haplotype containing (a1). Similarly, as shown in Fig 5A, considering together all non-(a1)-containing haplotypes carrying the HLA-DRB1*15:01~HLA-DQB1*06:02 motif these also had significantly smaller ORs than the (a1)-containing haplotypes (z = 3.9; p<10−4). By contrast, single copy carriers of the (a1) SNP haplotype who lack the HLA-DRB1*15:01~HLA-DQB1*06:02 HLA haplotype, still have a significantly increased disease risk (OR = 2.2; CI = 1.2–3.8). Moreover, although this OR is less than that found for single copy carriers of the HLA-DRB1*15:01~HLA-DQB1*06:02~a1 haplotype, the confidence intervals overlap and the two ORs did not differ significantly (z = 1.2; p = 0.24).
In the WTCCC dataset, the HLA alleles were imputed [44] and, thus, it is possible that either errors of imputation or errors in SNP identification could have influenced these findings. We addressed these possibilities in two ways. First, we compared the HLA associations of the different SNP haplotypes in the imputed WTCCC dataset with the HLA haplotype associations in the Expression, Proteomics, Imaging, and Clinical (EPIC) Study dataset, which had been determined by sequence based typing methods [30]. There was an excellent agreement in the corresponding Class II SNP haplotype associations found in the two datasets (Table 1). In addition, several of the rare HLA-DRB1*15:01~HLA-DQB1*06:02 containing SNP haplotypes were found in both datasets (Table 1). Second, we analyzed the hamming distance between the various HLA-DRB1*15:01~HLA-DQB1*06:02 containing SNP haplotypes to assess how close these haplotypes were to each other (Figs 6 and 7). Presumably, if errors in SNP identification were responsible for occasionally assigning the HLA-DRB1*15:01~ HLA-DQB1*06:02 haplotype to rare SNP haplotypes, the percentage of these errors would tend to be higher for haplotypes at short hamming distances from (a1). However, no such relationship was evident (Figs 6 and 7).
HLA-DRB1*03:01~HLA-DQB1*02:01
The haplotype HLA-DRB1*03:01~HLA-DQB1*02:01 is divided between the (a2) and the a6 SNP haplotypes (Figs 2 & 5B; Table 3). These two haplotypes seem to have distinctive disease associations. Thus, the a2-containing haplotype show dominance (or dose dependence), such that both the heterozygotes and homozygotes have an increased disease risk (Fig 5B). This is the case for all the common a2-containing extended haplotypes (Table 3). By contrast, the (a6)-containing haplotypes, for the most part, show a recessive pattern such that heterozygotes seem not to have any increased risk (Fig 5B). Thus, the increased risk in (a2)-containing heterozygotes is significantly different from the (a6)-containing heterozygotes (z = 5.9; p<10−8), and, in addition, the (a6)-containing homozygotes have a substantially increased disease risk, which is significantly greater than that found for a6-containing heterozygotes (z = 8.0; p <10−14). Again, the lack of any increased risk for heterozygotes seems to be true for most of the (a6)-containing CEHs (Table 3). However, this was not the case for the extended haplotype HLA-A*24:02~HLA-C*07:01~HLA-B*08:01~HLA-DRB1*03:01~HLA-DQB1*02:01~a6. Thus, for this haplotype, the disease risk for the heterozygote was both significantly increased (Table 3) and, with the exception of (c27) and (c90), significantly greater than that for other (a6)-containing CEHs (range of z-scores: 2.2–4.6; range of p-values: 0.03–10−5).
Table 3. Common a2-, a6-, or a14-containing (or other) extended haplotypes††.
HLA Haplotype | |||||||
---|---|---|---|---|---|---|---|
Name† | A~C~B~DRB1~DQB1~SNP | Frequency | OR* | p-value** | |||
c23 | 30:02~05:01~18:01~03:01~02:01~a2 | 212 | 2.0 (1.4–2.7) | < E-4 | |||
c46 | 01:01~07:01~08:01~03:01~02:01~a2 | 128 | 2.1 (1.5–3.0) | < E-4 | |||
c85 | 02:01~05:01~18:01~03:01~02:01~a2 | 75 | 1.7 (1.0–2.9) | < 0.05 | |||
c1§ | 01:01~07:01~08:01~03:01~02:01~a6 | 3782 | 1.1 (1.0–1.2) | < 0.05 | |||
c14 | 02:01~07:01~08:01~03:01~02:01~a6 | 397 | 0.9 (0.7–1.2) | ns | |||
c27 | 03:01~07:01~08:01~03:01~02:01~a6 | 181 | 1.7 (1.2–2.3) | < E-2 | |||
c51 | 68:01~07:01~08:01~03:01~02:01~a6 | 121 | 0.6 (0.4–1.0) | < 0.05 | |||
c68 | 24:02~07:01~08:01~03:01~02:01~a6 | 91 | 3.0 (1.8–4.9) | < E-5 | |||
c90 | 03:01~07:02~07:02~03:01~02:01~a6 | 71 | 1.6 (0.9–2.6) | ns | |||
c97 | 32:01~07:01~08:01~03:01~02:01~a6 | 68 | 1.1 (0.6–2.0) | ns | |||
c110 | 25:01~07:01~08:01~03:01~02:01~a6 | 63 | 1.3 (0.7–2.3) | ns | |||
c34 | 68:02~08:02~14:02~13:03~03:01~a14 | 161 | 1.9 (1.3–2.8) | < E-3 | |||
c96 | 66:01~17:01~41:02~13:03~03:01~a14 | 69 | 2.6 (1.5–4.5) | < E-3 | |||
c107 | 02:01~17:01~41:02~13:03~03:01~a14 | 64 | 1.9 (1.1–3.4) | < 0.05 | |||
c5§§ | 02:01~05:01~44:02~04:01~03:01~a3 | 906 | 0.5 (0.4–0.6) | < E-11 | |||
c15 | 02:01~06:02~13:02~07:01~02:02~a3 | 361 | 0.5 (0.3–0.6) | < E-5 | |||
c18 | 02:01~06:02~57:01~07:01~03:03~a5 | 293 | 0.5 (0.3–0.7) | < E-4 | |||
c24 | 02:01~01:02~27:05~01:01~05:01~a9 | 211 | 0.5 (0.3–0.7) | < E-3 | |||
c30 | 02:01~05:01~44:02~11:01~03:01~a4 | 173 | 0.6 (0.4–0.9) | < 0.05 | |||
c32 | 03:01~07:02~07:02~13:01~06:03~a18 | 166 | 0.6 (0.4–0.9) | < E-2 | |||
c73 | 02:01~15:02~51:01~09:01~03:03~a4 | 87 | 0.4 (0.2–0.8) | < E-2 | |||
c81 | 24:02~07:02~39:06~08:01~04:02~a16 | 79 | 3.1 (1.8–5.5) | < E-4 |
†† haplotypes with ≥ 50 representations in the WTCCC. All such haplotypes carrying the a2, a6, or a14 SNP haplotype are included. For each of the listed haplotypes, the Class I and Class II portions were significantly associated with each other far beyond the Bonferroni-adjusted level of significance.
† Arbitrary name for haplotype (sorted in descending order of frequency) for the entire WTCCC population.
* Odds ratio (OR) of disease for individuals having 1 copy of the listed haplotype compared to having no copies of the particular HLA-DRB1~HLA-DQB1~SNP Class II haplotype (95% CI range in parenthesis). All haplotypes carrying the HLA-DRB1*15:01~HLA-DQB1*06:02~a1 Class II motif were excluded in this analysis. A Bonferroni correction for the number of haplotypes with 50 or more representations (146) would require a significance level of (p<3*E-4).
** Significance of the association between having 1 copy of the specific allele and the disease (MS) compared to having no copies. The p-values are expressed in scientific notation as powers of 10 (E); ns = not significant. With exception of c23 and c46, all observations with p<0.001 still showed a statistically significant effect even after adjustment for population stratification, geographic, stratification, and gender. Moreover, even c23 and c46 trended in this direction (p≈0.10)
§ Only the c1 haplotype had enough observations to explore the disease association for having two copies of an allele compared to having no copies of the HLA-DRB1*03:01~HLA-DQB1*02:01~a6 Class II haplotype. Thus, this OR was
For c1: OR [two copies] = 2.1 (1.5–2.9); p = 2.1*E-6
This effect was still statistically significant even after adjustment for population stratification (p = 3.13*E-6).
The other Class II haplotypes containing HLA-DRB1*03:01~HLA-DQB1*02:01~a6, combined, had an OR of:
OR [two copies] = 0.8 (0.1–3.4); p = ns
§§ This group of haplotypes is composed of those that also had a significant association with this disease. Most of these haplotypes seem to be protective and this protective effect remained significant (p<0.05) even after excluding all individuals who carried the HLA-DRB1*15:01~HLA-DQB1*06:02~a1 haplotype.
HLA-DRB1*13:03~HLA-DQB1*03:01
The haplotype HLA-DRB1*13:03~HLA-DQB1*03:01 is essentially confined to the (a14I) SNP haplotype (Figs 2 & 5B; Table 3). This haplotype was clearly associated with an increased disease risk in the heterozygote (Fig 5B); roughly similar for all the most common (a14)-containing extended haplotypes (Fig 5). The disease risk may also be increased in individuals homozygous for this haplotype although there were too few observations to be sure (Fig 5B).
Other extended haplotypes
Several other CEHs also seemed to be associated with disease risk (Table 3). Many of these were protective and this protective effect was evident despite the fact that those individuals who carried the HLA-DRB1*15:01~HLA-DQB1*06:02~a1 haplotype were removed from the analysis (Table 3). By contrast, as is also shown in Table 3, the extended haplotype HLA-A*24:02~HLA-C*07:02~HLA-B*39:06~HLA-DRB1*08:01~HLA-DQB1*04:02~a16 was associated with a significant increase in disease risk (OR = 3.0; CI = 1.8–5.5).
Regression analysis confirmed the significance of these observations and no significant interactions were identified. Moreover, adjustment for population stratification, geographic stratification and for gender did not alter these findings (Tables 2 and 3).
The EPIC cohort
The cohort of patients from the EPIC study was considerably smaller than those in the WTCCC study and, consequently, only a limited amount of comparative information is available. For example, only 6 CEHs (c1, c2, c3, c5, c6, and c11) had 20 or more representations in the EPIC dataset (S3 Table). Nevertheless, all four of the HLA-DRB1*15:01~ HLA-DQB1*06:02~a1 containing haplotypes (c2, c3, c6, and c11) were significantly associated with MS and had ORs [single copy] ranging from 2.5 to 3.9, with the largest being for (c11) and the smallest being for (c2). The haplotype (c1) had a non-significant OR [single copy] of 1.3 and the haplotype (c5) had an OR, which was significantly less than one (OR [single copy] = 0.2). In general these findings are consistent with those reported above for the WTCCC cohort (Tables 2 and 3; S3 Table).
Discussion
In the WTCCC dataset, the MHC region seemed to be composed, largely, of a relatively small collection of very highly-selected CEHs (see S1 File) stretching, at least, from the HLA-A locus to beyond the HLA-DQB1 locus (a distance spanning more than 2.7 mb of DNA). The occurrence of homozygous CEHs was increased both in cases and controls. Such an increase might be expected in the patient population, where the homozygotes of certain haplotypes have an especially high disease risk [9,13–20]. However, it should not be the case for the control population if a balancing selection (i.e., one in which some heterozygous combinations have higher fitness than homozygous combinations) was expected [41]. Alternatively, such a finding might be due to population stratification effects. Thus, such an increase might be expected if local sub-populations had different CEH frequencies (e.g., like Fig 4, but with finer grained population subdivisions) and if individuals from these sub-populations had a propensity to find mates within their same sub-population [45].
Also, and as developed more fully in S2 File, when classifying the WTCCC haplotypes into “rare” and “frequent” CEHs (i.e., those found once or more than once, respectively), there is a significant excess in the number of homozygotes observed for both “rare” and “frequent” CEHs compared to HWE expectations. For this analysis, homozygotes are considered “rare−rare” and “frequent−frequent” individuals regardless of the actual CEHs that make up the haplotype pair. The conversion of CEHs from “rare” to “frequent” or vice versa can be caused either by biologic mechanisms (e.g., recombination or mutation) or by mistakes (e.g., typing, imputation, or phasing errors). These errors cannot be avoided entirely due to the marked similarity of many HLA alleles [46]. However, regardless of the underlying mechanism, haplotype conversion, by itself, does not produce any deviation from HWE (S2 File). Also, mistakes don’t produce actual changes in CEH frequencies that accumulate over time. By contrast, over time, actual haplotype conversions (e.g. those caused by biologic mechanisms), which are unopposed, would reach a stable state in the population only once the net conversion rate is zero–i.e., when the probability of frequent→rare and rare→frequent transitions are equal (S2 File). This, however, is decidedly not the state of the WTCCC, the EPIC, or other populations here, each of which is composed predominantly of a small number of very common CEHs (Fig 3; Figure B in S3 File). Consequently, it must be that the force of actual haplotype conversion is being opposed by another force (i.e., selection) that both retains “frequent” CEHs in the population and also perturbs HWE (S2 File). Such a selection is already strongly suggested just based on the typical CEH composition of the different human populations (Fig 3, Figure B in S3 File). Indeed, using the observed magnitude of the deviation from HWE, and presuming the forces of selection and haplotype conversion balance each other, leads both to the conclusion that the relative probability of survival for individuals with homozygous “rare” CEHs is less than 80% of that for individuals with homozygous “frequent” CEHs and also that the net frequent → rare haplotype conversion rate is on the order of 3−6% for the MHC region in each generation (S2 File).
Naturally, there are possible explanations, other than selection, which could also produce a deviation from HWE expectations. Most conspicuous and widely recognized among these is the possibility that the WTCCC population is composed of two or more sub-populations, each of which is in HWE but with each sub-population having different haplotype frequencies. Such a circumstance would violate the HWE assumption of random mating and would lead to the circumstance in which homozygotes are in excess of expectations (as we observed). Moreover, there is no doubt that the exact CEH composition of the WTCCC varies considerably from region to region (e.g., Fig 4; S2 Table). Nevertheless, as discussed in S2 File, there are several reasons that even this simple mechanism seems inadequate to account for our observed deviations from HWE, Most importantly, we examined the impact that the observed differences in the percentage of “rare” CEHs among the sub-populations would have had on the HWE deviation. This analysis indicated that these differences could account for only about a quarter of the difference in HWE that we actually observed (S2 File). Consequently, our observations seem likely to be the result of a combination of both haplotype conversion and haplotype selection–each representing processes that take place in every generation.
Moreover, the strong selection of CEHs implies that certain allelic combinations “work well together” whereas other combinations do not (S2 File). Presumably, this “working well together”, in a biological sense, means that a particular combination of these five alleles (but likely also including other specific alleles of the many intervening genes) permit the host to respond to a variety of abiotic and biotic threats (or opportunities) in a manner that improves fitness (regardless of whether these come from the external environment, the internal microbiome, or both). However, it is also clear from these findings that no single allelic combination is being selected above all others. Rather, a relatively small number (in the hundreds) of combinations are being selected simultaneously (e.g., Tables 2 and 3; S1 Table). Perhaps this is because the nature of these abiotic and biotic threats (or opportunities) result in a very complex “fitness landscape”, which is highly variable both in space and in time and, thus, in which fitness depends upon the precise environmental context of the individual, including specific host factors such as the exact location of their residence, their particular micro-environment, their diet, their lifestyle, or other individual idiosyncrasies. In such a case, no single CEH may be favored in all circumstances and, consequently, such highly variable landscape topography might help to rationalize why so many haplotypes seem to be selected simultaneously. It might also help to rationalize why the group composition of the selected CEHs seems to be so fluid between separated populations (e.g., Fig 4; S1 Table). Thus, even within European populations, the beginning of such a divergence can already be recognized (Fig 4; S2 Table) and, based on limited data, this divergence in the group composition of the selected haplotypes in long separated populations (i.e., Africans, AmerIndians, Asians, and Caucasians) seems to be substantially greater (S1 File; S1 Table).
The main hypothesis of the present study was that any observed allelic disease association is a reflection of those CEHs, which confer MS disease risk. The present study sheds considerable light on this hypothesis. For example, although many CEHs, which include the Class II motif HLA-DRB1*15:01~HLA-DQB1*06:02~a1, are associated with an increased disease risk (Table 2), the actual risk varies significantly among the different extended haplotypes (Table 2; Figure C in S3 File). Moreover, some haplotypes, which include the motif HLA-DRB1*15:01~HLA-DQB1*06:02 but don’t include the SNP-haplotype a1, seem not to carry any risk (Fig 5A). By contrast, the (a1)-containing haplotypes, which don’t include this Class II motif, still carry substantial risk (Fig 5A). These observations suggest that the motif of HLA-DRB1*15:01~HLA-DQB1*06:02, by itself, does not fully account for the disease risk associated with these extended haplotypes. Regardless of this conclusion, however, some disease risk seems to be attributable to some aspect of the HLA-DRB1*15:01~HLA-DQB1*06:02~a1 haplotype by itself. Thus, even correcting for population stratification effects, the disease risk is still significantly increased for those individuals who both carry this Class II haplotype and, yet, whose full extended haplotype had only a single representation in the WTCCC.
In the case of the Class II HLA motif of HLA-DRB1*03:01~HLA-DQB1*02:01, this dependence on the extended haplotype is even more evident. Thus, most of the common extended haplotypes, which include the Class II motif of HLA-DRB1*03:01~HLA-DQB1*02:01~a2 seem to associate with a disease risk that is either dominant or dose dependent (Table 3; Fig 5B). By contrast, those haplotypes, which include the motif of HLA-DRB1*03:01~HLA-DQB1*02:01~a6, as a group, seem to associate with a disease risk that is recessive (Fig 5B). Nevertheless, at least one of these (a6)-containing haplotypes (i.e., HLA-A*24:02~C*07:01~HLA-B*08:01~HLA-DRB1*03:01~HLA-DQB1*02:01~a6) is associated with a disease risk, which is either dominant or dose dependent (Table 3).
In summary, the MHC is organized into a relatively small group of extended haplotypes (CEHs), which seem to be under a very strong selection pressure, presumably based upon favorable biological properties of the complete haplotype. If so, then, of necessity, this means that disease susceptibility is probably not attributable to any specific HLA allele but rather susceptibility is likely to be dependent upon the nature of each CEH. This conclusion seems to be borne out by the data. Moreover, it is of note that the most highly selected of these CEHs (in Caucasians) also seem to be the ones most likely to be associated with and increased risk of MS. The reasons for this apparent relationship are unclear. However, it is a fact that for the WTCCC population as a whole, for each of the WTCCC regions individually (Fig 4), and also for the EPIC cohort, the three most common CEHs (and 11 of the most common 25 CEHs) were associated with a significantly increased risk of MS (Tables 2 and 3; S3 Table). This observation that the most highly-selected CEHs also carry the greatest MS risk presumably indicates that there must be a net survival advantage for individuals carrying these CEHs, which outweighs the small increased chance of getting MS–a circumstance that is also suggested by the observation (Figure A in S3 File) that only a very small proportion of the individuals who carry these disease-associated CEHs are even within the set of individuals who are “genetically susceptible” to getting the disease [3].
Materials & methods
Ethics statement
This research has been approved by the University of California, San Francisco's Institutional Review Board (IRB) has been conducted according to the principles expressed in the Declaration of Helsinki.
Study participants
Wellcome Trust Case Control Consortium (WTCCC)
The study cohort was assembled as a prospective multicenter, multinational, effort. This study population has been described in detail previously [12,14, 16, 17]. However, in brief, this cohort included 18,872 controls and 11,376 cases with MS, although SNP haplotype data was unavailable for 380 controls and 232 cases. Of the cases, 72.9% were women, the average age-of-clinical-onset was 33.1 years, and the mean Extended Disability Status Score (EDSS) was 3.7 [12]. Fifteen different countries from around the world participated (Australia, Belgium, Denmark, Finland, France, Germany, Ireland, Italy, Poland, New Zealand, Norway, Spain, Sweden, the United Kingdom, and the United States). The data from Australia and New Zealand were combined so that data from 14 different world regions was available. Consequently, the patients enrolled in this study (except for a few African Americans from the United States) were of European ancestry. Although all clinical MS subtypes were included, the large majority (89%) had a relapsing-remitting onset [11]. The diagnosis of MS was made based upon internationally recognized criteria [47–49]. Control subjects were composed of a combined group, which consisted of several different cohorts of healthy individuals with European ancestry [11]. The Ethical Committees or Institutional Review Boards at each of the participating centers approved the protocol and informed consent was obtained from each study participant. The WTCCC granted data access for this study.
Expression, Proteomics, Imaging, and Clinical (EPIC) study
An independent cohort, for certain comparative purposes, consisted of the patients and controls enrolled in the EPIC study of MS genetics at UCSF and this cohort, also, has been described in detail previously [8]. Briefly, this study included data from 964 patients with MS and 868 controls. Both patients and controls were matched for age and gender, and all participants provided their informed consent [8]. The cohort was drawn from the United States and, essentially, all participants were of European ancestry. The diagnosis of MS, also, was made using internationally recognized criteria [47–49].
Genotyping, and quality control
The genotyping methods and quality control for the WTCCC have been described in detail previously [11,12]. All genotyping was performed on the Illumina Infinium platform at the Wellcome Trust Sanger Institute. Case samples were genotyped using a customized Human660-Quad chip. Common controls were genotyped on a second customized Human1M-Duo chip (utilizing the same probes). After quality control, this provided data on 441,547 autosomal SNPs scattered throughout the genome in both MS patients and controls [17]. The identities of the five HLA alleles in the MHC region (A, C, B, DRB1 and DQB1) were determined for each participant by imputation using the HIBAG method [44].
Genotyping and quality control methods for the EPIC cohort have also been described in detail previously [7]. In this study, SNP genotyping was done at the Illumina facilities using the Sentrix HumanHap550 Bead Chip. This analysis provided genotype information on 551,642 SNPs. The identities of the five HLA alleles in the MHC region (HLA-A, HLA-C, HLA-B, HLA-DRB1 and HLA-DQB1) were determined by sequence based typing methods [28].
Statistical methods
Phasing
The phasing of alleles at each of five HLA loci (HLA-A, HLA-C, HLA-B, HLA-DRB1 and HLA-DQB1) was accomplished using a previously published probabilistic phasing algorithm [50, 51]. Phased SNP haplotypes were constructed using a previously published probabilistic method [29, 30] at sliding windows of 2 to 15 SNPs throughout the 1 mb span surrounding the Class II region of the DRB1 gene. The SNP-window of the most significant MS-associated SNP haplotype was carried forward as a haplotype locus, a multi-allelic gene to be phased with the 5 classic HLA genes. As discussed earlier, this haplotype locus consisted of 11 phased SNPs surrounding the HLA-DRB1 gene (Fig 1). The accuracy of the phasing was confirmed by the method of SHAPEIT2 [52–54], with better than 99% correspondence between methods.
Phasing was accomplished by determining the probability of each possible combination and assigning the phasing to the most likely combination. At times, however, there were several possible combinations and this method, potentially, might designate a haplotype pair in circumstances where several compatible haplotype pairs existed and each pair had a very similar posterior probability. Such a situation did occur, but rarely. Thus, for the HLA-A~HLA-C ~HLA-B~HLA-DRB1~HLA-DQB1 haplotypes, 98% of the designations had a posterior probability of more than (0.5), 92% had posterior probability of more than (0.6), and 85% had a posterior probability of more than (0.7). For the Class II haplotypes (HLA-DRB1~HLA-DQB1~SNP), these same respective percentages were (100%, 99.997%, and 99.95%).
Haplotype frequencies and association testing
Disease association tests, as measured by ORs and confidence intervals (CIs), were undertaken for each of the HLA haplotypes and HLA plus SNP haplotypes. Because of the very strong association between the HLA-DRB1*15:01~HLA-DQB1*06:02~a1 haplotype, all other associations were assessed after excluding those individuals who carried the HLA-DRB1*15:01~HLA-DQB1*06:02~a1 haplotype. Similarly, when the association of a specific CEH carrying the HLA-DRB1*15:01~HLA-DQB1*06:02~a1 haplotype was assessed, all other HLA-DRB1*15:01~HLA-DQB1*06:02~a1 carriers were excluded from the analysis.
In our previous study [30] we found an association of certain Class I alleles with MS (i.e., HLA-A*02:01, HLA-C*05:01, HLA-B*37:01, HLA-B*38:01, and HLA-B*44:02). Consequently, for each of the reported Class II associations (Fig 5), we undertook a regression analysis using these Class I alleles as covariates in the regression equations. This analysis confirmed that the reported Class II associations (Fig 5) were unaffected.
In our previous report [30] we assessed the significance of the association of each SNP haplotype with MS and adjusted these associations for the millions of comparisons made across the entire genome using the Benjamini-Hochberg method [55]. In the present manuscript, by contrast, we analyzed the 174 distinct SNP haplotypes composed of variants at 11 SNP locations (rs2395173, rs2395174, rs3129871, rs7192, rs3129890, rs9268832, rs532098, rs17533090, rs2187668, rs1063355, and rs9275141). Among these haplotypes was the (a1) SNP-haplotype (Table 1), which had the single largest disease-association with MS of any in the genome. In the present manuscript, however, these 174 SNP haplotypes in this genomic region served simply (and only) as an additional genetic marker to be included in the haplotype analysis with the other 5 HLA loci and, thus, no additional statistical adjustment is necessary (or appropriate) as a consequence of their inclusion in the analysis. Nevertheless, because only haplotypes with 50 or more representations in the WTCCC dataset were analyzed, and because there were 146 such haplotypes, a Bonferonni correction for these multiple comparisons would require a significance of (p < 0.05/146 = 0.0003) to be achieved.
Because of the tight linkage that exists among the Class II loci (HLA-DRB1, HLA-DQB1, and SNP haplotype) as well as among the Class I loci (HLA-A, HLA-C, and HLA-B), the association of the different Class I and Class II haplotype combinations (with more than 2 representations in the WTCCC dataset) was determined by the association of specific HLA-A~HLA-C~HLA-B combinations with a specific HLA-DRB1~HLA-DQB1~SNP haplotype combinations. The p-values for the association of different Class I with different Class II combinations were determined using a Fisher exact test if any expected cell frequencies was 5 or less and otherwise using a Chi square test [56]. The Benjamini-Hochberg method was used to correct for multiple testing of the different possible Class I / Class II combinations [55].
Significance of the difference in ORs for disease association between any two haplotypes was determined by z-scores calculated from the difference in the natural logarithm of the ORs for the haplotypes. Also, because of the marked predominance of the MS association with the HLA-DRB1*15:01~HLA-DQB1*06:02~a1 haplotype, all disease association tests for other haplotypes were assessed after persons carrying the HLA-DRB1*15:01~HLA-DQB1*06:02~a1 haplotype were excluded from the analysis. Similarly, in the case of disease association tests for individual CEHs that carried the HLA-DRB1*15:01~HLA-DQB1*06:02~a1 Class II motif, all other persons carrying this same Class II motif were excluded from the analysis.
Significance of disease associations were also confirmed using a regression analysis equating phenotype (case or control) with the dose (0, 1, or 2) of each of haplotypes identified as being disease associated. An analysis of the potential interactions between the haplotypes was also undertaken with these regression equations.
The expected occurrence of individuals homozygous for the different CEHs (or different CEH-types) was calculated from the measured CEH (or CEH-type) frequencies. These individual expectations were then summed and the expected total compared to the observed total number of homozygous individuals using a z-score.
Population stratification
We used principal components (PC) analysis excluding MHC SNPs (Eigensoft) to correct for population stratification within the WTCCC cohort [57]. There was evidence of considerable population structure in the WTCCC data. An analysis of variance test carried out between cases and controls demonstrated a significant difference for most of the first 10 PCs (which accounted for 84% of the of the population stratification). None of other PCs were significantly different between cases and controls (neither were PC4 or PC10). The potential impact of this population structure on our findings was assessed by the inclusion of these 10 PCs in the final regression equation.
Geographic, gender, and age stratification
We also adjusted for geographic heterogeneity (in addition to our adjustment for population stratification) by using dummy variable coding for each of the different geographic regions and including these in the final regression equation. Similarly, and adjustment for gender (male = 1; female = 0) was also included in the final regression equation. Neither information about the individual chronological age nor information about individual age-at-clinical-onset was available for either the WTCCC of EPIC data sets. Nevertheless, because this study analyzed only DNA-based haplotypes (which are independent of chronological age), chronological age is not a relevant factor. It is possible, however, that the age at disease-onset could be more relevant. Certainly, some authors have argued that “childhood-onset” MS cases might somehow be different (either genetically or environmentally, or both) from “adult-onset” cases. Nevertheless, within an “adult-onset” MS population (e.g., the WTCCC population), there is no evidence to suggest genetic heterogeneity with respect to age-at-clinical-onset. Also, it is worth pointing out that many patients with “adult-onset” MS, can be demonstrated to have MRI evidence of disease activity that precedes, by many years (oftentimes decades), the clinical-onset of MS. Moreover, there is no established (or suggested) relationship between the age-at-clinical-onset and the age of disease-onset. Consequently, any analysis, regarding the impact of the age at disease-onset based solely upon the age observed at the clinical-onset of disease activity, would be unreliable, even if such data were available.
Supporting information
Data Availability
All relevant data are within the paper and its Supporting Information files.
Funding Statement
The authors received no specific funding for this work.
References
- 1.Gourraud PA, Harbo HF, Hauser SL, Baranzini SE. (2012) The genetics of multiple sclerosis: an up-to-date review. Immunol Rev 248:87–103. doi: 10.1111/j.1600-065X.2012.01134.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Hofker MH, Fu J, Wijmenga C. (2014) The genome revolution and its role in understanding complex diseases. Biochim Biophys Acta 1842:1889–1895. doi: 10.1016/j.bbadis.2014.05.002 [DOI] [PubMed] [Google Scholar]
- 3.Goodin DS. The nature of genetic susceptibility to multiple sclerosis: Constraining the Possibilities. BMC Neurology 2016;16:56 doi: 10.1186/s12883-016-0575-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.GAMES, the Transatlantic Multiple Sclerosis Genetics Cooperative. (2003) A meta-analysis of whole genome linkage screens in multiple sclerosis. J Neuroimmunol 143:39–46. [DOI] [PubMed] [Google Scholar]
- 5.de Bakker PIW, Yelensky R, Pe’er I, Gabriel SB, Daly MJ, Altshuler D. (2005) Efficiency and power in genetic association studies. Nat Genet 37:1217–1223. doi: 10.1038/ng1669 [DOI] [PubMed] [Google Scholar]
- 6.The Wellcome Trust Case Control Consortium & The Australo-Anglo-American Spondylitis Consortium. (2007) Associations can of 14,500 nonsynonymous SNPs in four diseases identifies autoimmunity variants. Nature Genet 39:1329–1337. doi: 10.1038/ng.2007.17 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.The ANZgene Consortium. (2009) Genome-wide association study identifies new multiple sclerosis susceptibility loci on chromosomes 12 and 20. Nature Genet 41:824–828. doi: 10.1038/ng.396 [DOI] [PubMed] [Google Scholar]
- 8.Baranzini SE, Wang J, Gibson RA, Galwey N, Naegelin Y, Barkhof F, et al. (2009) Genome-wide association analysis of susceptibility and clinical phenotype in multiple sclerosis. Hum Mol Genet. 18:767–778. doi: 10.1093/hmg/ddn388 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.De Jager PL, Jia X, Wang J, de Bakker PI, Ottoboni L, Aggarwal NT, et al. (2009) Meta-analysis of genome scans and replication identify CD6, IRF8 and TNFRSF1A as new multiple sclerosis susceptibility loci. Nature Genet 41:776–782. doi: 10.1038/ng.401 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Sanna S. Pitzalis M, Zoledziewska M, Zara I, Sidore C, Murru R, et al. (2010) Variants within the immunoregulatory CBLB gene are associated with multiple sclerosis. Nature Genet 42:495–497. doi: 10.1038/ng.584 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.The International Multiple Sclerosis Genetics Consortium & the Wellcome Trust Case Control Consortium. Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis. Nature 2011;476:214–219. doi: 10.1038/nature10251 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.International Multiple Sclerosis Genetics Consortium (IMSGC). (2014) Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis Nat Genet 45:1353–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Dyment DA, Herrera BM, Cader MZ, Willer CJ, Lincoln MR, Sadovnick AD. et al. (2005) Complex interactions among MHC haplotypes in multiple sclerosis: susceptibility and resistance. Hum Mol Genet 14:2019–2026. doi: 10.1093/hmg/ddi206 [DOI] [PubMed] [Google Scholar]
- 14.Hafler DA, Compston A, Sawcer S, Lander ES, Daly MJ, De Jager P, et al. (2007) Risk alleles for multiple sclerosis identified by a genomewide study. N. Engl. J. Med. 357, 851–862. doi: 10.1056/NEJMoa073493 [DOI] [PubMed] [Google Scholar]
- 15.Ramagopalan SV, Anderson C, Sadovnick AD, Ebers GC. (2007) Genomewide study of multiple sclerosis. N. Engl. J. Med. 357, 2199–2200. doi: 10.1056/NEJMc072836 [DOI] [PubMed] [Google Scholar]
- 16.Link J, Kockum I, Lorentzen AR, Lie BA, Celius EG, Westerlind H, et al. (2012) Importance of Human Leukocyte Antigen (HLA) Class I and II Alleles on the Risk of Multiple Sclerosis. PLoS One 7(5):e36779 doi: 10.1371/journal.pone.0036779 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Patsopoulos NA, Barcellos LF, Hintzen RQ, Schaefer C, van Duijn CM, Noble JA, et al. (2014) Fine-Mapping the Genetic Association of the Major Histocompatibility Complex in Multiple Sclerosis: HLA and Non-HLA Effects. PLoS Genet 9(11):e1003926. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Chao MJ, Barnardo MC, Lincoln MR, Ramagopalan SV, Herrera BM, Dyment DA, et al. HLA class I alleles tag HLA-DRB1*1501 haplotypes for differential risk in multiple sclerosis susceptibility. Proc Natl Acad Sci USA 2008;105:13069–74. doi: 10.1073/pnas.0801042105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lincoln MR, Ramagopalan SV, Chao MJ, Herrera BM, Deluca GC, Orton SM, et al. Epistasis among HLA-DRB1, HLA-DQA1, and HLA-DQB1 loci determines multiple sclerosis susceptibility. Proc Natl Acad Sci USA 2009;106:7542–7. doi: 10.1073/pnas.0812664106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Multiple Sclerosis Genetics Group (1998) Linkage of the MHC to familial multiple sclerosis suggests genetic heterogeneity. Hum Molec Genet 7:1229–1234. [DOI] [PubMed] [Google Scholar]
- 21.McElroy JP, Cree BAC, Caillier SJ, Gregersen PK, Herbert J, Khan OA, et al. Refining the association of MHC with multiple sclerosis in African Americans. Hum Mol Genet 2010;19:3080–3088. doi: 10.1093/hmg/ddq197 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Dikson SP, Wang K, Krantz I, Hakonarson H, Goldstein DB. (2010) Rare Variants Create Synthetic Genome-Wide Associations. PLoS Biol 8(1): e1000294 doi: 10.1371/journal.pbio.1000294 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Bugawan TL, Klitz W, Blair A, Erlich HA. High-resolution HLA class I typing in the CEPH families: analysis of linkage disequilibrium among HLA loci. Tissue Antigens 2000: 56: 392–404. [DOI] [PubMed] [Google Scholar]
- 24.Ahmad T, Neville M, Marshall SE, Armuzzi A, Mulcahy-Hawes K, Crawshaw J, et al. Haplotype-specific linkage disequilibrium patterns define the genetic topography of the human MHC. Hum Mol Genet 2003;12:647–656. [PubMed] [Google Scholar]
- 25.Yunis EJ, Larsen CE, Fernandez-Viña M, Awdeh ZL, Romero T, Hansen JA, et al. Inheritable variable sizes of DNA stretches in the human MHC: conserved extended haplotypes and their fragments or blocks. Tissue Antigens 2003;62:1–20. [DOI] [PubMed] [Google Scholar]
- 26.Horton R, Gibson R, Coggill P, Miretti M, Allcock RJ, Almeida J, et al. Variation analysis and gene annotation of eight MHC haplotypes: The MHC Haplotype Project. Immunogenetics 2008;60:1–18. doi: 10.1007/s00251-007-0262-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wennerström A, Vlachopoulou E, Lahtela LE, Paakkanen R, Eronen KT, Seppänen M, et al. Diversity of Extended HLA DRB1 Haplotypes in the Finnish Population. PLoS One 2013;8(11):e79690 doi: 10.1371/journal.pone.0079690 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Zúñiga J, Yu N, Barquera R, Alosco S, Ohashi M, Lebedeva T, et al. HLA Class I and Class II Conserved Extended Haplotypes and Their Fragments or Blocks in Mexicans: Implications for the Study of Genetic Diversity in Admixed Populations. PLoS One 2013;8(9):e74442 doi: 10.1371/journal.pone.0074442 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Goodin DS, Khankhanian P. Single Nucleotide Polymorphism (SNP)-Strings: An Alternative Method for Assessing Genetic Associations. PLoS One 2014;9(4):e90034 doi: 10.1371/journal.pone.0090034 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Khankhanian P, Gourraud PA, Lizee A, Goodin DS. Haplotype-based approach to known MS-associated regions increases the amount of explained risk. J Med Genet. 2015;52:587–594. doi: 10.1136/jmedgenet-2015-103071 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Testi M, Battarra M, Lucarelli G, et al. HLA-A-B-C-DRB1-DQB1 phased haplotypes in 124 Nigerian families indicate extreme HLA diversity and low linkage disequilibrium in Central-West Africa. Tissue Antigens 2015;86:285–292 doi: 10.1111/tan.12642 [DOI] [PubMed] [Google Scholar]
- 32.Isobe N, Keshavan A, Gourraud PA, Zhu AH, Datta E, Schlaeger R, et al. Association of HLA Genetic Risk Burden With Disease Phenotypes in Multiple Sclerosis. JAMA Neurol May 31, 2016. (E-pub ahead of print). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Sanchez-Mazas A Djoulah M, Busson M, Le Monnier de Gouville I, Poirier JC, Dehay C, et al. , 2000 A linkage disequilibrium map of the MHC region based on the analysis of 14 loci haplotypes in 50 French families. Eur J Hum Genet 2000;8:33–41. doi: 10.1038/sj.ejhg.5200391 [DOI] [PubMed] [Google Scholar]
- 34.Arnheim A, Calabrese P, Nordborg M. Hot and cold spots of recombination in the human genome: the reason we should find them and how this can be achieved. Am J Hum Genet 2003;73:5–16. doi: 10.1086/376419 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Gragert L, Madbouly A, Freeman F, Maiers M. Six-locus high resolution HLA haplotype frequencies derived from mixed-resolution DNA typing for the entire US donor registry. Hum Immunol 2013:1313–1320. doi: 10.1016/j.humimm.2013.06.025 [DOI] [PubMed] [Google Scholar]
- 36.Pappas DP, Tomich A, Garnier F, Marry E, Gourraud PA. Comparison of high-resolution human leukocyte antigen haplotype frequencies in different ethnic groups: Consequences of sampling fluctuation and haplotype frequency distribution tail truncation. Hum Immunol 2015:374–380. doi: 10.1016/j.humimm.2015.01.029 [DOI] [PubMed] [Google Scholar]
- 37.Taylan F, Altiok E. Meiotic recombinations within major histocompatibility complex of human embryos. Immunogenetics 2012;64:839–44. doi: 10.1007/s00251-012-0644-y [DOI] [PubMed] [Google Scholar]
- 38.Vandiedonck C, Knight JC. The human major histocompatability complex as a paradigm in genomics research. Brief Funct Genomic Proteomic. 2009;8:379–394. doi: 10.1093/bfgp/elp010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Dawkins R, Leelayuwat C, Gaudieri S, Tay G, Hui J, Cattley S, et al. Genomics of the major histocompatibility complex: haplotypes, duplication, retroviruses and disease. Immunol Rev 1999;167:275–304. [DOI] [PubMed] [Google Scholar]
- 40.Trowsdale J, Knight JC. Major histocompatibility complex genomics and human disease. Annu Rev Genomics Hum Genet 2013;14:301–323. doi: 10.1146/annurev-genom-091212-153455 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.van Oosterhout C. A new theory of MHC evolution: beyond selection on the immune genes. Proc. R. Soc. B 2009;276:657–66541. doi: 10.1098/rspb.2008.1299 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Ardlie KG, Kruglyak L, Seielstad. Patterns of linkage disequilibrium in the human genome. Nat Rev Genet. 2002;3:299–309. doi: 10.1038/nrg777 [DOI] [PubMed] [Google Scholar]
- 43.Paul P, Nag D, Chakraborty S. Recombination hotspots: Models and tools for detection. DNA Repair 2016;40:47–56. doi: 10.1016/j.dnarep.2016.02.005 [DOI] [PubMed] [Google Scholar]
- 44.Zheng X, J Shen J, Cox C, Wakefield JC, Ehm MG, Nelson MR, et al. (2014) HIBAG–HLA genotype imputation with attribute bagging. Pharmacogenom J 14:192–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Gillespie JH. Population Genetics: A Precise Guide. Johns Hopkins University Press, (Baltimore and London: ), 1998. [Google Scholar]
- 46.IPD-IMGT/HLA Database. http://www.ebi.ac.uk/ipd/imgt/hla
- 47.Poser CM, Paty DW, Scheinberg L, McDonald WI, Davis FA, Ebers GC, et al. New diagnostic criteria for multiple sclerosis: guidelines for research protocols. Ann Neurol 13, 227–231 (1983). doi: 10.1002/ana.410130302 [DOI] [PubMed] [Google Scholar]
- 48.McDonald WI, Compston A, Edan G, Goodkin D, Hartung HP, Lublin FD, et al. Recommended diagnostic criteria for multiple sclerosis: guidelines from the International Panel on the diagnosis of multiple sclerosis. Ann Neurol 50, 121–127 (2001). [DOI] [PubMed] [Google Scholar]
- 49.Polman CH, Reingold SC, Edan G, Filippi M, Hartung HP, Kappos L, et al. Diagnostic criteria for multiple sclerosis: 2005 revisions to the "McDonald Criteria". Ann Neurol 58, 840–846 (2005). doi: 10.1002/ana.20703 [DOI] [PubMed] [Google Scholar]
- 50.Gourraud PA, Lamiraux P, El-Kadhi N, Raffoux C, Cambon-Thomsen A. Inferred HLA haplotype information for donors from hematopoietic stem cells donor registries. Hum Immunol 2005;66:563–70. doi: 10.1016/j.humimm.2005.01.011 [DOI] [PubMed] [Google Scholar]
- 51.Gourraud PA, Khankhanian P, Cereb N, Yang SY, Feolo M, Maiers M, et al. HLA diversity in the 1000 genomes dataset. PLoS One 2014;9:e9782. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Delaneau O, Zagury JF, Marchini J (2012) Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods 10: 5–6. [DOI] [PubMed] [Google Scholar]
- 53.Delaneau O, Marchini J, Zagury JF (2011) A linear complexity phasing method for thousands of genomes. Nat Methods 9: 179–81. doi: 10.1038/nmeth.1785 [DOI] [PubMed] [Google Scholar]
- 54.Howie B, Marchini J, Stephens M (2011) Genotype imputation with thousands of genomes. G3 (Bethesda) 1(6): 457–470. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Statist Soc B. (1995) 57:289–300. [Google Scholar]
- 56.Lydersen S, Fagerland MW, Laake P. Tutorials in biostatistics: Recommended tests for association in 2x2 tables. Statist Med. (2009) 28:1159–1175. [DOI] [PubMed] [Google Scholar]
- 57.Price AL Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38:904–909. doi: 10.1038/ng1847 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All relevant data are within the paper and its Supporting Information files.