Skip to main content
American Journal of Human Genetics logoLink to American Journal of Human Genetics
. 2025 Jul 10;112(8):1833–1851. doi: 10.1016/j.ajhg.2025.06.011

Haplotype analysis reveals pleiotropic disease associations in the HLA region

Courtney J Smith 1,2,, Satu Strausz 1,2,3,4; FinnGen, Jeffrey P Spence 1, Hanna M Ollila 2,5,6,7, Jonathan K Pritchard 1,8,∗∗
PMCID: PMC12414721  PMID: 40645183

Summary

The human leukocyte antigen (HLA) region plays an important role in human health through its involvement in immune cell recognition and maturation. While genetic variation in the HLA region is associated with many diseases, the pleiotropic patterns of these associations have not been systematically investigated. Here, we developed a haplotype approach to investigate disease associations phenome wide for 412,181 Finnish individuals and 2,459 diseases. Across the 1,035 diseases with a genome-wide association study association, we found a 17-fold average per-SNP enrichment of hits in the HLA region. Altogether, we identified 7,649 HLA associations across 647 diseases, including 1,750 associations uncovered by haplotype analysis. We found that some haplotypes show both risk-increasing and protective associations across different diseases, while others consistently increase risk across diseases, indicating a complex pleiotropic landscape involving a range of diseases. This study highlights the extensive impact of HLA variation on disease risk and underscores the importance of classical and non-classical genes as well as non-coding variation.

Keywords: genomics, GWAS, HLA, phenome-wide, biobank, summary statistics, UKBB, FinnGen


The HLA region plays an important role in human health but is often excluded in large-scale GWASs. We performed haplotype-based associations from genotype data to investigate pleiotropy across thousands of diseases. This comprehensive phenome-wide disease association catalog serves as a resource for future studies of HLA and complex diseases.

Introduction

The major histocompatibility complex (MHC) plays a crucial role in mediating tissue graft compatibility and immune system recognition of pathogens and self.1,2,3 The human MHC, referred to as the human leukocyte antigen (HLA) region, has been found to be associated with numerous diseases.2,3,4,5,6,7,8 The tension between being able to recognize a diverse array of pathogens while avoiding autoimmunity suggests that variants within the HLA region may affect multiple distinct diseases simultaneously, yet little work has been done to characterize the patterns of pleiotropy across diseases within the region.

The HLA region is approximately 5 Mb in length and contains hundreds of genes but is most known for the classical HLA genes, which are involved in response to infection and autoimmunity.9 The classical HLA genes, which include class I genes (HLA-A, -B, and -C) and class II genes (HLA-DR, -DQ, and -DP), encode cell surface proteins that present peptides to immune cells resulting in activation and maturation.10

The classical HLA genes are highly polymorphic, with each gene having multiple distinct alleles. These alleles are functionally diverse: some act as generalists, and others are specific to particular types of peptides.11,12,13 Different HLA alleles vary in their ability to recognize certain pathogens, and thus genetic variation modulating this ability can result in a variety of disease associations.9,14 Meanwhile, some pathogens have evolved to avoid common HLA alleles in a host-pathogen arms race.15,16 This arms race has resulted in long-term balancing selection at classical HLA genes, leading to trans-species polymorphisms and extreme nucleotide diversity—more than 70 times the genome-wide average.17,18,19

At the individual level, this genetic variation in the classical HLA genes affects the ability of the immune system to detect pathogens, fight infections, and attack cancerous cells, as well as the ability to limit inappropriate immune responses, such as autoimmune diseases.2,3,4,5 Furthermore, genetic variation in the HLA region can influence the balance between these conflicting goals of pathogen response and the prevention of autoimmunity, resulting in potential risk trade-offs.20,21,22 On the other hand, the risk trade-offs between autoimmunity, infection, and other diseases can be more complicated, as demonstrated by Epstein-Barr virus (EBV) infection. Chronic EBV infection is known to cause various cancers, including nasopharyngeal carcinoma and Hodgkin lymphoma,23,24,25 and it has also been shown to play a role in the development of multiple sclerosis, a degenerative demyelinating disease of the central nervous system caused by immune-mediated inflammation.26,27 Previous studies have explored associations with these and other diseases using common genetic variants or HLA-allele association testing.8,28,29,30,31,32,33 These studies have identified both risk-increasing and protective disease associations with these common genetic variants and HLA alleles, suggesting that a systematic characterization across different types of genetic variants and haplotypes might provide additional insight.

Association studies have implicated particular HLA alleles in many diseases.6,7 These HLA association studies have provided countless biologically and clinically informative associations. For example, seronegative spondyloarthritis has been associated with the HLA-B27 allele family, type 1 diabetes with the HLA-DR3 allele family, and rheumatoid arthritis with the HLA-DR4 allele family.34,35 In addition to providing biological insight into disease mechanisms, these studies have resulted in the use of HLA allele associations in the clinical setting.36,37,38

While there has been much focus on protein-coding variation within the classical HLA genes, there has been less work characterizing the majority of the genetic variation in the region, which falls outside of the coding regions of the classical HLA genes. Disease-associated variants are typically presumed to be protein coding, affecting the peptide-binding groove of a classical HLA gene, but variation in regulatory regions may also be a major risk factor in a subset of diseases by influencing gene expression.39,40,41,42,43 Recent experimental studies have demonstrated that for some traits, regulatory variation in the region confers more risk than HLA coding variation.44 There is also evidence for disease associations with variation in non-HLA genes within the locus, including C4A,45 SLC44A4,46 and NOTCH4.47 Prior work has also shown that many diseases are associated with extended haplotypes in the HLA region.48,49,50,51,52,53 In fact, the HLA region has previously been delineated into four main genomic blocks, referred to as the alpha, beta, gamma, and delta blocks, with some haplotypes in these blocks being associated with disease risk.48,49,50,51,54,55,56 Therefore, investigation of genetic variation throughout the entire HLA region has the potential to reveal additional contributions beyond those found by HLA allele analysis alone.

Analyses of the HLA region in genome-wide association studies (GWASs) in large cohorts such as FinnGen,32 UK Biobank (UKBB),57 and Japan Biobank58 have identified many trait associations with single-nucleotide polymorphisms (SNPs) in the HLA region.59 These traits span a variety of systems, including infections such as HIV60 and hepatitis B,61 and autoimmune diseases ranging from neurological diseases (such as multiple sclerosis62), gastrointestinal diseases (such as celiac disease and inflammatory bowel disease63), and rheumatic diseases (such as systemic lupus erythematosus [SLE]64). These studies typically either investigate associations with many traits across the entire genome,8,30,47 treating the HLA region as just another locus, or they specifically focus on the HLA region but consider only a small number of traits at a time.65,66 However, in order to understand how genetic variation in the HLA region contributes to the complicated interplay between different disease risks, it is crucial to study associations for many traits simultaneously. These findings of extensive disease associations with HLA motivate the need for investigating and systematically characterizing the role of HLA loci in these disease associations at the phenome-wide scale.

In this study, we quantified how genetic variation and pleiotropy at the HLA region contribute to disease risk across a broad range of diseases. We analyzed data from 412,181 Finnish individuals for 2,459 diseases in FinnGen. We focused on understanding the spatial distribution of disease associations throughout the HLA region and the nature of pleiotropy between different diseases. FinnGen is particularly suited for this work because it has relatively high case counts for many diseases compared to other cohorts.67 We developed a haplotype-based approach to robustly characterize patterns of disease associations throughout the entire HLA region, including non-coding variation and variation outside of classical HLA genes. We applied our approach at a phenome-wide scale and evaluated the role of HLA in modulating risk across a broad range of diseases in the context of the full complexity and breadth of HLA genetic variation.

Subjects and methods

Samples and participants

The FinnGen study is a large-scale genomics initiative that has analyzed over 500,000 Finnish biobank samples and correlated genetic variation with health data to understand disease mechanisms and predispositions. The project is a collaboration between research organizations, biobanks within Finland, and international industry partners. Here, we used data from FinnGen Data Freeze 10, which comprises samples from 412,181 Finnish individuals, 21,311,942 variants, and 2,459 traits (Table S1). For consistency, the term “disease” is used to refer to all traits throughout the paper, including those that reflect an abnormal state of health but do not yet have an identifiable cause. All but three traits in FinnGen (height, weight, and body mass index) were binary traits. To focus on disease, we excluded the association study results for these three quantitative traits from all downstream analyses in the paper.

FinnGen identification of SNP associations

The summary statistics used in this study were generated using Regenie v.2.2.4 and the FinnGen Regenie pipeline.68 Current age or age at death, sex, genotyping chip, genetic relationship, and the first ten principal components of the genome-wide genotype matrix were included as covariates.67 Fine-mapping was performed using the SuSiE “sum of single effects” model,69 excluding the HLA region. Further details are available at https://www.finngen.fi/en. The full FinnGen R10 GWAS summary statistics from the genome-wide SNP associations used for the enrichment analyses are publicly available at the FinnGen browser https://r10.finngen.fi/ and can be downloaded from https://elomake.helsinki.fi/lomakkeet/124935/lomake.html.

Defining the HLA region

The HLA region was defined as 28,510,120–33,480,577 based on the Genome Reference Consortium Assembly Grch38.p14 (hg38) (https://www.ncbi.nlm.nih.gov/grc/human/regions/MHC). Protein-coding genes were identified by overlapping FinnGen annotated genes with the protein-coding gene file from HGNC (https://www.genenames.org/download/statistics-and-files). The linkage disequilibrium (LD) plot represents LD as measured by r2 and D for 41,183 SNPs covering the HLA region. This set of SNPs corresponds to the subset of the 41,234 SNPs (minor allele frequency [MAF] > 1%) within the HLA boundaries remaining after pruning with “plink -ld-window 999999 –ld-window-kb 1000 –ld-window-r2 0.1.”

Genotype data and imputation

HLA alleles were imputed using HIBAG v.1.18.170 R library and using a Finnish population-specific HLA reference panel from 1,150 individuals as part of the earlier effort in the FinnGen study.32 The imputation genotype panel included approximately 4,500 SNPs within the MHC region (chr6:28.51–33.48 Mb [GRCh38/hg38]). HLA alleles with imputation posterior probabilities >0.5 were kept in the analysis.

The individuals in FinnGen were genotyped with Illumina and Affymetrix chip arrays (Illumina, San Diego and Thermo Fisher Scientific, Santa Clara, CA, USA). After this the chip genotype data were imputed using the population-specific SISu v.4.2 imputation reference panel of 8,554 whole genomes (25× coverage). The total number of imputed variants included 21,311,942 variants in reference assembly GRCh38/hg38 throughout the genome.67

GWAS association processing

GWAS results were filtered to include all diseases with at least one hit in the HLA region with p < 106. LD score regression (LDSC)71 was used to generate genetic correlation estimates, with relevant eur__ld_chr files downloaded from https://data.broadinstitute.org/alkesgroup/LDSCORE/. To remove essentially redundant diseases, we further filtered to diseases with LDSC genetic correlation <0.95 with all remaining diseases. We filtered to the most significant SNP (MAF > 1%) in the HLA region for each of the remaining diseases. We then used stepwise forward conditional analysis with Plink2 (https://www.cog-genomics.org/plink/2.0/) for each disease to identify additional independent significant SNPs (MAF > 1%) in the HLA region with p < 106. A significance threshold of p < 106 was selected modified from the genome-wide significance threshold of 5×108 because here we are only considering SNPs in the HLA region.

In the conditional analysis, we considered only unrelated individuals, reducing the sample size to 259,802. We adjusted for age, sex, and ten principal components of the genome-wide genotype matrix. Z scores were calculated from the GWAS results for the associations of the 428 hits in the HLA region with all the HLA-associated diseases. For visualizing effects across diseases, we normalized squared Z scores for each disease by the maximum Z2 for that disease. The sign of each SNP’s effects were assigned such that the SNP had a positive median Z score across diseases.

Enrichment analyses

We performed two enrichment analyses. In one analysis, for the 1,035 diseases in FinnGen that had at least one genome-wide significant association anywhere in the genome, we identified independent genome-wide significant SNP associations for each disease. To do this, we use the fine-mapped GWAS summary statistics from FinnGen for SNPs outside the HLA and conditional analysis for SNPs within the HLA boundaries. We then binned these SNPs into 100-kb bins and compared the number of independent significant SNP associations in each bin. We also repeated the analysis varying the window size from 100 kb to 50 kb, 200 kb, and 500 kb to determine the robustness of this result across window sizes.

In an additional enrichment analysis, diseases with at least one associated SNP (MAF > 1% and p < 106) anywhere in the genome were included and grouped into disease groups. A threshold of p < 106 was chosen for ascertaining SNP associations in the genome outside HLA to conservatively match the significance threshold used to identify significant associations in the HLA region via the method described above. Enrichment was calculated for each disease group by dividing the number of independent SNP associations per SNP in the HLA region by the number of independent SNP associations per SNP outside the HLA region. In FinnGen there are 41,234 SNPs within the boundaries of the HLA region and 9,727,032 SNPs in the rest of the genome. These numbers were determined by using all SNPs present in FinnGen and filtering to MAF > 1%. The HLA region was defined as 28,510,120–33,480,577, based on the Genome Reference Consortium Assembly Grch38.p14.

For verification that this enrichment was not driven by SNP density, this process was also repeated using enrichment per genes and per base pair. The per-gene enrichment was calculated for each disease by dividing the number of independent SNP associations per protein-coding gene in the HLA region by the number of independent SNP associations per protein-coding gene outside the HLA region. The per-base-pair enrichment was calculated for each disease by dividing the number of independent SNP associations per length in base pairs of the HLA region by the number of independent SNP associations per size of the genome outside the HLA region.

We also repeated both enrichment analyses described above using a list of the 644 diseases that remain after we filtered the 1,035 diseases included in the original analyses to randomly remove one disease for each pair with an LDSC genetic correlation >0.95.

Defining haplotype groups

Three regions (“blocks”) in the HLA region were selected based on the density of signal from the significant SNP associations, overlapping LD patterns, and functional relevance. The first block was defined as 100 kb below the start of the gene boundary of HLA-F to 100 kb past the end of the gene boundary of HLA-A, 29,622,820–30,045,616 (Grch38.p14) and contained 5,022 SNPs. The second block was defined as 100 kb below the start of the gene boundary of HLA-C to 100 kb past the end of the gene boundary of MICB, 31,168,798–31,611,071 (Grch38.p14) and contained 8,073 SNPs. The third block was defined as 100 kb below the start of the gene boundary of NOTCH4 to 100 kb past the end of the gene boundary of HLA-DQA2, 32,094,910–32,847,125 (Grch38.p14) and contained 11,027 SNPs.

Each block was then subset down to 1,000 randomly selected biallelic SNPs with MAF > 1% due to computational constraints of the clustering process. Each individual’s two phased haplotypes at these 1,000 positions were identified. Haplotypes were clustered by first removing rare haplotypes (defined as <10 total copies across all participants), generating a dendrogram, and recursively splitting the dendrogram at each branchpoint from the root toward the tips until the total number of haplotypes below each node was less than the maximum threshold (defined as 80,000 copies or the maximum in a single haplotype, whichever was greater). Once the haplotype groups were identified, the rare haplotypes were then added to the group with which they clustered.

Performing haplotype regression analysis

Logistic regression was then performed separately for each block for each of the 269 diseases with at least one SNP association in the HLA region for all haplotype groups, leaving out the haplotype group with the highest frequency. Sex, age, and the first ten principal components of the genome-wide genotype matrix were included as covariates. The haplotype group left out was then set to 0, and the Z scores of the regression results were then rescaled for each disease to have a mean of 0. A significance threshold of |Z| > 4 was chosen based approximately on the Bonferroni correction for the number of regressions (one for each of the 269 diseases) for each block at a significance level of 0.05.

In a follow-up analysis, we additionally performed haplotype regression analysis for all diseases regardless of whether there was a GWAS hit in the HLA region for that disease, and for these regressions we applied a more stringent significance threshold of p < 6.7×106 to account for the additional diseases tested (2,459 diseases × 3 blocks).

Permutation analysis

We performed permutation analysis to identify the false-positive rate for the main haplotype group regression analysis. We did this two ways: in permutation analysis 1 we repeated the analysis after randomizing the haplotype group assignments, and in permutation analysis 2 we repeated the analysis after randomizing the disease assignments. In total, we permuted the data 100 times (50 of each scrambling approach) and performed haplotype group associations in each of the three blocks with the 269 diseases for a total of 2,609,300 haplotype group-disease associations tested. From this analysis, there were nine significant haplotype group-disease associations (p < 1e−6), resulting in an estimated false-positive rate of 3.4e−6, close to the nominal level of 1e−6.

When comparing the permutation analysis results for rare associations (defined as the 25% most rare traits and haplotype groups) to those that are non-rare associations (associations without either the rare haplotype groups or rare traits), the genomic inflation factor (lambda) is close to or less than 1 for both and often even closer to 0 for the rare associations, indicating that there is not a trend of inflated p values for the rare associations. Specifically, the inflation factor for block 1 is 0.04 for rare associations and 0.95 for non-rare associations, for block 2 is 0.98 for rare associations and 0.99 for non-rare associations, and for block 3 is 0.98 for rare associations and 1.0 for non-rare associations.

We also used permutation analysis to determine the false-positive rate for the analysis identifying haplotype groups with at least one positive and one negative disease association. As in the original analyses, our permutation testing included all haplotype groups with significant associations regardless of haplotype group size. We found that of the 2,609,300 haplotype group-disease associations tested across permutations, none of the haplotype groups that had at least one significant disease association (p < 1e−6) also had a second nominal disease association (p < 0.001) with an effect in the opposite direction, thus corresponding to a false-positive rate of less than 1e−6. This is in contrast to the findings using the unpermutated data, where we found that 35 haplotype groups have a significant disease association in one direction and at least one nominal disease association in another direction, for a total of 349 positive haplotype group-disease associations and 242 negative associations.

Repeating haplotype regression analysis with varied SNP inputs

To determine the robustness of the main haplotype group association results, we performed this process again while varying the SNPs used to define the haplotypes. For these additional analyses, we repeated the full pipeline ten times using the same SNP criteria selection as in the original analysis (1,000 biallelic SNPs with MAF > 1%) to define haplotypes for each block. We then also repeated the process while varying the MAF cutoff for SNP selection, first with a set of 1,000 SNPs with MAF > 5%, then again for 1,000 SNPs where half had MAF > 5% and half had MAF > 1%, and finally again for 1,000 SNPs with MAF between 1% and 5%. We then re-ran our entire pipeline including clustering haplotypes into haplotype groups, performing haplotype group association analysis, and visualizing the processed results as a heatmap analogous to that in Figure 5 made from the original haplotype group analysis. While varying the SNPs used to define the haplotypes means that the haplotypes are not exactly the same between the additional analyses, qualitatively many of the same diseases were found to be associated with the haplotypes from a given block regardless of the input SNPs used to define them.

Figure 5.

Figure 5

Haplotype group regression results

A dendrogram showing the clustering of the 40 most frequent haplotypes per haplotype group, with white representing the reference allele and black representing the effect allele. Genes are labeled below the corresponding SNPs overlapping their genome position, indicating which are within gene boundaries and which are intergenic. Heatmap showing the Z scores from the haplotype group regression analysis across associated diseases for (A) block 1, (B) block 2, and (C) block 3, including all diseases with at least one association |Z| > 4 in that block, and all haplotype groups with at least one disease association or total copies greater than the minimum cutoff of 20,000 copies. For visualization purposes, diseases are clustered, and Z scores were set to a maximum |Z| of 5. MS, multiple sclerosis; CTDs, connective tissue diseases; SLE, systemic lupus erythematosus; RA, rheumatoid arthritis; Rx, medical prescription; T1D, type 1 diabetes; CKD, chronic kidney disease; STD, sexually transmitted disease. See Table S1 for the full description of all diseases.

Analysis of haplotype regression results

Subsequent analyses investigating patterns of pleiotropy of these haplotype groups focused on only the subset of diseases with at least one association |Z| > 4 in that block and the subset of haplotype groups with at least one disease association or total copies greater than the minimum cutoff of 20,000 copies. For these analyses, a significant threshold of |Z| > 3 was chosen based on the Bonferroni correction for the number of regressions (41 diseases for block 1, 46 for block 2, and 36 for block 3) for each block at a significance level of 0.05. See “permutation analysis” for further details on the selection of the significance threshold.

To calculate the overall disease burden proportion for each haplotype group, we defined the set of relevant diseases for each block as any disease that was significantly associated with at least one of the haplotype groups in that block. Then for each haplotype group in a given block, when considering all haplotype groups included in Figure 5, we identified the proportion of individuals in the haplotype group that had a diagnosis of at least one of the block’s relevant diseases. An individual was considered to be in a haplotype group if they were a carrier for at least one haplotype in the haplotype group. To identify the overall disease proportion as a baseline comparison, for each block we identified the proportion of all 412,181 individuals who had a diagnosis of at least one of the block’s relevant diseases. To evaluate the heterogeneity in disease burden across haplotype groups in the three blocks, we performed a chi-squared test of homogeneity jointly comparing the number of people in a given haplotype block with at least one of the block’s associated diseases to the number of people without an association with the block’s diseases.

Pearson’s correlation across haplotype associations

We computed the Pearson’s correlations for disease pairs across haplotype associations for each block. Specifically, for two diseases, we correlate the vector of Z scores of associations between each haplotype group and the first disease to the vector of Z scores for the associations between each haplotype and the second disease. In essence, we are leveraging the multi-allelic nature of the locus to determine whether a haplotype that has an effect on one disease is likely to have an effect on the other disease in the same direction. Since the correlation is effectively across the heatmap for the haplotype group association results, this should account for LD to the extent that the analysis does. This approach is subtly different from local genetic correlation, which relates to individual variants within the locus.

Haplotype regression analysis conditioned on HLA alleles

To determine the extent to which the haplotype group signal remained after adjusting for the classical HLA alleles, we re-ran the haplotype group regressions while adjusting for the HLA alleles in each block (frequency >1% and variance inflation factor <5). A variance inflation factor (VIF) of less than 5 was used to remove issues of multi-collinearity. To do this, the regression was first performed with all HLA alleles and then, if any had VIF > 5, they were iteratively removed until all had VIF < 5. The remaining HLA alleles were then included in the regression with the haplotype groups for that block. We performed Firth’s bias-reduced logistic regression for all haplotype groups and alleles for each block and each disease using logistf (https://cran.r-project.org/web/packages/logistf/index.html). We then compared the Z scores from the regression before and after adjusting for the alleles, using |Z| > 4 for the significance threshold. A significance threshold of |Z| > 4 was chosen based approximately on the Bonferroni correction for the number of regressions (one for each of the 269 diseases) for each block at a significance level of 0.05.

Specifically, classical HLA alleles with frequency >1% in the dataset within the respective block were included in that block’s regression analysis. In block 1, the classical HLA alleles included were HLA-A01:01, HLA-A25:01, HLA-A26:01, HLA-A31:01, HLA-A32:01, and HLA-A68:01.

In block 2, the classical HLA alleles included were HLA-B07:02, HLA-B13:02, HLA-B15:01, HLA-B18:01, HLA-B27:05, HLA-B37:01, HLA-B38:01, HLA-B39:01, HLA-B40:01, HLA-B40:02, HLA-B44:02, HLA-B44:03, HLA-B44:27, HLA-B47:01, HLA-B51:01, HLA-B56:01, HLA-B57:01, HLA-C02:02, HLA-C03:03, HLA-C04:01, HLA-C05:01, HLA-C07:04, HLA-C12:03, HLA-C14:02, and HLA-C15:02.

In block 3, the classical HLA alleles included were HLA-DRB302:02, HLA-DRB401:01, HLA-DRB401:03N, HLA-DRB502:02, HLA-DRB104:01, HLA-DRB104:04, HLA-DRB104:08, HLA-DRB108:01, HLA-DRB109:01, HLA-DRB111:01, HLA-DRB112:01, HLA-DRB114:54, HLA-DQA101:05, HLA-DQA103:03, HLA-DQB102:02, HLA-DQB103:01, HLA-DQB105:03, HLA-DQB106:02, HLA-DQB106:03, and HLA-DQB106:04.

In addition, we also performed haplotype group regression analysis specifically focusing on individuals without the HLA-B27:05 because it has well-known strong associations with many of the relevant diseases. The main haplotype group analysis on the 269 HLA SNP-associated diseases was then performed again on just the 349,950 individuals negative for the HLA-B27:05 allele using the same haplotype groups.

Allele regression analysis

We also performed allele associations on all diseases, regardless of whether there was a GWAS hit in the HLA region, using two approaches with sex, age, and ten principal components included as covariates. For the first approach, we performed logistic regression separately for each block and each disease with one allele included in each regression, with a significance threshold of p < 2×107. This threshold was chosen to account for the additional diseases tested (2,459 diseases × 98 alleles). For the second approach, we modeled all alleles within a block together jointly after we iteratively removed one regression variable at a time until all remaining had variance inflation factor <5 to minimize issues of multi-collinearity, and we applied a significance threshold of p < 6.7×106. This threshold was chosen to account for the additional diseases tested (2,459 diseases × 3 blocks).

UK Biobank haplotype-disease association analysis

The UKBB contains approximately 500,000 individuals of mainly European ancestry who have been recruited to the study between 2006 and 2010, were aged between 37 and 73 years of age, and were residents of the United Kingdom. the UKBB is a combination of lifestyle measures, genotypes, electronic health record data, blood count data, and questionnaire data, and the health record data are updated frequently to capture the health trajectories of participating individuals.

To replicate the haplotype association method in the UKBB, we utilized disease diagnosis data and phased genotype data to perform haplotype analysis in the HLA region for 337,138 unrelated white British individuals. Phenotypes were defined using hospital inpatient records and self-reported disease status based on prior work.72 The same boundaries of the HLA blocks used in the FinnGen analysis were used after lifting over from hg38 to hg19: chr6:29590597–30013393, chr6:31136575–31578848, and chr6:32062687–32814902 for blocks 1–3, respectively. We included 500 randomly sampled variants per haplotype in block 1 (of the 632 phased variants within the block’s boundaries) and 1,000 randomly sampled variants per haplotype in blocks 2 and 3 (of the 1,153 and 1,265 variants, respectively) with MAF > 1%.

Haplotypes were clustered by first removing rare haplotypes (defined as <10 total copies across all participants), generating a dendrogram, and recursively splitting the dendrogram at each branchpoint from the root toward the tips until the total number of haplotypes below each node was less than the maximum threshold (defined as 80,000 copies or the maximum in a single haplotype, whichever was greater) but not smaller than the minimum threshold (defined as 1,000 copies). Once the haplotype groups were identified, the rare haplotypes were then added to the group with which they clustered.

Logistic regression was then performed separately for each block for all haplotype groups, leaving out the haplotype group with the highest frequency. Sex, age, and the first ten principal components of the genome-wide genotype matrix were included as covariates. As in the FinnGen main haplotype association analysis, a significance threshold of |Z| > 4 was used.

We also performed UKBB replication analyses by mapping the UKBB haplotypes onto the haplotype groups defined in FinnGen. Specifically, when we overlapped the SNPs used to define the FinnGen haplotypes with those present in the UKBB phased data, there were 112 SNPs for block 1, 133 SNPs for block 2, and 109 SNPs for block 3. We extracted UKBB haplotypes using these SNPs and filtered to individual UKBB haplotypes with >5 total doses, then assigned the UKBB haplotypes to the FinnGen haplotype groups based on the FinnGen haplotype group assignment of the individual FinnGen haplotype (after filtering to all FinnGen haplotypes with total doses >10 due to FinnGen privacy policy) with the minimum Hamming distance away from the individual UKBB haplotype. We then ran the UKBB haplotype-disease associations for all haplotype groups with greater than 1,000 total doses in the UKBB data. To analyze the results, we identified 15 traits present in both the UKBB and FinnGen that were significantly associated with at least one haplotype group in FinnGen. We compared the Z scores for the haplotype group-disease associations for the haplotype groups included in both the FinnGen and UKBB analyses.

Ethics statement

Participants in FinnGen provided informed consent for biobank research based on the Finnish Biobank Act (see supplemental methods).

The North West Multi-centre Research Ethics Committee (MREC) has granted the Research Tissue Bank (RTB) approval for the UKBB that covers the collection and distribution of data and samples (http://www.ukbiobank.ac.uk/ethics/). Our work was performed under the UKBB application number 24983. All participants included in the conducted analyses have given written consent to participate.

Results

Enrichment of significant disease associations in the HLA region

To identify disease associations with genetic loci throughout the entire HLA region, we analyzed data from 412,181 Finnish individuals and 2,459 diseases (Figure 1). We used fine-mapped GWAS summary statistics released by FinnGen as well as new association data that we generated at the level of individual phased haplotypes and HLA alleles. We corrected for sex, age, and the first ten principal components of the genome-wide genotype matrix (see subjects and methods). Results from these association tests were used in subsequent analyses.

Figure 1.

Figure 1

Study overview

(A) An overview of the HLA region showing the nearest genes to disease-associated SNPs, colored by HLA class, spanning approximately 5 Mb.

(B) An overview of the study data and design.

While the importance of HLA variation in disease has been well established, we first sought to systematically quantify the enrichment of association signals across diseases, focusing on how enrichment varies by disease type. We considered the 1,035 diseases in FinnGen that had at least one genome-wide significant association anywhere in the genome (Table S1). We then identified independent genome-wide significant SNP associations for each disease and binned these SNPs into 100-kb bins (Figure 2A). We found the mean number of significant associations per bin was 2.75, with a median of 1. One of the bins on chromosome 6 that overlaps the class II region of the HLA region had the highest number of associations in a single bin with 282 associations. Five of the six bins with the most associations overlapped the HLA region. The remaining bin is on chromosome 19 and has 101 associations. This bin contains an apolipoprotein gene cluster including APOE, APOC1, APOC2, and APOC4, which are involved in lipid metabolism and affect Alzheimer disease risk. We then repeated the analysis varying the window size from 100 kb to 50 kb, 200 kb, and 500 kb to determine the robustness of this results across window sizes. As before, the bin with the highest number of associations overlapped the HLA region across all of these, with 171, 343, and 405 associations in the top bin for each analysis, respectively (Figure S1). In addition, across each of these window sizes, at least four of the five bins with the most associations overlapped the HLA region. These results show that the HLA region harbors a higher density of disease associations than the rest of the genome.

Figure 2.

Figure 2

Distribution of GWAS hits across the genome and disease group enrichment

(A) Distribution of fine-mapped GWAS hits throughout the genome across 1,035 FinnGen diseases, binned into 100-kb bins.

(B) Enrichment of association signal in the HLA region by disease group. The 1,035 diseases were categorized into 45 disease groups based on ICD codes, and the average per-SNP enrichment in the HLA region was calculated by comparing the number of independent associations in the HLA region relative to that in the rest of the genome.

(C) Classification of diseases with at least one significant SNP association in the HLA region by shared pathophysiology.

While the role of HLA in infectious disease and autoimmunity is well established, its role in other disease types is less clear. As such, while in the prior analysis we calculated the total number of fine-mapped GWAS variants across diseases in each 100-kb bin across the genome and looked at which parts of the genome contained the highest total number of GWAS variants, we next sought to quantify the average enrichment of GWAS signal in HLA relative to the rest of the genome overall and by disease group. To do this, we compared the number of independent associations inside the HLA region to the number in the rest of the genome, after adjusting for the number of SNPs in the HLA region relative to the number of SNPs in the rest of the genome (Table S2).

Overall, we found a 17× enrichment in the HLA region relative to the rest of the genome averaged across all 1,035 diseases that had at least one GWAS association anywhere in the genome. To quantify the enrichment of association signal stratified by disease groups, we classified the 1,035 diseases that had at least one GWAS association into 45 disease categories based on ICD codes (Table S1). We then calculated the average per-SNP enrichment of association signals for each disease category. The individual disease category with the highest enrichment was the infectious disease group, with a 396× enrichment relative to the rest of the genome (Figure 2B). The overall enrichment across all diseases remained relatively unchanged (16.6×) even after excluding all infectious diseases. In addition, the majority of other disease groups, including groups such as dental diseases (71×), dermatologic diseases (63× enriched), rheumatic diseases (53× enriched), hematologic diseases (50× enriched), and ear diseases (45× enriched), also showed a major enrichment in the HLA region. In contrast, the congenital group was the only group not enriched in the HLA region. This could be because the diseases in the congenital group are oligogenic, with an average of 2.2 hits outside the HLA region and none within the HLA locus. The most enriched disease groups showed enrichment for primarily two reasons (Figure S2). First, some diseases had high enrichment because they had many associations across the genome, with proportionately even more associations in the HLA region, such as rheumatic diseases. Alternatively, a subset of the enriched diseases did not have many associations overall, but the few associations they had were in the HLA region, such as infectious diseases.

Some traits in FinnGen are very similar, so we re-ran these enrichment analyses on the subset of 644 traits with LDSC genetic correlation <0.95, whereby the results are essentially identical (Figure S3, Table S2, and Note S1). Further, to ensure that our results for the average disease group enrichments were robust and not driven by the unusually high gene density or by differences in genotype array coverage of the HLA region, we repeated our analyses to quantify the SNP enrichments first, relative to the number of protein-coding genes in the HLA region versus the number in the rest of the genome, and second, relative to the size of the HLA region compared to the rest of the genome. The results were qualitatively consistent, differing by factors of 0.48× and 2.4×, respectively. Overall, these results emphasize the involvement of the HLA region in a broad range of disease groups, including those from a variety of different pathologic mechanisms and organ systems.

To understand how the HLA region contributes to disease mechanisms, we next examined diseases that had associations within the locus (N = 572 diseases). To remove essentially redundant diseases, we focused on the subset of these diseases that had LDSC genetic correlation ≤0.95, which included 269 diseases. We then used forward stepwise regression to identify conditionally independent SNP associations for each disease (Table S3). This resulted in 428 unique SNPs with disease associations (MAF > 1%, p < 106).

Classifying disease categories by ICD code, as was done in the enrichment analysis above, primarily results in anatomical groups as opposed to groups based on shared pathophysiology. To understand the contribution of HLA to biological disease mechanisms, we manually classified the 269 HLA-associated diseases based on pathophysiology (Figure 2C and Table S1). For diseases in which the underlying mechanism is unknown or ambiguous, we classified by the organ system affected.

We calculated the number of diseases in each of these disease categories that had at least one significant HLA association in the HLA region (Figure 2C). Two of the top disease categories were rheumatic (40 diseases) and infectious (38 diseases). In contrast to the enrichment analysis, additional multi-system disease groups beyond rheumatic and infectious diseases were well represented, including autoimmune (27 diseases) and cardiometabolic (27 diseases).

Pleiotropy and spatial structure of significant SNP association signal within the HLA region

We aimed to evaluate the spatial distribution of the significant SNP association signal across the HLA region. We first categorized the associations by assigning each variant to its nearest gene (Figure 3A). We observed association signals throughout the extended HLA region with the highest density of associations near the 12 classical HLA genes, particularly the class II genes. However, associations were spread broadly across the region, with a total of 75 genes that were the nearest gene for at least one association, 59 of which were non-HLA genes. Overall, the associations were spread relatively consistently across disease groups, although the autoimmune and rheumatic diseases had slightly higher signal near the class I genes than the other disease groups, likely driven at least in part by the well-known associations of HLA-B alleles with rheumatic diseases73,74 (Figure S4).

Figure 3.

Figure 3

Pleiotropic structure of the HLA region

(A) Distribution of significant SNP associations across the HLA region, binned by nearest gene. Each bar represents a different gene, and the width corresponds to the length of the gene boundaries.

(B) Heatmap of normalized Z scores for the 428 variants in the HLA region significantly associated with at least one disease. The x axis has SNPs projected on a pseudo-genome position scale, roughly proportional to the genome position of the variant, and the y axis corresponds to the HLA-associated diseases. Associations with all HLA-associated diseases are shown for all variants that had an independent significant association with at least one disease. The three blocks used in subsequent analysis are circled, and a schematic labeling key genes in and between each block are included. Simplified names for key diseases or groups of diseases are labeled in the schematic on the y axis with the length of the bar roughly proportional to the number of diseases the label is representing. A full description of all diseases can be found in Table S1.

(C) Linkage disequilibrium as measured by r2 and D of the approximately 40,000 SNPs covering the HLA region (MAF > 1%).

We next evaluated the role of genetic variation in the HLA region in contributing both risk-increasing and protective effects across diseases. We calculated normalized Z scores for each association discovered in the forward stepwise analysis (sign(Z)×Z2/(maxZ2ofdiseases); see subjects and methods) and visualized how these association signals were spread across the locus (Figure 3B). We found that 99% of the associations were also significant (p < 106) for one or more diseases beyond the disease for which they were identified as a conditionally independent significant association. Moreover, we found variants that significantly increased the risk for one disease while significantly decreasing risk for another disease.

The normalized Z scores visually clustered around three main genomic regions within the HLA locus. The first cluster spanned two non-classical and one class I HLA gene (HLA-F, HLA-G, and HLA-A). The second spanned two class I HLA genes and one non-HLA gene (HLA-C, HLA-B, and MICA). The third spanned one non-HLA gene and two sets of class II HLA genes (NOTCH4, HLA-DR, and HLA-DQ).

The overall pleiotropic structure revealed large blocks of SNPs spanning hundreds of kilobases that have similar effects across diseases. These encompass multiple genes and likely arise due to the high gene density and the extensive LD in the region (Figures 3 and S5).

Pleiotropic disease associations at the haplotype level

The HLA region is particularly challenging for standard association studies because of its strong LD, multi-allelic sites, and large-effect coding variants within the classical HLA genes. It has four main genomic blocks, referred to as the alpha, beta, gamma, and delta blocks.48,49,50,51,54,55,56 Motivated by the block-like structure of the HLA locus (Figures 3B, S6, and S7), we developed an approach to explore pleiotropy at the haplotype level from genotype data, with haplotype blocks spanning multiple genes and including non-classical HLA, non-HLA, and non-coding regions.

The three main regions (“blocks”) described above were selected based on the density of signal from the significant SNP associations, overlapping LD patterns, and functional relevance. These blocks essentially recapitulate the alpha, beta, and delta blocks, first reported decades ago, with the alpha block containing HLA-A, the beta block containing HLA-C and HLA-B, and the delta block containing HLA-DR and HLA-DQ.48,49,51,54,55,56 This demonstrates the ability of genotype-based genetic association studies to recover these blocks in a data-driven way. We defined haplotypes for each of the three regions by the phased genotypes at 1,000 randomly selected biallelic SNPs with MAF > 1% (Figure 4; see subjects and methods for additional details). We then clustered related haplotypes into groups (Table S4 and Note S2) and for each block performed association analyses between the haplotype groups and the 269 HLA-associated diseases.

Figure 4.

Figure 4

Haplotype group regression analysis pipeline

Overview of the pipeline for identifying the haplotype groups for each of the three blocks in the HLA region and performing disease associations. For each block, all unique phased combinations of nucleotides at 1,000 randomly selected SNPs were considered as haplotypes. We then clustered related haplotypes into groups by recursively splitting the dendrogram at each branchpoint (see subjects and methods). Finally, for each of the three blocks, we performed association analyses between the haplotype groups and the 269 HLA-associated diseases, including all haplotype groups for a given block except the most frequent in each regression, as well as sex, age, and the first ten principal components of the genome-wide genotype matrix as covariates.

We discovered 469 significant disease-haplotype group associations (|Z| > 4) across blocks (Figure 5 and Table S5), representing 64 diseases. Of these diseases, 25 had significant associations with all three blocks. Celiac disease had the most disease-haplotype group associations with 36 in total (8 in block 1, 16 in block 2, and 12 in block 3), followed by rheumatic disease prescriptions with 34, spondylopathies with 32, and iridocyclitis and type 1 diabetes with 25 each. Early work on the long-range alpha, beta, and delta HLA blocks has identified disease associations with many of these diseases including celiac disease, rheumatoid arthritis, SLE, and type 1 diabetes48,50,54; however, these findings are reported across many different individual cohort studies, motivating the need for a centralized resource systematically reporting haplotype associations conducted in a uniform way on a phenome-wide scale.

To determine the robustness of the main haplotype group association results, we repeated the associations while varying the input SNPs used to define the haplotypes (see subjects and methods for details). While varying the SNPs used to define the haplotypes means that the haplotypes are not exactly the same between the additional analyses, qualitatively many of the same diseases were found to be associated with the haplotypes from a given block regardless of the input SNPs used to define them (Figure S8).

We sought to explore the patterns of pleiotropy within these blocks. For each block, we considered all diseases with at least one association (|Z| > 4) in that block and all haplotype groups with at least one disease association or total copies greater than the minimum cutoff of 20,000 copies (Figure 5). This resulted in 41 diseases and 23 haplotype groups for block 1, 46 diseases and 25 haplotype groups for block 2, and 36 diseases and 21 haplotype groups for block 3.

The majority of the haplotype groups were significantly associated with multiple diseases. A subset of haplotype groups was associated with increased risk for some diseases but decreased risk for others. For example, in block 1, haplotype group 6 is associated with increased risk (Z > 3) for ten diseases, including gastrointestinal autoimmune diseases, thyroid diseases, and connective tissue and rheumatic diseases. However, this haplotype group is also associated with decreased risk for eight diseases, mostly other rheumatic and inflammatory diseases (Table S5). While prior studies have found extensive genetic associations in the HLA region with these diseases,32,63,64 to our knowledge this is the first study to identify pleiotropic associations with all of these diseases for an individual haplotype in the alpha block region. Overall, of the 58 haplotype groups that showed a significant disease association (|Z| > 4), the mean number of associations (|Z| > 3) per haplotype group was five risk-increasing associations and seven risk-decreasing associations (Figure S9 and Note S3).

In contrast to these haplotype groups showing both risk-increasing and protective effects across different diseases, we also observed that some haplotype groups had the same direction of effect across the majority of associated diseases (Figure 5). For example, haplotype group 49 in block 1 was one of the rarest haplotype groups (0.09% frequency), but all six of the diseases with which it was significantly (|Z| > 3) associated were in the risk-increasing direction, including depression and phobic anxiety disorder. Individuals with chronic medical conditions overall are at greater risk of developing mood disorders,75 but the extent to which shared underlying biological processes contribute is still being explored.76 Prior work has previously implicated the HLA region in mood disorders; a GWAS of depression previously identified a SNP association in the extended HLA classical I region,77 and another study identified associations of classical HLA alleles with depression and anxiety,78 while another recent study investigating the role of HLA alleles found no evidence of increased risk for depression.76

This finding motivated us to calculate overall disease burden proportions for each haplotype group (Figure S10). We defined the set of relevant diseases for each block as any disease that was significantly associated with at least one of the haplotype groups in that block. Then, for each haplotype group in a given block, we identified the proportion of individuals in the haplotype group that had a diagnosis of at least one of the block’s relevant diseases. To identify the overall disease proportion as a baseline comparison, for each block we identified the proportion of all 412,181 individuals who had a diagnosis of at least one of the block’s relevant diseases. We then compared the haplotype group-disease proportion to the overall disease proportion (Figure S11). Overall, we found there was heterogeneity in disease burden across haplotype groups in all three blocks (block 1: χ2(22,412181)=138,p<2.2×1016; block 2: χ2(24,412181)=319,p<2.2×1016; block 3: χ2(20,412181)=255,p<2.2×1016). One example of a haplotype group that demonstrated this heterogeneity was haplotype group 49 in block 1, which had one of the highest block-relevant disease burdens with 73% of carriers having at least one of the block’s significantly associated (|Z| > 4) diseases, compared to the baseline prevalence in FinnGen of 67.5%. Our findings indicate that while some haplotypes both increased and decreased disease associations, other haplotypes had an overall net positive or net negative impact across diseases.

Comparison of effects on disease pairs across haplotype groups

Many diseases have shared underlying pathology resulting in comorbidity. As a result, we expected to see sharing of associations across these diseases for the HLA haplotype groups. Indeed, our analysis recapitulated shared pathology for many diseases, such as rheumatoid arthritis and seropositive rheumatoid arthritis, with similar associations across haplotype groups. More broadly, we found that the inflammatory and rheumatic diseases, such as spondylopathies, iridocyclitis, polyarthropathies, and rheumatoid arthritis clustered together throughout the three blocks (Figure 5). This could result from phenotypic correlations, caused, for example, by being comorbid. An alternative explanation is that these diseases have a shared biological mechanism modulated by genetic variation in the HLA region. Finally, it is possible that these correlations are an artifact of long-range LD extending beyond the haplotypes.

In contrast, we observed a surprising lack of concordance for a subset of seemingly similar diseases, such as inflammatory bowel disease (IBD) and “IBD with primary sclerosing cholangitis” (IBD with PSC) (Figure 5). IBD with PSC is an idiopathic chronic liver disease complication developed by a subset of IBD patients, in which the bile ducts become inflamed and scarred, causing liver damage. IBD and IBD with PSC have a genome-wide genetic correlation of 0.45 (SE = 0.16) and have similar effects across haplotypes in block 3 (Pearson’s correlation of 0.57, SE = 0.12, p = 0.005), suggesting a shared etiology (Figure S12). However, the haplotype groups have essentially uncorrelated effects on the two diseases in block 1 (Pearson’s correlation of 0.10, SE = 0.13, p = 0.47). In fact, some haplotype groups in block 1, such as group 6, are associated with increased risk for IBD with PSC, but not IBD (Figure S12). These results build on previous work based on early serologic HLA typing that found some HLA alleles were more common in individuals with ulcerative colitis and hepatobiliary disease relative to those with just ulcerative colitis.79

The difference in haplotype group effects on IBD and IBD with PSC is particularly interesting because it is difficult for clinicians to predict which IBD patients will develop liver damage, and the mechanism leading to this damage is unknown.80 Thus, these traits serve as an example of how understanding which parts of the genome are associated with increased risk for both a disease and its complications—as opposed to loci that differentially affect a disease and its complications—may help us better understand the factors that modulate the risk of certain disease complications. Understanding these differences may help explain why individuals with the same disease can present with a wide range of symptoms and outcomes.

To better disentangle whether these pleiotropic associations were due to LD, comorbidity, or shared biological pathways, we quantified the genome-wide LDSC genetic correlation, phenotypic correlation, and Pearson’s correlation across haplotype group-disease associations for all pairwise combinations of haplotype group-associated diseases for each block (Figure 6A). We discovered 1,520 pairs of diseases with genome-wide genetic correlations greater than 0.3 where both diseases are significantly associated (|Z| > 4) with at least one block. Of these disease pairs, 408 have a Pearson’s correlation across haplotype group effects >0.3 for all three blocks and, surprisingly, 256 had a discordant Pearson’s correlation of less than −0.3 in at least one block. We also observed discordant association signals for diseases with previously well-defined genetic associations and with clinical impact,73,81,82 such as Graves disease and rheumatoid arthritis (Figure 6B).

Figure 6.

Figure 6

Correlation of haplotype-associated diseases

(A) Overview and comparison of the pairwise relationships between diseases that were significantly associated with the haplotype group regression analysis, comparing genome-wide LDSC genetic correlations, Pearson’s correlation across haplotype groups in each block, and phenotypic correlations.

(B) Comparison of correlation measures between Graves disease and rheumatoid arthritis. Error bars correspond to standard error.

In Graves disease, autoantibodies against the TSH receptor lead to overstimulation of the thyroid gland, resulting in hyperthyroidism. Rheumatoid arthritis is an idiopathic chronic inflammatory autoimmune disease primarily affecting the joints. Graves disease and rheumatoid arthritis have a genome-wide genetic correlation of 0.35 (p = 0.0002), despite a phenotypic correlation of approximately 0 (p = 0.6). Both Graves disease and rheumatoid arthritis have been found in many prior studies to have strong shared and individual associations in the HLA region.83,84,85,86,87 Here, we replicate these findings and also find that for some haplotype groups, the same haplotype groups have not just different but opposite effects on these conditions. The Pearson’s correlation of effects within block 2 is concordant—although not significantly so (Pearson’s correlation of 0.28, p = 0.18)—with this genome-wide genetic correlation. However, the effects in blocks 1 and 3 are significantly negatively correlated (Pearson’s correlations of −0.37 and −0.59, p = 0.006 and 0.004, respectively). A potential explanation of this discordance between the genome-wide genetic correlation and the Pearson’s correlation within HLA regions is that these discordant regions affect a biochemical mechanism that breaks shared pathology, resulting in an increased risk for one disease while decreasing the risk for another, when relevant variants elsewhere in the genome typically cause a shared increase or decrease of risk for both diseases. In previous work, we showed that such mechanisms can result in associations with opposite signs on diseases, despite a positive genome-wide genetic correlation, driven by variants acting at the shared biochemical pathways between both diseases.88

Evaluation of haplotype group signal independent of HLA alleles

While protein-coding variation within the HLA genes likely contributes significantly to the disease associations at the haplotype level, prior studies have shown the role of variation outside the classical HLA genes in multiple traits.39,40,41,42,43,44 A feature of the haplotype analysis is that it includes genetic variation beyond coding variants in classical HLA genes, including non-classical HLA genes, non-HLA genes, and non-coding variation. Therefore, we sought to determine the extent to which our haplotype analysis was able to capture signal beyond the HLA alleles. To be conservative, we only considered signal entirely independent (directly, or indirectly due to LD) of the HLA alleles by performing the haplotype group regressions while including all classical HLA alleles (frequency >1%) in each block as covariates. Overall, we found that 129 haplotype associations remained significant (|Z| > 4) after accounting for HLA allelic variation (Figures S13 and S14; Table S5). In particular, block 1 had 171 significant associations across 48 unique diseases in our original analysis, and 50 significant associations (|Z| > 4) across 18 unique diseases after adjusting for the classical HLA alleles. For example, rheumatic disease prescriptions was significantly associated (|Z| > 4) with 12 haplotype groups in block 1 in the original analysis, but after adjusting for the classical HLA alleles in block 1, it was significantly associated (|Z| > 4) with 13 haplotype groups. A possible explanation for this could be that the haplotype association analysis is picking up on signal from genetic variation in the non-coding region. Consistent with this, conditional analysis to identify independent SNP associations in the HLA region resulted in a significant association with intergenic variant rs9393984 (β = 0.13, p = 3.8e−26, nearest gene HLA-A) with rheumatic disease prescriptions (Table S3 and Note S4).

This indicates that many associations cannot be explained by HLA allele variation or signal tagged by it and demonstrates that the haplotype group analysis was able to pick up on disease associations that would have been missed in traditional allele association analyses. Block 1 overlapped only one classical HLA gene, HLA-A, suggesting that our haplotype regression approach may be particularly beneficial for regions of the HLA that cover non-classical HLA genes. Moreover, including the HLA alleles as covariates increased the strength of 42 significant haplotype-disease associations, indicating that the haplotypes explain some variation independent of that explained by the HLA alleles. We note that some of the haplotype associations can be driven by LD between our SNP-based haplotypes and long-range HLA haplotypes based on classical two-field HLA alleles or amino acid residues.51,52,53

To further disentangle the information provided by haplotypes, SNPs, and HLA alleles, we performed association analyses at each of these levels separately. For the allele associations, we performed regressions using two approaches (Table S6). The first approach used the standard method of including one allele per regression, while the second performed a multi-variable regression of all alleles (variance inflation factor <5) within a given block. Our findings from these associations largely recapitulate HLA associations that have been shown by earlier HLA studies.8,28,29,30,31,32 Furthermore, our data discovers associations at many loci not present in prior work such as Ritari et al.32 (Note S5). The results of our association analyses at the SNP, haplotype, and HLA allele level on the full cohort across all 2,459 diseases are available in Tables S3, S5, and S6, respectively. In total, we identified 7,649 associations and 647 HLA-associated diseases across the combined association analyses. In particular, we identified 1,750 significant associations within the HLA locus in the haplotype analysis, including 27 diseases not identified in the SNP or HLA allele analyses. These diseases included non-organic psychotic disorder, otorrhagia, vascular dementia, and rectal cancers. This emphasizes that analyzing variation at the haplotype level provides orthogonal information about the role of the HLA region in disease.

Haplotype-disease associations in the UK Biobank

To elucidate the portability of the method in an independent cohort, we applied the haplotype association method in the UKBB cohort (see subjects and methods). While Finnish haplotypes are distinct due to population-specific genetic structure, we similarly observed many disease associations with haplotypes in the UKBB (Table S7 and Figure S15). Many of the diseases identified to be associated with haplotypes in FinnGen were also associated with haplotypes in the UKBB, such as multiple sclerosis, diabetes, lupus, celiac disease, thyrotoxicosis, Sjögren syndrome, rheumatoid arthritis, iridocyclitis, and psoriatic arthropathies (Figure S15; see Note S6 for additional details). These associations across blocks are generally concordant but consistent with having lower power in the UKBB due to smaller case numbers. In addition, trait clusters—traits that cluster together based on shared effects across haplotypes—in each block for these traits are similar in FinnGen and UKBB (Figure S15 and Note S6).

We additionally performed a replication analysis of the FinnGen haplotype group associations in the UKBB by mapping UKBB haplotypes to the nearest FinnGen haplotype groups (see subjects and methods). For 15 traits significantly associated with at least one haplotype group in FinnGen, we compared the Z scores for the haplotype group-disease associations in FinnGen to those from the UKBB (Table S8). The Pearson’s correlation of Z scores between all 345 haplotype group-disease associations present in both biobanks was 0.28 (p = 7.7e−8), the correlation for the 120 haplotype group-disease associations with |Z| > 2 in FinnGen was 0.43 (p = 1.3e−6), and for the 33 associations with |Z| > 4 in FinnGen the correlation was 0.55 (p = 9.8e−4). In addition, 10 of the 33 associations with |Z| > 4 in FinnGen have |Z| > 1.645 (based on a one-sided Z test at significance level of 0.05) and the same sign in the UKBB. Under a null model where there is no signal in the UKBB, the expected replication probability would be 0.05, or 1.65 out of 33 associations, significantly fewer than the ten replications we observed (binomial test: p = 3.1e−6).

We also examined the replication of particular haplotype group-disease associations. For example, in block 1 haplotype group 3 has a Z score of −3 for diabetes in the UKBB and −4.5 for its mapped trait diabetes insulin treatment in FinnGen, and 3.2 for thyroiditis in the UKBB and 4.4 for its mapped trait thyrotoxicosis in FinnGen. For block 2, haplotype group 10 has a Z score of 16.7 in the UKBB and 15.7 in FinnGen for celiac disease. The same haplotype group is strongly associated with Sjögren’s in both the UKBB (Z = 6.1) and FinnGen (Z = 5.5). For block 3, haplotype group 8 has a Z score of 6.7 (UKBB) and 7.1 (FinnGen) for rheumatoid arthritis and a Z score of 3.97 for diabetes in the UKBB and 11.1 for diabetes insulin treatment in FinnGen.

We next evaluated Pearson’s correlation across haplotype groups for the 15 traits we mapped between the UKBB and FinnGen and qualitatively compared these to the original FinnGen Pearson’s correlation analysis. Specifically, of the 12 trait pairs in FinnGen that had a genetic correlation >0.3 and had a Pearson’s correlation >0.3 across the haplotype groups in all three blocks, five trait pairs also had a Pearson’s correlation >0.3 in all three blocks in the UKBB. While Graves disease was not present in our UKBB haplotype association data, we did have thyroiditis, a related trait. Therefore, we compared the correlation effects across the three blocks for thyroiditis and rheumatoid arthritis in the UKBB. While the Pearson’s correlation across haplotype groups is not significant in the UKBB, the direction of effects for these traits match that of the mapped traits thyrotoxicosis and rheumatoid arthritis in FinnGen for all three blocks, with a negative correlation in blocks 1 and 3 (Pearson’s correlations of −0.12 and −0.28, respectively) and positive correlation in block 2 (Pearson’s correlation of 0.19). In Table S8 we include Pearson’s correlations between all pairs of these 15 diseases mapped between FinnGen and UKBB.

Together, these results emphasize the pleiotropic patterns of the haplotypes across diseases in the HLA region.

Discussion

In this work, we investigated how genetic variation throughout the HLA region associates with disease, with a focus on broad pleiotropic patterns. We quantified the enrichment of association signal in the HLA region relative to the rest of the genome. We found a strong enrichment of disease associations across a broad range of disease groups and organ systems. Unsurprisingly, infectious diseases were almost 400-fold enriched in the HLA region compared to the rest of the genome, despite infections making up a minority of the HLA-associated diseases. We also found enrichment across multiple disease categories and organ systems including cardiovascular and neuropsychiatric diseases. Overall, these findings indicate HLA is a major locus for disease risk, not only for infectious diseases but for diseases across many organ systems and etiologies.

Even with the extreme enrichment for infection-related associations, we expect that there is still substantially more information to be gleaned about the role of HLA in mediating infection. Our enrichment analysis controls for how well powered a disease is by using the number of associations in the rest of the genome as a baseline. However, while we find a huge enrichment, the absolute number of total associations is small. Infectious diseases are often under-reported in large biobank cohorts: identifying cases requires patients to seek care for the infection, followed by testing to confirm the specific pathogen. The infectious diseases that we identified with the clearest signal tended to be those with more consistent reporting, such as sexually transmitted diseases. Therefore, our findings indicate that there is likely more signal for infectious diseases that will be discovered with larger samples or more systematic reporting.

We performed disease association testing with SNPs, HLA alleles, and haplotypes to capture disease associations throughout the entire HLA region, including non-classical HLA genes and non-coding regions. We developed a haplotype analysis approach that includes genetic variation outside of the classical HLA alleles. While many diseases strongly associate with HLA alleles, the HLA region harbors hundreds of genes, many of which also play an important role in immune response and other biological processes. Our haplotype approach discovered disease associations in the HLA region that remained after adjusting for classical HLA alleles, particularly in the region that overlaps more non-HLA and non-classical HLA genes.

Furthermore, we found some haplotype groups that displayed protective associations for some diseases and risk-increasing associations for others. Meanwhile, we found some haplotype groups that were more consistently associated with increased disease burden across tested diseases. In addition, our haplotype analysis discovered that local genetic correlation, genome-wide genetic correlation, and phenotypic correlation between disease pairs are not always concordant. This discordance suggests that the HLA region plays not only an important but also a distinct role relative to the rest of the genome in contributing to the shared biology underlying these diseases.

In total, we identified 7,649 significant disease associations across 647 unique diseases in the HLA region. Here, we highlight interesting patterns across these diseases and example associations, but we have only begun to explore the thousands of disease associations generated by these analyses. Therefore, we are releasing the association test results as a resource for future studies of the HLA region (Tables S3, S5, and S6). For example, our haplotype association results identify multiple diseases or disease complications of previously unknown pathology that cluster with diseases with known mechanism. It could be fruitful to use these clusters to generate hypotheses about the biology underlying idiopathic diseases. Further, future studies could disentangle the signal captured by the haplotype group associations in terms of the effects of individual non-coding SNPs, haplotypes, classical HLA alleles, and protein-coding variation in other genes in the region. In addition, the haplotypes present in FinnGen represent only a fraction of the genetic diversity present in the world. As more large cohort data continue to become available from regions around the world, future studies will benefit from application of these methods in other cohorts to study the HLA region at the haplotype level.

In conclusion, this work offers insights into the role of the HLA region in modulating the complex interplay between hundreds of diseases. Our findings highlight haplotype regression analysis as an additional approach for studying genetic variation in the region beyond the classical HLA alleles. Our results also provide insight into the nature of pleiotropy in the region and highlight novel pathological processes for not only infectious and autoimmune diseases typically associated with HLA but also across a broad range of diseases.

Data and code availability

Acknowledgments

We want to acknowledge the participants and investigators of the FinnGen study (see Table S9 for a full list of FinnGen contributors). We thank Alyssa Lyn Fortier, Mineto Ota, Roshni Patel, Matthew Aguirre, Tami Gjorgjieva, and other members of the Pritchard lab for helpful discussions. This work has been supported by the National Science Foundation Graduate Research Fellowship, Stanford’s Knight-Hennessy Scholars Program, and the Stanford Center for Computational, Evolutionary and Human Genomics (C.J.S.); the Finnish Medical Foundation (S.S.); and Instrumentarium Science Foundation and Academy of Finland #340539 (H.M.O.). The FinnGen project is funded by two grants from Business Finland (HUS 4685/31/2016 and UH 4386/31/2016) and the following industry partners: AbbVie, AstraZeneca UK, Biogen MA, Bristol Myers Squibb (and Celgene Corporation & Celgene International II Sàrl), Genentech, Merck Sharp & Dohme, Pfizer, GlaxoSmithKline Intellectual Property Development, Sanofi US Services, Maze Therapeutics, Janssen Biotech, Novartis, and Boehringer Ingelheim International. The following biobanks are acknowledged for delivering biobank samples to FinnGen: Auria Biobank (www.auria.fi/biopankki), THL Biobank (www.thl.fi/biobank), Helsinki Biobank (www.helsinginbiopankki.fi), Biobank Borealis of Northern Finland (https://www.ppshp.fi/Tutkimus-ja-opetus/Biopankki/Pages/Biobank-Borealis-briefly-in-English.aspx), Finnish Clinical Biobank Tampere (www.tays.fi/en-US/Research_and_development/Finnish_Clinical_Biobank_Tampere), Central Finland Biobank (www.ksshp.fi/fi-FI/Potilaalle/Biopankki), Biobank of Eastern Finland (www.ita-suomenbiopankki.fi/en), Finnish Red Cross Blood Service Biobank (www.veripalvelu.fi/verenluovutus/biopankkitoiminta), and Terveystalo Biobank (www.terveystalo.com/fi/Yritystietoa/Terveystalo-Biopankki/Biopankki/). All Finnish Biobanks are members of BBMRI.fi infrastructure (www.bbmri.fi). Finnish Biobank Cooperative - FINBB (https://finbb.fi/) is the coordinator of BBMRI-ERIC operations in Finland. The Finnish biobank data can be accessed through the Fingenious services (https://site.fingenious.fi/en/) managed by FINBB. This work was supported by NIH grants RO1HG008140 and R01AG066490 (to J.K.P.).

Declaration of interests

The authors declare no competing interests.

Published: July 10, 2025

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2025.06.011.

Contributor Information

Courtney J. Smith, Email: courtrun@stanford.edu.

Jonathan K. Pritchard, Email: pritch@stanford.edu.

Web resources

Supplemental information

Document S1. Figures S1–S17, Notes S1–S6, and supplemental methods
mmc1.pdf (9.1MB, pdf)
Table S1. Description of all 2,459 FinnGen diseases included in the association analyses, including the disease categories, number of cases, and ICD codes

The first column (“NAME”) corresponds to the short name assigned by FinnGen, the second (“TAGS”) is the group tag assigned by FinnGen, “LONGNAME” is the full name of the disease, “HD_ICD_10” corresponds to ICD code 10, “HD_ICD_9” is ICD code 9, “HD_ICD_8” is ICD code 8, “category” is the FinnGen description of the disease tag, “num_cases” is the number of cases for that disease, “num_controls” is the number of controls, “lambda” is the lambda from the FinnGen GWAS runs, “HLA_hits” is the number of conditionally independent SNP associations identified in this study within the HLA boundaries, “non_HLA_hits” is the number of fine-mapped variants for all SNPs outside the HLA region, “group” is the groupings used for the enrichment analysis based on the FinnGen groupings, and “ManualCategory” is the manually curated groupings based on shared pathophysiology for the diseases identified as HLA associated by the SNP analysis. For all diseases where the pathophysiologic mechanism underlying the disease was unknown or if the disease can have multiple causes, the ManualCategory is recorded as “Organ.” “ManualSubcategory” is the same as “ManualCategory” except for all diseases where the “ManualCategory” is recorded as “Organ,” then this lists the part of the body/organ primarily affected. “Plot” is the concise interpretable name used to plot the 269 diseases included in the main haplotype group regression analysis, results of which are shown in Figure 5. In addition, full details for how FinnGen defined each disease is available at https://r10.risteys.finngen.fi/.

mmc2.zip (104.5KB, zip)
Table S2. Enrichment of GWAS hits in the HLA region for each disease group with at least ten diseases in the group, for all diseases in FinnGen with at least one GWAS hit (MAF > 1%) anywhere in the genome

The first column (“group”) corresponds to the disease group, the second column (“enrich”) has the enrichment for that disease group, the third column (“n”) has the number of diseases in that disease group, the fourth column (“HLA_hits”) has the mean number of independent SNP associations in the HLA region for diseases in that disease group, the fifth column (“non_HLA_hits”) has the mean number of independent SNP associations outside the HLA region for diseases in that disease group, the sixth column (“se_HLA”) has the standard error of the number of independent SNPs in the HLA region across diseases in that disease group, and the last column (“se_non_HLA”) has the standard error of the number of independent SNPs outside the HLA region across diseases in that disease group. The second tab has the same format but includes results from the repeated analysis using the 644 diseases that remained after randomly removing one disease for each pair with an LDSC genetic correlation >0.95.

mmc3.xlsx (25.5KB, xlsx)
Table S3. Regression results for the SNP-disease associations for significant SNP associations remaining after stepwise conditional analysis in the HLA region

The first column has the FinnGen short name for the disease, the second column has the longer name of the disease, the third has the SNP ID, the fourth has the position of the SNP, the fifth has the reference allele, the sixth has the alternative allele, the seventh has the allele frequency of the alternative allele, the eighth has the rsID of the SNP, and the next five columns have the beta, standard error, Z score, and p value for the association of that SNP with that disease. The next column (“nearest_genes”) has the nearest gene to the SNP, followed by (column “round”) the conditional analysis round the SNP was found to be independently significant with the disease, then (“annot”) the variant annotation for the SNP. This table includes the 1,064 disease associations from the full SNP conditional analysis with all 572 diseases. The conditional analysis focusing on just the 269 diseases included in the main analysis, after removing redundant traits, resulted in 540 disease associations across 428 unique SNPs.

mmc4.xlsx (160.8KB, xlsx)
Table S4. Haplotype group and individual haplotype statistics and assignments

The first tab (“allhaplotype_group_stats”) has the total doses of each haplotype group present in the dataset for each block. Tabs 2–4 have the haplotype information with one tab for each block. The first 1,000 columns correspond to the 1,000 SNPs in the haplotype, where 0 corresponds to the reference allele and 1 corresponds to the alternative allele, then the second to last column (“total_doses”) has the total doses of that haplotype, and the last column (“haplotype_group”) has the haplotype group to which each haplotype belonged for that block. Total doses for individual haplotypes are included for all haplotypes with >10 total doses for FinnGen privacy policy reasons.

mmc5.xlsx (61MB, xlsx)
Table S5. Results from the haplotype groups association analyses across all three blocks

The first tab (“main_hapgroup_reg_results”) has the full results from the main haplotype group association analysis, with the first column (“traits”) indicating the disease, the second (“hapgroup”) referring to the haplotype group, the next column (“Z_rescaled”) referring to the values of the regression Z scores rescaled to add back in the dropped haplotype group for each block, the next column (“plotted”) corresponding to whether or not that association is plotted in the heatmap of Figure 5, and the final column (“block”) indicating the block in the HLA region in which the haplotype group was identified. The next two tabs have the (non-rescaled) regression results for all diseases, with (tab called “allregresults_adjallelesinblock”) and without joint modeling to condition on the relevant classical HLA alleles in the block (tab called “all_hapgroup_reg_results_sig”).

mmc6.xlsx (1.1MB, xlsx)
Table S6. Regression results for all significant allele associations for all diseases, for both the approach jointly modeling alleles within a given block together (tab 1; tab called “alleles_vifindep_joint”) and for the approach with one individual allele per regression (tab 2; tab called “alleles_indiv”)

For both tabs, the first column has the disease name, the next four columns have the beta, standard error, Z score, and p value, respectively for the association of that allele with that disease, and the sixth column has the allele. For the first tab the last column has the block number the allele is within.

mmc7.xlsx (570.6KB, xlsx)
Table S7. Haplotype-disease association results in UK Biobank

Tab 1 (called “ukbhapassociations”) has the results for blocks 1–3 for the associations plotted in the UKBB heatmap, with the first column (“traits”) indicating the disease code, the second (“hapgroup”) referring to the haplotype group, the next column (“Z”) referring to the values of the regression Z scores, the next column (“LONG_NAME”) referring to the long name for the disease, the next column (“N”) referring to the number of cases for that disease, the next column (“Plot”) referring to the concise interpretable name used for the heatmap plot label, and the last column (“block”) indicating the block to which the haplotype group belonged. Tab 2 (called “hapgroupdoses”) has the number of haplotypes included in each haplotype group.

mmc8.xlsx (59.1KB, xlsx)
Table S8. Haplotype-disease association results in UK Biobank using UK Biobank haplotypes mapped onto the original FinnGen haplotype groups

Tab 1 has the UKBB results for blocks 1–3 comparing the Z scores for the equivalent haplotype group-disease association in FinnGen, with the first column (“block”) indicating the block the haplotype group belonged to, the second (“hapgroup”) referring to the haplotype group, the third (“finngen_trait”) indicating the disease code in FinnGen, the fourth (“finngen_Z”) referring to the values of the FinnGen regression Z scores, the fifth (“ukb_trait”) indicating the disease code in UKBB, and the last column (“ukb_Z”) indicating the values of the UKBB regression Z scores. Tab 2 has the total doses in UKBB for each of the original FinnGen haplotype groups. Tab 3 has the replication analysis results for the original Pearson’s correlation analysis across haplotype groups for all pairwise combinations of the diseases mapped between FinnGen and UKBB.

mmc9.xlsx (82KB, xlsx)
Table S9. List of FinnGen contributors
mmc10.xlsx (40KB, xlsx)
Document S2. Article plus supplemental information
mmc11.pdf (18.7MB, pdf)

References

  • 1.Horton R., Wilming L., Rand V., Lovering R.C., Bruford E.A., Khodiyar V.K., Lush M.J., Povey S., Talbot C.C., Wright M.W., et al. Gene map of the extended human MHC. Nat. Rev. Genet. 2004;5:889–899. doi: 10.1038/nrg1489. [DOI] [PubMed] [Google Scholar]
  • 2.Neefjes J., Jongsma M.L.M., Paul P., Bakke O. Towards a systems understanding of MHC class I and MHC class II antigen presentation. Nat. Rev. Immunol. 2011;11:823–836. doi: 10.1038/nri3084. [DOI] [PubMed] [Google Scholar]
  • 3.Ishigaki K., Lagattuta K.A., Luo Y., James E.A., Buckner J.H., Raychaudhuri S. HLA autoimmune risk alleles restrict the hypervariable region of T cell receptors. Nat. Genet. 2022;54:393–402. doi: 10.1038/s41588-022-01032-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Fan W.-L., Shiao M.-S., Hui R.C.-Y., Su S.-C., Wang C.-W., Chang Y.-C., Chung W.-H. HLA Association with Drug-Induced Adverse Reactions. J. Immunol. Res. 2017;2017 doi: 10.1155/2017/3186328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Parham P., Guethlein L.A. Genetics of Natural Killer Cells in Human Health, Disease, and Survival. Annu. Rev. Immunol. 2018;36:519–548. doi: 10.1146/annurev-immunol-042617-053149. [DOI] [PubMed] [Google Scholar]
  • 6.Butler-Laporte G., Farjoun J., Nakanishi T., Lu T., Abner E., Chen Y., Hultström M., Metspalu A., Milani L., Mägi R., et al. HLA allele-calling using multi-ancestry whole-exome sequencing from the UK Biobank identifies 129 novel associations in 11 autoimmune diseases. Commun. Biol. 2023;6:1113–1117. doi: 10.1038/s42003-023-05496-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Karnes J.H., Bastarache L., Shaffer C.M., Gaudieri S., Xu Y., Glazer A.M., Mosley J.D., Zhao S., Raychaudhuri S., Mallal S., et al. Phenome-wide scanning identifies multiple diseases and disease severity phenotypes associated with HLA variants. Sci. Transl. Med. 2017;9 doi: 10.1126/scitranslmed.aai8708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Sakaue S., Kanai M., Tanigawa Y., Karjalainen J., Kurki M., Koshiba S., Narita A., Konuma T., Yamamoto K., Akiyama M., et al. A cross-population atlas of genetic associations for 220 human phenotypes. Nat. Genet. 2021;53:1415–1424. doi: 10.1038/s41588-021-00931-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Kennedy A.E., Ozbek U., Dorak M.T. What has GWAS done for HLA and disease associations? Int. J. Immunogenet. 2017;44:195–211. doi: 10.1111/iji.12332. [DOI] [PubMed] [Google Scholar]
  • 10.Hurley C.K. Naming HLA diversity: A review of HLA nomenclature. Hum. Immunol. 2021;82:457–465. doi: 10.1016/j.humimm.2020.03.005. [DOI] [PubMed] [Google Scholar]
  • 11.Pierini F., Lenz T.L. Divergent Allele Advantage at Human MHC Genes: Signatures of Past and Ongoing Selection. Mol. Biol. Evol. 2018;35:2145–2158. doi: 10.1093/molbev/msy116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Manczinger M., Boross G., Kemény L., Müller V., Lenz T.L., Papp B., Pál C. Pathogen diversity drives the evolution of generalist MHC-II alleles in human populations. PLoS Biol. 2019;17 doi: 10.1371/journal.pbio.3000131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Özer O., Lenz T.L. Unique Pathogen Peptidomes Facilitate Pathogen-Specific Selection and Specialization of MHC Alleles. Mol. Biol. Evol. 2021;38:4376–4387. doi: 10.1093/molbev/msab176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Miyadera H., Tokunaga K. Associations of human leukocyte antigens with autoimmune diseases: challenges in identifying the mechanism. J. Hum. Genet. 2015;60:697–702. doi: 10.1038/jhg.2015.100. [DOI] [PubMed] [Google Scholar]
  • 15.Radwan J., Babik W., Kaufman J., Lenz T.L., Winternitz J. Advances in the Evolutionary Understanding of MHC Polymorphism. Trends Genet. 2020;36:298–311. doi: 10.1016/j.tig.2020.01.008. [DOI] [PubMed] [Google Scholar]
  • 16.Takahata N., Nei M. Allelic genealogy under overdominant and frequency-dependent selection and polymorphism of major histocompatibility complex loci. Genetics. 1990;124:967–978. doi: 10.1093/genetics/124.4.967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Fortier A.L., Pritchard J.K. Ancient Trans-Species Polymorphism at the Major Histocompatibility Complex in Primates. eLife. 2025;14 doi: 10.7554/eLife.103547.1. [DOI] [Google Scholar]
  • 18.Arden B., Klein J. Biochemical comparison of major histocompatibility complex molecules from different subspecies of Mus musculus: evidence for trans-specific evolution of alleles. Proc. Natl. Acad. Sci. USA. 1982;79:2342–2346. doi: 10.1073/pnas.79.7.2342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Mayer W.E., Jonker M., Klein D., Ivanyi P., van Seventer G., Klein J. Nucleotide sequences of chimpanzee MHC class I alleles: evidence for trans-species mode of evolution. EMBO J. 1988;7:2765–2774. doi: 10.1002/j.1460-2075.1988.tb03131.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Ohashi A., Murayama M.A., Miyabe Y., Yudoh K., Miyabe C. Streptococcal infection and autoimmune diseases. Front. Immunol. 2024;15 doi: 10.3389/fimmu.2024.1361123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Hillary R.P., Ollila H.M., Lin L., Desestret V., Rogemond V., Picard G., Small M., Arnulf I., Dauvilliers Y., Honnorat J., Mignot E. Complex HLA association in paraneoplastic cerebellar ataxia with anti-Yo antibodies. J. Neuroimmunol. 2018;315:28–32. doi: 10.1016/j.jneuroim.2017.12.012. [DOI] [PubMed] [Google Scholar]
  • 22.Santambrogio L., Marrack P. The broad spectrum of pathogenic autoreactivity. Nat. Rev. Immunol. 2023;23:69–70. doi: 10.1038/s41577-022-00812-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Bakkalci D., Jia Y., Winter J.R., Lewis J.E., Taylor G.S., Stagg H.R. Risk factors for Epstein Barr virus-associated cancers: a systematic review, critical appraisal, and mapping of the epidemiological evidence. J. Glob. Health. 2020;10 doi: 10.7189/jogh.10.010405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Khan G., Hashim M.J. Global burden of deaths from Epstein-Barr virus attributable malignancies 1990-2010. Infect. Agent. Cancer. 2014;9:38. doi: 10.1186/1750-9378-9-38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Parkin D.M. The global health burden of infection-associated cancers in the year 2002. Int. J. Cancer. 2006;118:3030–3044. doi: 10.1002/ijc.21731. [DOI] [PubMed] [Google Scholar]
  • 26.Bjornevik K., Cortese M., Healy B.C., Kuhle J., Mina M.J., Leng Y., Elledge S.J., Niebuhr D.W., Scher A.I., Munger K.L., Ascherio A. Longitudinal analysis reveals high prevalence of Epstein-Barr virus associated with multiple sclerosis. Science. 2022;375:296–301. doi: 10.1126/science.abj8222. [DOI] [PubMed] [Google Scholar]
  • 27.Bjornevik K., Münz C., Cohen J.I., Ascherio A. Epstein-Barr virus as a leading cause of multiple sclerosis: mechanisms and implications. Nat. Rev. Neurol. 2023;19:160–171. doi: 10.1038/s41582-023-00775-5. [DOI] [PubMed] [Google Scholar]
  • 28.de Bakker P.I.W., Raychaudhuri S. Interrogating the major histocompatibility complex with high-throughput genomics. Hum. Mol. Genet. 2012;21:R29–R36. doi: 10.1093/hmg/dds384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Lenz T.L., Spirin V., Jordan D.M., Sunyaev S.R. Excess of Deleterious Mutations around HLA Genes Reveals Evolutionary Cost of Balancing Selection. Mol. Biol. Evol. 2016;33:2555–2564. doi: 10.1093/molbev/msw127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Canela-Xandri O., Rawlik K., Tenesa A. An atlas of genetic associations in UK Biobank. Nat. Genet. 2018;50:1593–1599. doi: 10.1038/s41588-018-0248-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Watanabe K., Stringer S., Frei O., Umićević Mirkov M., de Leeuw C., Polderman T.J.C., van der Sluis S., Andreassen O.A., Neale B.M., Posthuma D. A global overview of pleiotropy and genetic architecture in complex traits. Nat. Genet. 2019;51:1339–1348. doi: 10.1038/s41588-019-0481-0. [DOI] [PubMed] [Google Scholar]
  • 32.Ritari J., Koskela S., Hyvärinen K., FinnGen n., Partanen J. HLA-disease association and pleiotropy landscape in over 235,000 Finns. Hum. Immunol. 2022;83:391–398. doi: 10.1016/j.humimm.2022.02.003. [DOI] [PubMed] [Google Scholar]
  • 33.Stokkers P.C., Reitsma P.H., Tytgat G.N., van Deventer S.J. HLA-DR and -DQ phenotypes in inflammatory bowel disease: a meta-analysis. Gut. 1999;45:395–401. doi: 10.1136/gut.45.3.395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Brown M.A., Kenna T., Wordsworth B.P. Genetics of ankylosing spondylitis–insights into pathogenesis. Nat. Rev. Rheumatol. 2016;12:81–91. doi: 10.1038/nrrheum.2015.133. [DOI] [PubMed] [Google Scholar]
  • 35.Noble J.A., Valdes A.M. Genetics of the HLA Region in the Prediction of Type 1 Diabetes. Curr. Diab. Rep. 2011;11:533–542. doi: 10.1007/s11892-011-0223-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Ziade N. Human leucocyte antigen-B27 testing in clinical practice: a global perspective. Curr. Opin. Rheumatol. 2023;35:235–242. doi: 10.1097/BOR.0000000000000946. [DOI] [PubMed] [Google Scholar]
  • 37.Raiteri A., Granito A., Giamperoli A., Catenaro T., Negrini G., Tovoli F. Current guidelines for the management of celiac disease: A systematic review with comparative analysis. World J. Gastroenterol. 2022;28:154–175. doi: 10.3748/wjg.v28.i1.154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Amstutz U., Shear N.H., Rieder M.J., Hwang S., Fung V., Nakamura H., Connolly M.B., Ito S., Carleton B.C., CPNDS clinical recommendation group Recommendations for HLA-B∗15:02 and HLA-A∗31:01 genetic testing to reduce the risk of carbamazepine-induced hypersensitivity reactions. Epilepsia. 2014;55:496–506. doi: 10.1111/epi.12564. [DOI] [PubMed] [Google Scholar]
  • 39.D’Antonio M., Reyna J., Jakubosky D., Donovan M.K., Bonder M.-J., Matsui H., Stegle O., Nariai N., D’Antonio-Chronowska A., Frazer K.A. Systematic genetic analysis of the MHC region reveals mechanistic underpinnings of HLA type associations with disease. eLife. 2019;8 doi: 10.7554/eLife.48476. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Bettens F., Ongen H., Rey G., Buhler S., Calderin Sollet Z., Dermitzakis E., Villard J. Regulation of HLA class I expression by non-coding gene variations. PLoS Genet. 2022;18 doi: 10.1371/journal.pgen.1010212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Dendrou C.A., Petersen J., Rossjohn J., Fugger L. HLA variation and disease. Nat. Rev. Immunol. 2018;18:325–339. doi: 10.1038/nri.2017.143. [DOI] [PubMed] [Google Scholar]
  • 42.International Multiple Sclerosis Genetics Consortium Multiple sclerosis genomic map implicates peripheral immune cells and microglia in susceptibility. Science. 2019;365 doi: 10.1126/science.aav7188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Mayor N.P., Hayhurst J.D., Turner T.R., Szydlo R.M., Shaw B.E., Bultitude W.P., Sayno J.-R., Tavarozzi F., Latham K., Anthias C., et al. Recipients Receiving Better HLA-Matched Hematopoietic Cell Transplantation Grafts, Uncovered by a Novel HLA Typing Method, Have Superior Survival: A Retrospective Study. Biol. Blood Marrow Transplant. 2019;25:443–450. doi: 10.1016/j.bbmt.2018.12.768. [DOI] [PubMed] [Google Scholar]
  • 44.Jin Y., Roberts G.H.L., Ferrara T.M., Ben S., van Geel N., Wolkerstorfer A., Ezzedine K., Siebert J., Neff C.P., Palmer B.E., et al. Early-onset autoimmune vitiligo associated with an enhancer variant haplotype that upregulates class II HLA expression. Nat. Commun. 2019;10:391. doi: 10.1038/s41467-019-08337-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Sekar A., Bialas A.R., De Rivera H., Davis A., Hammond T.R., Kamitaki N., Tooley K., Presumey J., Baum M., Van Doren V., et al. Schizophrenia risk from complex variation of complement component 4. Nature. 2016;530:177–183. doi: 10.1038/nature16549. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Gupta A., Thelma B.K. Identification of critical variants within SLC44A4, an ulcerative colitis susceptibility gene identified in a GWAS in north Indians. Genes Immun. 2016;17:105–109. doi: 10.1038/gene.2015.53. [DOI] [PubMed] [Google Scholar]
  • 47.Zhang X., Lucas A.M., Veturi Y., Drivas T.G., Bone W.P., Verma A., Chung W.K., Crosslin D., Denny J.C., Hebbring S., et al. Large-scale genomic analyses reveal insights into pleiotropy across circulatory system diseases and nervous system disorders. Nat. Commun. 2022;13:3428. doi: 10.1038/s41467-022-30678-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Dawkins R., Leelayuwat C., Gaudieri S., Tay G., Hui J., Cattley S., Martinez P., Kulski J. Genomics of the major histocompatibility complex: haplotypes, duplication, retroviruses and disease. Immunol. Rev. 1999;167:275–304. doi: 10.1111/j.1600-065x.1999.tb01399.x. [DOI] [PubMed] [Google Scholar]
  • 49.Alper C.A., Awdeh Z., Yunis E.J. Conserved, extended MHC haplotypes. Exp. Clin. Immunogenet. 1992;9:58–71. [PubMed] [Google Scholar]
  • 50.Dawkins R.L., Christiansen F.T., Kay P.H., Garlepp M., McCluskey J., Hollingsworth P.N., Zilko P.J. Disease associations with complotypes, supratypes and haplotypes. Immunol. Rev. 1983;70:5–22. doi: 10.1111/j.1600-065x.1983.tb00707.x. [DOI] [PubMed] [Google Scholar]
  • 51.Gabriel S.B., Schaffner S.F., Nguyen H., Moore J.M., Roy J., Blumenstiel B., Higgins J., DeFelice M., Lochner A., Faggart M., et al. The structure of haplotype blocks in the human genome. Science. 2002;296:2225–2229. doi: 10.1126/science.1069424. [DOI] [PubMed] [Google Scholar]
  • 52.de Bakker P.I.W., McVean G., Sabeti P.C., Miretti M.M., Green T., Marchini J., Ke X., Monsuur A.J., Whittaker P., Delgado M., et al. A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat. Genet. 2006;38:1166–1172. doi: 10.1038/ng1885. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Oliveira L.C., Porta G., Marin M.L.C., Bittencourt P.L., Kalil J., Goldberg A.C. Autoimmune hepatitis, HLA and extended haplotypes. Autoimmun. Rev. 2011;10:189–193. doi: 10.1016/j.autrev.2010.09.024. [DOI] [PubMed] [Google Scholar]
  • 54.Yunis E.J., Larsen C.E., Fernandez-Viña M., Awdeh Z.L., Romero T., Hansen J.A., Alper C.A. Inheritable variable sizes of DNA stretches in the human MHC: conserved extended haplotypes and their fragments or blocks. Tissue Antigens. 2003;62:1–20. doi: 10.1034/j.1399-0039.2003.00098.x. [DOI] [PubMed] [Google Scholar]
  • 55.Askar M., Madbouly A., Zhrebker L., Willis A., Kennedy S., Padros K., Rodriguez M.B., Bach C., Spriewald B., Ameen R., et al. HLA Haplotypes In 250 Families: The Baylor Laboratory Results And A Perspective On A Core NGS Testing Model For The 17th International HLA And Immunogenetics Workshop. Hum. Immunol. 2019;80:897–905. doi: 10.1016/j.humimm.2019.07.298. [DOI] [PubMed] [Google Scholar]
  • 56.Gaudieri S., Leelayuwat C., Tay G.K., Townend D.C., Dawkins R.L. The major histocompatability complex (MHC) contains conserved polymorphic genomic sequences that are shuffled by recombination to form ethnic-specific haplotypes. J. Mol. Evol. 1997;45:17–23. doi: 10.1007/pl00006194. [DOI] [PubMed] [Google Scholar]
  • 57.Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Hirata J., Hosomichi K., Sakaue S., Kanai M., Nakaoka H., Ishigaki K., Suzuki K., Akiyama M., Kishikawa T., Ogawa K., et al. Genetic and phenotypic landscape of the major histocompatibility complex region in the Japanese population. Nat. Genet. 2019;51:470–480. doi: 10.1038/s41588-018-0336-0. [DOI] [PubMed] [Google Scholar]
  • 59.Mozzi A., Pontremoli C., Sironi M. Genetic susceptibility to infectious diseases: Current status and future perspectives from genome-wide approaches. Infect. Genet. Evol. 2018;66:286–307. doi: 10.1016/j.meegid.2017.09.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Apps R., Qi Y., Carlson J.M., Chen H., Gao X., Thomas R., Yuki Y., Del Prete G.Q., Goulder P., Brumme Z.L., et al. Influence of HLA-C Expression Level on HIV Control. Science. 2013;340:87–91. doi: 10.1126/science.1232685. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Tian C., Hromatka B.S., Kiefer A.K., Eriksson N., Noble S.M., Tung J.Y., Hinds D.A. Genome-wide association and HLA region fine-mapping studies identify susceptibility loci for multiple common infections. Nat. Commun. 2017;8:599. doi: 10.1038/s41467-017-00257-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Binder M.D., Fox A.D., Merlo D., Johnson L.J., Giuffrida L., Calvert S.E., Akkermann R., Ma G.Z.M., Perera A.A., et al. ANZgene Common and Low Frequency Variants in MERTK Are Independently Associated with Multiple Sclerosis Susceptibility with Discordant Association Dependent upon HLA-DRB1∗15:01 Status. PLoS Genet. 2016;12 doi: 10.1371/journal.pgen.1005853. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Bosca-Watts M.M., Minguez M., Planelles D., Navarro S., Rodriguez A., Santiago J., Tosca J., Mora F. HLA-DQ: Celiac disease vs inflammatory bowel disease. World J. Gastroenterol. 2018;24:96–103. doi: 10.3748/wjg.v24.i1.96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Lundström E., Gustafsson J.T., Jönsen A., Leonard D., Zickert A., Elvin K., Sturfelt G., Nordmark G., Bengtsson A.A., Sundin U., et al. HLA-DRB1∗04/∗13 alleles are associated with vascular disease and antiphospholipid antibodies in systemic lupus erythematosus. Ann. Rheum. Dis. 2013;72:1018–1025. doi: 10.1136/annrheumdis-2012-201760. [DOI] [PubMed] [Google Scholar]
  • 65.Rioux J.D., Goyette P., Hammarström L., Hammarström L., Fernando M.M.A., Green T., De Jager P.L., Foisy S., Wang J., et al. International MHC and Autoimmunity Genetics Network Mapping of multiple susceptibility variants within the MHC region for 7 immune-mediated diseases. Proc. Natl. Acad. Sci. USA. 2009;106:18680–18685. doi: 10.1073/pnas.0909307106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Debebe B.J., Boelen L., Lee J.C., IAVI Protocol C Investigators. Thio C.L., Astemborski J., Kirk G., Khakoo S.I., Donfield S.M., Goedert J.J., Asquith B. Identifying the immune interactions underlying HLA class I disease associations. eLife. 2020;9 doi: 10.7554/eLife.54558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Kurki M.I., Karjalainen J., Palta P., Sipilä T.P., Kristiansson K., Donner K.M., Reeve M.P., Laivuori H., Aavikko M., Kaunisto M.A., et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature. 2023;613:508–518. doi: 10.1038/s41586-022-05473-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Mbatchou J., Barnard L., Backman J., Marcketta A., Kosmicki J.A., Ziyatdinov A., Benner C., O’Dushlaine C., Barber M., Boutkov B., et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 2021;53:1097–1103. doi: 10.1038/s41588-021-00870-7. [DOI] [PubMed] [Google Scholar]
  • 69.Wang G., Sarkar A., Carbonetto P., Stephens M. A Simple New Approach to Variable Selection in Regression, with Application to Genetic Fine Mapping. J. R. Stat. Soc. Series B Stat. Methodol. 2020;82:1273–1300. doi: 10.1111/rssb.12388. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Zheng X., Shen J., Cox C., Wakefield J.C., Ehm M.G., Nelson M.R., Weir B.S. HIBAG–HLA genotype imputation with attribute bagging. Pharmacogenomics J. 2014;14:192–200. doi: 10.1038/tpj.2013.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Bulik-Sullivan B.K., Loh P.-R., Finucane H.K., Ripke S., Yang J., Schizophrenia Working Group of the Psychiatric Genomics Consortium. Patterson N., Daly M.J., Price A.L., Neale B.M. LD Score Regression Distinguishes Confounding from Polygenicity in Genome-Wide Association Studies. Nat. Genet. 2015;47:291–295. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.DeBoever C., Tanigawa Y., Aguirre M., McInnes G., Lavertu A., Rivas M.A. Assessing Digital Phenotyping to Enhance Genetic Studies of Human Diseases. Am. J. Hum. Genet. 2020;106:611–622. doi: 10.1016/j.ajhg.2020.03.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Busch R., Kollnberger S., Mellins E.D. HLA associations in inflammatory arthritis: emerging mechanisms and clinical implications. Nat. Rev. Rheumatol. 2019;15:364–381. doi: 10.1038/s41584-019-0219-5. [DOI] [PubMed] [Google Scholar]
  • 74.Queiro R., Morante I., Cabezas I., Acasuso B. HLA-B27 and psoriatic disease: a modern view of an old relationship. Rheumatology. 2016;55:221–229. doi: 10.1093/rheumatology/kev296. [DOI] [PubMed] [Google Scholar]
  • 75.Benros M.E., Waltoft B.L., Nordentoft M., Ostergaard S.D., Eaton W.W., Krogh J., Mortensen P.B. Autoimmune diseases and severe infections as risk factors for mood disorders: a nationwide study. JAMA Psychiatry. 2013;70:812–820. doi: 10.1001/jamapsychiatry.2013.1111. [DOI] [PubMed] [Google Scholar]
  • 76.Glanville K.P., Coleman J.R.I., Hanscombe K.B., Euesden J., Choi S.W., Purves K.L., Breen G., Air T.M., Andlauer T.F.M., Baune B.T., et al. Classical Human Leukocyte Antigen Alleles and C4 Haplotypes Are Not Significantly Associated With Depression. Biol. Psychiatry. 2020;87:419–430. doi: 10.1016/j.biopsych.2019.06.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Wray N.R., Ripke S., Mattheisen M., Trzaskowski M., Byrne E.M., Abdellaoui A., Adams M.J., Agerbo E., Air T.M., Andlauer T.M.F., et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 2018;50:668–681. doi: 10.1038/s41588-018-0090-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Cheng B., Yang J., Cheng S., Pan C., Liu L., Meng P., Yang X., Wei W., Liu H., Jia Y., et al. Associations of classical HLA alleles with depression and anxiety. HLA. 2024;103 doi: 10.1111/tan.15173. [DOI] [PubMed] [Google Scholar]
  • 79.Schrumpf E., Fausa O., Førre O., Dobloug J.H., Ritland S., Thorsby E. HLA antigens and immunoregulatory T cells in ulcerative colitis associated with hepatobiliary disease. Scand. J. Gastroenterol. 1982;17:187–191. doi: 10.3109/00365528209182038. [DOI] [PubMed] [Google Scholar]
  • 80.Kim Y.S., Hurley E.H., Park Y., Ko S. Primary sclerosing cholangitis (PSC) and inflammatory bowel disease (IBD): a condition exemplifying the crosstalk of the gut–liver axis. Exp. Mol. Med. 2023;55:1380–1387. doi: 10.1038/s12276-023-01042-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Conigliaro P., D’Antonio A., Pinto S., Chimenti M.S., Triggianese P., Rotondi M., Perricone R. Autoimmune thyroid disorders and rheumatoid arthritis: A bidirectional interplay. Autoimmun. Rev. 2020;19 doi: 10.1016/j.autrev.2020.102529. [DOI] [PubMed] [Google Scholar]
  • 82.The China Consortium for the Genetics of Autoimmune Thyroid Disease A genome-wide association study identifies two new risk loci for Graves’ disease. Nat. Genet. 2011;43:897–901. doi: 10.1038/ng.898. [DOI] [PubMed] [Google Scholar]
  • 83.Gough S.C.L., Simmonds M.J. The HLA Region and Autoimmune Disease: Associations and Mechanisms of Action. Curr. Genomics. 2007;8:453–465. doi: 10.2174/138920207783591690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Burton P.R., Clayton D.G., Cardon L.R., Craddock N., Deloukas P., Duncanson A., Kwiatkowski D.P., McCarthy M.I., et al., Wellcome Trust Case Control Consortium, Australo-Anglo-American Spondylitis Consortium TASC Association scan of 14,500 nonsynonymous SNPs in four diseases identifies autoimmunity variants. Nat. Genet. 2007;39:1329–1337. doi: 10.1038/ng.2007.17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Kurkó J., Besenyei T., Laki J., Glant T.T., Mikecz K., Szekanecz Z. Genetics of rheumatoid arthritis - a comprehensive review. Clin. Rev. Allergy Immunol. 2013;45:170–179. doi: 10.1007/s12016-012-8346-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Farid N.R., Bear J.C. The human major histocompatibility complex and endocrine disease. Endocr. Rev. 1981;2:50–86. doi: 10.1210/edrv-2-1-50. [DOI] [PubMed] [Google Scholar]
  • 87.Simmonds M.J. GWAS in autoimmune thyroid disease: redefining our understanding of pathogenesis. Nat. Rev. Endocrinol. 2013;9:277–287. doi: 10.1038/nrendo.2013.56. [DOI] [PubMed] [Google Scholar]
  • 88.Smith C.J., Sinnott-Armstrong N., Cichońska A., Julkunen H., Fauman E.B., Würtz P., Pritchard J.K. Integrative analysis of metabolite GWAS illuminates the molecular basis of pleiotropy and genetic correlation. eLife. 2022;11 doi: 10.7554/eLife.79348. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S17, Notes S1–S6, and supplemental methods
mmc1.pdf (9.1MB, pdf)
Table S1. Description of all 2,459 FinnGen diseases included in the association analyses, including the disease categories, number of cases, and ICD codes

The first column (“NAME”) corresponds to the short name assigned by FinnGen, the second (“TAGS”) is the group tag assigned by FinnGen, “LONGNAME” is the full name of the disease, “HD_ICD_10” corresponds to ICD code 10, “HD_ICD_9” is ICD code 9, “HD_ICD_8” is ICD code 8, “category” is the FinnGen description of the disease tag, “num_cases” is the number of cases for that disease, “num_controls” is the number of controls, “lambda” is the lambda from the FinnGen GWAS runs, “HLA_hits” is the number of conditionally independent SNP associations identified in this study within the HLA boundaries, “non_HLA_hits” is the number of fine-mapped variants for all SNPs outside the HLA region, “group” is the groupings used for the enrichment analysis based on the FinnGen groupings, and “ManualCategory” is the manually curated groupings based on shared pathophysiology for the diseases identified as HLA associated by the SNP analysis. For all diseases where the pathophysiologic mechanism underlying the disease was unknown or if the disease can have multiple causes, the ManualCategory is recorded as “Organ.” “ManualSubcategory” is the same as “ManualCategory” except for all diseases where the “ManualCategory” is recorded as “Organ,” then this lists the part of the body/organ primarily affected. “Plot” is the concise interpretable name used to plot the 269 diseases included in the main haplotype group regression analysis, results of which are shown in Figure 5. In addition, full details for how FinnGen defined each disease is available at https://r10.risteys.finngen.fi/.

mmc2.zip (104.5KB, zip)
Table S2. Enrichment of GWAS hits in the HLA region for each disease group with at least ten diseases in the group, for all diseases in FinnGen with at least one GWAS hit (MAF > 1%) anywhere in the genome

The first column (“group”) corresponds to the disease group, the second column (“enrich”) has the enrichment for that disease group, the third column (“n”) has the number of diseases in that disease group, the fourth column (“HLA_hits”) has the mean number of independent SNP associations in the HLA region for diseases in that disease group, the fifth column (“non_HLA_hits”) has the mean number of independent SNP associations outside the HLA region for diseases in that disease group, the sixth column (“se_HLA”) has the standard error of the number of independent SNPs in the HLA region across diseases in that disease group, and the last column (“se_non_HLA”) has the standard error of the number of independent SNPs outside the HLA region across diseases in that disease group. The second tab has the same format but includes results from the repeated analysis using the 644 diseases that remained after randomly removing one disease for each pair with an LDSC genetic correlation >0.95.

mmc3.xlsx (25.5KB, xlsx)
Table S3. Regression results for the SNP-disease associations for significant SNP associations remaining after stepwise conditional analysis in the HLA region

The first column has the FinnGen short name for the disease, the second column has the longer name of the disease, the third has the SNP ID, the fourth has the position of the SNP, the fifth has the reference allele, the sixth has the alternative allele, the seventh has the allele frequency of the alternative allele, the eighth has the rsID of the SNP, and the next five columns have the beta, standard error, Z score, and p value for the association of that SNP with that disease. The next column (“nearest_genes”) has the nearest gene to the SNP, followed by (column “round”) the conditional analysis round the SNP was found to be independently significant with the disease, then (“annot”) the variant annotation for the SNP. This table includes the 1,064 disease associations from the full SNP conditional analysis with all 572 diseases. The conditional analysis focusing on just the 269 diseases included in the main analysis, after removing redundant traits, resulted in 540 disease associations across 428 unique SNPs.

mmc4.xlsx (160.8KB, xlsx)
Table S4. Haplotype group and individual haplotype statistics and assignments

The first tab (“allhaplotype_group_stats”) has the total doses of each haplotype group present in the dataset for each block. Tabs 2–4 have the haplotype information with one tab for each block. The first 1,000 columns correspond to the 1,000 SNPs in the haplotype, where 0 corresponds to the reference allele and 1 corresponds to the alternative allele, then the second to last column (“total_doses”) has the total doses of that haplotype, and the last column (“haplotype_group”) has the haplotype group to which each haplotype belonged for that block. Total doses for individual haplotypes are included for all haplotypes with >10 total doses for FinnGen privacy policy reasons.

mmc5.xlsx (61MB, xlsx)
Table S5. Results from the haplotype groups association analyses across all three blocks

The first tab (“main_hapgroup_reg_results”) has the full results from the main haplotype group association analysis, with the first column (“traits”) indicating the disease, the second (“hapgroup”) referring to the haplotype group, the next column (“Z_rescaled”) referring to the values of the regression Z scores rescaled to add back in the dropped haplotype group for each block, the next column (“plotted”) corresponding to whether or not that association is plotted in the heatmap of Figure 5, and the final column (“block”) indicating the block in the HLA region in which the haplotype group was identified. The next two tabs have the (non-rescaled) regression results for all diseases, with (tab called “allregresults_adjallelesinblock”) and without joint modeling to condition on the relevant classical HLA alleles in the block (tab called “all_hapgroup_reg_results_sig”).

mmc6.xlsx (1.1MB, xlsx)
Table S6. Regression results for all significant allele associations for all diseases, for both the approach jointly modeling alleles within a given block together (tab 1; tab called “alleles_vifindep_joint”) and for the approach with one individual allele per regression (tab 2; tab called “alleles_indiv”)

For both tabs, the first column has the disease name, the next four columns have the beta, standard error, Z score, and p value, respectively for the association of that allele with that disease, and the sixth column has the allele. For the first tab the last column has the block number the allele is within.

mmc7.xlsx (570.6KB, xlsx)
Table S7. Haplotype-disease association results in UK Biobank

Tab 1 (called “ukbhapassociations”) has the results for blocks 1–3 for the associations plotted in the UKBB heatmap, with the first column (“traits”) indicating the disease code, the second (“hapgroup”) referring to the haplotype group, the next column (“Z”) referring to the values of the regression Z scores, the next column (“LONG_NAME”) referring to the long name for the disease, the next column (“N”) referring to the number of cases for that disease, the next column (“Plot”) referring to the concise interpretable name used for the heatmap plot label, and the last column (“block”) indicating the block to which the haplotype group belonged. Tab 2 (called “hapgroupdoses”) has the number of haplotypes included in each haplotype group.

mmc8.xlsx (59.1KB, xlsx)
Table S8. Haplotype-disease association results in UK Biobank using UK Biobank haplotypes mapped onto the original FinnGen haplotype groups

Tab 1 has the UKBB results for blocks 1–3 comparing the Z scores for the equivalent haplotype group-disease association in FinnGen, with the first column (“block”) indicating the block the haplotype group belonged to, the second (“hapgroup”) referring to the haplotype group, the third (“finngen_trait”) indicating the disease code in FinnGen, the fourth (“finngen_Z”) referring to the values of the FinnGen regression Z scores, the fifth (“ukb_trait”) indicating the disease code in UKBB, and the last column (“ukb_Z”) indicating the values of the UKBB regression Z scores. Tab 2 has the total doses in UKBB for each of the original FinnGen haplotype groups. Tab 3 has the replication analysis results for the original Pearson’s correlation analysis across haplotype groups for all pairwise combinations of the diseases mapped between FinnGen and UKBB.

mmc9.xlsx (82KB, xlsx)
Table S9. List of FinnGen contributors
mmc10.xlsx (40KB, xlsx)
Document S2. Article plus supplemental information
mmc11.pdf (18.7MB, pdf)

Data Availability Statement


Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

RESOURCES