Skip to main content
eLife logoLink to eLife
. 2025 Jun 3;13:RP99200. doi: 10.7554/eLife.99200

Confirmation of HLA-II associations with TB susceptibility in admixed African samples

Dayna Adrienne Croock 1, Yolandi Swart 1, Haiko Schurz 1, Desiree C Petersen 1, Marlo Möller 1,2, Caitlin Uren 1,2,
Editors: Bavesh D Kana3, Bavesh D Kana4
PMCID: PMC12133154  PMID: 40458991

Abstract

Previously, the International Tuberculosis Host Genetics Consortium (ITHGC) demonstrated the power of large-scale GWAS analysis across diverse ancestries in identifying tuberculosis (TB) susceptibility loci (Schurz et al., 2024). Despite identifying a significant genetic correlate in the human leukocyte antigen (HLA)-II region, this association did not replicate in the African ancestry-specific analysis, due to small sample size and the inclusion of admixed samples. Our study aimed to build upon the findings from the ITHGC and identify TB susceptibility loci in an admixed South African cohort using the local ancestry allelic adjusted association (LAAA) model. We identified a suggestive association peak (rs3117230, p-value = 5.292 × 10-6, OR = 0.437, SE = 0.182) in the HLA-DPB1 gene originating from KhoeSan ancestry. These findings extend the work of the ITHGC, underscore the need for innovative strategies in studying complex admixed populations, and confirm the role of the HLA-II region in TB susceptibility in admixed South African samples.

Research organism: Human

Introduction

Tuberculosis (TB) is a communicable disease caused by Mycobacterium tuberculosis (M.tb) (World Health Organization, 2023). M.tb infection has a wide range of clinical manifestations from asymptomatic, non-transmissible, or so-called ‘latent’, infections to active TB (Zaidi et al., 2023). Approximately 1/4 of the global population is infected with M.tb, but only 5–15% of infected individuals will develop active TB (Menzies et al., 2021). Several factors increase the risk of progressing to active TB, including co-infection with HIV and comorbidities, such as diabetes mellitus, asthma and other airway and lung diseases (Glaziou et al., 2018). Socio-economic factors including smoking, malnutrition, alcohol abuse, intravenous drug use, prolonged residence in a high burdened community, overcrowding, informal housing and poor sanitation also influence M.tb transmission and infection (Cudahy et al., 2020; Escombe et al., 2019; Laghari et al., 2019; Matose et al., 2019; Smith et al., 2023). Additionally, individual variability in infection and disease progression has been attributed to variation in the host genome (Schurz et al., 2024; Uren et al., 2020; Verhein et al., 2018; Uren et al., 2021). Numerous genome-wide association studies (GWASs) investigating TB susceptibility have been conducted across different population groups. However, findings from these studies often do not replicate across population groups (Möller and Kinnear, 2020; Möller et al., 2018; Uren et al., 2017). This lack of replication could be caused by small sample sizes, variation in phenotype definitions among studies, variation in linkage disequilibrium (LD) patterns across different population groups and the presence of population-specific effects (Möller and Kinnear, 2020). Additionally, complex LD patterns within population groups, produced by admixture, impede the detection of statistically significant loci when using traditional GWAS methods (Swart et al., 2020).

The International Tuberculosis Host Genetics Consortium (ITHGC) performed a meta-analysis of TB GWAS results including 14 153 TB cases and 19 536 controls of African, Asian and European ancestries (Schurz et al., 2024). The multi-ancestry meta-analysis identified one genome-wide significant variant (rs28383206) in the human leukocyte antigen (HLA)-II region (p=5.2 x 10–9, OR = 0.89, 95% CI=0.84–0.95). The association peak at the HLA-II locus encompassed several genes encoding crucial antigen presentation proteins (including HLA-DR and HLA-DQ). While ancestry-specific association analyses in the European and Asian cohorts also produced suggestive peaks in the HLA-II region, the African ancestry-specific association test did not yield any significant associations or suggestive peaks. The authors described possible reasons for the lack of associations, including the smaller sample size compared to the other ancestry-specific meta-analyses, increased genetic diversity within African individuals and population stratification produced by two admixed cohorts from the South African Coloured (SAC) population (Schurz et al., 2024). The SAC population (as termed in the South African census Lehohla, 2012) forms part of a multi-way (up to five-way) admixed population with ancestral contributions from Bantu-speaking African (~30%), KhoeSan (~30%), European (~20%), and East (~10%) and Southeast Asian (~10%) populations (Chimusa et al., 2013). The diverse genetic background of admixed individuals can lead to population stratification, potentially introducing confounding variables. However, the power to detect statistically significant loci in admixed populations can be improved by leveraging admixture-induced local ancestry (Swart et al., 2021; Swart et al., 2022a). Since previous computational algorithms did not include local ancestry as a covariate for GWASs, the local ancestry allelic adjusted association model (LAAA) was developed to overcome this limitation (Duan et al., 2018). The LAAA model identifies ancestry-specific alleles associated with the phenotype by including the minor alleles and the corresponding ancestry of the minor alleles (obtained by local ancestry inference) as covariates. The LAAA model has been successfully applied in a cohort of multi-way admixed SAC individuals to identify novel variants associated with TB susceptibility (Swart et al., 2021; Swart et al., 2022b).

Our study builds upon the findings from the ITHGC (Schurz et al., 2024) and aims to resolve the challenges faced in African ancestry-specific association analysis. Here, we explore host genetic correlates of TB in a complex admixed SAC population using the LAAA model.

Results

Global and local ancestry inference

After close inspection of global ancestry proportions generated using ADMIXTURE, the K number of contributing ancestries (the lowest k-value determined through cross-validation) was K=3 for the Xhosa individuals and K=5 for the SAC individuals (Figure 1). This is consistent with previous global ancestry deconvolution results (Chimusa et al., 2014; Choudhury et al., 2021). It is evident that our cohort is a complex, highly admixed group with ancestral contributions from the indigenous KhoeSan (~22–30%), Bantu-speaking African (~30–72%), European (~5–24%), Southeast Asian (~11%), and East Asian (~5%) population groups.

Figure 1. Genome-wide ancestral proportions of all individuals in the merged dataset.

Figure 1.

Ancestral proportions for each individual are plotted vertically with different colours representing different contributing ancestries.

Local ancestry was estimated for all individuals. Admixture between geographically distinct populations creates complex ancestral and admixture-induced LD blocks, which can be visualised using local ancestry karyograms. Figure 2 shows karyograms for three individuals from the merged dataset. It is evident that, despite individuals being from the same population group, each possesses unique patterns of local ancestry arising from differing numbers and lengths of ancestral segments.

Figure 2. Local ancestry karyograms of three admixed individuals from the SAC population.

Figure 2.

Each admixed individual (A, B and C) has unique local ancestry patterns generated by admixture among geographically distinct ancestral population groups.

Local ancestry-allelic adjusted analysis

LAAA models were successfully applied for all five contributing ancestries (KhoeSan, Bantu-speaking African, European, East Asian and Southeast Asian). However, no variants passed the threshold for statistical significance. Although no variants reached genome-wide significance, a suggestive peak was identified in the HLA-II region of chromosome 6 when using the LAAA model and adjusting for KhoeSan ancestry (Figure 3). The QQ-plot suggested minimal genomic inflation, which was verified by calculating the genomic inflation factor (λ=1.05289; Figure 3—figure supplement 1). The lead variants identified using the LAAA model whilst adjusting for KhoeSan ancestry in this region on chromosome 6 are summarised in Table 1. The suggestive peak encompasses the HLA-DPA1/B1 (major histocompatibility complex, class II, DP alpha 1/beta 1) genes (Figure 4). It is noteworthy that without the LAAA model, this suggestive peak would not have been observed for this cohort. This highlights the importance of utilising the LAAA model in future association studies when investigating disease susceptibility loci in admixed individuals, such as the SAC population.

Figure 3. Log transformation of association signals obtained for KhoeSan ancestry whilst using the LAAA model on chromosome 6.

The thresholds for genome-wide significance (p-value = 5 x 10–8) and suggestive significance (p-value = 1 x 10–5) and the significance threshold for admixture mapping (p-value = 2.5 x 10–6) are shown. The four different models are represented in black (global ancestry only - GAO), blue (local ancestry effect - LAO), orange (ancestry plus allelic effect - APA), and pink (local ancestry adjusted allelic effect - LAAA).

Figure 3.

Figure 3—figure supplement 1. QQ-plot of expected p-values and observed p-values for the association signals obtained for Khoisan ancestry located on chromosome 6.

Figure 3—figure supplement 1.

Table 1. Suggestive associations (p-value <1e–5) for the LAAA analysis adjusting for KhoeSan local ancestry on chromosome 6.

Position Marker name Ref Alt AltFreq OR (95% CI) SE p-value (x10–6) Gene Location Imputed/typed INFO score
33075635 rs3117230 A G 0.370 0.437 (0.306; 0.624) 0.182 5.292 HLA-DPB1 Intergenic Genotyped NA
33048661 rs1042151 A G 0.325 0.437 (0.305; 0.627) 0.184 6.806 HLA-DPB1 Exonic Imputed 0.992
33058874 rs2179920 C T 0.369 0.445 (0.313; 0.633) 0.180 6.960 HLA-DPB1 Intergenic Genotyped NA
33072266 rs2064478 C T 0.371 0.447 (0.313; 0.637) 0.181 8.222 HLA-DPB1 Intergenic Imputed 1
33072729 rs3130210 G T 0.371 0.447 (0.313; 0.637) 0.181 8.222 HLA-DPB1 Intergenic Imputed 0.999
33073440 rs2064475 G A 0.371 0.447 (0.313; 0.637) 0.181 8.222 HLA-DPB1 Intergenic Imputed 1
33074348 rs3117233 T C 0.371 0.447 (0.313; 0.637) 0.181 8.222 HLA-DPB1 Intergenic Imputed 1
33074707 rs3130213 G A 0.371 0.447 (0.313; 0.637) 0.181 8.222 HLA-DPB1 Intergenic Imputed 0.970

Ref, reference allele; Alt, alternate allele; AltFreq, alternate allele frequency; OR, odds ratio; SE, standard error.

Figure 4. Regional plot indicating the nearest genes in the region of the lead variant (rs3117230) observed on chromosome 6.

Figure 4.

SNPs in linkage disequilibrium (LD) with the lead variant are coloured red/orange. The lead variant is indicated in purple. Functional protein-coding genes are coded in red and non-functional (pseudo-genes) are indicated in black.

The lead variant within this suggestive peak lies within COL11A2P1 (collagen type X1 alpha 2 pseudogene 1). COL11A2P1 is an unprocessed pseudogene (ENSG00000228688). Unprocessed pseudogenes are seldom transcribed and translated into functional proteins (Witek and Mohiuddin, 2024). HLA-DPB1 and HLA-DPA1 are the closest functional protein-coding genes to our lead variants. The lead variant identified in the ITHGC meta-analysis, rs28383206, was not present in our genotype or imputed datasets. The ITHGC imputed genotypes using the 1000 Genomes (1000 G) reference panel (Schurz et al., 2024). The lead variant, rs28383206, has an alternate allele frequency of 11.26% in the African population subgroup within the 1000 G dataset (https://www.ncbi.nlm.nih.gov/snp/rs28383206). However, rs28383206 is absent from our in-house whole-genome sequencing (WGS) datasets, which include Bantu-speaking African and KhoeSan individuals. This absence suggests that rs28383206 might not have been imputed in our datasets using the AGR reference panel, potentially due to its low alternate allele frequency in southern African populations. Our merged dataset contained two variants located within 800 base pairs of rs28383206: rs482205 (6:32576009) and rs482162 (6:32576019). However, these variants were not significantly associated with TB status in our cohort (Supplementary file 1).

Discussion

The LAAA analysis of host genetic susceptibility to TB, involving 942 TB cases and 592 controls, identified one suggestive association peak adjusting for KhoeSan local ancestry. The association peak identified in this study encompasses the HLA-DPB1 gene, a highly polymorphic locus, with over 2000 documented allelic variants (Robinson et al., 2020). This association is noteworthy given that HLA-DPB1 alleles have been associated with TB resistance (Dawkins et al., 2022; Ravikumar et al., 1999; Selvaraj et al., 2008). The direction of effect of the lead variants in our study (Table 1) similarly suggests a protective effect against developing active TB. However, variants in HLA-DPB1 were not identified in the ITHGC meta-analysis.

The ITHGC did not identify any significant associations or suggestive peaks in their African ancestry-specific analyses. Notably, the suggestive peak in the HLA-DPB1 region was only captured in our cohort using the LAAA model whilst adjusting for KhoeSan local ancestry. This underscores the importance of incorporating global and local ancestry in association studies investigating complex multi-way admixed individuals, as the genetic heterogeneity present in admixed individuals (produced as a result of admixture-induced and ancestral LD patterns) may cause association signals to be missed when using traditional association models (Duan et al., 2018; Swart et al., 2022b).

We did not replicate the significant association signal in HLA-DRB1 identified by the ITHGC. However, the ITHGC also did not replicate this association in their own African ancestry-specific analysis. The significant association, rs28383206, identified by the ITHGC meta-analysis appears to be tagging the HLA-DQA1*02:1 allele, which is associated with TB in Icelandic and Asian populations (Li et al., 2021; Sveinbjornsson et al., 2016; Zheng et al., 2018). It is possible that this association signal is specific to non-African populations, but additional research is required to verify this hypothesis. Both our study and the ITHGC independently pinpointed variants associated with TB susceptibility in different genes within the HLA-II locus (Figure 5). The HLA-II region spans ~0.8 Mb on chromosome 6p21.32 and encompasses the HLA-DP, -DR, and -DQ alpha and beta chain genes. The HLA-II complex is the human form of the major histocompatibility complex class II (MHC-II) proteins on the surface of antigen presenting cells, such as monocytes, dendritic cells and macrophages. The innate immune response against M.tb involves phagocytosis by alveolar macrophages. In the phagosome, mycobacterial antigens are processed for presentation on MHC-II on the surface of the antigen presenting cell. Previous studies have suggested that M.tb interferes with the MHC-II pathway to enhance intracellular persistence and delay activation of the adaptive immune response (Oliveira-Cortez et al., 2016). For example, M.tb can inhibit phagosome maturation and acidification, thereby limiting antigen processing and presentation on MHC-II molecules (Chang et al., 2005). Given that MHC-II plays an essential role in the adaptive immune response to TB and numerous studies have identified HLA-II variants associated with TB (Cai et al., 2019; Chihab et al., 2023; de de Sá et al., 2020; Harishankar et al., 2018; Schurz et al., 2024; Selvaraj et al., 2008), additional research is required to elucidate the effects of HLA-II variation on TB risk status.

Figure 5. A schematic diagram of the location of HLA-II genes associated with TB susceptibility.

Figure 5.

Genes in red were identified by the ITHGC. Genes in blue were identified by this study.

This analysis has a few limitations. First, unlike the ITHGC manuscript, we did not validate our SNP peak in the HLA-II region through fine mapping. Although we initially considered performing HLA imputation and fine-mapping using the HIBAG R package, as described in the ITHGC article (https://hibag.s3.amazonaws.com/hlares_index.html#estimates), the African HIBAG model was trained on genotype data from African American and HapMap YRI populations, which have minimal to no KhoeSan ancestry. Since our association peak likely originates from KhoeSan ancestral haplotype blocks, using an imputation reference panel that includes individuals with KhoeSan ancestry is essential to this analysis. We acknowledge that HLA typing could validate the importance of our lead SNPs in the HLA-II region and support the LAAA model, but this was not feasible due to the absence of a suitable reference panel that includes KhoeSan ancestry. Second, our analysis has a notable case-control imbalance (cases/controls = 1.610). While many studies discuss methods for addressing case-control imbalances with more controls than cases which can inflate type 1 error rates (Dai et al., 2021; Öztornaci et al., 2023; Zhou et al., 2018), few address the implications of a large case-to-control ratio like ours (952 cases to 592 controls). To assess the impact of this imbalance, we used the Michigan genetic association study (GAS) power calculator (Skol et al., 2006). Under an additive disease model with an estimated prevalence of 0.15, a disease allele frequency of 0.3, a genotype relative risk of 1.5, and a default significance level of 7×10⁻⁶, we achieved an expected power of approximately 75%. With a balanced sample size of 950 cases and 950 controls, power would exceed 90%, but it would drop significantly with a smaller balanced cohort of 590 cases and 590 controls. Given these results, we proceeded with our analysis to maximise statistical power despite the case-control imbalance.

In conclusion, the application of the LAAA to a highly admixed SAC cohort revealed a suggestive association signal in the HLA-II region associated with protection against TB that was not identified by the African-ancestry specific analysis performed by the ITHGC. Our study builds on the results of the ITHGC by demonstrating an alternative method to identify association signals in cohorts with complex genetic ancestry. This analysis shows the value of including individual global and local ancestry in genetic association analyses. Furthermore, we confirm HLA-II loci associations with TB susceptibility in an admixed South African population, highlighting the role of the adaptive immune system in TB susceptibility and resistance.

Materials and methods

Data

This study included the two SAC admixed datasets from the ITHGC analysis [RSA(A) and RSA(M)] as well as four additional TB case-control datasets obtained from admixed South African population groups (Table 2). Like the SAC population, the Xhosa population is admixed with Bantu-speaking African and KhoeSan ancestral contributions (Choudhury et al., 2021). All datasets were collected over the past 30 years under different research projects (Daya et al., 2013; Kroon et al., 2020; Schurz et al., 2018; Smith et al., 2023; Ugarte-Gil et al., 2020) and individuals that were included in the analyses consented to the use of their data in future research regarding TB host genetics. Across all datasets, TB cases were bacteriologically confirmed (culture positive) or diagnosed by GeneXpert. Controls were healthy individuals with no history of TB disease or treatment. However, given the high prevalence of TB in South Africa 852 cases (95% CI 679–1026) per 10,000 individuals 15 years and older (Cudahy et al., 2020), most controls have likely been exposed to M.tb at some point (Gallant et al., 2010). For all datasets, cases and controls were obtained from the same community and thus share similar socio-economic status and health care access.

Table 2. Summary of the datasets included in analysis.

Dataset Genotyping platform Self-reported ethnicity Cases/controls Reference
RSA(A) Affymetrix 500 k SAC 642/91 Daya et al., 2013
RSA(M) MEGA array 1.1 M SAC 555/440 Schurz et al., 2018; Swart et al., 2021
RSA(TANDEM) H3Africa array SAC and Bantu-speaking African 161/133 Swart et al., 2022b
RSA(NCTB) H3Africa array SAC 49/111 Oyageshio et al., 2023
RSA(Worcester) H3Africa array SAC 61 cases Unpublished
RSA(Xhosa) Whole genome sequencing IsiXhosa 44/120 Unpublished

A list of sites genotyped on the Infinium H3Africa array (https://chipinfo.h3abionet.org/browse) was extracted from the whole-genome sequenced [RSA(Xhosa)] dataset and treated as genotype data in subsequent analyses. Quality control (QC) of raw genotype data was performed using PLINK v1.9 (Purcell et al., 2007). In all datasets, individuals were screened for sex concordance and discordant sex information was corrected based on X chromosome homozygosity estimates (Festimate <0.2 for females and Festimate >0.8 for males). In the event that sex information could not be corrected based on homozygosity estimates, individuals with missing or discordant sex information were removed. Individuals with genotype call rates less than 90% and SNPs with more than 5% missingness were removed as described previously (Swart et al., 2021). Monomorphic sites were removed. Individuals were screened for deviations in Hardy-Weinberg Equilibrium (HWE) for each SNP, and sites deviating from the HWE threshold of 10–5 were removed. Sex chromosomes were excluded from the analysis. The genome coordinates across all datasets were checked for consistency and, if necessary, converted to GRCh37 using the UCSC liftOver tool (Kuhn et al., 2013). The number of individuals and variants remaining after genotype QC is shown in Supplementary file 2.

Genotype datasets were pre-phased using SHAPEIT v2 (Delaneau et al., 2013) and imputed using the Positional Burrows-Wheeler Transformation (PBWT) algorithm through the Sanger Imputation Server (SIS; Durbin, 2014). The African Genome Resource (AGR) panel (n=4956), accessed via the SIS, was used as the reference panel for imputation (Gurdasani et al., 2015) since it has been shown that the AGR is the best reference panel for imputation of missing genotypes for samples from the SAC population (Schurz et al., 2019). Imputed data were filtered to remove sites with imputation quality INFO scores less than 0.95. Individual datasets were screened for relatedness using KING software (Manichaikul et al., 2010) and individuals up to second degree relatedness were removed. A total of 7,544,769 markers overlapped across all six datasets. This list of intersecting markers was extracted from each dataset using the PLINK --extract flag. The datasets were then merged using the PLINK v1.9. After merging, all individuals missing more than 10% genotypes were removed, markers with more than 5% missing data were excluded and a HWE filter was applied to controls (threshold <10–5). The merged dataset was screened for relatedness using KING, and individuals up to second degree relatedness were subsequently removed. The final merged dataset after QC and data filtering (including the removal of related individuals) consisted of 1 544 individuals (952 TB cases and 592 healthy controls). A total of 7,510,057 variants passed QC and filtering parameters.

Global ancestry inference

ADMIXTURE was used to determine the correct number of contributing ancestral proportions in our multi-way admixed population cohort (Alexander and Lange, 2011). ADMIXTURE estimates the number of contributing ancestral populations (denoted by K) and population allele frequencies through cross-validation (CV). All 1544 individuals were grouped into running groups of equal size together with 191 reference populations (Table 3). Running groups were created to ensure approximately equal numbers of reference populations and admixed populations. Xhosa and SAC samples were divided into separate running groups.

Table 3. Ancestral populations included for global ancestry deconvolution.

Population n Source
European (British – GBR) 40 1000 Genomes (1000 G) phase 3 (Auton et al., 2015)
East Asian (Chinese – CHB) 40 1000 G phase 3
Bantu-speaking African (Yoruba – YRI) 40 1000 G phase 3
Southeast Asian (Malaysian) 38 Singapore Sequencing Malay Project (SSMP) (Wong et al., 2013)
KhoeSan (Nama) 33 African Genome Variation Project (AGVP/ADRP) (Gurdasani et al., 2015)

Redundant SNPs were removed by PLINK through LD pruning by removing each SNP with LD r2 >0.1 within a 50-SNP sliding window (advanced by 10 SNPs at a time). Ancestral proportions were inferred in an unsupervised manner for K=3–6 (1 iteration). The best value of K for the data was selected by choosing the K value with the lowest CV error across all running groups. Ten iterations of K=3 and K=5 were run for the Xhosa and SAC individuals respectively. Since it has been shown that RFMix (Maples et al., 2013) outperforms ADMIXTURE in determining global ancestry proportions (Uren et al., 2020), RFMix was also used to refine inferred global ancestry proportions. Global ancestral proportions were visualised using PONG (Behr et al., 2016).

Local ancestry inference

The merged dataset and the reference file (containing reference populations from Table 3) were phased separately using SHAPEIT2. The local ancestry for each position in the genome was inferred using RFMix (Maples et al., 2013). Default parameters were used, but the number of generations since admixture was set to 15 for the SAC individuals and 20 for the Xhosa individuals (as determined by previous studies) (Uren et al., 2016). RFMix was run with three expectation maximisation iterations and the --reanalyse-reference flag.

Batch effect screening and correction

Merging separate datasets generated at different timepoints and/or facilities, as we have done here, will undoubtedly introduce batch effects. Principal component analysis (PCA) is a common method used to visualise batch effects, where the first two principal components (PCs) are plotted with each sample coloured by batch, and a separation of colours is indicative of a batch effect (Nyamundanda et al., 2017). However, it is difficult to differentiate between separation caused by population structure and separation caused by batch effect using PCA alone. An alternative method to detect batch effects (Chen et al., 2022) involves coding case/control status by batch, followed by running an association analysis testing each batch against all other batches. If any single dataset has more positive signals compared to the other datasets, then batch effects may be responsible for producing spurious results. Batch effects can be resolved by removing those SNPs which pass the genome-wide significance threshold from the merged dataset. We have adapted this batch effect correction method for application in a highly admixed cohort with complex population structure (Croock et al., 2024). Code required to execute batch effect correction procedures is publicly available (https://github.com/TBHostGenetics/data_harmonisation, copy archived at Croock, 2025). Our modified method was used to remove 36 627 SNPs affected by batch effects from our merged dataset.

Local ancestry allelic adjusted association analysis

The LAAA association model was used to investigate if there are allelic, ancestry-specific or ancestry-specific allelic associations with TB susceptibility in our merged dataset. Global ancestral components inferred by RFMix, age and sex were included as covariates in the association tests (Supplementary file 3). Variants with minor allele frequency (MAF) <1% were removed to improve the stability of the association tests. A total of 784,557 autosomal markers (with MAF >1%) and 1544 unrelated individuals (952 TB cases and 592 healthy controls) were available for further analyses. Of the markers included in the final dataset, 535,193 sites were imputed. Dosage files, which code the number of alleles of a specific ancestry at each locus across the genome, were compiled. Separate regression models for each ancestral contribution were fitted to investigate which ancestral contribution is associated with TB susceptibility. Code required to execute the LAAA model is publicly available (https://github.com/TBHostGenetics/LAAA-model, copy archived at Swart, 2025). Details regarding the models have been described elsewhere (Swart et al., 2022b); but in summary, four regression models were tested to detect the source of the association signals observed:

(1) Null model or global ancestry (GA) model

The null model only includes global ancestry, sex and age covariates. This test investigates whether an additive allelic dose exerts an effect on the phenotype (without including local ancestry of the allele).

(2) Local ancestry (LA) model

This model is used in admixture mapping to identify ancestry-specific variants associated with a specific phenotype. The LA model evaluates the number of alleles of a specific ancestry at a locus and includes the corresponding marginal effect as a covariate in association analyses.

(3) Ancestry plus allelic (APA) model

The APA model simultaneously performs model (1) and (2). This model tests whether an additive allelic dose exerts an effect on the phenotype whilst adjusting for local ancestry.

(4) Local ancestry adjusted allelic (LAAA) model

The LAAA model is an extension of the APA model, which models the combination of the minor allele and ancestry of the minor allele at a specific locus and the effect this interaction has on the phenotype.

The R package STEAM (Significance Threshold Estimation for Admixture Mapping; Grinde et al., 2019) was used to determine the admixture mapping significance threshold given the global ancestral proportions of each individual and the number of generations since admixture (g=15). For the LA model, a genome-wide significance threshold of p-value <2.5 x 10–6 was deemed significant by STEAM. The traditional genome-wide significance threshold of 5x10–8 was used for the GA, APA and LAAA models, as recommended by the authors of the LAAA model (Duan et al., 2018). Results from the analysis performed on chromosome 6 whilst adjusting for KhoeSan ancestry are documented in Supplementary file 4.

Acknowledgements

We acknowledge the support of the DSI-NRF Centre of Excellence for Biomedical Tuberculosis Research, South African Medical Research Council Centre for Tuberculosis Research (SAMRC CTR), Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University, Cape Town, South Africa. We also acknowledge the Centre for High Performance Computing (CHPC), South Africa, for providing computational resources. This research was partially funded by the South African government through the SAMRC and the Harry Crossley Research Foundation.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Caitlin Uren, Email: caitlinu@sun.ac.za.

Bavesh D Kana, University of the Witwatersrand, South Africa.

Bavesh D Kana, University of the Witwatersrand, South Africa.

Funding Information

This paper was supported by the following grants:

  • South African Medical Research Council to Dayna Adrienne Croock.

  • Harry Crossley Foundation to Dayna Adrienne Croock.

Additional information

Competing interests

No competing interests declared.

Author contributions

Formal analysis, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing.

Resources, Supervision, Methodology, Writing – review and editing.

Conceptualization, Supervision, Methodology, Writing – review and editing.

Conceptualization, Supervision, Writing – review and editing.

Conceptualization, Data curation, Supervision, Writing – review and editing.

Conceptualization, Resources, Data curation, Supervision, Project administration, Writing – review and editing.

Ethics

Ethics approval was granted by the Health Research Ethics Committee (HREC) of Stellenbosch University, South Africa (project number S22/02/031). Individuals that were included in the analyses consented to the use of their data in future research regarding TB host genetics.

Additional files

MDAR checklist
Supplementary file 1. Summary statistics for two variants within 800 base pairs of the ITHGC lead SNP 167 (rs28383206) on chromosome 6 for the LAAA analysis adjusting for KhoeSan and Bantu-speaking African local 168 ancestry.
elife-99200-supp1.xlsx (9.1KB, xlsx)
Supplementary file 2. The number of individuals and variants across all array datasets following genotype QC.
elife-99200-supp2.xlsx (9.2KB, xlsx)
Supplementary file 3. Summary of the age, sex and ancestral proportions for individuals in the merged cohort.
elife-99200-supp3.xlsx (120.4KB, xlsx)
Supplementary file 4. Summary statistics of the results for chromosome 6 whilst using the local ancestry adjusted allelic (LAAA) model whilst adjusting for KhoeSan ancestry.
elife-99200-supp4.xlsx (7.1MB, xlsx)

Data availability

The current manuscript is a computational study, so no new genetic data was generated for this manuscript. Access to retrospective genetic datasets analysed can be requested through the original studies data access process. Where the dataset is yet to be published, access to these datasets will be considered upon reasonable request in line with the initial participant consent - please email caitlinu@sun.ac.za. Summary statistics for the covariate data for individuals in the cohort are available in Supplementary File 3, and LAAA model results for chromosome 6 (adjusted for KhoeSan ancestry) are available in Supplementary File 4. Code required to perform genotype QC, imputation, ancestry inference and batch effect procedures is publicly available (https://github.com/TBHostGenetics/data_harmonisation copy archived at Croock, 2025). Code required to execute the LAAA model is publicly available (https://github.com/TBHostGenetics/LAAA-model copy archived at Swart, 2025).

The following previously published dataset was used:

Oyageshio OP, Myrick JW, Saayman J, van der Westhuizen L, Al-Hindi D, Reynolds AW, Zaitlen N, Uren C, Möller M, Henn BM. 2023. Investigating Host Genetic Risk Factors for Tuberculosis in Highly Endemic South African Populations. European Genome-Phenome Archive. EGAS00001007850

References

  1. Alexander DH, Lange K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics. 2011;12:246. doi: 10.1186/1471-2105-12-246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR, 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Behr AA, Liu KZ, Liu-Fang G, Nakka P, Ramachandran S. pong: fast analysis and visualization of latent clusters in population genetic data. Bioinformatics. 2016;32:2817–2823. doi: 10.1093/bioinformatics/btw327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cai L, Li Z, Guan X, Cai K, Wang L, Liu J, Tong Y. The research progress of host genes and tuberculosis susceptibility. Oxidative Medicine and Cellular Longevity. 2019;2019:9273056. doi: 10.1155/2019/9273056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chang ST, Linderman JJ, Kirschner DE. Multiple mechanisms allow Mycobacterium tuberculosis to continuously inhibit MHC class II-mediated antigen presentation by macrophages. PNAS. 2005;102:4530–4535. doi: 10.1073/pnas.0500362102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chen D, Tashman K, Palmer DS, Neale B, Roeder K, Bloemendal A, Churchhouse C, Ke ZT. A data harmonization pipeline to leverage external controls and boost power in GWAS. Human Molecular Genetics. 2022;31:481–489. doi: 10.1093/hmg/ddab261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chihab LY, Kuan R, Phillips EJ, Mallal SA, Rozot V, Davis MM, Scriba TJ, Sette A, Peters B, Lindestam Arlehamn CS, Group SS. Expression of specific HLA class II alleles is associated with an increased risk for active tuberculosis and a distinct gene expression profile. HLA. 2023;101:124–137. doi: 10.1111/tan.14880. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chimusa ER, Daya M, Möller M, Ramesar R, Henn BM, van Helden PD, Mulder NJ, Hoal EG. Determining ancestry proportions in complex admixture scenarios in South Africa using a novel proxy ancestry selection method. PLOS ONE. 2013;8:e73971. doi: 10.1371/journal.pone.0073971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chimusa ER, Zaitlen N, Daya M, Möller M, van Helden PD, Mulder NJ, Price AL, Hoal EG. Genome-wide association study of ancestry-specific TB risk in the South African Coloured population. Human Molecular Genetics. 2014;23:796–809. doi: 10.1093/hmg/ddt462. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Choudhury A, Sengupta D, Ramsay M, Schlebusch C. Bantu-speaker migration and admixture in southern Africa. Human Molecular Genetics. 2021;30:R56–R63. doi: 10.1093/hmg/ddaa274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Croock D, Swart Y, Schurz H, Petersen DC, Möller M, Uren C. Data harmonization guidelines to combine multi-platform genomic data from admixed populations and boost power in genome-wide association studies. Current Protocols. 2024;4:e1055. doi: 10.1002/cpz1.1055. [DOI] [PubMed] [Google Scholar]
  12. Croock D. Data_harmonisation. swh:1:rev:e9709c5a9257c2622637c418bc410f7b832a5cd7Software Heritage. 2025 https://archive.softwareheritage.org/swh:1:dir:34c80bd568d8d7bf8deaa5e190c8f35f9bf2caf7;origin=https://github.com/TBHostGenetics/data_harmonisation;visit=swh:1:snp:0bc87001d37daf4994a047e4979bdc48e140b021;anchor=swh:1:rev:e9709c5a9257c2622637c418bc410f7b832a5cd7
  13. Cudahy PGT, Wilson D, Cohen T. Risk factors for recurrent tuberculosis after successful treatment in a high burden setting: a cohort study. BMC Infectious Diseases. 2020;20:789. doi: 10.1186/s12879-020-05515-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Dai X, Fu G, Zhao S, Zeng Y. Statistical learning methods applicable to genome-wide association studies on unbalanced case-control disease data. Genes. 2021;12:736. doi: 10.3390/genes12050736. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Dawkins BA, Garman L, Cejda N, Pezant N, Rasmussen A, Rybicki BA, Levin AM, Benchek P, Seshadri C, Mayanja-Kizza H, Iannuzzi MC, Stein CM, Montgomery CG. Novel HLA associations with outcomes of Mycobacterium tuberculosis exposure and sarcoidosis in individuals of African ancestry using nearest-neighbor feature selection. Genetic Epidemiology. 2022;46:463–474. doi: 10.1002/gepi.22490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Daya M, van der Merwe L, Galal U, Möller M, Salie M, Chimusa ER, Galanter JM, van Helden PD, Henn BM, Gignoux CR, Hoal E. A panel of ancestry informative markers for the complex five-way admixed South African coloured population. PLOS ONE. 2013;8:e82224. doi: 10.1371/journal.pone.0082224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Delaneau O, Howie B, Cox AJ, Zagury JF, Marchini J. Haplotype estimation using sequencing reads. American Journal of Human Genetics. 2013;93:687–696. doi: 10.1016/j.ajhg.2013.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. de Sá NBR, Ribeiro-Alves M, da Silva TP, Pilotto JH, Rolla VC, Giacoia-Gripp CBW, Scott-Algara D, Morgado MG, Teixeira SLM. Clinical and genetic markers associated with tuberculosis, HIV-1 infection, and TB/HIV-immune reconstitution inflammatory syndrome outcomes. BMC Infectious Diseases. 2020;20:59. doi: 10.1186/s12879-020-4786-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Duan Q, Xu Z, Raffield LM, Chang S, Wu D, Lange EM, Reiner AP, Li Y. A robust and powerful two-step testing procedure for local ancestry adjusted allelic association analysis in admixed populations. Genetic Epidemiology. 2018;42:288–302. doi: 10.1002/gepi.22104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Durbin R. Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT) Bioinformatics. 2014;30:1266–1272. doi: 10.1093/bioinformatics/btu014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Escombe AR, Ticona E, Chávez-Pérez V, Espinoza M, Moore DAJ. Improving natural ventilation in hospital waiting and consulting rooms to reduce nosocomial tuberculosis transmission risk in a low resource setting. BMC Infectious Diseases. 2019;19:88. doi: 10.1186/s12879-019-3717-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Gallant CJ, Cobat A, Simkin L, Black GF, Stanley K, Hughes J, Doherty TM, Hanekom WA, Eley B, Beyers N, Jaïs J-P, van Helden P, Abel L, Alcaïs A, Hoal EG, Schurr E. Impact of age and sex on mycobacterial immunity in an area of high tuberculosis incidence. The International Journal of Tuberculosis and Lung Disease. 2010;14:952–959. [PubMed] [Google Scholar]
  23. Glaziou P, Floyd K, Raviglione MC. Global epidemiology of tuberculosis. Seminars in Respiratory and Critical Care Medicine. 2018;39:271–285. doi: 10.1055/s-0038-1651492. [DOI] [PubMed] [Google Scholar]
  24. Grinde KE, Brown LA, Reiner AP, Thornton TA, Browning SR. Genome-wide significance thresholds for admixture mapping studies. American Journal of Human Genetics. 2019;104:454–465. doi: 10.1016/j.ajhg.2019.01.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Gurdasani D, Carstensen T, Tekola-Ayele F, Pagani L, Tachmazidou I, Hatzikotoulas K, Karthikeyan S, Iles L, Pollard MO, Choudhury A, Ritchie GRS, Xue Y, Asimit J, Nsubuga RN, Young EH, Pomilla C, Kivinen K, Rockett K, Kamali A, Doumatey AP, Asiki G, Seeley J, Sisay-Joof F, Jallow M, Tollman S, Mekonnen E, Ekong R, Oljira T, Bradman N, Bojang K, Ramsay M, Adeyemo A, Bekele E, Motala A, Norris SA, Pirie F, Kaleebu P, Kwiatkowski D, Tyler-Smith C, Rotimi C, Zeggini E, Sandhu MS. The African genome variation project shapes medical genetics in Africa. Nature. 2015;517:327–332. doi: 10.1038/nature13997. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Harishankar M, Selvaraj P, Bethunaickan R. Influence of genetic polymorphism towards pulmonary tuberculosis susceptibility. Frontiers in Medicine. 2018;5:213. doi: 10.3389/fmed.2018.00213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kroon EE, Kinnear CJ, Orlova M, Fischinger S, Shin S, Boolay S, Walzl G, Jacobs A, Wilkinson RJ, Alter G, Schurr E, Hoal EG, Möller M. An observational study identifying highly tuberculosis-exposed, HIV-1-positive but persistently TB, tuberculin and IGRA negative persons with M. tuberculosis specific antibodies in Cape Town, South Africa. EBioMedicine. 2020;61:103053. doi: 10.1016/j.ebiom.2020.103053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Kuhn RM, Haussler D, Kent WJ. The UCSC genome browser and associated tools. Briefings in Bioinformatics. 2013;14:144–161. doi: 10.1093/bib/bbs038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Laghari M, Sulaiman SAS, Khan AH, Talpur BA, Bhatti Z, Memon N. Contact screening and risk factors for TB among the household contact of children with active TB: a way to find source case and new TB cases. BMC Public Health. 2019;19:1274. doi: 10.1186/s12889-019-7597-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Lehohla P. South African Census 2011 Meta-data (Report No. 03-01-47; p. 130). South African Census. Statistics South Africa; 2012. [Google Scholar]
  31. Li M, Hu Y, Zhao B, Chen L, Huang H, Huai C, Zhang X, Zhang J, Zhou W, Shen L, Zhen Q, Li B, Wang W, He L, Qin S. A next generation sequencing combined genome-wide association study identifies novel tuberculosis susceptibility loci in Chinese population. Genomics. 2021;113:2377–2384. doi: 10.1016/j.ygeno.2021.05.035. [DOI] [PubMed] [Google Scholar]
  32. Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26:2867–2873. doi: 10.1093/bioinformatics/btq559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Maples BK, Gravel S, Kenny EE, Bustamante CD. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. American Journal of Human Genetics. 2013;93:278–288. doi: 10.1016/j.ajhg.2013.06.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Matose MT, Poluta M, Douglas TS. Natural ventilation as a means of airborne tuberculosis infection control in minibus taxis. South African Journal of Science. 2019;115:9/10. doi: 10.17159/sajs.2019/5737. [DOI] [Google Scholar]
  35. Menzies NA, Swartwood N, Testa C, Malyuta Y, Hill AN, Marks SM, Cohen T, Salomon JA. Time since infection and risks of future disease for individuals with Mycobacterium tuberculosis infection in the United States. Epidemiology. 2021;32:70–78. doi: 10.1097/EDE.0000000000001271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Möller M, Kinnear CJ, Orlova M, Kroon EE, van Helden PD, Schurr E, Hoal EG. Genetic resistance to Mycobacterium tuberculosis infection and disease. Frontiers in Immunology. 2018;9:2219. doi: 10.3389/fimmu.2018.02219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Möller M, Kinnear CJ. Human global and population-specific genetic susceptibility to Mycobacterium tuberculosis infection and disease. Current Opinion in Pulmonary Medicine. 2020;26:302–310. doi: 10.1097/MCP.0000000000000672. [DOI] [PubMed] [Google Scholar]
  38. Nyamundanda G, Poudel P, Patil Y, Sadanandam A. A novel statistical method to diagnose, quantify and correct batch effects in genomic studies. Scientific Reports. 2017;7:10849. doi: 10.1038/s41598-017-11110-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Oliveira-Cortez A, Melo AC, Chaves VE, Condino-Neto A, Camargos P. Do HLA class II genes protect against pulmonary tuberculosis? A systematic review and meta-analysis. European Journal of Clinical Microbiology & Infectious Diseases. 2016;35:1567–1580. doi: 10.1007/s10096-016-2713-x. [DOI] [PubMed] [Google Scholar]
  40. Oyageshio OP, Myrick JW, Saayman J, van der Westhuizen L, Al-Hindi D, Reynolds AW, Zaitlen N, Uren C, Möller M, Henn BM. Strong effect of demographic changes on tuberculosis susceptibility in South Africa. medRxiv. 2023 doi: 10.1371/journal.pgph.0002643. https://www.medrxiv.org/content/10.1101/2023.11.02.23297990v1 [DOI] [PMC free article] [PubMed]
  41. Öztornaci RO, Syed H, Morris AP, Taşdelen B. The use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studies. Journal of Big Data. 2023;10:174. doi: 10.1186/s40537-023-00853-x. [DOI] [Google Scholar]
  42. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Ravikumar M, Dheenadhayalan V, Rajaram K, Lakshmi SS, Kumaran PP, Paramasivan CN, Balakrishnan K, Pitchappan RM. Associations of HLA-DRB1, DQB1 and DPB1 alleles with pulmonary tuberculosis in south India. Tubercle and Lung Disease. 1999;79:309–317. doi: 10.1054/tuld.1999.0213. [DOI] [PubMed] [Google Scholar]
  44. Robinson J, Barker DJ, Georgiou X, Cooper MA, Flicek P, Marsh SGE. IPD-IMGT/HLA Database. Nucleic Acids Research. 2020;48:D948–D955. doi: 10.1093/nar/gkz950. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Schurz H, Kinnear CJ, Gignoux C, Wojcik G, van Helden PD, Tromp G, Henn B, Hoal EG, Möller M. A sex-stratified genome-wide association study of tuberculosis using a multi-ethnic genotyping array. Frontiers in Genetics. 2018;9:678. doi: 10.3389/fgene.2018.00678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Schurz H, Müller SJ, van Helden PD, Tromp G, Hoal EG, Kinnear CJ, Möller M. Evaluating the accuracy of imputation methods in a five-way admixed population. Frontiers in Genetics. 2019;10:34. doi: 10.3389/fgene.2019.00034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Schurz H, Naranbhai V, Yates TA, Gilchrist JJ, Parks T, Dodd PJ, Möller M, Hoal EG, Morris AP, Hill AVS. International tuberculosis host genetics consortium. eLife. 2024;13:84394. doi: 10.7554/eLife.84394. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Selvaraj P, Raghavan S, Swaminathan S, Alagarasu K, Narendran G, Narayanan PR. HLA-DQB1 and -DPB1 allele profile in HIV infected patients with and without pulmonary tuberculosis of south India. Infection, Genetics and Evolution. 2008;8:664–671. doi: 10.1016/j.meegid.2008.06.005. [DOI] [PubMed] [Google Scholar]
  49. Skol AD, Scott LJ, Abecasis GR, Boehnke M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature Genetics. 2006;38:209–213. doi: 10.1038/ng1706. [DOI] [PubMed] [Google Scholar]
  50. Smith MH, Myrick JW, Oyageshio O, Uren C, Saayman J, Boolay S, van der Westhuizen L, Werely C, Möller M, Henn BM, Reynolds AW. Epidemiological correlates of overweight and obesity in the Northern Cape Province, South Africa. PeerJ. 2023;11:e14723. doi: 10.7717/peerj.14723. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Sveinbjornsson G, Gudbjartsson DF, Halldorsson BV, Kristinsson KG, Gottfredsson M, Barrett JC, Gudmundsson LJ, Blondal K, Gylfason A, Gudjonsson SA, Helgadottir HT, Jonasdottir A, Jonasdottir A, Karason A, Kardum LB, Knežević J, Kristjansson H, Kristjansson M, Love A, Luo Y, Magnusson OT, Sulem P, Kong A, Masson G, Thorsteinsdottir U, Dembic Z, Nejentsev S, Blondal T, Jonsdottir I, Stefansson K. HLA class II sequence variants influence tuberculosis risk in populations of European ancestry. Nature Genetics. 2016;48:318–322. doi: 10.1038/ng.3498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Swart Y, van Eeden G, Sparks A, Uren C, Möller M. Prospective avenues for human population genomics and disease mapping in southern Africa. Molecular Genetics and Genomics. 2020;295:1079–1089. doi: 10.1007/s00438-020-01684-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Swart Y, Uren C, van Helden PD, Hoal EG, Möller M. Local ancestry adjusted allelic association analysis robustly captures tuberculosis susceptibility loci. Frontiers in Genetics. 2021;12:716558. doi: 10.3389/fgene.2021.716558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Swart Y, Eeden G, Uren C, Spuy G, Tromp G, Moller M. Cis -eQTL mapping of TB-T2D comorbidity elucidates the involvement of african ancestry in TB susceptibility. bioRxiv. 2022a doi: 10.1101/2022.10.19.512814. [DOI]
  55. Swart Y, van Eeden G, Uren C, van der Spuy G, Tromp G, Möller M. GWAS in the Southern African context. bioRxiv. 2022b doi: 10.1101/2022.02.16.480704. [DOI] [PMC free article] [PubMed]
  56. Swart Y. LAAA-model. swh:1:rev:2026dd400d02cdea6de32d578bdab657df8d251bSoftware Heritage. 2025 https://archive.softwareheritage.org/swh:1:dir:89b4ab6d74cb1bf1f1601e077db7b8e3154177d2;origin=https://github.com/TBHostGenetics/LAAA-model;visit=swh:1:snp:863ae787ed6e43134d24b9b5a2fed72c3f7a52c9;anchor=swh:1:rev:2026dd400d02cdea6de32d578bdab657df8d251b
  57. Ugarte-Gil C, Alisjahbana B, Ronacher K, Riza AL, Koesoemadinata RC, Malherbe ST, Cioboata R, Llontop JC, Kleynhans L, Lopez S, Santoso P, Marius C, Villaizan K, Ruslami R, Walzl G, Panduru NM, Dockrell HM, Hill PC, Mc Allister S, Pearson F, Moore DAJ, Critchley JA, van Crevel R. Diabetes mellitus among pulmonary tuberculosis patients from 4 tuberculosis-endemic countries: the TANDEM study. Clinical Infectious Diseases. 2020;70:780–788. doi: 10.1093/cid/ciz284. [DOI] [PubMed] [Google Scholar]
  58. Uren C, Kim M, Martin AR, Bobo D, Gignoux CR, van Helden PD, Möller M, Hoal EG, Henn BM. Fine-scale human population structure in Southern Africa reflects ecogeographic boundaries. Genetics. 2016;204:303–314. doi: 10.1534/genetics.116.187369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Uren C, Henn BM, Franke A, Wittig M, van Helden PD, Hoal EG, Möller M. A post-GWAS analysis of predicted regulatory variants and tuberculosis susceptibility. PLOS ONE. 2017;12:e0174738. doi: 10.1371/journal.pone.0174738. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Uren C, Hoal EG, Möller M. Putting RFMix and ADMIXTURE to the test in a complex admixed population. BMC Genetics. 2020;21:40. doi: 10.1186/s12863-020-00845-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Uren C, Hoal EG, Möller M. Mycobacterium tuberculosis complex and human coadaptation: a two-way street complicating host susceptibility to TB. Human Molecular Genetics. 2021;30:R146–R153. doi: 10.1093/hmg/ddaa254. [DOI] [PubMed] [Google Scholar]
  62. Verhein KC, Vellers HL, Kleeberger SR. Inter-individual variation in health and disease associated with pulmonary infectious agents. Mammalian Genome. 2018;29:38–47. doi: 10.1007/s00335-018-9733-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Witek J, Mohiuddin SS. Biochemistry, Pseudogenes in StatPearls. StatPearls Publishing; 2024. [PubMed] [Google Scholar]
  64. Wong LP, Ong RTH, Poh WT, Liu X, Chen P, Li R, Lam KKY, Pillai NE, Sim KS, Xu H, Sim NL, Teo SM, Foo JN, Tan LWL, Lim Y, Koo SH, Gan LSH, Cheng CY, Wee S, Yap EPH, Ng PC, Lim WY, Soong R, Wenk MR, Aung T, Wong TY, Khor CC, Little P, Chia KS, Teo YY. Deep whole-genome sequencing of 100 southeast Asian Malays. American Journal of Human Genetics. 2013;92:52–66. doi: 10.1016/j.ajhg.2012.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. World Health Organization Global tuberculosis report 2023 (World health organization, Ed.; p. 75) World health organization. 2023. [August 4, 2023]. https://www.who.int/publications/i/item/9789240083851
  66. Zaidi SMA, Coussens AK, Seddon JA, Kredo T, Warner D, Houben R, Esmail H. Beyond latent and active tuberculosis: a scoping review of conceptual frameworks. EClinicalMedicine. 2023;66:102332. doi: 10.1016/j.eclinm.2023.102332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Zheng R, Li Z, He F, Liu H, Chen J, Chen J, Xie X, Zhou J, Chen H, Wu X, Wu J, Chen B, Liu Y, Cui H, Fan L, Sha W, Liu Y, Wang J, Huang X, Zhang L, Xu F, Wang J, Feng Y, Qin L, Yang H, Liu Z, Cui Z, Liu F, Chen X, Gao S, Sun S, Shi Y, Ge B. Genome-wide association study identifies two risk loci for tuberculosis in Han Chinese. Nature Communications. 2018;9:4072. doi: 10.1038/s41467-018-06539-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Zhou W, Nielsen JB, Fritsche LG, Dey R, Gabrielsen ME, Wolford BN, LeFaive J, VandeHaar P, Gagliano SA, Gifford A, Bastarache LA, Wei WQ, Denny JC, Lin M, Hveem K, Kang HM, Abecasis GR, Willer CJ, Lee S. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nature Genetics. 2018;50:1335–1341. doi: 10.1038/s41588-018-0184-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

eLife Assessment

Bavesh D Kana 1

This valuable study confirms the association between the human leukocyte antigen (HLA)-II region and tuberculosis (TB) susceptibility in genetically admixed South African populations, specifically identifying a near-genome-wide significant association in the HLA-DPB1 gene, which originates from KhoeSan ancestry. The evidence supporting the association between the HLA-II region and TB susceptibility is solid, and the work will be of interest to those studying the genetic basis of tuberculosis susceptibility/infection resistance.

Reviewer #2 (Public review):

Anonymous

Summary:

This manuscript is about using different analytical approaches to allow ancestry adjustments to GWAS analyses amongst admixed populations. This work is a follow-on from the recently published ITHGC multi-population GWAS (https://doi.org/10.7554/eLife.84394), with the focus on the admixed South African populations. Ancestry adjustment models detected a peak of SNPs in the class II HLA DPB1, distinct from the class II HLA DQA1 loci signficant in the ITHGC analysis.

Strengths:

Excellent demonstration of GWAS analytical pipelines in highly admixed populations. Particularly the utility of ancestry adjustment to improve study power to detect novel associations. Further confirmation of the importance of the HLA class II locus in genetic susceptibility to TB.

Weaknesses:

Limited novelty compared to the group's previous existing publications and the body of work linking HLA class II alleles with TB susceptibility in South Africa or other African populations. This work includes only ~100 new cases and controls from what has already been published. High resolution HLA typing has detected significant signals in both the DQA1 and DPB1 regions identified by the larger ITHGC and in this GWAS analysis respectively (Chihab L et al. HLA. 2023 Feb; 101(2): 124-137).

Despite the availability of strong methods for imputing HLA from GWAS data (Karnes J et Plos One 2017), the authors did not confirm with HLA typing the importance of their SNP peak in the class II region. This would have supported the importance of this ancestry adjustment versus prior ITHGC analysis.

The populations consider active TB and healthy controls (from high-burden presumed exposed communities) and do not provide QFT or other data to identify latent TB infection.

Important methodological points for clarification and for readers to be aware of when reading this paper:

(1) One of the reasons cited for the lack of African ancestry-specific associations or suggestive peaks in the ITHGC study was the small African sample size. The current association test includes a larger African cohort and yields a near-genome-wide significant threshold in the HLA-DPB1 gene originating from the KhoeSan ancestry. Investigation is needed as to whether the increase in power is due to increased African samples and not necessarily the use of the LAAA model as stated on lines 295 and 296?

Authors response - The Manhattan plot in Figure 3 includes the results for all four models: the traditional GWAS model (GAO), the admixture mapping model (LAO), the ancestry plus allelic (APA) model and the LAAA model. In this figure, it is evident that only the LAAA model identified the association peak on chromosome 6, which lends support the argument that the increase in power is due to the use of the LAAA model and not solely due to the increase in sample size.

Reviewer comment - This data supports the authors conclusions that increase power is related to the LAAA model application rather than simply increase sample size.

(2) In line 256, the number of SNPs included in the LAAA analysis was 784,557 autosomal markers; the number of SNPs after quality control of the imputed dataset was 7,510,051 SNPs (line 142). It is not clear how or why ~90% of the SNPs were removed. This needs clarification.

Authors response:

In our manuscript (line 194), we mention that "...variants with minor allele frequency (MAF) < 1% were removed to improve the stability of the association tests." A large proportion of imputed variants fell below this MAF threshold and were subsequently excluded from this analysis.

Reviewers additional comment: The authors should specify the number of SNPs in the dataset before imputation and indicate what proportion of the 784,557 remaining SNPs were imputed. Providing this information might help the reader better understand the rationale behind the imputation process.

(3) The authors have used the significance threshold estimated by the STEAM p-value < 2.5x10-6 in the LAAA analysis. Grinde et al. 2019 implemented their significance threshold estimation approach tailored to admixture mapping (local ancestry (LA) model), where there is a reduction in testing burden. The authors should justify why this threshold would apply to the LAAA model (a joint genotype and ancestry approach).

Authors response: We describe in the methods (line 189 onwards) that the LAAA model is an extension of the APA model. Since the APA model itself simultaneously performs the null global ancestry only model and the local ancestry model (utilised in admixture mapping), we thus considered the use of a threshold tailored to admixture mapping appropriate for the LAAA model.

Reviewers additional comment: While the LAAA model is an extension of the APA model, the authors describe the LAAA test as 'models the combination of the minor allele and the ancestry of the minor allele at a specific locus, along with the effect of this interaction,' thus a joint allele and ancestry effects model. Grinde et al. (2019) proposed the significance threshold estimation approach, STEAM, specifically for the LA approach, which tests for ancestry effects alone and benefits from the reduced testing burden. However, it remains unclear why the authors found it appropriate to apply STEAM to the LAAA model, a joint test for both allele and ancestry effects, which does not benefit from the same reduction in testing burden.

(4) Batch effect screening and correction (line 174) is a quality control check. This section is discussed after global and local ancestry inferences in the methods. Was this QC step conducted after the inferencing? If so, the authors should justify how the removed SNPs due to the batch effect did not affect the global and local ancestry inferences or should order the methods section correctly to avoid confusion.

Authors response: The batch effect correction method utilised a pseudo-case-control comparison which included global ancestry proportions. Thus, batch effect correction was conducted after ancestry inference. We excluded 36 627 SNPs that were believed to have been affected by the batch effect. We have amended line 186 to include the exact number of SNPs excluded due to batch effect.

The ancestry inference by RFMix utilised the entire merged dataset of 7 510 051 SNPs. Thus, the SNPs removed due to the batch effect make up a very small proportion of the SNPs used to conduct global and local ancestry inferences (less than 0.5%). As a result, we do not believe that the removed SNPs would have significantly affected the global and local ancestry inferences. However, we did conduct global ancestry inference with RFMix on each separate dataset as a sanity check. In the tables below, we show the average global ancestry proportions inferred for each separate dataset, the average global ancestry proportions across all datasets and the average global ancestry proportions inferred using the merged dataset. The SAC and Xhosa cohorts are shown in two separate tables due to the different number of contributing ancestral populations to each cohort. The differences between the combined average global ancestry proportions across the separate cohorts does not differ significantly to the global ancestry proportions inferred using the merged dataset.

This is an excellent response and should remain accessible to readers for clarifying this issue.

Comments on revisions:

Thank you for addressing my other recommendations to authors. These have all been satisfactorily addressed.

eLife. 2025 Jun 3;13:RP99200. doi: 10.7554/eLife.99200.4.sa2

Author response

Dayna Adrienne Croock 1, Yolandi Swart 2, Haiko Schurz 3, Desiree C Petersen 4, Marlo Möller 5, Caitlin Uren 6

The following is the authors’ response to the previous reviews

Recommendations for the authors:

Reviewer #1:

First, I thank the authors for clarifying some of the confusion I had in the previous comment and I appreciate the efforts the authors put into improving the quality of the manuscript. However, my concerns about the lack of novelty of the key findings are not perfectly addressed and there is no additional analysis done in this revision. Currently in this version of the manuscript, asserting that a p-value of 10-6 is close to genome-wide significance may be considered an overstatement. Further analysis focusing on finding novel and additional discovery is very necessary.

We thank the reviewer for their comments. Reviewer #2 also made a comment regarding the genomewide threshold, “However, it remains unclear why the authors found it appropriate to apply STEAM to the LAAA model, a joint test for both allele and ancestry effects, which does not benefit from the same reduction in testing burden.” The reviewers’ have correctly identified our oversight - we have amended the manuscript as follows:

(1) The abstract, “We identified a suggestive association peak (rs3117230, p-value = 5.292 x10-6, OR = 0.437, SE = 0.182) in the HLA-DPB1 gene originating from KhoeSan ancestry.”

(2) From line 233 to 239: “The R package STEAM (Significance Threshold Estimation for Admixture Mapping) (Grinde et al., 2019) was used to determine the admixture mapping significance threshold given the global ancestral proportions of each individual and the number of generations since admixture (g = 15). For the LA model, a genome-wide significance threshold of pvalue < 2.5 x 10-6 was deemed significant by STEAM. The traditional genome-wide significance threshold of 5 x 10-8 was used for the GA, APA and LAAA models, as recommended by the authors of the LAAA model (Duan et al., 2018).”

(3) We excluded the results for the signal on chromosome 20, since this also did not reach the LAAA model genome-wide significance threshold.

(4) From line 296 to 308: “LAAA models were successfully applied for all five contributing ancestries (KhoeSan, Bantu-speaking African, European, East Asian and Southeast Asian). However, no variants passed the threshold for statistical significance. Although no variants reached genome-wide significance, a suggestive peak was identified in the HLA-II region of chromosome 6 when using the LAAA model and adjusting for KhoeSan ancestry (Figure 3). The QQ-plot suggested minimal genomic inflation, which was verified by calculating the genomic inflation factor (= 1.05289) (Supplementary Figure 1). The lead variants identified using the LAAA model whilst adjusting for KhoeSan ancestry in this region on chromosome 6 are summarised in Table 3. The suggestive peak encompasses the HLA-DPA1/B1 (major histocompatibility complex, class II, DP alpha 1/beta 1) genes (Figure 4). It is noteworthy that without the LAAA model, this suggestive peak would not have been observed for this cohort. This highlights the importance of utilising the LAAA model in future association studies when investigating disease susceptibility loci in admixed individuals, such as the SAC population.”

We acknowledge that our results are not statistically significant. However, our study advances this area of research by identifying suggestive African-specific ancestry associations with TB in the HLA-II region. These findings build upon the work of the ITHGC, which did not identify any significant associations or suggestive peaks in their African-specific analyses. We have included this argument in our manuscript (from lines 425 to 432):

“The ITHGC did not identify any significant associations or suggestive peaks in their African ancestryspecific analyses. Notably, the suggestive peak in the HLA-DPB1 region was only captured in our cohort using the LAAA model whilst adjusting for KhoeSan local ancestry. This underscores the importance of incorporating global and local ancestry in association studies investigating complex multi-way admixed individuals, as the genetic heterogeneity present in admixed individuals (produced as a result of admixtureinduced and ancestral LD patterns) may cause association signals to be missed when using traditional association models (Duan et al., 2018; Swart, van Eeden, et al., 2022).”

We appreciate the comment regarding additional analyses. We acknowledge that we did not validate our SNP peak in the HLA-II region through fine-mapping due to the lack of a suitable reference panel (see lines 490 to 500). Our long-term goal is to develop a HLA-imputation reference panel incorporating KhoeSan ancestry; however, this is beyond the scope and funding allowances of this study.

Reviewer #2 (Recommendations for the authors):

The authors we think have done an excellent job with their responses and the manuscript has been substantially improved.

Thank you for taking the time to help us improve our manuscript.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Oyageshio OP, Myrick JW, Saayman J, van der Westhuizen L, Al-Hindi D, Reynolds AW, Zaitlen N, Uren C, Möller M, Henn BM. 2023. Investigating Host Genetic Risk Factors for Tuberculosis in Highly Endemic South African Populations. European Genome-Phenome Archive. EGAS00001007850

    Supplementary Materials

    MDAR checklist
    Supplementary file 1. Summary statistics for two variants within 800 base pairs of the ITHGC lead SNP 167 (rs28383206) on chromosome 6 for the LAAA analysis adjusting for KhoeSan and Bantu-speaking African local 168 ancestry.
    elife-99200-supp1.xlsx (9.1KB, xlsx)
    Supplementary file 2. The number of individuals and variants across all array datasets following genotype QC.
    elife-99200-supp2.xlsx (9.2KB, xlsx)
    Supplementary file 3. Summary of the age, sex and ancestral proportions for individuals in the merged cohort.
    elife-99200-supp3.xlsx (120.4KB, xlsx)
    Supplementary file 4. Summary statistics of the results for chromosome 6 whilst using the local ancestry adjusted allelic (LAAA) model whilst adjusting for KhoeSan ancestry.
    elife-99200-supp4.xlsx (7.1MB, xlsx)

    Data Availability Statement

    The current manuscript is a computational study, so no new genetic data was generated for this manuscript. Access to retrospective genetic datasets analysed can be requested through the original studies data access process. Where the dataset is yet to be published, access to these datasets will be considered upon reasonable request in line with the initial participant consent - please email caitlinu@sun.ac.za. Summary statistics for the covariate data for individuals in the cohort are available in Supplementary File 3, and LAAA model results for chromosome 6 (adjusted for KhoeSan ancestry) are available in Supplementary File 4. Code required to perform genotype QC, imputation, ancestry inference and batch effect procedures is publicly available (https://github.com/TBHostGenetics/data_harmonisation copy archived at Croock, 2025). Code required to execute the LAAA model is publicly available (https://github.com/TBHostGenetics/LAAA-model copy archived at Swart, 2025).

    The following previously published dataset was used:

    Oyageshio OP, Myrick JW, Saayman J, van der Westhuizen L, Al-Hindi D, Reynolds AW, Zaitlen N, Uren C, Möller M, Henn BM. 2023. Investigating Host Genetic Risk Factors for Tuberculosis in Highly Endemic South African Populations. European Genome-Phenome Archive. EGAS00001007850


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES