Abstract
Background
Latent tuberculosis infection (LTBI) is a significant reservoir for active tuberculosis development. Identifying key risk factors is crucial for prevention strategies. Machine learning techniques can uncover complex relationships between risk factors and disease outcomes.
Methods
Data were collected from China’s Tuberculosis Management Information System. LTBI was defined by positive tuberculin skin tests. A case-control design comparing LTBI (n = 669) with active tuberculosis (ATB, n = 669) patients was employed. Propensity score matching (1:1) was performed using age, gender, and education level. Four machine learning models (random forest, XGBoost, support vector machine, and neural network) were developed for feature importance analysis. Least Absolute Shrinkage and Selection Operator (LASSO) regression and logistic regression identified key risk factors. Bootstrap resampling (n = 1,000 iterations) assessed model stability with 95% confidence intervals. Shapley Additive Explanations (SHAP) analysis provided feature importance interpretation. A risk nomogram was constructed and evaluated using receiver operating characteristic curves, calibration plots, and decision curve analysis.
Results
Among 1,338 matched participants, XGBoost demonstrated superior performance (AUC = 0.898, accuracy = 85.7%, sensitivity = 84.2%, specificity = 86.9%). SHAP analysis revealed age group (mean |SHAP value|=0.818) as the most influential predictor, followed by medical insurance type (0.599), income group (0.523), and education level (0.439). Logistic regression identified 11 significant risk factors: age (OR = 2.35, 95%CI: 1.86–2.96), BMI (OR = 0.81, 95%CI: 0.71–0.93), smoking status, occupational dust exposure, diabetes, medical insurance type, immunosuppressant use, education level, silicosis, anemia, and TB contact history. The nomogram showed good discrimination (AUC = 0.839) and clinical utility, identifying 64.44% of subjects as high-risk with 53.62% confirmed as true positives at 20% risk threshold.
Conclusion
This study successfully identified key LTBI risk factors using machine learning approaches. The developed nomogram provides a practical tool for targeted screening in resource-limited settings. Interventions targeting modifiable factors such as smoking cessation and occupational dust control may reduce LTBI and active TB burden.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12889-025-25844-w.
Keywords: Latent tuberculosis infection, Risk factors, Machine learning, Nomogram, SHAP analysis
Background
Tuberculosis (TB) continues to pose a significant global health challenge, with an estimated 10.8 million new cases and an incidence rate of 134 per 100,000 population in 2023. China ranks third in the world in terms of TB burden, accounting for 11% of global cases [1]. Despite implementing various control measures, the incidence of TB in China remains high, posing a considerable threat to public health [2].
Latent tuberculosis infection (LTBI) is a state of persistent immune response to stimulation by Mycobacterium tuberculosis antigens without evidence of clinically manifested active TB [3]. It is estimated that approximately one-quarter of the world’s population has LTBI, and 5–10% of infected individuals will develop active TB disease over their lifetime [3].
The progression from LTBI to active TB typically occurs when the host’s immune defenses are compromised due to factors such as HIV infection, immunosuppressive therapy, malnutrition, diabetes, or aging. During this transition, dormant mycobacteria within granulomas begin to multiply and spread, leading to clinically apparent disease. The identification and treatment of LTBI is a critical component of TB control strategies, as it can prevent the development of active disease and reduce transmission [4].
Machine learning (ML) has emerged as a powerful tool for predicting disease risk and identifying key risk factors [5]. ML algorithms excel at managing complex, high-dimensional data and capturing non-linear relationships among variables, which makes them particularly well-suited for modeling the multifactorial nature of tuberculosis (TB) [6]. Several studies have utilized ML techniques to predict TB risk and identify significant predictors, demonstrating their potential to enhance TB control efforts [7–9].
However, the majority of existing studies have concentrated on active tuberculosis (TB), with limited research on the application of machine learning (ML) for predicting latent TB infection (LTBI) risk and its progression. Furthermore, there are few studies that have compared the performance of various ML algorithms and evaluated their clinical utility through decision curve analysis. Recent advances in ensemble deep learning frameworks have shown promising results in detecting various diseases, demonstrating how adaptive integration of multiple models can improve diagnostic accuracy [10]. Furthermore, research on optimized information fusion techniques has highlighted the importance of selective network combination for enhancing detection capabilities in respiratory conditions [11].
Consequently, this study aims to develop and validate ML models for predicting LTBI risk and for identifying key risk factors within a Chinese population, as well as to assess the clinical usefulness of these models using a risk nomogram and decision curve analysis.
Methods
Study population
This study employed a case-control design comparing individuals with latent tuberculosis infection (LTBI) to those with active tuberculosis (ATB). The rationale for this comparison is to identify factors that distinguish between latent and active forms of TB infection, which can inform targeted intervention strategies. The data for ATB patients were obtained from a public health surveillance cohort, specifically the Tuberculosis Management Information System (TBMIS) maintained by a disease prevention and control center in Xinjiang. These cases were diagnosed through laboratory tests from January to December 2022. In contrast, the data for LTBI individuals were derived from a community-based active screening cohort. Volunteers in this cohort underwent tuberculin skin testing (TST) as part of a regional tuberculosis prevention and control program, with positive results recorded from May to December 2022.
Justification for data integration and machine learning approach
While this study utilizes data from different sources (TBMIS surveillance data and community screening data), rigorous propensity score matching was implemented to minimize selection bias and ensure comparability between groups. A 1:1 nearest neighbor matching approach was employed using caliper width of 0.1 standard deviations of the logit of the propensity score [12]. Matching variables included age, gender, and education level as potential confounders. This approach is increasingly recognized in epidemiological research as a valid method to leverage existing datasets when randomized controlled trials are not feasible or ethical.
The complex, multifactorial nature of tuberculosis risk necessitates analytical approaches that can capture non-linear relationships and interactions among risk factors. Traditional statistical methods often assume linear relationships and independence among variables, which may not reflect the true complexity of TB risk factors. Machine learning algorithms, particularly ensemble methods like Random Forest [13] and XGBoost [14], can model these complex relationships without requiring pre-specified assumptions about the functional form of relationships between predictors and outcomes.
Diagnostic criteria
ATB diagnostic criteria followed
The diagnostic criteria for active tuberculosis patients refer to the “Diagnostic Criteria for Pulmonary Tuberculosis (WS288-2017) / Health Industry Standard of the People’s Republic of China.” Patients were included if they met at least one of the following criteria: sputum smear and/or culture positive for Mycobacterium tuberculosis; chest X-ray showing presence of exudative lesions, caseous lesions, cavities, proliferative lesions, or hematogenous disseminated tuberculosis lesions in the lungs; or clinical symptoms consistent with active tuberculosis, such as persistent cough, sputum production, fever, night sweats, or weight loss.
Patients were excluded if they had tuberculosis of hilar lymph nodes, tuberculous pleurisy, extrapulmonary tuberculosis with intrapulmonary lesions, HIV infection, or hematologic malignancies.
LTBI diagnostic criteria followed
The tuberculin skin test (TST) screening followed the “Diagnostic Criteria for Pulmonary Tuberculosis (WS288-2017)/Health Industry Standard of the People’s Republic of China.” Individuals were included if they met all of the following criteria: TST positivity defined as induration diameter ≥ 10 mm in Bacillus Calmette-Guérin (BCG)-vaccinated areas or ≥ 5 mm in non-Bacillus Calmette-Guérin (BCG)-vaccinated areas, absence of clinical respiratory or systemic manifestations such as cough, sputum production, hemoptysis, or fever, no history of mental illness, voluntary participation, and chest imaging examinations showing no abnormalities consistent with active tuberculosis.
Exclusion criteria included previous tuberculosis diagnosis, HIV infection, hematologic malignancies, extrapulmonary tuberculosis or any systemic symptoms suggestive of extrapulmonary TB manifestations including lymphadenopathy, bone or joint symptoms, genitourinary symptoms, or neurological manifestations, and current use of anti-tuberculosis treatment.
Sample size calculation
The sample size was calculated using the formula:
![]() |
In this study, N represents the sample size, Z denotes the statistic (1.96 for a 95% confidence level), p indicates the probability of individuals with latent tuberculosis infection (LTBI) developing active tuberculosis (0.03375 based on previous studies), and 1-p equals 0.96625. The margin of error, d, is set at 1.5%. The calculated sample size (N) was 557. Taking into account a 20% loss to follow-up rate, the final sample size for each group was adjusted to 669. The study received approval from the Ethics Committee of the First Affiliated Hospital of Xinjiang Medical University, and informed consent was obtained from all participants.
Statistical analysis
Statistical analysis was performed using SPSS version 26.0 and R version 4.2.0. Continuous variables were compared using the Mann-Whitney U test, while categorical variables were analyzed with the chi-square test to determine if there were statistically significant differences between the groups. Statistical significance was set at P < 0.05.
Propensity score matching
To minimize confounding bias and improve the reliability of causal inferences, a 1:1 nearest neighbor matching approach was employed for propensity score matching [12]. The matching variables included potential confounders such as age, gender, and education level. Propensity scores were estimated using logistic regression, and matching was performed with a caliper of 0.1 standard deviations of the logit of the propensity score.
Machine learning models
Four machine learning models were developed: random forest [13], XGBoost [14], support vector machine [15], and neural network [16]. Grid search and 5-fold cross-validation were utilized to assess the performance of these models. The following hyperparameters were optimized: Random Forest: n_estimators (100–500), max_depth (3–10), min_samples_split (2–10);XGBoost: n_estimators (100–500), max_depth (3–8), learning_rate (0.01–0.3);Support Vector Machine: C (0.1–100), gamma (0.001-1), kernel (rbf, linear); Neural Network: hidden_layer_sizes (50–200), learning_rate (0.001-0.1), alpha (0.0001-0.1).
Bootstrap validation
The Bootstrap method [17] was applied to evaluate the performance of the four machine learning models. Bootstrap resampling was performed with 1000 iterations, generating sampling distributions for each performance metric (area under the curve (AUC), sensitivity, specificity, and accuracy). Bias-corrected and accelerated (BCa) 95% confidence intervals were calculated for each metric to assess the stability and reliability of model performance estimates. This approach provides robust estimates of model performance variability and helps identify the most stable predictive models.
SHAP analysis
To understand both the importance and direction of influence for each risk factor, the Shapley Additive Explanations (SHAP) methodology [18] was applied to the machine learning models. SHAP values provide a unified measure of feature importance while also indicating the directional impact of each feature on model predictions. SHAP values for both XGBoost and Random Forest models were calculated to compare their feature importance rankings and interpretations.
Feature selection and logistic regression
Variables that showed statistical significance in the univariate analysis were included in the Least Absolute Shrinkage and Selection Operator (LASSO) regression model [19] to select the optimal predictors. The regularization parameter (λ) was selected using 10-fold cross-validation to minimize prediction error. The selected variables were then incorporated into a binary logistic regression model to explore the relationship between each factor and the risk of tuberculosis.
Nomogram development and validation
A tuberculosis risk nomogram was developed based on the variables identified from the logistic regression analysis. The predictive performance of the nomogram model was evaluated using receiver operating characteristic (ROC) curves, calibration plots, and decision curve analysis (DCA) [20]. The DCA evaluated the clinical utility of the model by quantifying the net benefit across different risk thresholds.
Results
Univariate analysis of tuberculosis incidence
The univariate analysis identified 31 factors, including age, body mass index (BMI), average monthly income over the past two years, type of medical insurance, and education level, which demonstrated statistically significant differences between the tuberculosis and non-tuberculosis groups (P < 0.05) (Table 1 and Table S1).
Table 1.
Univariate analysis of continuous variables for tuberculosis incidence
| Factor | Tuberculosis Incidence [M(P25,P75)] | Z | P-value | |
|---|---|---|---|---|
| LTBI Group | TB Group | |||
| Age (years) | 39 (18, 59) | 60 (47, 70) | 258.659 | <0.001 |
| BMI | 22.23 (20.00, 25.00) | 22.84 (20.45, 25.06) | 4.974 | 0.026 |
| Average Monthly Income in the Past Two Years (yuan) | 5050 (2100, 11475) | 3200 (2000, 6000) | 69.466 | <0.001 |
LTBI Latent Tuberculosis Infection, TB Tuberculosis, BMI Body Mass Index, M Median, P25 25th percentile, P75 75th percentile
Presence of propensity score matching
To mitigate confounding bias and enhance the reliability of causal inferences, a 1:1 nearest neighbor matching approach was employed for propensity score matching. The matching variables included potential confounders such as age, gender, and education level. Following the matching process, the propensity score distributions of the two groups became more comparable, thereby reducing systematic differences (Figure S1). The Love plot further illustrated that the standardized differences of most variables were maintained within 0.1 after matching, indicating a well-balanced distribution of characteristics between the two groups (Figure S2).
Machine learning model variable importance analysis
Four machine learning models were developed: random forest, XGBoost, support vector machine, and neural network. Grid search and 5-fold cross-validation were utilized to assess the performance of each model. All models demonstrated strong predictive capabilities, with XGBoost achieving the highest area under the curve (AUC) of 0.898, followed closely by random forest (AUC = 0.895), support vector machine (AUC = 0.877), and neural network (AUC = 0.797). Additionally, XGBoost exhibited the highest accuracy (85.7%), sensitivity (84.2%), and specificity (86.9%) (Fig. 1).
Fig. 1.

Machine learning model performance comparison. Note AUC = Area Under the Curve; RF = Random Forest; SVM = Support Vector Machine; NN = Neural Network.Performance metrics include AUC, sensitivity, specificity, and accuracy evaluated using 5-fold cross-validation
Bootstrap method
The Bootstrap method was employed to assess the performance of the four machine learning models. The results indicated that the XGBoost model demonstrated superior overall predictive performance compared to the other models ( Figure S3 and Table 2).
Table 2.
Bootstrap results for machine learning model performance
| AUC | Sensitivity | Specificity | Accuracy | |
|---|---|---|---|---|
| Random Forest | 0.894(0.869–0.920) | 0.930(0.904–0.954) | 0.671(0.608–0.733) | 0.833(0.802–0.861) |
| XGBoost | 0.898(0.873–0.921) | 0.877(0.842–0.910) | 0.721(0.660–0.780) | 0.819(0.787–0.849) |
| SVM | 0.878(0.848–0.904) | 0.872(0.839–0.905) | 0.730(0.671–0.787) | 0.819(0.790–0.849) |
| Neural Network | 0.798(0.764–0.832) | 0.779(0.736–0.820) | 0.650(0.587–0.712) | 0.731(0.696–0.767) |
Calibration curve evaluation
Calibration curves were generated for each model to evaluate the reliability of the predicted probabilities. The XGBoost model exhibited superior calibration performance compared to the other models, with its calibration curve closely aligning with the diagonal line (Figure S4).
Variable feature importance selection
Feature importance analysis was conducted using the XGBoost model to pinpoint key risk factors. Variables that exhibited statistical significance in the univariate analysis were incorporated into the model. The analysis indicated that age, BMI, smoking status, occupational dust exposure, diabetes, and a family history of tuberculosis significantly impact the risk of developing tuberculosis (Figure S5).
SHAP analysis results
The SHAP analysis revealed nuanced insights into how each factor contributes to tuberculosis risk prediction. Figure 2 shows the SHAP summary plot for the XGBoost model, with features ordered by their overall importance. Age group emerged as the most influential predictor (mean |SHAP value| = 0.818), followed by type of medical insurance (0.599), income group (0.523), and education level (0.439).
Fig. 2.
SHAP Value Distribution of Key Features for Tuberculosis Risk Prediction. Note SHAP = Shapley Additive Explanations; TB = Tuberculosis. Features are ordered by mean absolute SHAP value. Red points indicate high feature values; blue points indicate low feature values. Positive SHAP values increase TB risk prediction; negative values decrease risk prediction
The SHAP values revealed not only feature importance but also the direction of impact. Higher age values consistently pushed predictions toward higher tuberculosis risk (positive SHAP values), while higher income generally reduced predicted risk (negative SHAP values). Education level showed a non-linear relationship with TB risk, where both very low and very high education levels decreased risk prediction, while mid-range education levels increased risk, as shown in Fig. 2.
When comparing Random Forest and XGBoost models (Figure S6), substantial agreement in the top features identified was found. Both models ranked age group, type of medical insurance, and income group among their top five features. However, the models differed in the importance assigned to some factors: Random Forest emphasized marital status more heavily, while XGBoost gave greater weight to education level. These differences can be attributed to Random Forest’s tendency to detect more complex feature interactions through its ensemble of decision trees, while XGBoost’s gradient boosting approach may better capture the main effects of individual features.
Individual feature contribution plots for representative cases (Figures S7-S9) demonstrate how the models arrive at predictions for specific patients. In case #1, type of medical insurance and age group were the dominant factors pushing the prediction toward tuberculosis risk, while in case #10, age group and education level were most influential, with income group providing a protective effect.
LASSO regression
The 31 variables that showed statistically significant differences in the univariate analysis were included in the LASSO regression model. When lambda (λ) was set to 0.0032, the model error was minimized, corresponding to 57 variables. After incorporating the dummy variables for the categorical variables, 28 optimal variables were identified.(Figure S10).
Multivariate analysis of tuberculosis incidence
The selected 28 variables were included in the binary logistic regression model. The results indicated that age, BMI, household monthly income, smoking status, exercise frequency, personality type, occupational exposure to dust and chemical fumes, history of tuberculosis contact, presence of anemia, insomnia, silicosis, use of immunosuppressants, awareness of tuberculosis transmission routes and free policies, type of medical insurance, and education level were significantly associated with the risk of tuberculosis.(Table 3).
Table 3.
Binary logistic regression analysis of tuberculosis incidence
| Factor | β | SE | wald
|
P-value | OR(95%CI) |
|---|---|---|---|---|---|
| Age | 0.85 | 0.12 | 51.74 | < 0.001 | 2.35 (1.86 − 2.96) |
| BMI | −0.21 | 0.07 | 9.6 | < 0.001 | 0.81 (0.71 − 0.93) |
| Average Monthly Household Income | −0.77 | 0.09 | 78.18 | < 0.001 | 0.46 (0.39 − 0.55) |
| Smoking Status (Quit Smoking as Reference) | 13.46 | < 0.001 | |||
| Yes | −1.68 | 0.46 | 13.17 | < 0.001 | 0.19 (0.08 − 0.46) |
| No | −1.45 | 0.43 | 11.64 | < 0.001 | 0.23 (0.10 − 0.54) |
| Occupational Dust Exposure (A Lot as Reference) | 8.33 | 0.02 | |||
| Average | 0.7 | 0.38 | 3.41 | 0.07 | 2.02 (0.96 − 4.24) |
| History of TB Contact (Unknown as Reference) | 31.26 | < 0.001 | |||
| No | −0.66 | 0.22 | 8.88 | < 0.001 | 0.52 (0.34 − 0.80) |
| Presence of Anemia (Yes as Reference) | −0.70 | 0.33 | 4.52 | 0.03 | 0.50 (0.26 − 0.95) |
| Presence of Insomnia (Yes as Reference) | −0.81 | 0.33 | 6.2 | 0.01 | 0.44 (0.23 − 0.84) |
| Presence of Silicosis (Yes as Reference) | −1.54 | 0.56 | 7.47 | 0.01 | 0.22 (0.07 − 0.65) |
| History of Immunosuppressant Use (Yes as Reference) | −1.95 | 0.7 | 7.81 | 0.01 | 0.14 (0.04 − 0.56) |
| Type of Medical Insurance (Commercial Medical Insurance as Reference) | 69.33 | < 0.001 | |||
| Self-Funded | −3.91 | 0.7 | 31.69 | < 0.001 | 0.02 (0.01 − 0.17) |
| Urban Employee Basic Medical Insurance | −1.98 | 0.49 | 16.32 | < 0.001 | 0.14 (0.05 − 0.36) |
| Urban Resident Medical Insurance | −1.93 | 0.46 | 17.87 | < 0.001 | 0.15 (0.06 − 0.36) |
| New Rural Cooperative Medical Insurance | −5.43 | 0.73 | 56.12 | < 0.001 | 0.004 (0.001 − 0.018) |
| Education Level (Bachelor’s Degree and Above as Reference) | 22.46 | < 0.001 | |||
| Primary School and Below | −0.96 | 0.33 | 8.56 | < 0.001 | 0.38 (0.20 − 0.73) |
| Junior High School | −1.11 | 0.3 | 13.55 | < 0.001 | 0.33 (0.18 − 0.60) |
| High School/Technical Secondary School | −1.45 | 0.32 | 21.09 | < 0.001 | 0.23 (0.13 − 0.44) |
BMI Body Mass Index, TB Tuberculosis, OR Odds Ratio, CI Confidence Interval, SE Standard Error
Nomogram
A tuberculosis risk nomogram was developed based on the 11 variables selected from the logistic regression analysis (Fig. 3). Age was standardized and categorized into three groups using tertiles: the low age group (lowest tertile), the middle age group (middle tertile), and the high age group (highest tertile). This three-level stratification of age was incorporated into the nomogram to enhance risk assessment based on different age categories.
Fig. 3.

Tuberculosis risk nomogram
The most influential factor was the type of medical insurance, followed by the use of immunosuppressants, age group, education level, presence of silicosis, smoking status, income group, presence of anemia, mental health status, history of tuberculosis contact, and presence of insomnia. Each variable was assigned a score based on a predefined scoring scale, and the total score was calculated by summing the scores of all variables. A higher total score indicated an increased risk of tuberculosis.
Nomogram model evaluation
The area under the ROC curve (AUC) for the nomogram was 0.839, indicating good discrimination (Fig. 4). Decision curve analysis (DCA) showed that the model performed optimally at a 20% risk threshold, achieving a net benefit of 0.269. At this threshold, the prediction model identified 1,283 high-risk individuals (64.44%) among the 1,991 study subjects, of whom 688 (53.62%) were confirmed as true positive cases. The model demonstrated high sensitivity and effectively identified potential high-risk populations (Figure S11).
Fig. 4.

Nomogram model ROC curve
Discussion
This comprehensive study utilized multiple analytical approaches to identify key risk factors for tuberculosis in a large Chinese population. The integration of traditional statistical methods with advanced machine learning techniques provided robust evidence for risk factor identification and clinical prediction model development. Recent advances in machine learning applications for tuberculosis have demonstrated significant potential in improving diagnostic accuracy and risk stratification [21, 22].
The univariate analysis confirmed many established TB risk factors while identifying novel associations. Age emerged as the strongest predictor, consistent with known age-related immune decline and increased comorbidity burden [23]. The inverse association between BMI and TB risk supports the well-established relationship between malnutrition and tuberculosis susceptibility, which has been consistently demonstrated in recent studies [24]. Socioeconomic factors, including income, education level, and medical insurance type, demonstrated significant associations with TB risk. These findings highlight the social determinants of health in tuberculosis epidemiology and underscore the importance of addressing health inequities in TB control programs [25].
The comparison of four machine learning algorithms revealed that ensemble methods (Random Forest and XGBoost) outperformed individual classifiers. XGBoost’s superior performance can be attributed to its gradient boosting framework, which sequentially corrects prediction errors and effectively handles complex feature interactions [26]. This finding aligns with recent studies demonstrating the superiority of ensemble methods in tuberculosis prediction tasks [27]. The SHAP analysis provided unprecedented transparency in understanding model predictions. The identification of non-linear relationships, such as the U-shaped education effect, demonstrates the value of ML approaches over traditional linear statistical methods [28]. These complex relationships likely reflect differential occupational exposures, healthcare utilization patterns, and social factors across education levels, as evidenced by recent research utilizing SHAP analysis in tuberculosis studies [29].
The developed machine learning models and nomogram have several practical applications in healthcare systems, particularly in resource-limited settings. The nomogram provides a simple, cost-effective tool for healthcare providers to assess individual tuberculosis risk without requiring sophisticated technology or extensive laboratory testing. Recent studies have demonstrated the clinical utility of similar nomogram-based prediction tools in tuberculosis management [30, 31]. In health system implementation, the ML models could be integrated into electronic health records to automatically flag high-risk patients during routine healthcare visits. This automated risk assessment could trigger targeted screening protocols, enabling early detection and intervention [32]. For public health programs, the identified risk factors can inform the design of targeted prevention campaigns, such as smoking cessation programs for high-risk individuals or enhanced occupational safety measures in industries with dust exposure. The personalized risk assessment capability allows for resource optimization by focusing intensive screening efforts on individuals most likely to benefit. In Xinjiang and similar high-burden regions, this could significantly improve the efficiency of tuberculosis prevention programs while reducing costs [33].
The machine learning techniques employed offer several advantages over traditional statistical approaches. ML models can process large volumes of heterogeneous data and identify complex, non-linear relationships that may not be apparent through conventional analysis [34]. This capability is particularly valuable in tuberculosis research, where multiple interacting factors contribute to disease risk. The ensemble approach using multiple algorithms provides robust predictions and reduces the risk of model-specific biases. The SHAP analysis framework offers unprecedented transparency in ML-based clinical predictions, addressing the ‘black box’ criticism often leveled against complex algorithms [35]. This interpretability is crucial for clinical acceptance and regulatory approval of AI-based diagnostic tools.
Several limitations should be acknowledged. First, the retrospective design limits our ability to establish causal relationships between identified risk factors and tuberculosis outcomes. Second, the integration of data from different sources may introduce selection bias despite propensity score matching efforts [36]. The surveillance data may capture more severe cases, while community screening may identify milder or earlier-stage infections. Third, the cross-sectional nature of the risk factor assessment means that temporal relationships cannot be definitively established. Fourth, some unmeasured confounders, such as genetic susceptibility, nutritional status, or environmental factors not captured in our dataset, may influence the observed associations [37].
A significant limitation of this study is the reliance on tuberculin skin test (TST) as the sole diagnostic method for LTBI. While TST is widely used in high-burden settings due to its cost-effectiveness and accessibility, it has well-documented limitations in both sensitivity and specificity [38]. In BCG-vaccinated populations, which is common in China, TST specificity can be substantially reduced due to cross-reactivity with BCG and non-tuberculous mycobacteria, potentially leading to false-positive results [39]. Conversely, TST sensitivity may be compromised in immunocompromised individuals, potentially missing true LTBI cases. Newer interferon-gamma release assays (IGRAs), such as QuantiFERON-TB Gold or T-SPOT.TB, offer improved specificity as they are not affected by BCG vaccination and demonstrate superior performance in distinguishing LTBI from BCG vaccination or exposure to non-tuberculous mycobacteria [40, 41]. Future studies should consider incorporating IGRAs alongside or instead of TST to improve diagnostic accuracy, particularly in BCG-vaccinated populations. The use of combined testing strategies, where both TST and IGRA are employed, may further enhance the identification of true LTBI cases while reducing false-positive results.
While our study excluded HIV infection and hematologic malignancies, we did not systematically exclude all immunocompromised conditions that may affect TST interpretation and TB risk. Patients with rheumatoid arthritis or other autoimmune diseases receiving biological therapies, individuals receiving chronic corticosteroids, patients on dialysis due to end-stage renal disease, and those with solid organ transplantation should ideally have been included in the exclusion criteria [42]. The presence of such conditions could influence both TST results and the risk of progression from LTBI to active TB, potentially introducing confounding in our risk factor analysis. Immunosuppressive medications can lead to false-negative TST results due to impaired cell-mediated immunity, while simultaneously increasing the actual risk of TB reactivation. Future studies should implement more comprehensive screening for immunocompromised states and consider stratified analyses to better understand TB risk in these vulnerable populations.
The cross-sectional design of this study limits our ability to observe actual progression from LTBI to active TB over time. Prospective follow-up of individuals diagnosed with LTBI would provide valuable insights into the true predictive value of identified risk factors and validate the clinical utility of our prediction models [43]. Such follow-up studies would enable estimation of progression rates in different risk groups and identification of factors that specifically predict LTBI reactivation, which may differ from factors distinguishing prevalent LTBI from active TB at a single time point. Longitudinal studies would also allow for the assessment of time-varying exposures and the dynamic nature of risk factors, providing a more comprehensive understanding of the natural history of LTBI progression. Future research should incorporate prospective cohort designs with adequate follow-up periods (typically 2–5 years) to establish temporal relationships between risk factors and TB outcomes, validate the predictive models in real-world settings, and ultimately improve the evidence base for targeted prevention strategies.
Finally, the study population is limited to Xinjiang province, which may limit the generalizability of findings to other populations with different demographic, genetic, or environmental characteristics. The findings have important implications for tuberculosis control policies. The identification of modifiable risk factors such as smoking, occupational exposures, and socioeconomic factors suggests that comprehensive tuberculosis prevention should extend beyond traditional medical interventions to include social and environmental policies [44]. Future research should focus on prospective validation of the developed models in different populations and healthcare settings [45]. Additionally, intervention studies targeting the identified modifiable risk factors could provide valuable evidence for policy development. The integration of emerging technologies such as digital health platforms and mobile applications could further enhance the accessibility and impact of risk-based screening approaches.
Conclusion
This study successfully identified key risk factors for tuberculosis using advanced machine learning techniques in a large Chinese population. The developed risk prediction models and nomogram provide practical tools for healthcare providers to implement personalized, evidence-based tuberculosis prevention strategies. The integration of traditional epidemiological methods with modern ML approaches offers a robust framework for risk assessment that can be adapted to different healthcare settings.
The nomogram demonstrates good discrimination and clinical utility, particularly valuable in resource-limited settings where efficient risk stratification is essential for optimal resource allocation. The identification of both traditional risk factors and novel associations through SHAP analysis provides insights that can inform targeted intervention strategies and policy development.
These findings contribute to the growing evidence base for precision medicine approaches in tuberculosis prevention and control, offering pathways for more effective and efficient public health interventions [46]. However, future studies should address the limitations identified, particularly by incorporating more specific diagnostic tests such as IGRAs, implementing comprehensive exclusion criteria for immunocompromised states, and conducting prospective longitudinal studies to validate the predictive value of identified risk factors and assess actual progression from LTBI to active TB.
Supplementary Information
Acknowledgements
I begin by expressing my deepest gratitude to study participants, data collectors, and supervisors for their willingness to participate in this study, for data collection, and for the facilitation and organization of the data throughout the study period.
Authors’ contributions
Y.W wrote the draft of the manuscript and interpreted the results. Z.L extracted data and analyzed data. Y.W and X.W selected the studies and made the quality assessment of studies included. L.Y and M.K provided coordinate the scene and distribute the questionnaire. Y.W designed the search strategy and searched the literature. Y.X provided administrative supports, comments, and suggestions in revisions of the paper. Y.X provided resources used in drafting the discussion. All authors read and approved the final manuscript.
Funding
This work was supported by the Xinjiang Uygur Autonomous Region Science Foundation (No.2022D01C203), and The 14-th Five-Year Plan Distinctive Program of Public Health and Preventive Medicine in Higher Education Institutions of Xinjiang Uygur Autonomous Region.
Data availability
The data underlying this article cannot be shared publicly due to the privacy of individuals that participated in the study. The data will be shared on reasonable request to the corresponding author. Relevant R code is available upon request to the corresponding author.
Declarations
Ethical approval and consent to participate
This study was conducted in accordance with the Declaration of Helsinki and was approved by the Ethics Committee of Xinjiang Medical University (ID: XJYKDXR20211015010). All participants were informed about the purpose, procedures, potential risks, and benefits of the study, and written informed consent was obtained from each participant prior to their participation.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.World Health Organization. Global tuberculosis report 2024. Geneva: World Health Organization; 2024. [Google Scholar]
- 2.Cui X, Gao L, Cao B. Management of latent tuberculosis infection in china: exploring solutions suitable for high-burden countries. Int J Infect Dis. 2020;92S:S37–40. [DOI] [PubMed] [Google Scholar]
- 3.Getahun H, Matteelli A, Chaisson RE, Raviglione M. Latent Mycobacterium tuberculosis infection. N Engl J Med. 2015;372(22):2127–35. [DOI] [PubMed] [Google Scholar]
- 4.World Health Organization. Guidelines on the management of latent tuberculosis infection. Geneva: World Health Organization; 2015. [PubMed] [Google Scholar]
- 5.Deo RC. Machine learning in medicine. Circulation. 2015;132(20):1920–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gao J, Jiang Q, Zhou B, Chen D. Convolutional neural networks for computer-aided detection or diagnosis in medical image analysis: an overview. Math Biosci Eng. 2019;16(6):6536–61. [DOI] [PubMed] [Google Scholar]
- 7.Seixas JM, Faria J, Souza Filho JB, et al. Artificial neural network models to support the diagnosis of pleural tuberculosis in adult patients. Int J Tuberc Lung Dis. 2013;17(5):682–6. [DOI] [PubMed] [Google Scholar]
- 8.Lakhani P, Sundaram B. Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology. 2017;284(2):574–82. [DOI] [PubMed] [Google Scholar]
- 9.Pasa F, Golkov V, Pfeiffer F, Cremers D, Pfeiffer D. Efficient deep network architectures for fast chest X-ray tuberculosis screening and visualization. Sci Rep. 2019;9(1):6268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Iqbal MS, Naqvi RA, Alizadehsani R, Hussain S, Moqurrab SA, Lee SW. An adaptive ensemble deep learning framework for reliable detection of pandemic patients. Comput Biol Med. 2024;168:107836. [DOI] [PubMed] [Google Scholar]
- 11.Hamza A, Attique Khan M, Wang SH, Alhaisoni M, Alharbi M, Hussein HS, Alshazly H, Kim YJ, Cha J. COVID-19 classification using chest X-ray images based on fusion-assisted deep bayesian optimization and Grad-CAM visualization. Front Public Health. 2022;10:1046296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55. [Google Scholar]
- 13.Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. [Google Scholar]
- 14.Chen T , Guestrin C .XGBoost: A Scalable Tree Boosting System [J]. ACM. 2016. 10.1145/2939672.2939785.
- 15.Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97. [Google Scholar]
- 16.Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge: MIT Press; 2016. [Google Scholar]
- 17. Efron B, Tibshirani RJ. An Introduction To The Bootstrap [M]. 1993.
- 18.Lundberg S , Lee S I .A Unified Approach to Interpreting Model Predictions [J]. 2017. 10.48550/arXiv.1705.07874.
- 19.Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B Stat Methodol. 1996;58(1):267–88. [Google Scholar]
- 20.Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Mak. 2006;26(6):565–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Chen J, Jiang Y, Li Z, et al. Predictive machine learning models for anticipating loss to follow-up in tuberculosis patients throughout anti-TB treatment journey. Sci Rep. 2024;14(1):24685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Tang X, Qian Y, Zhang X, et al. Machine learning prediction model of tuberculosis incidence based on meteorological factors and air pollutants. Int J Environ Res Public Health. 2023;20(5):3910. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Li D, Tang SY, Lei S, Xie HB, Li LQ. A nomogram for predicting mortality of patients initially diagnosed with primary pulmonary tuberculosis in Hunan province, china: a retrospective study. Front Cell Infect Microbiol. 2023;13:1179369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Peng AZ, Kong XH, Liu ST, et al. Explainable machine learning for early predicting treatment failure risk among patients with TB-diabetes comorbidity. Sci Rep. 2024;14(1):6814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Emegano DI, Duwa BB, Usman AG, Ahmad H, Ozsahin DU. A comparative study on TB incidence and HIVTB coinfection using machine learning models on WHO global TB dataset. Sci Rep. 2024;14(1):11781. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Nalunjogi A, Kigozi B, Kiguli J, et al. Comprehensive study on tuberculosis prediction models: integrating machine learning into epidemiological analysis. Comput Biol Med. 2024;168:107836. [DOI] [PubMed] [Google Scholar]
- 27.Rodrigues MSB, Rabahi MF, Araujo-Filho JAB, et al. Machine learning algorithms using National registry data to predict loss to follow-up during tuberculosis treatment. BMC Public Health. 2024;24(1):1388. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Deelder W, Christakoudi S, Phelan J, et al. Machine learning predicts accurately Mycobacterium tuberculosis drug resistance from whole genome sequencing data. Front Genet. 2019;10:922. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Nalunjogi A, Kigozi B, Kiguli J, et al. A machine learning approach to explore individual risk factors for tuberculosis treatment non-adherence in Mukono district. PLOS Glob Public Health. 2023;3(7):e0001466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ma JB, Zeng LC, Ren F, et al. Development and validation of a prediction model for unsuccessful treatment outcomes in patients with multi-drug resistance tuberculosis. BMC Infect Dis. 2023;23(1):289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Ji S, Lu B, Pan X. A nomogram model to predict the risk of drug-induced liver injury in patients receiving anti-tuberculosis treatment. Front Pharmacol. 2023;14:1153815. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Bartl L, Zeeb M, Kälin M, et al. Machine Learning-based Prediction of Active Tuberculosis in People With HIV Using Clinical Data. Clin Infect Dis. 2025 Oct 6;81(3):521–530. [DOI] [PMC free article] [PubMed]
- 33.Wang Y, Gu Y, Ren J, et al. Prediction of tuberculosis-specific mortality for older adult patients with pulmonary tuberculosis. Front Public Health. 2025;12:1497842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Zhang A, Xing L, Zou J, Wu JC. Shifting machine learning for healthcare from development to deployment and from models to data. Nat Biomed Eng. 2022;6(12):1330–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Smith JP, Milligan K, McCarthy KD, et al. Machine learning to predict bacteriologic confirmation of Mycobacterium tuberculosis in infants and very young children. PLOS Digit Health. 2023;2(5):e0000249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Cheng Q, Zhao G, Wang X, et al. Nomogram for individualized prediction of incident multidrug-resistant tuberculosis after completing pulmonary tuberculosis treatment. Sci Rep. 2020;10:13730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Rashidi HH, Khan IH. Prediction of tuberculosis using an automated machine learning platform for models trained on synthetic data. J Pathol Inf. 2022;13:100172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Pai M, Denkinger CM, Kik SV, et al. Gamma interferon release assays for detection of Mycobacterium tuberculosis infection. Clin Microbiol Rev. 2014;27(1):3–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Farhat M, Greenaway C, Pai M, Menzies D. False-positive tuberculin skin tests: what is the absolute effect of BCG and non-tuberculous mycobacteria? Int J Tuberc Lung Dis. 2006;10(11):1192–204. [PubMed] [Google Scholar]
- 40.Sester M, Sotgiu G, Lange C, et al. Interferon-γ release assays for the diagnosis of active tuberculosis: a systematic review and meta-analysis. Eur Respir J. 2011;37(1):100–11. [DOI] [PubMed] [Google Scholar]
- 41.Diel R, Loddenkemper R, Nienhaus A. Evidence-based comparison of commercial interferon-gamma release assays for detecting active TB: a metaanalysis. Chest. 2010;137(4):952–68. [DOI] [PubMed] [Google Scholar]
- 42.Hasan T, Au E, Chen S, Tong A, Wong G. Screening and prevention for latent tuberculosis in immunosuppressed patients at risk for tuberculosis: a systematic review of clinical practice guidelines. BMJ Open. 2018;8(9):e022445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Comstock GW, Livesay VT, Woolpert SF. The prognosis of a positive tuberculin reaction in childhood and adolescence. Am J Epidemiol. 1974;99(2):131–8. [DOI] [PubMed] [Google Scholar]
- 44.Farhat MR, Keshavjee S, Hoen AG, et al. Genomic epidemiology of rifampicin resistance in Mycobacterium tuberculosis: global analysis reveals key drivers and complex patterns. Sci Transl Med. 2024;16(731):eadj2002. [Google Scholar]
- 45.Naidoo K, Perumal R. Advances in tuberculosis control during the past decade. Lancet Respir Med. 2023;11(4):311–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Motta I, Boeree M, Chesov D, et al. Recent advances in the treatment of tuberculosis. Clin Microbiol Infect. 2023;29(8):1005–15. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data underlying this article cannot be shared publicly due to the privacy of individuals that participated in the study. The data will be shared on reasonable request to the corresponding author. Relevant R code is available upon request to the corresponding author.



