Abstract
In this study, we investigated the correlation between air pollution indicators and pulmonary tuberculosis (TB) incidence and mortality rates across provincial administrative regions of China from January 2013 to December 2020 to develop predictive models using machine learning. Data on TB rates and six air pollution indicators were collected and analyzed for correlations. Regression models were built using six algorithms, among which the random forest (RF) model showed superior performance. SHapley Additive exPlanations analysis helped interpret the RF model’s predictions. Seasonal and lag analyses identified a 10-month optimal lag period. Seasonal autoregressive integrated moving average models were used to predict 2020 TB incidence rates, which were validated by comparing them with actual data. The results indicated significant correlations between air pollution and TB rates, highlighting that air pollution data can predict TB incidence and mortality; therefore, air pollution data can help develop public health strategies. This study emphasized the importance of integrating environmental factors into TB control efforts using artificial intelligence.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-025-08078-z.
Keywords: Tuberculosis, Air pollution, Machine learning
Subject terms: Environmental sciences, Diseases
Introduction
Pulmonary tuberculosis (TB) is a major global health challenge, particularly in regions with high levels of air pollution1,2. PM2.5 and SO2 increase TB risk by inducing lung inflammation and oxidative stress, impairing respiratory mucosal defenses, and suppressing the alveolar macrophage and T-cell responses critical for Mycobacterium tuberculosis control3. Liu et al.2 reported a 4.6% TB risk increase per 10 μg/m3 SO2 in Hubei (RR = 1.046), underscoring the role of pollution in TB epidemiology. In China, despite considerable efforts to control TB, the disease is a major public health concern, with incidence and mortality rates closely linked to environmental factors4. Recent studies have highlighted the effects of ambient air pollution on the transmission and outcomes of TB. However, despite the growing application of machine learning (ML) and time series models, a critical gap persists in systematically integrating comprehensive air pollution indicators to predict both TB incidence and mortality across China’s diverse geographic regions, where pollution and TB burdens vary significantly5.
Outdoor environment air pollution caused by the presence of particulate matter (PM2.5, PM10), nitrogen dioxide (NO2), sulfur dioxide (SO2), carbon monoxide (CO), and ozone (O3) increases the incidence of respiratory diseases such as TB6. Researchers have focused on understanding the relationships between these pollutants and the incidence of TB, using advanced statistical and ML models to predict disease patterns5.
Previous studies have investigated various modeling approaches, such as autoregressive moving average with extra input (ARIMAX) and ML models, to predict the incidence of TB by incorporating air pollution and meteorological factors. These studies showed that the ARIMAX model can predict the incidence of TB, taking into account air pollutants and meteorological factors4. In a previous study, ML techniques were also used to develop TB incidence prediction models based on meteorological data and air pollutants, and the random forest (RF) model showed promising results in terms of prediction accuracy7. However, these efforts often focus on isolated predictors, specific regions, or single outcomes such as incidence, leaving the combined impact of multiple air pollutants on both TB incidence and mortality, as well as regional variations, underexplored across China.
Therefore, there is an urgent need to quantify the environmental drivers of TB in China. We hypothesize that a comprehensive set of air pollution indicators (PM2.5, PM10, SO2, NO2, CO, O3, and the composite AQI) significantly influences TB incidence and mortality, with effects varying by region and season due to China’s geographic and climatic diversity. We further posit that a hybrid approach combining ML and seasonal autoregressive integrated moving average (SARIMA) models, enhanced by SHAP for interpretability, outperforms standalone models in terms of predictive accuracy and provides deeper insights into pollutant-specific contributions. In this study, we investigated the associations between air pollution indicators and the incidence and mortality of TB and its subgroups to establish RF and SARIMA models via ML and time series analysis to effectively predict the incidence and mortality of TB patients. Additionally, we built models for each region, subgroup, and air pollution indicator and compared their performance to comprehensively understand TB and air pollution indicators in the ML prediction of morbidity and mortality. These findings provide new insights into the public health field and a novel perspective on ML and SARIMA models, along with predictions of disease incidence and mortality4,5,8.
The comprehensive approach adopted in this study, including model interpretation of SHAP values and nested model analysis, might help researchers understand the complex interaction between air pollution and TB in China9.
Methods
Data collection
The data on TB incidence and mortality rates from January 2013–December 2020 were obtained for each provincial administrative region in Mainland China, excluding Hong Kong, Macau, and Taiwan. These data were sourced from the Data Center for Public Health Sciences, the Chinese Center for Disease Control and Prevention (PhSciencedata). The dataset included overall TB incidence and mortality rates, as well as rates for specific subgroups: pathogen-positive, pathogen-negative, and cases without pathogen examination and rifampicin resistance. All incidence and mortality rates were reported per 100,000 people. In this study, the four municipalities directly under the central government, including Beijing, Tianjin, Shanghai, and Chongqing, are treated as provincial-level administrative units and are collectively referred to as provinces. The analysis unit for this study is monthly data for each province, with TB incidence, mortality rates, and air pollution indicators (PM2.5, PM10, SO2, NO2, CO, O3, and composite AQI) collected on a monthly basis from January 2013 to December 2020.
Six air pollution indicators (PM2.5, PM10, SO2, NO2, CO, and O3) were obtained from the monitoring platform of the National Environmental Monitoring Center platform (https://www.cnemc.cn), along with the composite air quality index (AQI) for each province; these indicators provide a comprehensive measure of air quality across the country. The O3 indicator refers to ozone measured over 8 h and is also known as the 8-h sliding average. This metric evaluates the daily ozone pollution level using the maximum average concentration over any continuous 8-h period within a day as the standard for that day. The composite AQI is a standardized overall measure of air quality calculated by the National Environmental Monitoring Center on the basis of the concentrations of six key air pollutants (PM2.5, PM10, SO2, NO2, CO, and O3). These pollutants were weighted according to their health effects and regulatory thresholds, with the weights determined according to the Chinese National Standard (GB 3095–2012). This approach ensures a comprehensive assessment of air quality and provides a solid basis for our analysis. Our study exclusively focuses on outdoor environment air quality data, such as data collected from outdoor monitoring stations, and does not include indoor air pollution data, which is highlighted to distinguish our scope and ensure clarity of the data source.
Data processing
Since the data had no missing values, preprocessing focused on data normalization to ensure comparability across different metrics. Air pollution indicators and TB rates were normalized via z scores10,11.
Statistical analysis
Descriptive analysis
The national and provincial TB incidence rates and mortality rates in China from January 2013 to December 2020 were analyzed. Subgroup proportions (pathogen-positive, pathogen-negative, and without pathogen examination) were calculated. Time trend changes were described to identify any temporal patterns or shifts in TB incidence and mortality over the study period. The relationship between provincial TB rates and the composite AQI was visualized using maps to provide a geographical perspective on the data.
Correlation analysis
Data normality was tested by conducting the “Kolmogorov–Smirnov test”12. The Spearman correlation coefficient was calculated to assess the relationships between each air pollution indicator and TB incidence and mortality rates across provinces13. The average correlation for each indicator was calculated and ranked from highest to lowest. By conducting this analysis, we identified the provinces with the highest average correlation between air pollution indicators and TB incidence and mortality rates14.
Machine learning models
Feature selection
In the regression model development process, the time dimension was excluded. The model incorporated six air pollution indicators (PM2.5, PM10, SO2, NO2, CO, and O3_8H) and the composite AQI for each province. Owing to significant geographical differences across provinces, the province was included as a feature. Thus, eight features were used to predict the incidence and mortality rates of TB and its subgroups (pathogen-positive, pathogen-negative, and without pathogen examination). The province was reencoded to account for these differences. One method involves ordering provinces alphabetically and encoding them numerically, which is suitable for ordinal categories. Another method, known as one-hot encoding15, converts categorical features into binary vectors. Each category is represented by a vector with one element as 1 and the others as 0. Some learning algorithms perform differently depending on the encoding method used. To ensure consistency across machine learning algorithms when processing the province feature, one-hot encoding was initially selected after repeated comparisons. One-hot encoding transforms each province into a binary vector, preserving the categorical nature of the feature and ensuring fair model comparisons.
Regression modeling
First, the dataset was split into an 80% training set and a 20% external validation set. The 80% training set was further divided into 75% for training and 25% for internal validation. Six regression algorithms were used to predict TB incidence and mortality rates: k-nearest neighbors (k-NN), decision tree (DT), random forest (RF), support vector machine (SVM), elastic net regularization (glmnet), and linear regression (lm). These algorithms were chosen for their complementary strengths. RF and DT are good at capturing complex interactions and nonlinearities in high-dimensional data, whereas SVM and k-NN are robust for noise and local pattern recognition, respectively. Glmnet and lm provide regularization and simplicity, ensuring extensive evaluation of model suitability. This allows us to systematically compare the predictive power and select the most effective TB epidemiology methods. Default hyperparameters are used for all the models implemented in individual machine learning libraries to ensure a standardized approach. The RF model uses 100 trees, and the k-NN algorithm is set to k = 5 neighbors. These choices are based on default settings commonly used in epidemiological modeling. Model performance was determined by R-squared (R2), root mean squared error (RMSE), mean squared error (MSE), and mean absolute error (MAE) metrics. Cross-validation was performed to ensure robustness. The best-performing model on the basis of these metrics was selected for subsequent analysis.
Model interpretation
Feature importance was assessed via SHapley Additive exPlanations (SHAP) values to interpret the model’s predictions16,17. SHAP provides interpretable, feature-specific insights into the predictions of RF models, addressing the key need for opaque contributions of air pollution metrics to outcomes in previous TB studies. This approach helps us understand potential pollutant-specific drivers to support targeted public health interventions. SHAP dependence plots were used to visualize the individual effects of each air pollution indicator on the model. First, we used shap. TreeExplainer to compute SHAP values for the entire dataset, providing a comprehensive understanding of feature contributions. The SHAP library automatically samples background data from the training set, preserving the statistical distribution of air pollution indicators. Second, we specified the training set as the background dataset and then used Shap. TreeExplainer to compute SHAP values for the test set, ensuring robust interpretation of feature contributions.
SARIMA models
Province selection
For further analysis, the provinces with the highest average correlation and best regression model performance for TB incidence and mortality rates were selected. These provinces were used for subsequent seasonal, lag, and SARIMA modeling analyses.
SARIMA analysis
Seasonal decomposition of time series data (STL)18,19 was used to analyze seasonal patterns in TB incidence and mortality rates. Lag analysis revealed the optimal lag period. In time series models, such as the ARIMA or SARIMA models20, different lags are selected by calculating the Akaike information criterion (AIC) values21. AIC is used to assess the model’s fit quality by balancing model complexity and the goodness of fit to the data. The AIC value can be calculated via the following equation:
| 1 |
Here, k indicates the number of parameters in the model, and L indicates the maximum likelihood estimate of the model. The smaller the AIC value is, the better the model. Therefore, researchers generally choose the lag with the smallest AIC value as the optimal lag for the model.
SARIMA models were fitted via the identified optimal lag period22. SARIMA was chosen for its ability to capture seasonality and long lags, outperforming simple ARIMA models by analyzing the optimal lag period. This method improves the prediction accuracy of time series with a periodic model, which is consistent with our goal of accurately predicting the TB trend of each province. The models were trained on data from January 2013–December 2019 and validated on data collected in 2020. The accuracy of the predictions was evaluated by comparing the predicted values with actual TB rates for 2020.
Subgroup analysis
Optimal models were also constructed for different TB subtypes (pathogen-positive, pathogen-negative, and without pathogen examination) and their incidence and mortality rates. Additionally, the differences in the model performance in terms of TB incidence rates and subgroups were compared.
Nested model analysis
A series of nested models were developed by adding air pollution indicators, starting with one indicator and progressing to all seven indicators (including the composite AQI.)23. This approach was chosen to systematically evaluate the incremental predictive value of integrating comprehensive air pollution indicators, testing our hypothesis that a complete set of environmental factors improves model accuracy more than a simple framework does. By quantifying the contribution of each additional indicator, this approach supports our goal of developing robust predictive models while validating the importance of multipollutant interactions in TB epidemiology. This approach was used to create seven models, and their performances were compared to identify the effect of each additional indicator on the accuracy of the model. Model performance metrics (R2, RMSE, MSE, and MAE) for each nested model were calculated and analyzed.
Software and tools
All the statistical analyses and ML models were implemented via R (version 4.3.2; R Core Team, 2024, R Foundation for Statistical Computing, Vienna, Austria) and Python (version 3.11.8; Python Software Foundation, https://www.python.org/). The ML models were developed using the mlr3 package in R and the scikit-learn package in Python. Data manipulation and visualization were performed via dplyr, ggplot2, and matplotlib24,25. The R and Python codes containing all analyses are provided as text files in the Supplementary Files S1 for reproducibility. The geographic distributions of TB incidence rates and mortality rates in the context of the composite AQI across the abovementioned Chinese provinces were analyzed via ArcGIS Pro (version 3.0; Esri Inc., Redlands, CA, USA). The map of China used for this analysis was identified by map number GS (2022)1873. Adobe Illustrator 2023 was used to combine and finalize the images.
Results
National TB subgroup composition and time series analysis
Layered bar charts of TB incidence and mortality rates from January 2013 to December 2020 in China are shown in Fig. 1. The incidence rates and mortality rates were categorized into pathogen-positive, pathogen-negative, without pathogen examination, and rifampicin-resistant cases. The overall TB incidence rates (Fig. 1a) decreased over the study period, with pathogen-positive cases being the most predominant, followed by pathogen-negative cases without pathogen examination. Rifampicin-resistant cases, while present, were less common. Similarly, TB mortality rates (Fig. 1b) decreased, with pathogen-positive patients having the highest mortality rates, followed by pathogen-negative patients and those without pathogen examination results. There were fewer rifampicin-resistant cases.
Fig. 1.
National trends in TB incidence and mortality rates in China (2013–2020). (a) The figure illustrates a decrease in the incidence of TB in China from 2013 to 2020, with pathogen-positive cases consistently representing the highest proportion. (b) TB mortality rates decreased, with pathogen-positive patients having the highest mortality. Both figures show seasonal peaks in winter and troughs in summer, reflecting environmental effects.
The incidence rates and mortality rates exhibited seasonal patterns, with peaks typically occurring in winter and troughs in summer, suggesting that environmental and social factors influence TB transmission and outcomes. The overall decline in TB rates indicated improved public health measures and better access to healthcare services over the years.
Geographic distribution of TB incidence and mortality rates in the context of air quality in China
The TB incidence and mortality rates displayed on the map were recalculated as overall rates for the study period on the basis of sample data from each province covering the period from January 2013–December 2020.
The composite AQI for each province, as shown on the map, was calculated using the average annual AQI values over the study period. This approach more accurately represented the long-term air quality in each province.
The analysis revealed that the geographic distributions of TB incidence (Fig. 2a) and mortality rates (Fig. 2b) in the context of the composite AQI varied significantly across different regions. To capture changes over time, we also analyzed the geographic distribution of TB incidence, mortality, and composite AQI for key years (2013, 2016, and 2020), highlighting trends over time, as shown in supplementary Fig. S1a–i.
Fig. 2.
Geographic distribution of TB incidence and mortality in the context of the composite AQI. (a) The height of each province represents the TB incidence rate multiplied by 50,000 m, with the color indicating the composite AQI. Hong Kong, Macau, and Taiwan are shown in gray, as they were not included in this study. (b) The height of each province represents the TB mortality rate multiplied by 5,000,000 m, with the color indicating the composite AQI. Hong Kong, Macau, and Taiwan are shown in gray and were excluded from the analysis. The figure illustrates varying degrees of correlation between TB incidence and mortality rates and the composite AQI across different provinces.
The varying correlations between air quality and TB rates among provinces indicate that the effects of air pollution on TB epidemiology differ depending on geographic location. This finding emphasized the need for region-specific public health strategies to address TB effectively in areas with poor air quality, ensuring that interventions are tailored to the unique environmental challenges faced by each region.
Spearman correlation analysis between air pollution indicators and TB rates
Spearman correlation coefficient heatmaps were used to illustrate the relationships between various air pollution indicators and TB incidence (Fig. 3a) and mortality rates (Fig. 3b) across the provinces, with the provinces ranked by average correlation from highest to lowest. The analysis revealed that Sichuan Province had the strongest correlation between air pollution and TB incidence rates, whereas Guizhou Province had the strongest correlation with TB mortality rates. The differing correlation strengths across provinces indicated that air pollution had variable effects on TB epidemiology in different regions. These findings suggest that tailored public health interventions are needed to address specific regional air pollution issues to effectively control TB incidence and mortality.
Fig. 3.
Correlation analysis between air pollution indicators and TB rates. (a) Spearman’s correlation heat map between air pollution indicators and TB incidence rates shows that Sichuan, Gansu, and Shandong Provinces had the highest average correlations. (b) The Spearman correlation heat map between air pollution indicators and TB mortality rates indicates that Guizhou, Tianjin, and Gansu exhibited the highest average correlations. The analysis revealed varying levels of correlation across provinces, indicating that the effect of air pollution on TB epidemiology differs by region.
Regression models for TB incidence and mortality rates
Regression models were developed using six algorithms to predict TB incidence and mortality rates: k-NN, DT, RF, SVM, glmnet, and lm. The performance was evaluated via the R2, RMSE, MSE, and MAE metrics. The residual plots for the TB incidence rate regression models (Fig. 4a–f) and the actual vs. predicted plots (Fig. 4g–l) revealed the superior performance of these models. Similarly, the residual plots (Fig. 5a–f) and actual vs. predicted plots (Fig. 5g–l) for the TB mortality rate regression models performed well. The detailed performance metrics are provided in Table 1.
Fig. 4.
Regression model analysis for TB incidence prediction. (a–f) The residual plots illustrate the differences between the observed and predicted TB incidence rates across various regression models, highlighting the lower residuals of the RF model, indicating greater accuracy. (g–l) The actual vs. predicted plots indicated the superior predictive power of the RF model, showing a closer alignment between the predicted values and actual TB incidence rates, which suggested that it was more effective than the other models.
Fig. 5.
Regression model analysis for TB mortality prediction. (a–f) The residual plots illustrate the differences between the observed and predicted TB mortality rates across various regression models, with the RF model displaying lower residuals, indicating greater accuracy. Compared to the TB incidence residuals in Fig. 4a, the mortality residuals showed slightly greater variability, suggesting that TB mortality is more challenging to predict. (g–l) The actual vs. predicted plots confirmed the strong predictive power of the RF model, closely aligning with actual mortality rates. Although the alignment was strong, it was slightly less precise than it was for incidence rates, which reinforced that predicting TB mortality is challenging.
Table 1.
Performance metrics of regression models for TB incidence and mortality predictions.
| Model | Dataset | R2 | RMSE | MSE | MAE |
|---|---|---|---|---|---|
| Incidence rate | |||||
| K nearest neighbor | Train | 0.9048941 | 1.0553381 | 1.1137385 | 0.4819396 |
| Internal Validation | 0.8152901 | 1.4671295 | 2.1524689 | 0.7565232 | |
| External Validation | 0.8214972 | 1.4006284 | 1.9617598 | 0.7747783 | |
| Decision tree | Train | 0.7354076 | 1.7602589 | 3.0985114 | 1.1239856 |
| Internal Validation | 0.7243146 | 1.7923805 | 3.212628 | 1.1415151 | |
| External Validation | 0.7683863 | 1.5954475 | 2.5454527 | 1.1705087 | |
| Random forest | Train | 0.8877206 | 1.1466684 | 1.3148484 | 0.5877167 |
| Internal Validation | 0.8129191 | 1.4765158 | 2.180099 | 0.8054838 | |
| External Validation | 0.8243993 | 1.3891958 | 1.9298651 | 0.8548957 | |
| Support vector machine | Train | 0.8183808 | 1.4583735 | 2.1268532 | 0.6780968 |
| Internal Validation | 0.7931284 | 1.5526507 | 2.4107241 | 0.7054613 | |
| External Validation | 0.8371619 | 1.3377606 | 1.7896034 | 0.7572507 | |
| Elastic net regularization | Train | 0.7694182 | 1.6432375 | 2.7002295 | 0.9645219 |
| Internal Validation | 0.7726064 | 1.627843 | 2.649873 | 0.941992 | |
| External Validation | 0.8094378 | 1.4471676 | 2.0942939 | 0.9761135 | |
| Linear regression | Train | 0.7956459 | 1.5469615 | 2.3930899 | 0.8172586 |
| Internal Validation | 0.7932898 | 1.552045 | 2.4088437 | 0.7849069 | |
| External Validation | 0.8387318 | 1.3312966 | 1.7723505 | 0.8198348 | |
| Mortality rate | |||||
| K nearest neighbor | Train | 0.821197231 | 0.010718922 | 0.000114895 | 0.005447882 |
| Internal Validation | 0.526075564 | 0.01912715 | 0.000365848 | 0.010000171 | |
| External Validation | 0.526901554 | 0.018775802 | 0.000352531 | 0.009794139 | |
| Decision tree | Train | 0.64492304 | 0.015105157 | 0.000228166 | 0.008918135 |
| Internal Validation | 0.56952844 | 0.018229214 | 0.000332304 | 0.009923917 | |
| External Validation | 0.404625515 | 0.021062884 | 0.000443645 | 0.01006212 | |
| Random forest | Train | 0.804227042 | 0.01121606 | 0.0001258 | 0.005494624 |
| Internal Validation | 0.628972457 | 0.016923836 | 0.000286416 | 0.0087832 | |
| External Validation | 0.598537608 | 0.01729598 | 0.000299151 | 0.008810925 | |
| Support vector machine | Train | 0.620911243 | 0.015607539 | 0.000243595 | 0.007448998 |
| Internal Validation | 0.616260513 | 0.017211312 | 0.000296229 | 0.00881854 | |
| External Validation | 0.57142005 | 0.01787058 | 0.000319358 | 0.008590823 | |
| Elastic net regularization | Train | 0.511172426 | 0.017723188 | 0.000314111 | 0.010482823 |
| Internal Validation | 0.552052033 | 0.01859557 | 0.000345795 | 0.010834667 | |
| External Validation | 0.527543999 | 0.018763049 | 0.000352052 | 0.010638602 | |
| Linear regression | Train | 0.60780598 | 0.015875027 | 0.000252017 | 0.008271318 |
| Internal Validation | 0.628587938 | 0.016932603 | 0.000286713 | 0.008995218 | |
| External Validation | 0.577450282 | 0.017744413 | 0.000314864 | 0.009149828 | |
Among the evaluated models, the RF model was identified as the most effective for predicting TB incidence and mortality rates, with the highest external validation R squared (0.824 for incidence, 0.599 for mortality) and lowest RMSE (1.389 for incidence, 0.017 for mortality), reflecting its precision and reliability. Although the k-NN model showed very similar external validation metrics, with an R squared of 0.821 for incidence and 0.527 for mortality, an RMSE of 1.401 for incidence and 0.019 for mortality, and a lower MAE (0.775 versus 0.855 for RF in incidence), RF selection is justified by its superior stability and generalizability. RF’s ensemble approach, which aggregates multiple decision trees, mitigates overfitting, as shown by its smaller R squared decline from training (0.888) to external validation (0.824, a 6.33 percent drop) compared with k-NN’s larger drop (0.905–0.821, an 8.34 percent drop), indicating k-NN’s risk of overfitting to training data. Additionally, the RF captures complex, non-linear interactions between air pollution indicators and TB outcomes across provinces, outperforming k-NN, ensuring robust predictions for public health applications. These findings confirmed that the RF model was the most effective for predicting TB incidence and mortality rates, demonstrating the highest predictive accuracy and reliability.
SHAP interpretation of the RF model
Using SHAP values, we interpreted the predictions made by the RF model for TB incidence and mortality rates. Due to the effects of encoding methods on different algorithms during the model construction phase, provinces were initially one-hot encoded to reduce errors. However, the increased number of features from one-hot encoding hindered the SHAP analysis. We then conducted RF modeling via simple numerical encoding (label encoding) and one-hot encoding for provinces and compared the MSEs of the model. The RF model with label encoding had an MSE of 1.54, whereas the one-hot encoded model had an MSE of 1.63. Although the difference was small, the label-encoded model performed slightly better. Therefore, label encoding was used for subsequent model interpretations to facilitate the SHAP analysis. Label encoding reduced features for clearer SHAP interpretation, focusing on the province feature’s impact, but may imply ordinality, oversimplifying variability. One-hot encoding preserves category distinctions yet increases dimensionality, diluting importance and raising MSE. We chose label encoding for its balance of interpretability and performance. The SHAP summary plots for the RF models of TB incidence and mortality (Fig. 6a,b) were used to illustrate the contributions of all the features to the models. Provincial features had the greatest impact on both TB incidence and mortality, reflecting significant regional variability, such as higher pollution levels and TB burdens in Sichuan and Guizhou, which modulate the effects of pollutants on TB outcomes. For TB incidence, NO2, SO2, and O3 were the air pollution indicators with the highest SHAP contributions, where SO2 and O3 presented significant positive effects, with higher concentrations associated with increased incidence, likely due to oxidative stress and inflammation impairing respiratory defenses, whereas NO2 presented a significant negative effect, with higher concentrations linked to decreased incidence. For TB mortality, SO2, O3, and PM2.5 were the indicators with the highest SHAP contributions, where SO2 had a significant positive effect, with higher concentrations associated with increased mortality, likely via immune suppression and physiological damage in advanced TB patients, and O3 displayed a slight positive effect. These patterns highlight the distinct pollutant-specific mechanisms driving incidence and mortality. The SHAP dependence plots for TB incidence (Fig. 6c–i) and mortality (Figs. 6j–p) show the impact of each air pollution indicator on the model predictions, with different colors representing different feature values. These plots confirmed the province’s dominant role in capturing regional variability. SHAP analyses revealed that air pollution indicators, particularly NO2, SO2, O3, and PM2.5, significantly influence TB incidence and mortality, with the province’s impact underscoring the need for localized interventions, guiding targeted strategies to mitigate these effects.
Fig. 6.
SHAP interpretation of the RF model for TB incidence and mortality. (a) SHAP summary plots for TB incidence show the influence of each air pollution indicator, with NO2 and SO2 having the greatest impact. (b) SHAP summary plots for TB mortality highlight SO2 and O3_8H as key predictors. (c–p) The dependence plots illustrate how changes in these indicators affect TB predictions, with distinct patterns for both incidence and mortality, emphasizing the importance of specific pollutants in TB outcomes.
Provincial regression model performance comparison
The RF model was used to predict TB incidence and mortality rates across the different provinces. The R2 values for the TB incidence predictions are shown in Fig. 7a, whereas the R2 values for the TB mortality predictions are shown in Fig. 7b. According to the analysis, the provincial regression models for incidence performed well, with Sichuan achieving the highest predictive performance, followed by Hebei and Jiangsu. In contrast, the models for mortality generally had poorer performance, with Shaanxi showing the best predictive accuracy for TB mortality, followed by Guizhou and Heilongjiang. The superior performance in provinces can be attributed to stronger pollution-TB correlations and more consistent TB burden data, whereas lower performance in provinces may result from lower pollution exposure, data sparsity, or greater heterogeneity in TB outcomes, as influenced by regional environmental and epidemiological factors. The varying R2 values across provinces indicated that the predictive power of air pollution indicators for TB incidence and mortality differed across regions. This highlights the need for localized public health strategies to address specific regional factors affecting TB outcomes. Sichuan and Guizhou were selected for further analysis on the basis of the correlation analysis and performance of the RF model.
Fig. 7.
Provincial Model Performance in Predicting TB Rates. (a) Comparative R2 values for TB incidence prediction across provinces were determined using RF models; Sichuan and Hebei showed the highest predictive accuracy. (b) Comparative R2 values for TB mortality prediction, where Shaanxi and Guizhou showed the best performance. The analysis showed regional variations in predictive accuracy, highlighting the importance of developing localized models. Based on the correlation analysis and provincial model performance, Sichuan was selected for further incidence analysis, and Guizhou was selected for mortality analysis.
SARIMA analysis
Seasonal and lag analyses were conducted for TB incidence in Sichuan and mortality in Guizhou, which were identified in previous analyses as having the highest predictive performance and the strongest correlation with air pollution. The seasonal decomposition for TB incidence in Sichuan is shown in Fig. 8a–c, whereas the seasonal decomposition for TB mortality in Guizhou is shown in Fig. 8e–g. The analysis revealed distinct seasonal patterns in both provinces, with peaks generally occurring in winter and troughs in summer. These seasonal fluctuations are driven by a combination of environmental, social and biological factors. Cold winter temperatures and high humidity increase indoor crowding, respiratory vulnerability and the accumulation of environmental pollutants, increasing TB transmission. The year-end surge may also reflect holiday gatherings that increase social contact or seasonal immune responses that weaken host defenses. Lag analysis revealed that a 10-month lag period provided the best model fit, as indicated by the lowest AIC values (Fig. 8d,h). These findings suggest that TB incidence and mortality are influenced by seasonal factors and that incorporating a lag period can increase predictive accuracy.
Fig. 8.
Seasonal and lag analyses of the TB incidence and mortality rate. (a–d) Seasonal decomposition and lag analysis for TB incidence rates in Sichuan showed prominent seasonal patterns with winter peaks and summer troughs and identified a 10-month lag period as optimal, supported by the lowest AIC values. (e–h) Similarly, for TB mortality rates in Guizhou, a 10-month lag was determined to be the best fit, with distinct seasonal variations evident. These findings emphasized the importance of incorporating seasonal and lag effects in predictive models to increase the accuracy of predicting TB incidence and mortality.
SARIMA models were constructed to predict TB incidence in Sichuan and TB mortality in Guizhou using the optimal 10-month lag period identified in previous analyses. The SARIMA model predictions for TB incidence in Sichuan are shown in Fig. 9a,b, whereas the predictions for TB mortality in Guizhou are shown in Fig. 9c,d. The models were trained on data from November 2013–December 2019 (74 months total, after removing the first 10 months due to the 10-month optimal lag) and validated against actual 2020 data (12 months). For TB incidence in Sichuan Province, the mean squared error (MSE) was 0.445, and the mean absolute error (MAE) was 0.542, indicating robust prediction accuracy. For TB mortality in Guizhou, the MSE was 0.000, and the MAE was 0.006, reflecting high precision in predicting mortality outcomes. The SARIMA models demonstrated strong predictive accuracy, with the predicted values closely aligning with the actual observed data for 2020. The models effectively captured the seasonal trends and temporal dynamics in TB rates, emphasizing the importance of including seasonal patterns and lag effects in predictive modeling. These results suggest that when properly tuned, SARIMA models can significantly increase the accuracy of predictions related to TB incidence and mortality, providing valuable information for public health planning and targeted intervention strategies to mitigate TB transmission and mortality in regions with high pollution levels.
Fig. 9.
SARIMA model predictions for TB incidence and mortality rates. (a) Predicted versus observed TB incidence rates in Sichuan, trained on November 2013 to December 2019 data (74 months, after a 10-month lag) and validated with January to December 2020 data. (b) Close-up of predicted versus observed TB incidence rates in Sichuan for 2020, showing seasonal alignment. (c) Predicted versus observed TB mortality rates in Guizhou, trained on November 2013 to December 2019 data (74 months, after a 10-month lag) and validated with January to December 2020 data. (d) Close-up of predicted versus observed TB mortality rates in Guizhou for 2020, demonstrating temporal accuracy.
Subgroup analysis
Subgroup analysis was conducted to evaluate the performance of RF models across different TB subtypes: pathogen-positive, pathogen-negative, and without-pathogen examination cases. The performance metrics for each subgroup model are shown in Table 2. The analysis revealed that RF models for TB incidence generally exhibited high predictive performance across all subgroups, and the without pathogen examination subgroup and overall TB incidence models presented the highest R2 values and the lowest RMSE, MSE, and MAE values. However, regarding TB mortality, the RF model for the without pathogen examination subgroup had the lowest performance, while the other subgroups demonstrated a predictive accuracy close to that of the overall TB mortality model. These lower validation performances, particularly for the without-pathogen examination mortality subgroup, stem from data sparsity, diagnostic heterogeneity, smaller sample sizes, and complex regional interactions between ambient air pollution and TB outcomes. Biologically, these subgroup differences reflect varying TB progression stages: pathogen-positive cases are likely to indicate more severe disease with stronger links to pollution-induced inflammation and immune suppression, whereas nonpathogen examination cases may suggest underdiagnosis, milder forms, or diagnostic challenges, complicating predictions. These findings indicate that the predictive performance of air pollution indicators for TB rates varies by TB subtype, highlighting the importance of considering these subgroups in TB epidemiological studies to make the model more accurate and guide effective public health interventions.
Table 2.
Performance of the random forest model across TB subgroups (pathogen-positive, pathogen-negative, without pathogen examination).
| Model | Dataset | R2 | RMSE | MSE | MAE |
|---|---|---|---|---|---|
| Incidence rate | |||||
| Without pathogen examination | Train | 0.888561896 | 0.228264507 | 0.052104685 | 0.122259526 |
| Internal validation | 0.787174087 | 0.312086935 | 0.097398255 | 0.1771542 | |
| External validation | 0.790853426 | 0.298714028 | 0.089230071 | 0.181872633 | |
| Pathogen positive | Train | 0.824838694 | 0.37124491 | 0.137822783 | 0.219839547 |
| Internal validation | 0.613526626 | 0.606905297 | 0.368334039 | 0.372829038 | |
| External validation | 0.656332631 | 0.507897759 | 0.257960134 | 0.34913909 | |
| Pathogen negative | Train | 0.838528499 | 0.928190109 | 0.861536879 | 0.441203721 |
| Internal validation | 0.692713051 | 1.386016445 | 1.921041587 | 0.681837253 | |
| External validation | 0.704534377 | 1.262454915 | 1.593792413 | 0.689755704 | |
| Pulmonary tuberculosis | Train | 0.895021518 | 1.123920426 | 1.263197125 | 0.575073293 |
| Internal validation | 0.768303461 | 1.67456525 | 2.804168778 | 0.870912834 | |
| External validation | 0.756348713 | 1.526038162 | 2.328792473 | 0.843717805 | |
| Mortality rate | |||||
| Without pathogen examination | Train | 0.599864558 | 0.002230868 | 4.97677E-06 | 0.000957703 |
| Internal validation | 0.166132147 | 0.003318672 | 1.10136E-05 | 0.001564559 | |
| External validation | 0.193802611 | 0.003309516 | 1.09529E-05 | 0.001514524 | |
| Pathogen positive | Train | 0.71411655 | 0.006557894 | 4.3006E-05 | 0.003745473 |
| Internal validation | 0.461905209 | 0.009119009 | 8.31563E-05 | 0.005655071 | |
| External validation | 0.419598235 | 0.009177366 | 8.4224E-05 | 0.005861027 | |
| Pathogen negative | Train | 0.732879685 | 0.007992073 | 6.38732E-05 | 0.003359437 |
| Internal validation | 0.482227239 | 0.010937959 | 0.000119639 | 0.005242532 | |
| External validation | 0.451522878 | 0.011646409 | 0.000135639 | 0.005216794 | |
| Pulmonary tuberculosis | Train | 0.752456437 | 0.01261217 | 0.000159067 | 0.005875755 |
| Internal validation | 0.589983809 | 0.017790831 | 0.000316514 | 0.008874491 | |
| External validation | 0.513010489 | 0.019049454 | 0.000362882 | 0.009285554 | |
Nested model analysis
To evaluate the effects of individual air pollution indicators on the predictive performance of TB incidence and mortality rates, nested models were developed by adding these indicators one at a time. The performance metrics for each model configuration are shown in Table 3. The analysis demonstrated a clear trend: with the addition of each air pollution indicator, the predictive accuracy of the models improved consistently. The model that included all indicators (PM2.5, PM10, SO2, NO2, CO, O3_8H, and the composite AQI) achieved the best performance, as determined by the highest R2 values and the lowest RMSE, MSE, and MAE scores.
Table 3.
Predictive impact of incrementally added air pollution indicators on TB incidence and mortality.
| Model | PM10 | PM2.5 | SO2 | NO2 | CO | O3_8H | Composite AQI | R2 | RMSE | MSE | MAE |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Incidence rate | |||||||||||
| Model1 | × | 0.787 | 1.585 | 2.513 | 0.950 | ||||||
| Model2 | × | × | 0.815 | 1.478 | 2.183 | 0.893 | |||||
| Mode3l | × | × | × | 0.831 | 1.415 | 2.002 | 0.826 | ||||
| Model4 | × | × | × | × | 0.863 | 1.274 | 1.624 | 0.767 | |||
| Model5 | × | × | × | × | × | 0.857 | 1.302 | 1.694 | 0.777 | ||
| Model6 | × | × | × | × | × | × | 0.868 | 1.249 | 1.559 | 0.729 | |
| Model7 | × | × | × | × | × | × | × | 0.872 | 1.232 | 1.519 | 0.735 |
| Mortality rate | |||||||||||
| Model1 | × | 0.476 | 0.018 | 0.000 | 0.011 | ||||||
| Model2 | × | × | 0.517 | 0.017 | 0.000 | 0.010 | |||||
| Mode3l | × | × | × | 0.574 | 0.016 | 0.000 | 0.009 | ||||
| Model4 | × | × | × | × | 0.628 | 0.015 | 0.000 | 0.009 | |||
| Model5 | × | × | × | × | × | 0.611 | 0.015 | 0.000 | 0.009 | ||
| Model6 | × | × | × | × | × | × | 0.640 | 0.015 | 0.000 | 0.009 | |
| Model7 | × | × | × | × | × | × | × | 0.642 | 0.015 | 0.000 | 0.009 |
‘×’ indicates that the model incorporates this indicator.
This improvement suggested that the comprehensive approach that considered multiple environmental factors significantly improved the ability of the model to predict TB rates accurately. These findings underscore the importance of including a broad range of air pollution indicators in TB epidemiological models to capture the complex interactions between environmental factors and TB outcomes. This comprehensive modeling approach can facilitate more precise predictions, which are crucial for informing public health strategies and interventions aimed at controlling TB in regions with varying air pollution levels.
Discussion
In this study, advanced ML models and time series analysis were used to establish a predictive model of TB incidence and mortality, and its performance was evaluated comprehensively. We aimed to construct accurate predictive models of TB incidence and mortality and predict trends in the prevalence of TB. This study provides important insights into the relationships between air pollution and TB incidence and mortality in China. Our findings indicate that air pollution is a key environmental factor affecting TB outcomes, likely by impairing respiratory defenses and immune responses; for example, PM2.5 induces lung inflammation and suppresses macrophage control of Mycobacterium tuberculosis, increasing susceptibility, as shown in urban exposure studies3. Liu et al.2 similarly reported that SO2 was linked to increased TB risk in Hubei, suggesting that pollution amplifies susceptibility in heavily exposed regions such as Sichuan and Guizhou. The results obtained have implications for public health strategies in regions with different levels of pollution. Correlation analysis revealed that higher levels of pollutants such as PM10, PM2.5, SO2, and CO were associated with increased TB morbidity and mortality26,27. The regional differences in the strength of associations observed suggested that morbidity and mortality may also differ in their associations with air quality across geographic locations and that geographic and environmental differences play crucial roles in how air pollution affects TB incidence, indicating the need for tailored public health interventions28,29.
The RF model was identified as the most effective at predicting TB incidence and mortality, which is excellent because of its stability, generalizability, and ability to capture complex nonlinear interactions. For TB incidence, RF demonstrated stability across datasets: training R2 of 0.888, external validation R2 of 0.824 (6.33% drop), outperforming k-NN’s overfitting (training R2 0.905, external validation R2 of 0.821, 8.34% drop) and SVM/LM’s instability (training-to-external R2 increases: SVM + 1.28%, LM + 1.43%), confirming its generalizability in nonlinear, high-dimensional data. Compared with k-NN, RF’s external validation R2 (0.824) slightly exceeded k-NN (0.821, 0.35% improvement), with a lower RMSE (1.389 vs. 1.401), indicating greater precision, although its MAE (0.855 vs. 0.775) reflects sensitivity to outliers, balanced by ensemble robustness. This stable and balanced error distribution ensures reliable public health predictions across diverse datasets, justifying RF selection for capturing complex air pollution-TB interactions. For TB mortality, RF’s external validation R2 (0.599) significantly outperformed k-NN (0.527), SVM (0.571), glmnet (0.528), and LM (0.577), with the lowest RMSE (0.017) and tight standard deviation, reflecting precision and consistency. RF’s outperformance is driven by its resilience to multicollinearity among air pollutants and robustness to noisy, sparse TB data, such as mortality records. By randomly selecting feature subsets, RF reduces bias from correlated variables, capturing complex, nonlinear interactions and ensuring stable predictions across diverse provinces. Liu et al. (2020) employed a Bayesian spatiotemporal model to link air pollutants with TB incidence, which excelled in probabilistic inference2. Our RF model exhibits superior predictive ability (R2 = 0.824, RMSE = 1.389 for incidence) and SHAP-driven interpretability, surpassing this approach by capturing complex interactions between air pollution and tuberculosis. Future studies can further optimize model performance through hyperparameter tuning techniques, such as grid search or Bayesian optimization, to improve prediction accuracy and adaptability to regional variations.
The SHAP analysis underscores the complex, region-specific interactions between air pollution and TB outcomes, with significant implications for public health. The dominant influence of the province feature highlights the critical role of geographic variability, particularly in high-pollution areas such as Sichuan and Guizhou, where environmental exposures shape TB incidence and mortality differently. Pollutants such as NO2, SO2, O3, and PM2.5 exert distinct effects, likely through mechanisms such as inflammation, oxidative stress, and immune suppression, which vary across TB progression stages and regional contexts. For example, the contrasting impacts of pollutants on incidence versus mortality suggest differential biological responses: the increasing incidence of pollutants may impair early defenses, whereas those affecting mortality may exacerbate late-stage severity. These findings emphasize the need for localized strategies to address specific pollutant exposures, such as reducing SO2 and O3 to mitigate incidence risks. The analysis also reveals the importance of considering regional pollution profiles, as provinces such as Sichuan and Guizhou face unique challenges due to elevated SO2 and PM2.5 levels, necessitating targeted interventions to reduce the TB burden. Overall, these insights enhance our understanding of the environmental drivers of TB, guiding public health policies to prioritize emission controls and regional adaptation. When SHAP values were used to interpret the RF model, PM2.5, NO2, O3, and SO2 were the most influential air pollution indicators, and large differences were found between different regions30,31. The dominance of the province feature in SHAP importance plots suggests that geographic fixed effects, driven by regional variability in pollution and TB burdens, may mask pollutant-specific effects. While we controlled for this by using label encoding, future studies could incorporate interaction terms between province and pollutants to better isolate their individual contributions. Localized models may be needed to account for differences in predicted performance across provinces to reflect different environmental exposures and TB outcomes across China32–34.
Seasonality and lag analyses were used to analyze the temporal dynamics of TB incidence and mortality, and the results suggested that the fluctuations in incidence and mortality exhibited seasonality35. We also found that a lag period of 10 months was the best time for forecasting accuracy36. TB progression often involves a 2–12 month incubation period, with delayed effects exacerbated by seasonal air pollution peaks that heighten transmission risks through indoor crowding and respiratory vulnerability, as observed in Sichuan and Guizhou. SARIMA models incorporating these temporal effects showed strong predictive performance, highlighting the importance of considering the seasonal and delayed effects of air pollution in TB prediction models37,38. Compared with Li et al.’s (2020) ARIMA model, which predicts TB incidence across three cities, our SARIMA model enhances precision through explicit seasonal adjustments, outperforming ARIMA’s simpler framework.39 Subgroup analyses highlighted differences in the predictive performance of air pollution indicators for different TB subtypes40, as well as differences in the subgroup analyses of morbidity and mortality, with the best model performance. This finding highlights the importance of considering TB subtypes in epidemiological models to improve accuracy41. Nested model analyses have indicated that including a comprehensive set of air pollution indicators can improve overall predictive performance, highlighting the need for a multifactor approach in TB epidemiology42.
The regional differences in TB incidence and mortality reported in this study indicate the influence of air pollution variations across provinces, which can play an important role in guiding subsequent research. The excellent predictive performance of the ML and SARIMA models in predicting morbidity and mortality suggested that the reasonable use of ML can improve the accuracy of predicting the prevalence of TB to effectively formulate interventions and responses, which play an important role in public health43.
While our study provides robust insights into the air pollution-TB nexus, certain limitations warrant consideration. Notably, owing to data availability constraints, we could not correlate climatic and other demographic characteristics, hereditary traits, vaccination status, or socioeconomic indicators with TB outcomes. Our datasets, sourced from national public health and environmental monitoring systems, aggregate TB incidence and mortality at the provincial level without individual-level details, which may affect the observed associations between air pollution rates and TB rates. Provincial averages mask subgroup variability across factors like age, sex, and comorbidities, so we could not assess differential effects on vulnerable groups such as older adults or those with comorbidities. The absence of meteorological variables, including temperature, humidity, and seasonal behaviors, likely introduces confounding, given their role in TB transmission and pollution dynamics. Additionally, unmeasured variables such as indoor air pollution, socioeconomic status, smoking rates, and healthcare access may further confound the air pollution-TB relationship. Future research will aim to address this by accessing detailed TB registries to incorporate these covariates, potentially using stratified ML models to refine risk predictions across diverse populations.
Conclusion
In this study, we found that air pollution significantly affected the incidence and mortality of TB in China, demonstrating that pollutants such as PM10, PM2.5, SO2, and CO are closely associated with increased TB rates. Using advanced ML models, particularly the RF and SARIMA models, we developed robust predictive tools that enhance our understanding of TB epidemiology in the context of environmental factors. Our findings confirm and extend previous research, such as that of Liu et al.2, by demonstrating broader pollutant impacts across provinces. We achieved external validation R2 values of 0.824 for incidence and 0.599 for mortality, with SARIMA predictions validated by an MSE of 0.445 and an MAE of 0.542 for Sichuan incidence and an MSE of 0.000 and an MAE of 0.006 for Guizhou mortality. Compared with Li et al.39, whose ARIMA model predicts TB incidence with a simpler framework, our SARIMA model improves precision through explicit seasonal adjustments, outperforming ARIMA’s approach and capturing regional variability. These findings emphasize the need to develop localized public health strategies that consider regional differences in air pollution and TB outcomes. Improving air quality is a critical intervention that can substantially decrease the incidence and mortality of TB. Consistent with Liu et al.2, who linked SO2 increases to higher TB risk in Hubei (RR = 1.046), we recommend stricter SO2 and PM2.5 emission controls in high-risk provinces such as Sichuan and Guizhou. Our predictive models can prioritize intervention areas, guiding targeted policies to reduce the TB burden. Researchers should continue investigating the complex interactions between air pollution and TB, incorporating additional environmental and socioeconomic factors to further refine predictive models and provide more accurate predictions for the development of effective public health interventions.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Acknowledgements
We are very grateful to Prof. Chong Liu and Prof. Xinli Zhan (Spine and Osteopathy Ward, The First Affiliated Hospital of Guangxi Medical University) for support in all stages of this study.
Author contributions
Boli Qin, Rongqing He and Xiaopeng Qin participated conceptualization and methodology design of the study. Jiayan Jiang, Chengxing Zhou, Jichong Zhu, Jiarui Chen and Shaofeng Wu in charge of data curation and investigation. Songze Wu, Jiang Xue, Kechang He and Chong Liu analyzed and visualized the data. Jie Ma and Xinli Zhan: Writing-Reviewing and Editing. All authors contributed to the article and approved the submitted version.
Funding
The present research was supported by ① The National Natural Science Foundation of China (82360422); ② Joint Project on Regional High-Incidence Diseases Research of Guangxi Natural Science Foundation (2023JJA140227); ③ Guangxi Young and Middle aged Teacher’s Basic Ability Promoting Project(2023KY0115); ④ The “Medical Excellence Award” Funded by the Creative Research Development Grant from the First Affiliated Hospital of Guangxi Medical University; ⑤ Clinical Research Climbing Plan Project of the First Affiliated Hospital of Guangxi Medical University in 2023; ⑥ Bethune Charity Foundation’s “Constant Learning and Improvement-Medical Research” project; ⑦ Guangxi Degree and Postgraduate Education Reform Project 2024(JGY2024094); ⑧ Guangxi Zhuang Autonomous Region “Four New” Research and Practice Project(SX202320).
Data availability
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.
Declarations
Competing interests
The authors declare no competing interests.
Consent for publication
Not applicable.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Boli Qin, Rongqing He, and Xiaopeng Qin contribute equally to this article.
Contributor Information
Jie Ma, Email: mj-friend@163.com.
Xinli Zhan, Email: zhanxinli@stu.gxmu.edu.cn.
References
- 1.Mao, J. J. et al. Population impact of fine particulate matter on tuberculosis risk in China: A causal inference. BMC Public Health23(1), 2285 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Liu, F., Zhang, Z., Chen, H. & Nie, S. Associations of ambient air pollutants with regional pulmonary tuberculosis incidence in the central Chinese province of Hubei: A Bayesian spatial-temporal analysis. Environ. Health Global Access Sci. Source19(1), 51 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Torres, M. et al. Urban airborne particle exposure impairs human lung and blood Mycobacteriumtuberculosis immunity. Thorax74(7), 675–683 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Chen, Y. P. et al. Modeling and predicting pulmonary tuberculosis incidence and its association with air pollution and meteorological factors using an ARIMAX model: An ecological study in Ningbo of China. Int. J. Environ. Res. Public Health19(9), 5385 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Tang, N. et al. Machine learning prediction model of tuberculosis incidence based on meteorological factors and air pollutants. Int. J. Environ. Res. Public Health20(5), 3910 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Wang, X. Q. et al. Associations of exposures to air pollution and greenness with mortality in a newly treated tuberculosis cohort. Environ. Sci. Pollut. Res. Int.30(12), 34229–34242 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zhu, Q. & Liu, J. A united model for diagnosing pulmonary tuberculosis with random forest and artificial neural network. Front. Genet.14, 1094099 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Liang, W., Hu, A., Hu, P., Zhu, J. & Wang, Y. Estimating the tuberculosis incidence using a SARIMAX-NNARX hybrid model by integrating meteorological factors in Qinghai Province, China. Int. J. Biometeorol.67(1), 55–65 (2023). [DOI] [PubMed] [Google Scholar]
- 9.Houdou, A., El Badisy, I., Khomsi, K. & Andrade, S. Interpretable machine learning approaches for forecasting and predicting air pollution: A systematic review. Mach. Learn. (2022).
- 10.Cabello-Solorzano, K., Ortigosa de Araujo, I., Peña, M., Correia, L. & Tallón-Ballesteros, J. A. The impact of data normalization on the accuracy of machine learning algorithms: a comparative analysis. In: International Conference on Soft Computing Models in Industrial and Environmental Applications, 344–353 (Springer, 2023).
- 11.Zhu, D., Cai, C., Yang, T. & Zhou, X. A machine learning approach for air quality prediction: Model regularization and optimization. Big Data Cogn. Comput.2(1), 5 (2018). [Google Scholar]
- 12.Razali, N. M. & Wah, Y. B. Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests. J. Stat. Model. Anal.2(1), 21–33 (2011). [Google Scholar]
- 13.Pearson’s, C., Comparison of values of pearson’s and spearman’s correlation coefficients. Comparison of Values of Pearson’s and Spearman’s Correlation Coefficients (2011).
- 14.Zhu, S. et al. Ambient air pollutants are associated with newly diagnosed tuberculosis: a time-series study in Chengdu, China. Sci. Total Environ.631, 47–55 (2018). [DOI] [PubMed] [Google Scholar]
- 15.Poslavskaya, E. & Korolev, A. Encoding categorical data: Is there yet anything’hotter’than one-hot encoding? arXiv preprint arXiv:2312.16930 (2023).
- 16.Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell.2(1), 56–67 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Antwarg, L., Miller, R. M., Shapira, B. & Rokach, L. Explaining anomalies detected by autoencoders using shapley additive explanations. Expert Syst. Appl.186, 115736 (2021). [Google Scholar]
- 18.Clevelandc, R. B. STL: A seasonal-trend decomposition procedure based on loess. J. Off. Stat.6, 3–73 (1990). [Google Scholar]
- 19.Theodosiou, M. Forecasting monthly and quarterly time series using STL decomposition. Int. J. Forecast.27(4), 1178–1195 (2011). [Google Scholar]
- 20.Nokeri, T. C. Forecasting using arima, sarima, and the additive model. In Implementing Machine Learning for Finance: A Systematic Approach to Predictive Risk and Performance Analysis for Investment Portfolios 21–50 (Springer, 2021). [Google Scholar]
- 21.Bauer, D. Information-criterion-based lag length selection in vector autoregressive approximations for I (2) processes. Econometrics11(2), 11 (2023). [Google Scholar]
- 22.Perez-Guerra, U. H. et al. Seasonal autoregressive integrated moving average (SARIMA) time-series model for milk production forecasting in pasture-based dairy cows in the Andean highlands. PLoS ONE18(11), e0288849 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Kirwa, K. et al. Fine-scale air pollution models for epidemiologic research: Insights from approaches developed in the multi-ethnic study of atherosclerosis and air pollution (MESA air). Curr. Environ. Health Rep.8(2), 113–126 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Lang, M. et al. mlr3: A modern object-oriented machine learning framework in R. J. Open Source Softw.4(44), 1903 (2019). [Google Scholar]
- 25.Risso, A., Cosulich, M. E., Rubartelli, A., Mazza, M. R. & Bargellesi, A. MLR3 molecule is an activation antigen shared by human B, T lymphocytes and T cell precursors. Eur. J. Immunol.19(2), 323–328 (1989). [DOI] [PubMed] [Google Scholar]
- 26.Zhu, S. et al. Ambient air pollutants are associated with newly diagnosed tuberculosis: A time-series study in Chengdu, China. Sci. Total Environ.631–632, 47–55 (2018). [DOI] [PubMed] [Google Scholar]
- 27.Liu, Y. et al. Effect of ambient air pollution on tuberculosis risks and mortality in Shandong, China: A multi-city modeling study of the short- and long-term effects of pollutants. Environ. Sci. Pollut. Res. Int.28(22), 27757–27768 (2021). [DOI] [PubMed] [Google Scholar]
- 28.Zhao, C. N. et al. Associations between air pollutants and acute exacerbation of drug-resistant tuberculosis: Evidence from a prospective cohort study. BMC Infect. Dis.24(1), 121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Li, H., Ge, M. & Zhang, M. Spatio-temporal distribution of tuberculosis and the effects of environmental factors in China. BMC Infect. Dis.22(1), 565 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Cerezuela-Escudero, E., Montes-Sanchez, J. M., Dominguez-Morales, J. P., Duran-Lopez, L. & Jimenez-Moreno, G. A systematic comparison of different machine learning models for the spatial estimation of air pollution. Appl. Intell.53(24), 29604–29619 (2023). [Google Scholar]
- 31.Wu, Y., Lin, S., Shi, K., Ye, Z. & Fang, Y. Seasonal prediction of daily PM(2.5) concentrations with interpretable machine learning: A case study of Beijing, China. Environ. Sci. Pollut. Res. Int.29(30), 45821–45836 (2022). [DOI] [PubMed] [Google Scholar]
- 32.Nasejje, J. B., Whata, A. & Chimedza, C. Statistical approaches to identifying significant differences in predictive performance between machine learning and classical statistical models for survival data. PLoS ONE17(12), e0279435 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Montesinos López, O. A., Montesinos López, A. & Crossa, J. Overfitting, Model Tuning, and Evaluation of Prediction Performance 109–139 (Springer, 2022). [Google Scholar]
- 34.Syfert, M. M., Smith, M. J. & Coomes, D. A. The effects of sampling bias and model complexity on the predictive performance of MaxEnt species distribution models. PLoS ONE8(2), e55158 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Fares, A. Seasonality of tuberculosis. J. Global Infect. Dis.3(1), 46–55 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Tedijanto, C., Hermans, S., Cobelens, F., Wood, R. & Andrews, J. R. Drivers of seasonal variation in tuberculosis incidence: Insights from a systematic review and mathematical model. Epidemiology29(6), 857–866 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Samal, K. K. R., Babu, K. S., Das, S. K. & Acharaya, A. Time series based air pollution forecasting using SARIMA and prophet model. In: Proceedings of the 2019 International Conference on Information Technology and Computer Communications, 80–85 (2019).
- 38.Bhatti, U. A. et al. Time series analysis and forecasting of air pollution particulate matter (PM 2.5): An SARIMA and factor analysis approach. IEEE Access9, 41019–41031 (2021). [Google Scholar]
- 39.Li, Z. Q., Pan, H. Q., Liu, Q., Song, H. & Wang, J. M. Comparing the performance of time series models with or without meteorological factors in predicting incident pulmonary tuberculosis in eastern China. Infect. Dis. Poverty9(1), 151 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Sumpter, C. & Chandramohan, D. Systematic review and meta-analysis of the associations between indoor air pollution and tuberculosis. Trop. Med. Int. Health18(1), 101–108 (2013). [DOI] [PubMed] [Google Scholar]
- 41.Akhmetova, A. et al. Genomic epidemiology of Mycobacterium bovis infection in sympatric badger and cattle populations in Northern Ireland. Microb. Genom.9(5), 001023 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Vega, V. et al. Risk factors for pulmonary tuberculosis recurrence, relapse and reinfection: A systematic review and meta-analysis. BMJ Open Respir. Res.11(1), e002281 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Steyerberg, E. W. et al. Assessing the performance of prediction models: A framework for traditional and novel measures. Epidemiology21(1), 128–138 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.









