Data-driven model analysis of the impact of environmental and socioeconomic factors on tuberculosis incidence

Yiwen Tao; Jiaxin Zhao; Hao Cui; Zhanlue Liang; Jian Li; Jingli Ren; Huaiping Zhu

doi:10.1016/j.idm.2026.02.002

. 2026 Feb 26;11(3):930–946. doi: 10.1016/j.idm.2026.02.002

Data-driven model analysis of the impact of environmental and socioeconomic factors on tuberculosis incidence

Yiwen Tao ^a, Jiaxin Zhao ^a,^b, Hao Cui ^c, Zhanlue Liang ^d, Jian Li ^c, Jingli Ren ^a,^⁎, Huaiping Zhu ^e,^⁎⁎

PMCID: PMC12969114 PMID: 41810135

Abstract

Tuberculosis (TB), a global infectious disease, poses a formidable challenge to Taiwan, China, exacerbated by its aging demographic and the incursion of pathogens from Southeast Asia's high-risk districts. In this study, we analyzed data across 19 cities and counties in Taiwan, China from 2014 to 2022, deploying four machine learning (ML) and four deep learning (DL) models to forecast TB's monthly incidence, leveraging 12 drivers. The CatBoost, random forest, and gradient boosting models emerged as the top-performing models. By amalgamating these models with post-hoc explainable ML techniques, we consistently identified population size, sulfur dioxide levels, physician count, normalized difference vegetation index, wind velocity, and precipitation level the paramount influences on TB incidence. Additionally, we disclosed the nonlinear interactions and threshold effects between these determinants and TB incidence. W e further employed stepwise regression and statistical assessments to identify a model configuration that minimizes the necessary drivers while maintaining a high predictive accuracy. The framework and findings of this study offer a robust data support and decision-making basis for TB mitigation initiatives on a global scale.

Keywords: Tuberculosis incidence, Machine learning, Explainable AI, Natural and socioeconomic drivers

Graphical abstract

1. Introduction

Tuberculosis (TB) is an infectious and contagious disease. Although it predominantly affects the lungs, resulting in the classic pulmonary tuberculosis syndrome, it can also involve other organs, including lymph nodes, brain, kidneys, and spine (Raviglione & Sulis, 2016). According to the World Health Organization (WHO) Global Tuberculosis Report 2023, an estimated 10.6 million new TB cases were reported globally in 2022, reflecting an incidence rate of 133 cases per 100,000 people. This represents the highest number of confirmed cases recorded since WHO began global monitoring of TB in 1995. Additionally, TB caused 1.3 million deaths in 2022, making it the second leading cause of death from a single infectious agent, surpassed only by novel coronavirus infections. The United Nations Sustainable Development Goals (Arora & Mishra, 2019) and the WHO's End TB Strategy (Organization, W.H., 2008) both aim to substantially reduce the global burden of TB by 2030. However, progress towards these objectives remains inadequate (Chakaya et al., 2022).

To effectively reduce the TB incidence and work toward the goal of eliminating this global health threat, one of the key strategies is to gain a comprehensive understanding of the various factors influencing TB transmission (Chakaya et al., 2021). Pioneering research indicates that various exogenous factors, such as meteorological conditions, air pollution, and socio-economic factors, may influence TB incidence by affecting the growth, reproduction, and transmission of Mycobacterium tuberculosis (Guo et al., 2017; Kharwadkar et al., 2022; Li et al., 2019; Murray et al., 2011). Meteorological factors, including temperature, humidity, wind speed, and precipitation, are associated with TB transmission by modifying vitamin D metabolism, ultraviolet radiation exposure, and Mycobacterium tuberculosis colonization (Talat et al., 2010; Wang et al., 2023a; Zhang et al., 2018). Air pollution may impact TB risk by impairing immune function and reducing the clearance capacity of mucosal cilia (Houtmeyers et al., 1999; Knorst et al., 1996). Socioeconomic factors, including healthcare standards, economic conditions, and population density, have been found to be closely linked to TB incidence (Tao et al., 2024). Therefore, a comprehensive understanding of how these complex external factors influence TB transmission, both individually and interactively, is crucial for deepening our understanding of TB transmission dynamics and supporting more effective prevention strategies.

Populations are often exposed to complex environments shaped by the interaction of socioeconomic and natural factors, characterized by numerous dynamic elements and highly interdependent data (Cui et al., 2024). Traditional methods, such as time series analysis and multiple regression models, when applied to study the relationship between TB transmission and these factors, may introduce biases or confounding effects due to simultaneous exposure to multiple variables (Nie et al., 2022; Yang et al., 2020; Zhu et al., 2018). Consequently, researchers have increasingly turned to machine learning (ML) and deep learning (DL) algorithms, which have demonstrated exceptional effectiveness in capturing the complex nonlinear relationships inherent in time series data. These algorithms not only provide accurate predictions of TB incidence but also capture the influence of various factors on TB transmission. Owing to their advanced algorithms and inherent strengths, the application of ML in TB predictive modeling continues to expand (Giri et al., 2019; Khan et al., 2019; Mohidem et al., 2021).

However, improving predictive accuracy often comes at the cost of increased model complexity. This complexity, combined with the large datasets required to train these sophisticated systems, can reduce model transparency and interpretability (Linardatos et al., 2020). To address these challenges, recent advancements in the field of explainable artificial intelligence (XAI) offer solutions by providing explanations for AI models through domain knowledge and post-hoc analysis, mitigating the "black-box" nature of complex models. (Chan et al., 2022). Techniques such as Shapley additive explanations (SHAP) (Shapley, 1953) and individual conditional expectation (ICE) (Goldstein et al., 2015) plots enhance the interpretability of complex predictive models, facilitating a deeper understanding of their underlying mechanisms and improving their practical utility in TB predictive modeling.

TB remains a significant notifiable infectious disease in Taiwan, China, although its incidence is lower than the global average (Wu et al., 2023). A key challenge in controlling TB in Taiwan Province is its aging population. As individuals age, the immune system gradually weakens, reducing the body's natural defense against TB (Chen, 2023). Notably, older adults (65 years and above) account for 57% of TB cases and over 80% of TB-related deaths in Taiwan Province, highlighting their substantial contribution to the TB burden (Lin et al., 2021). Furthermore, Taiwan's foreign labor force, primarily from TB high-burden countries in Southeast Asia, such as Indonesia, the Philippines, and Thailand (Deng et al., 2020; Trinh et al., 2015), adds another layer of complexity to TB control efforts. The intersection of this demographic with Taiwan's rapidly aging population presents an increasingly serious public health challenge. By 2026, the proportion of older adults (65 years and above) is expected to reach 20% of the total population, officially marking Taiwan Province as a "super-aged society." This demographic shift suggests that the TB burden may increase further in the coming years.

To explore and address future challenges related to TB in Taiwan Province, targeted research is essential. In this paper we aim to apply a novel data-driven framework (1) to analyze the spatiotemporal dynamics of TB incidence and its drivers, (2) to develop a predictive model for TB incidence, (3) to determine the minimum combination of drivers for near-optimal predictive performance, (4) to identify the thresholds and interaction patterns between TB incidence and drivers through XAI.

2. Material and methods

2.1. Study area

Taiwan, a provincial administrative region of the People's Republic of China, is situated on the continental shelf along China's southeastern coast (Fig. 1). Spanning both temperate and tropical zones, Taiwan Province experiences a subtropical climate in the north and a tropical climate in the south, with notable seasonal variations that influence disease patterns. The island is characterized by hot, humid summers and mild winters. This climatic diversity significantly affects the prevalence and spread of various infectious diseases, including vector-borne illnesses such as dengue fever (Wu et al., 2023) and respiratory diseases like tuberculosis (Liao et al., 2012).

Fig. 1 — A cartographic overview of cities, counties, and outlying islands of Taiwan, China.

2.2. Data source

This study uses monthly confirmed TB case data from January 2014 to December 2022 across 19 cities and counties in Taiwan Province, excluding Penghu, Kinmen, and Lienchiang counties, sourced from (Taiwan Centers for Disease Control, 2024). The environmental data consist of three categories: meteorological data (e.g., 10-m wind speed, 2-m temperature, total precipitation, and relative humidity), topographical data (the ratio of plain area to total land area), vegetation data (normalized vegetation index) and pollution data (monthly average concentrations of PM_2.5, PM₁₀, SO₂, CO, NO₂, and O₃). The first three meteorological variables are sourced from ERA5's monthly averaged single-level data (Hersbach et al., 2023a), while relative humidity is obtained from ERA5's pressure-level data at 1000 hPa (Hersbach et al., 2023b). The topographical data are derived from (Directorate‑General of Budget, 2024), and the vegetation index from the Terra Vegetation Indices Monthly L3 Global 1 km SIN Grid V061 (Didan, 2015). Pollution data were obtained from (Ministry of Environment Environmental Information Open Platform, MOENV., 2024). Socioeconomic factors primarily encompass population density, per capita annual disposable income, and the average number of practicing physicians per 10,000 people. The data is derived from (Directorate‑General of Budget, 2024). All abbreviations and units of the variables are listed in Table 1.

Table 1.

Abbreviation and unit of the variables.

Name	Unit	Abbreviation
Monthly confirmed TB case	Case	TB
Monthly mean temperature	K	TEMP
Monthly mean relative humidity	%	RH
Monthly mean wind speed	m· s⁻¹	WS
Monthly mean perception	kg· m⁻² · s⁻¹	PRE
Normalized Difference Vegetation Index	Null	NDVI
Ratio of plain area	%	RPL
Monthly mean concentration of PM_2.5	μg· m⁻³	PM_2.5
Monthly mean concentration of PM₁₀	μg· m⁻³	PM₁₀
Monthly mean concentration of SO₂	μg· m⁻³	SO₂
Monthly mean concentration of NO₂	μg· m⁻³	NO₂
Monthly mean concentration of CO	μg· m⁻³	CO
Monthly mean concentration of O₃	μg· m⁻³	O₃
Per capita disposable income	TWD	PI
Population density	Inhabitants· km⁻²	POP
Number of doctors per 10,000 people	people	DOC

Open in a new tab

2.3. Data preprocessing

In this study, ArcGIS is utilized to process raster-format environmental data to determine monthly averages across Taiwan's counties and cities. The process begins with creating a grid based on administrative boundaries, followed by generating a feature class. Using the "Extract Values to Points" tool, we extracted raster data values corresponding to a point layer. The "Marker" tool then assigns these values to specific areas by calculating intersections with administrative data. Invalid data outside the province's boundaries, marked as −999, are filtered out. Subsequently, the "Summary Statistics" tool calculates the average values for each county and city. These averages are then aggregated to determine the overall environmental factor average within Taiwan Province. For annual data, cubic spline interpolation is used to estimate monthly values, ensuring a comprehensive and accurate analysis.

2.4. Augmented Dickey-Fuller test

To ascertain the stability of our time series, we employed the Augmented Dickey-Fuller (ADF) (Harris, 1992) test, which is utilized to detect the presence of a unit root in the series, indicating non-stationarity. Non-stationary series must be transformed into stationary forms prior to conducting linear regression analysis.

2.5. Feature selection

We refine feature selection by calculating the Spearman correlation coefficient and variance inflation factor (VIF) among features, thereby reducing model complexity and enhancing performance. Features with correlation coefficients exceeding 0.85 or a VIF greater than 10 are indicative of multicollinearity and are subsequently eliminated.

To minimize predictive costs, we employ stepwise regression for feature selection, aiming to construct an economical predictive model. This approach incrementally introduces variables, conducting statistical tests to remove those that are non-significant or highly correlated with others, ensuring that the model comprises only significant variables with minimal multicollinearity.

2.6. Modelling

In this study, to forecast the incidence of TB in Taiwan Province, a suite of ML and DL models was employed. Within the realm of ML, we leveraged algorithms such as random forest (RF) (Breiman, 2001), gradient boosting (GB) (Friedman, 2001), eXtreme gradient boosting (XGBoost) (Chen, 2015), and categorical boosting (CatBoost) (Prokhorenkova et al., 2018). These ensemble learning methods effectively mitigate the risk of overfitting and enhance the models' generalization capabilities by constructing multiple decision trees and optimizing their structures. In the domain of DL, we implemented feedforward neural networks (FNN) (Hornik et al., 1989), backpropagation neural networks (BPNN) (Hecht-Nielsen, 1989), deep neural networks (DNN) (LeCun et al., 2015), and residual networks (ResNet) (He et al., 2016). These networks are designed to capture complex patterns through layered architectures, with ResNets specifically addressing deep learning challenges like vanishing gradients through the use of skip connections.

Hyperparameter optimization was conducted using Bayesian optimization implemented through the Optuna framework with the Tree-structured Parzen Estimator (TPE) sampler. For tree-based ensemble models, optimization targeted parameters governing model complexity and regularization. Specifically, for CatBoost, tree depth, number of boosting iterations, and L2 leaf regularization were optimized under the ordered boosting framework. For GB, tree depth, learning rate, and number of estimators were tuned. For XGBoost, maximum tree depth, learning rate, and L2 regularization were optimized, with early stopping applied during training. For RF models, the number of trees, maximum tree depth, and minimum leaf size were optimized. For neural network-based models, hyperparameter tuning focused on architectural and training-related parameters. For the DNN, network depth, number of hidden units, learning rate, and dropout rate were optimized. For the BPNN, the number of hidden neurons, learning rate, and L2 regularization were tuned. For the FNN, network width, dropout rate, and optimizer configuration were optimized. For the ResNet model, the number of residual blocks and weight decay were optimized, with early stopping applied during training to reduce overfitting. All tuned models were selected based on the optimization objective within the training data only, without access to held-out test data, ensuring consistent and unbiased comparison across all algorithms.

2.7. Model selection

In order to select the best performing predictive model among all ML and DL models, we used four well-established metrics for evaluation: coefficient of determination (R²), root mean square error (RMSE), and mean absolute error (MAE). These metrics are widely used to assess the performance of regression models. The coefficient of determination, R², assesses the model's ability to account for the variance in the data set, with values close to 1 indicating high predictive accuracy, while the RMSE and MAE are both error metrics, with smaller values indicating better model performance. The mathematical expressions for these metrics are as follows:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - {\bar{y}}_{i})}^{2}},

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2},}

MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - \hat{y_{i}} | .

Here $n$ denotes the sample size, $y_{i}$ is the real value of the $i$ -th observation in data $, {\hat{y}}_{i}$ represents the predicted value for the $i$ -th observation in data by the ML model, and $\bar{y}$ is the sample mean.

To further assess the stability and robustness of the models, five-fold cross-validation was first applied. In this process, the dataset was randomly divided into five approximately equal subsets. In each iteration, four subsets were used for model training, and the remaining subset was used for validation, ensuring that each subset served as the validation set exactly once. This procedure was consistently applied to both the default and tuned configurations of all eight models. Subsequently, the standard deviation (SD) of the R² values across the five folds was calculated, along with the coefficient of variation (CV), which is defined as the ratio of SD to the mean cross-validated R², expressed as a percentage. Models exhibiting higher mean R² together with lower SD and CV values were considered to demonstrate more robust generalization. This evaluation framework enabled identification of models achieving high peak performance at the expense of increased fold-to-fold variability, as well as models providing more stable and consistent predictions across cross-validation folds.

2.8. Explainable machine learning methods

2.8.1. Shapley additive explanations

Drawing from cooperative game theory, SHAP values offer a robust method for determining the influence of individual features on the model's output by systematically evaluating their collective contributions across all possible feature combinations (Shapley, 1953). Formally, for a given prediction, the SHAP value $Φ_{i}$ for feature $i$ is computed as:

Φ_{i} = \sum_{S \subseteq {1, \dots, m} ∖ {i}} \frac{| S |! (m - | S | - 1)!}{m!} [f (S \cup {i}) - f (S)],

where $f (S \cup {i})$ and $f (S)$ are the model outcomes with and without the $i$ -th predictor, respectively, and $S$ is a subset of features from the model.

2.8.2. Permutation feature importance

Permutation feature importance (PFI) (Altmann et al., 2010) is a versatile model-agnostic method that can be applied to various models. It assesses the significance of each feature by observing the change in the model's performance metric when the values of a particular feature are randomly permuted. The formula for calculating PFI for a feature $X_{j}$ is given by

I_{j} = \frac{1}{n} \sum_{i = 1}^{n} (S_{0} - S_{i}) .

Here, $S_{0}$ is the baseline performance metric of the model on the original dataset, $S_{i}$ represents the performance metric after the i -th permutation of feature $X_{j}$ , and n is the total number of permutations performed. A higher value of $I_{j}$ indicates that feature $X_{j}$ has a more substantial effect on the model's predictive performance.

2.8.3. Individual conditional expectation

ICE plot (Goldstein et al., 2015), is utilized to analyze regression models and grasp the relationship between specific predictor variables and the response variable. They operate by calculating and graphing the changes in a model's predictive output as a single feature varies across data points while holding other variables constant. These plots illustrate how the conditional expectation of the target variable evolves with alterations in the explanatory variables.

2.8.4. Value window

To ascertain the optimal value ranges for managing key factors, we introduced a novel metric termed the "value window" and conducted its calculation. This metric defines the interval within which the values of two specific features can sustain the target variable at either its peak or nadir. The process initiated with the computation of the mean across all features, followed by an even division of the selected two features into 100 levels spanning their minimum to maximum values, yielding a total of 10,000 distinct combinations. Subsequently, each combination of these two features was amalgamated with the mean values of the remaining features, and a previously validated optimal model was employed to forecast the incidence of tuberculosis for each feature combination. Upon compiling the 10,000 forecasted outcomes and graphically analyzing them, we identified the values of the two features that either maximized or minimized the prevalence of tuberculosis, whether as a specific combination or a value range, thereby establishing the so-called "value window." We analyzed nine pairs of features that are potentially controllable or significantly correlated with the influence on tuberculosis. The model framework presented in the article is depicted in Fig. 2.

Fig. 2 — Schematic illustration of the framework.

3. Results

3.1. Spatiotemporal pattern of TB incidence

The epidemiological trajectory of TB incidence in Taiwan Province has shown a notable decline across all regions (Fig. 3). Specifically, the overall incidence decreased from 49 per 100,000 residents in 2014 to 32 per 100,000 in 2022, reflecting a significant reduction of 30.43%. The ADF test, yielding a p-value of 0.989, suggests that the time series of tuberculosis cases in Taiwan Province is non-stationary, hinting at the presence of seasonal variations. Seasonally adjusted average incidence rates are as follows: spring with 791 cases, summer with 782 cases, autumn with 748 cases, and winter with the lowest incidence at 676 cases, with spring recording the highest rate.

Spatially, elevated tuberculosis incidence rates are predominantly concentrated in the southern and eastern regions of Taiwan Province, notably in Yunlin, Nantou, Hualien, Taitung, Kaohsiung, and Pingtung counties. Comparative epidemiological analysis from 2014 to 2022 indicates notable decreases in tuberculosis cases in Hualien, Taitung, and Pingtung counties, especially in the southeastern area.

Spatial autocorrelation analysis, utilizing the local Moran's I index, further delineates the distribution patterns of tuberculosis cases across Taiwan Province. Fig. 4 presents a cold and hot spot map, highlighting clusters of high and low incidence rates. Hotspots are primarily located in the southeast, including Hualien, Taitung, Kaohsiung, and Pingtung counties, while cold spots, such as Miaoli County, Hsinchu County, and Hsinchu City, situated northwest of the Central Mountain Range, consistently show lower incidence rates.

Fig. 4 — The cold/hot spot map of TB incidence rate in Taiwan, China, from 2014 to 2022.

3.2. Model selection

Before integrating 15 variables into our model, we performed feature screening, calculating Spearman correlation coefficients and identifying high correlations between PM_2.5, PM₁₀, RPL, and NDVI (Fig. 5a). We also computed VIFs, noting that "NO₂" and "PM₁₀" showed signs of multicollinearity with VIFs over 10 (Fig. 5b). Due to these high correlations, we removed "NO₂", "PM₁₀", and "RPL". Afterward, we confirmed that the remaining 12 features had Spearman correlation coefficients below 0.75 and VIFs under 10 (Fig. 5c and d), indicating an absence of multicollinearity.

Fig. 5 — Heat map of Spearman's correlation coefficient and VIFs for candidate features before/after feature screening.

In this study, we applied four ML models and four DL models to predict TB cases using 12 selected features across 19 cities and counties spanning from 2014 to 2022 (Fig. 6). After deploying our predictive models, their efficacy was assessed using MAE, RMSE, and R² metrics. CatBoost outperforming with an R² of 0.934 and an RMSE of 9.615. The RF model reported the lowest MAE at 6.586, and the GB model, with an R² of 0.930 and an RMSE of 9.882, was a close second. These outcomes demonstrate the potency of ML and DL in accurately predicting TB cases, underscoring their utility in this epidemiological context.

Fig. 6 — Model performance metrics (R², MAE, and RMSE) for eight candidate ML models.

Using five-fold cross-validation, we observed that hyperparameter optimization systematically increased performance variability across models, despite occasional improvements in peak R2 (Table 2). For example, the default CatBoost model achieved a high and stable mean R2 of 0.934, with a standard deviation of 0.008 and a CV of only 0.83%, indicating minimal sensitivity to data partitioning. In contrast, the tuned CatBoost model, although reaching a higher best-fold R2 (0.960), exhibited substantially greater variability, with an SD of 0.034 and a CV of 3.65%, and a worst-fold R2 as low as 0.888. This pattern was more pronounced for neural network-based models. For instance, the tuned ResNet and DNN models showed CV values of 8.47% and 7.71%, respectively, with large gaps between best-fold and worst-fold performance (e.g., 0.962 vs. 0.781 for ResNet and 0.965 vs. 0.802 for DNN), highlighting pronounced instability. Even for tree-based models such as GB and RF, tuning increased CV from approximately 0.8-1.4% in default configurations to 1.7-2.4% after optimization. Given that the primary objective of this study was to identify robust and generalizable associations between environmental factors and tuberculosis, model stability was prioritized over marginal gains in peak accuracy. The default CatBoost model provided the most favorable balance between predictive performance and robustness, combining a high mean R2 with the lowest variability across folds. Therefore, hyperparameter-tuned models were not selected for subsequent analyses, and the default CatBoost model was adopted as the final modeling framework to ensure reliable inference and reproducibility.

Table 2.

Comparison of R² for 8 models before and after TPE optimization.

Model	Mean R²	Fold 1 R²	Fold 2 R²	Fold 3 R²	Fold 4 R²	Fold 5 R²	SD	CV (%)
CatBoost	0.934	0.941	0.922	0.93	0.939	0.936	0.008	0.83
CatBoost_TPE	0.931	0.955	0.888	0.953	0.899	0.96	0.034	3.65
GB	0.927	0.934	0.918	0.927	0.925	0.931	0.008	0.81
GB_TPE	0.922	0.922	0.915	0.93	0.94	0.897	0.016	1.76
XGB	0.91	0.921	0.892	0.914	0.919	0.902	0.012	1.35
XGB_TPE	0.918	0.925	0.931	0.912	0.919	0.904	0.011	1.16
RF	0.924	0.912	0.926	0.91	0.929	0.941	0.013	1.39
RF_TPE	0.931	0.912	0.914	0.954	0.919	0.956	0.022	2.37
ResNet	0.921	0.92	0.915	0.94	0.923	0.907	0.012	1.33
ResNet_TPE	0.893	0.9	0.962	0.96	0.781	0.861	0.076	8.47
DNN	0.916	0.928	0.918	0.898	0.912	0.926	0.012	1.32
DNN_TPE	0.919	0.802	0.963	0.963	0.9	0.965	0.071	7.71
BPNN	0.916	0.933	0.923	0.913	0.904	0.908	0.012	1.28
BPNN_TPE	0.92	0.958	0.94	0.869	0.921	0.915	0.033	3.63
FNN	0.93	0.934	0.943	0.936	0.914	0.921	0.012	1.28
FNN_TPE	0.91	0.95	0.91	0.916	0.952	0.821	0.053	5.85

Open in a new tab

3.3. Model interpretation

3.3.1. Relative importance of drivers on TB

As illustrated in Fig. 7, we evaluated the significance of various factors influencing TB incidence using the mean absolute SHAP value (SHAPABS) and PFI. Across all models, POP was unanimously identified as the most critical predictor. Notably, the CatBoost and GB models, which are founded on the boosting methodology, exhibited a high degree of congruence in their rankings of the predictive factors. Both models concurred that SO₂, DOC, NDVI, and O₃ are the four most influential factors subsequent to population. In contrast, the RF model, which is based on the bagging approach, aligned with the others regarding the primacy of population but diverged slightly in its assessment of the remaining factors, prioritizing WS and PI more highly. Despite these nuances, a consensus among the three models emerged, affirming POP, SO₂, DOC, NDVI, WS, and PI as the six most significant factors affecting TB. This concordance validates the robustness of the models and underscores the efficacy of ML techniques in forecasting public health issues.

3.3.2. Response between drivers and TB incidence

Fig. 8 presents a comprehensive summary of SHAP plots, illustrating the impact of 12 selected features on the prediction of TB incidence across three distinct models. Each data point within the graph corresponds to an individual SHAP value, quantifying the degree to which a particular feature influences a specific predicted case. The aggregation of these SHAP values offers a holistic perspective on the relative impact, frequency, and directional effects of each variable. Notably, the consistency in the directionality of feature influence across all three models substantiates the reliability of our analytical conclusions. In the context of population density, the clustering of blue dots in the negative fraction of SHAP values indicates a positive correlation with TB incidence, suggesting that areas with higher population densities tend to have increased TB cases. Similar patterns of association are observed with SO2, NDVI, RH, and PRE, implying that these factors may contribute to the occurrence of TB. Conversely, for the DOC, the blue dots predominantly amass in the positive fraction of SHAP values, indicating an inverse relationship where higher DOC levels are associated with a decrease in TB incidence. Analogous trends are discernible for WS, PI, and TEMP, suggesting a negative association with TB incidence.

To further analyze the factors influencing TB incidence, SHAP dependence plots were generated using the top-performing CatBoost model (Fig. 9). Fig. 9a reveals that POP shows greater variation in SHAP values between cities/counties than within them. In Fig. 9b, DOC is linked to higher SHAP values, indicating that limited medical care may increase TB incidence.

As shown in Fig. 9c, increasing SO₂ concentration corresponds to rising SHAP values, indicating a positive correlation with TB cases; the transition from negative to positive SHAP values occurs at a threshold of 2.5 μg/m³. Similarly, positive correlations are observed for NDVI and RH, with respective thresholds of 0.6 and 75% (Fig. 9d and f). Additionally, interactions are evident: when NDVI is below 0.6, higher PI at the same NDVI level leads to more TB cases, and when RH is below 75%, lower SO₂ at the same RH level increases TB incidence. WS exhibits a negative correlation with TB, with a threshold of 2 m/s marking the point where SHAP values shift from positive to negative (Fig. 9e). These findings are supported by consistent patterns observed in the ICE plots (Fig. 10).

Notably, the SHAP dependence plot (Fig. 9b) and the average ICE curve (Fig. 10c) for the DOC feature exhibit partially inconsistent local trends. This discrepancy primarily stems from the different ways in which the two methods handle feature interaction effects. The SHAP dependence plot preserves the real co-occurrence patterns of other features (such as POP or SO₂) in the data when visualizing the influence of DOC, thereby reflecting actual synergistic effects present in the observed dataset. For instance, within the high SHAP value region (i.e., where a lower number of doctors contributes more positively to TB incidence), the pattern in point coloration (representing the interacting feature) suggests a strong association with specific demographic or environmental conditions. In contrast, the average ICE curve is derived by first calculating the “parallel effect” of varying DOC while holding all other features fixed at each sample's actual values, and then averaging these individual trajectories. This approach focuses more on describing the average conditional main effect of DOC. Consequently, the more complex trend exhibited by DOC in the SHAP plot reveals that its impact is not independent but is highly contingent on specific socioeconomic and environmental contexts. This observation reinforces the finding in this study that nonlinear interactions exist among the driving factors.

3.3.3. Regulation of TB incidence using value window

A series of value window analyses were conducted to gain insight into the link between air pollution levels and TB incidence and to assess the feasibility of reducing TB risk by adjusting air pollutant levels. This analytical approach systematically identifies and visualizes actionable value ranges for feature pairs that can maintain the target variable, TB incidence, at a predefined, desirable level, such as its minimum predicted value. This identification is achieved through a comprehensive grid search combined with explicit threshold filtering. By shifting the analytical focus from understanding 'what the relationship is' toward recommending 'what specific, actionable ranges should be targeted,' the Value Window provides concrete, quantitative reference intervals that can directly inform disease control strategies and resource allocation decisions.

The results of the study revealed the following. Controlling SO₂ below 2.34 μg/m³ was the most effective in controlling TB, and predicted 65 or fewer confirmed cases of TB when other characteristics were averaged; controlling SO₂ below 3.77 μg/m³ was the second most effective, and predicted 80 or fewer confirmed cases of TB at this time (Fig. 11a). Meanwhile, for CO, when the concentration is from 0.17 μg/m³ to 0.45 μg/m³, it corresponds to the highest number of confirmed cases of TB, and lower or higher than this range reduces the number of TB cases (Fig. 11a). O₃ and PM_2.5 were positively correlated with TB, with a higher number of confirmed TB cases when O₃ was above 78.38 μg/m³ and PM_2.5 was above 39.6 μg/m³ (Fig. 11b).

Fig. 11 — Plots of the value window. (a) Dependence of CO and SO₂ on TB. (b) Dependence of PM_2.5 and O₃ on TB. (c) Dependence of TEMP and RH on TB. (d) Dependence of PI and DOC on TB. The red window shows the highest or lowest value of the TB prediction.

TEMP was negatively correlated with TB and RH was largely positively correlated with TB, but it is worth noting that TB incidence was better reduced when the RH was between 68.27% and 72.39% (Fig. 11c). As shown in Fig. 11d, the impact of DOC on TB is more pronounced than that of PI, and no clear optimal value window is evident. Maintaining a doctor density above 101 per 10,000 residents significantly contributes to a reduction in TB incidence.

3.3.4. Sustainable model

To enhance the efficiency of our predictive model and minimize computational expense, we utilized a stepwise regression approach to refine our variable selection, ultimately identifying four key indicators: SO₂, NDVI, POP, and PI (Fig. 12a). Upon integrating these indicators into the CatBoost model, we achieved a robust performance with an R² value of 0.932, a RMSE of 9.749, and a MAE of 6.725, demonstrating the model's exceptional accuracy in forecasting TB cases. The directionality of feature influence on TB cases, as depicted in the SHAP summary plot (Fig. 12b), aligns with the model that incorporated 12 features, confirming the consistency of our findings.

Fig. 12 — (a) Number of features and model evaluation metrics. (b) SHAP summary plot for low-cost model.

4. Discussion

Our findings indicate that spring and summer are the peak seasons for TB incidence in Taiwan Province, aligning with available research on the seasonality of TB (Bonell et al., 2020; Fares, 2011; Liu et al., 2010). This seasonal pattern is likely due to the increased transmission of the disease during indoor crowding in winter and the reactivation of latent TB infections in the warmer seasons of spring and summer (Naranbat et al., 2009; Willis et al., 2012). Furthermore, sunlight exposure is essential for the synthesis of vitamin D, and deficiencies in vitamin D levels are associated with an increased positivity rate in tuberculin skin tests among TB contacts, a deficiency that is prevalent in populations at high risk for TB (Koh et al., 2013; Wingfield et al., 2014). These factors often lead to a peak in TB infections in late winter, followed by a peak in reported cases in spring and summer. Concurrently, climatic factors, such as precipitation and temperature, significantly influence the seasonality of TB, with conditions conducive to the transmission of Mycobacterium tuberculosis, especially temperature and humidity in spring, being correlated with increased incidence rates (Zhang et al., 2018). The autocorrelation analysis depicted in Fig. 4 indicates a higher concentration of TB incidence in tropical regions south of the tropic of cancer, with lower concentrations observed north of this line. This pattern is not only specific to Taiwan Province but is also seen globally, where tropical countries tend to have higher TB incidence rates (Camargo, 2008; Lim & Siow, 2018). The geographic and economic disparities highlight the complex interplay between environmental, socioeconomic, and public health factors in shaping the global and regional patterns of TB incidence.

To enhance the effectiveness of TB prevention and control strategies, it is essential to explore the corresponding relationships and thresholds between natural and socio-economic factors associated with TB incidence (Fig. 7, Fig. 8, Fig. 9, Fig. 10). Population density emerges as a principal determinant of TB incidence, primarily due to its influence on contact rates, a critical factor in the transmission of infectious diseases. Numerous epidemiological models consider population density as a foundational parameter (De Jong et al., 1995; Hu et al., 2013; McCallum et al., 2001). The impact of SO₂ on TB incidence is multifaceted and significant; while it may exhibit antimicrobial properties offering protective effects against TB, it could also elevate TB incidence due to increased absorption during human activity and its effects on alveolar macrophages (Hwang et al., 2014; Liu et al., 2021; Zhu et al., 2018). Furthermore, air pollutants such as PM_2.5, CO, and O₃ are correlated with TB incidence. The availability of healthcare resources, including the ratio of practicing physicians, directly affects TB detection and treatment outcomes (Wang et al., 2007). There is an inverse correlation between per capita disposable income and TB incidence, highlighting the link between poverty and health outcomes, often attributed to substandard housing, overcrowding, and inadequate sanitation (Gupta et al., 2004). Natural factors, including precipitation and humidity, are positively correlated with TB incidence, while higher temperature and wind speed generally have an inhibitory effect (Desalu, 2011; Fares, 2011; Wang et al., 2023b).

While medical advancements, particularly in vaccines, have increased confidence in eradicating tuberculosis (Bo et al., 2023; Organization, 2024), there is a notable gap in integrating social determinants into TB control strategies. These strategies often emphasize pharmaceutical interventions, potentially neglecting the broader social issues essential for comprehensive TB management. Our research aims to incorporate critical socioeconomic and environmental factors to develop effective TB control strategies. Fig. 11a and b illustrate the impact of four pollution factors on TB and identify key thresholds, aiding governments in establishing more effective and cost-efficient TB control goals. Fig. 11c reveals a negative correlation between temperature and TB incidence, suggesting that global warming trends do not exacerbate TB. Additionally, Fig. 11d indicates that a critical threshold of 101 or more practicing doctors is essential for effective TB management, offering valuable insights for public health agencies to optimize healthcare resource allocation. Our findings call for comprehensive strategies that integrate socioeconomic, environmental, and healthcare dimensions to effectively combat TB for the global health goals for its elimination.

5. Conclusion

In this study we examined epidemiological trends and drivers of TB in Taiwan Province from 2014 to 2022, considering air pollution, natural factors, and socioeconomic conditions. The main findings are as follows.

1.
The top-performing and sustainable models have been established, consistently highlighting the most significant drivers affecting TB incidence.
2.
XAI elucidates the nonlinear responses and thresholds between TB incidence and its drivers.
3.
The value window identifies optimal levels of air pollution, environmental conditions, and economic factors for reducing TB incidence.

This study has several limitations. First, model performance and generalizability depend on the quality and completeness of historical data; potential data gaps or reporting biases at finer spatiotemporal resolutions could affect analytical precision. Second, despite the inclusion of a broad set of drivers, unmeasured confounding factors, such as individual behavioral patterns, pathogen genetic variations, or local differences in health policy implementation, may still exist. Third, although XAI techniques were employed, the framework remains predictive and associative in nature, and caution is warranted when drawing causal inferences. While the identified thresholds and “value windows” provide strong suggestive evidence, they require further validation through dedicated causal study designs. Future work should integrate multi-source data (e.g., high-resolution remote sensing, electronic health records, mobility data) with advanced modeling (e.g., spatiotemporal graph neural networks) to better capture the complex environment-society-infection interactions while mitigating confounding effects.

CRediT authorship contribution statement

Yiwen Tao: Writing – review & editing, Writing – original draft, Methodology, Conceptualization. Jiaxin Zhao: Writing – original draft, Software, Data curation. Hao Cui: Software, Formal analysis, Data curation. Zhanlue Liang: Software, Formal analysis, Data curation. Jian Li: Visualization. Jingli Ren: Validation, Supervision, Funding acquisition. Huaiping Zhu: Writing – review & editing, Supervision, Methodology.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and materials

Data and materials will be made available on request.

Funding

The work is supported by the National Key Research and Development Program of China (2024YFB3411500), the National Natural Science Foundation of China (12201577, U23A2065, 42001405), Key Scientific Research Projects of Higher Education Institutions in Henan Province, CN, China Postdoctoral Science Foundation (2024M762971).

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Handling Editor: Dr. Raluca Eftimie

Footnotes

Peer review under the responsibility of KeAi Communications Co., Ltd.

Contributor Information

Jingli Ren, Email: renjl@zzu.edu.cn.

Huaiping Zhu, Email: huaiping@yorku.ca.

References

Altmann A., Toloşi L., Sander O., et al. Permutation importance: A corrected feature importance measure. Bioinformatics. 2010;26(10):1340–1347. doi: 10.1093/bioinformatics/btq134. [DOI] [PubMed] [Google Scholar]
Arora N.K., Mishra I. United nations sustainable development goals 2030 and environmental sustainability: Race against time. Environmental Sustainability. 2019;2(4):339–342. [Google Scholar]
Bo H., Moure U.A.E., Yang Y., et al. Mycobacterium tuberculosis-macrophage interaction: Molecular updates. Frontiers in Cellular and Infection Microbiology. 2023;13 doi: 10.3389/fcimb.2023.1062963. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bonell A., Contamin L., Thai P.Q., et al. Does sunlight drive seasonality of TB in Vietnam? A retrospective environmental ecological study of tuberculosis seasonality in Vietnam from 2010 to 2015. BMC Infectious Diseases. 2020;20:1–11. doi: 10.1186/s12879-020-4908-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
Breiman L. Random forests. Machine Learning. 2001;45:5–32. [Google Scholar]
Camargo E.P. Tropical diseases. Estudos Avançados. 2008;22:95–110. [Google Scholar]
Chakaya J., Khan M., Ntoumi F., et al. Global tuberculosis report 2020–reflections on the global TB burden, treatment and prevention efforts. International Journal of Infectious Diseases. 2021;113:S7–S12. doi: 10.1016/j.ijid.2021.02.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chakaya J., Petersen E., Nantanda R., et al. The WHO global tuberculosis 2021 report–not so good news and turning the tide back to end TB. International Journal of Infectious Diseases. 2022;124:S26–S29. doi: 10.1016/j.ijid.2022.03.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chan M., Pai K., Su S., et al. Explainable machine learning to predict long-term mortality in critically ill ventilated patients: A retrospective study in central Taiwan. BMC Medical Informatics and Decision Making. 2022;22(1):75. doi: 10.1186/s12911-022-01817-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen T. Xgboost: Extreme gradient boosting. R package version 04-2. 2015;1(4) [Google Scholar]
Chen C. Taiwan’s rapidly aging population: A crisis in the making? MPRA Paper 116543. University Library of; Munich, Germany: 2023. [Google Scholar]
Cui H., Li J., Sun Y., et al. A novel framework for quantitative attribution of particulate matter pollution mitigation to natural and socioeconomic drivers. Science of The Total Environment. 2024;926:171910. doi: 10.1016/j.scitotenv.2024.171910. [DOI] [PubMed] [Google Scholar]
De Jong M.C., Diekmann O., Heesterbeek H. Vol. 84. 1995. How does transmission of infection depend on population size. (Epidemic models: Their structure and relation to data). [Google Scholar]
Deng J., Wahyuni H., Yulianto V. Labor migration from southeast Asia to Taiwan: Issues, public responses and future development. Asian Education and Development Studies. 2020;10(1):69–81. [Google Scholar]
Desalu O. Seasonal variation in hospitalisation for respiratory diseases in the tropical rain forest of south Western Nigeria. The Nigerian Postgraduate Medical Journal. 2011;18(1):39–43. [PubMed] [Google Scholar]
Didan K. 2015. MOD13A3 MODIS/Terra vegetation indices monthly L3 global 1km SIN grid V006. [DOI] [Google Scholar]
Directorate‑General of Budget . Accounting and Statistics, Executive Yuan, Taiwan, Province of China. 2024. https://winstacity.dgbas.gov.tw/DgbasWeb/ZWeb/StateFile_ZWeb.aspx [Google Scholar]
Fares A. Seasonality of tuberculosis. Journal of Global Infectious Diseases. 2011;3(1):46–55. doi: 10.4103/0974-777X.77296. [DOI] [PMC free article] [PubMed] [Google Scholar]
Friedman J.H. Greedy function approximation: A gradient boosting machine. Annals of Statistics. 2001:1189–1232. [Google Scholar]
Giri N., Chavan S., Heda R., et al. 2019 IEEE pune section international conference (PuneCon) 2019. Disease migration, mitigation, and containment: Impact of climatic conditions & air quality on tuberculosis for India; pp. 1–6. [Google Scholar]
Goldstein A., Kapelner A., Bleich J., et al. Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational & Graphical Statistics. 2015;24(1):44–65. [Google Scholar]
Guo C., Du Y., Shen S., et al. Spatiotemporal analysis of tuberculosis incidence and its associated factors in mainland China. Epidemiology and Infection. 2017;145(12):2510–2519. doi: 10.1017/S0950268817001133. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gupta D., Das K., Balamughesh T., et al. Role of socio-economic factors in tuberculosis prevalence. Indian Journal of Tuberculosis. 2004;51(1):27–32. [Google Scholar]
Harris R.I. Testing for unit roots using the augmented dickey-fuller test: Some issues relating to the size, power and the lag structure of the test. Economics Letters. 1992;38(4):381–386. [Google Scholar]
He K., Zhang X., Ren S., et al. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. Deep residual learning for image recognition; pp. 770–778. [Google Scholar]
Hecht-Nielsen R. Theory of the backpropagation neural network. International 1989 Joint Conference on Neural Networks. 1989;1:593–605. [Google Scholar]
Hersbach H., Bell B., Berrisford P., et al. 2023. ERA5 monthly averaged data on single levels from 1940 to present. Published online 2023. [Google Scholar]
Hersbach H., Bell B., Berrisford P., et al. 2023. ERA5 monthly averaged data on pressure levels from 1940 to present. Published online 2023. [Google Scholar]
Hornik K., Stinchcombe M., White H. Multilayer feedforward networks are universal approximators. Neural Networks. 1989;2(5):359–366. [Google Scholar]
Houtmeyers E., Gosselink R., Gayan-Ramirez G., et al. Regulation of mucociliary clearance in health and disease. European Respiratory Journal. 1999;13(5):1177–1188. doi: 10.1034/j.1399-3003.1999.13e39.x. [DOI] [PubMed] [Google Scholar]
Hu H., Nigmatulina K., Eckhoff P. The scaling of contact rates with population density for the infectious disease models. Mathematical Biosciences. 2013;244(2):125–134. doi: 10.1016/j.mbs.2013.04.013. [DOI] [PubMed] [Google Scholar]
Hwang S., Lee J., Lee J., et al. Impact of outdoor air pollution on the incidence of tuberculosis in the Seoul metropolitan area, South Korea. Korean Journal of Internal Medicine (English Edition) 2014;29(2):183–190. doi: 10.3904/kjim.2014.29.2.183. [DOI] [PMC free article] [PubMed] [Google Scholar]
Khan M.T., Kaushik A.C., Ji L., et al. Artificial neural networks for prediction of tuberculosis disease. Frontiers in Microbiology. 2019;10:395. doi: 10.3389/fmicb.2019.00395. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kharwadkar S., Attanayake V., Duncan J., et al. The impact of climate change on the risk factors for tuberculosis: A systematic review. Environmental Research. 2022;212 doi: 10.1016/j.envres.2022.113436. [DOI] [PubMed] [Google Scholar]
Knorst M., Kienast K.J., Ferlinz R. Effect of sulfur dioxide on cytokine production of human alveolar macrophages in vitro. Archives of Environmental Health: An International Journal. 1996;51(2):150–156. doi: 10.1080/00039896.1996.9936009. [DOI] [PubMed] [Google Scholar]
Koh G.C., Hawthorne G., Turner A.M., Kunst H., Dedicoat M. Tuberculosis incidence correlates with sunshine: An ecological 28-year time series study. PLoS One. 2013;8(3) doi: 10.1371/journal.pone.0057752. [DOI] [PMC free article] [PubMed] [Google Scholar]
LeCun Y., Bengio Y., Hinton G. Deep learning. Nature. 2015;521(7553):436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
Li Q., Liu M., Zhang Y., et al. The spatio-temporal analysis of the incidence of tuberculosis and the associated factors in mainland China, 2009-2015. Infection, Genetics and Evolution. 2019;75 doi: 10.1016/j.meegid.2019.103949. [DOI] [PubMed] [Google Scholar]
Liao C., Hsieh N., Huang T., et al. Assessing trends and predictors of tuberculosis in Taiwan. BMC Public Health. 2012;12:1–12. doi: 10.1186/1471-2458-12-29. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lim T.K., Siow W.T. Pneumonia in the tropics. Respirology. 2018;23(1):28–35. doi: 10.1111/resp.13137. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin S., Chien J., Chiang H., et al. Ambulatory independence is associated with higher incidence of latent tuberculosis infection in long-term care facilities in Taiwan. Journal of Microbiology, Immunology, and Infection. 2021;54(2):319–326. doi: 10.1016/j.jmii.2019.07.008. [DOI] [PubMed] [Google Scholar]
Linardatos P., Papastefanopoulos V., Kotsiantis S. Explainable ai: A review of machine learning interpretability methods. Entropy. 2020;23(1):18. doi: 10.3390/e23010018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu Y., Zhao S., Li Y., et al. Effect of ambient air pollution on tuberculosis risks and mortality in Shandong, China: A multi-city modeling study of the short-and long-term effects of pollutants. Environmental Science and Pollution Research. 2021;28:27757–27768. doi: 10.1007/s11356-021-12621-6. [DOI] [PubMed] [Google Scholar]
Liu L., Zhao X., Zhou Y. A tuberculosis model with seasonality. Bulletin of Mathematical Biology. 2010;72(4):931–952. doi: 10.1007/s11538-009-9477-8. [DOI] [PubMed] [Google Scholar]
McCallum H., Barlow N., Hone J. How should pathogen transmission be modelled? Trends in Ecology & Evolution. 2001;16(6):295–300. doi: 10.1016/s0169-5347(01)02144-9. [DOI] [PubMed] [Google Scholar]
Ministry of Environment Environmental Information Open Platform, MOENV. (2024) Concentration of Air Pollutants. https://data.moenv.gov.tw/
Mohidem N.A., Osman M., Muharam F.M., et al. Prediction of tuberculosis cases based on sociodemographic and environmental factors in gombak, Selangor, Malaysia: A comparative assessment of multiple linear regression and artificial neural network models. International Journal of Mycobacteriology. 2021;10(4):442–456. doi: 10.4103/ijmy.ijmy_182_21. [DOI] [PubMed] [Google Scholar]
Murray M., Oxlade O., Lin H. Modeling social, environmental and biological determinants of tuberculosis. International Journal of Tuberculosis & Lung Disease. 2011;15(6):S64–S70. doi: 10.5588/ijtld.10.0535. [DOI] [PubMed] [Google Scholar]
Naranbat N., Nymadawa P., Schopfer K., et al. Seasonality of tuberculosis in an Eastern-Asian country with an extreme continental climate. European Respiratory Journal. 2009;34(4):921–925. doi: 10.1183/09031936.00035309. [DOI] [PubMed] [Google Scholar]
Nie Y., Lu Y., Wang C., et al. Effects and interaction of meteorological factors on pulmonary tuberculosis in urumqi, China, 2013–2019. Frontiers in Public Health. 2022;10 doi: 10.3389/fpubh.2022.951578. [DOI] [PMC free article] [PubMed] [Google Scholar]
Organization W.H. 2024. WHO Consolidated Guidelines on tuberculosis. Module 3: Diagnosis–rapid diagnostics for tuberculosis detection. [PubMed] [Google Scholar]
Organization, W.H. 2008. Implementing the WHO stop TB Strategy: A handbook for national TB control programmes. [PubMed] [Google Scholar]
Prokhorenkova L., Gusev G., Vorobev A., et al. CatBoost: Unbiased boosting with categorical features. Advances in Neural Information Processing Systems. 2018;31 [Google Scholar]
Raviglione M., Sulis G. Tuberculosis 2015: Burden, challenges and strategy for control and elimination. Infectious Disease Reports. 2016;8(2):6570. doi: 10.4081/idr.2016.6570. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shapley L.S. A value for n-person games. Contributions to the Theory of Games II (Annals of Mathematics Studies. 1953;28:307–317. [Google Scholar]
Taiwan Centers for Disease Control. (2024) Taiwan infectious disease statistics. https://nidss.cdc.gov.tw/en/
Talat N., Perry S., Parsonnet J., et al. Vitamin D deficiency and tuberculosis progression. Emerging Infectious Diseases. 2010;16(5):853. doi: 10.3201/eid1605.091693. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tao Y., Zhao J., Cui H., et al. Exploring the impact of socioeconomic and natural factors on pulmonary tuberculosis incidence in China (2013–2019) using explainable machine learning: A nationwide study. Acta Tropica. 2024;253 doi: 10.1016/j.actatropica.2024.107176. [DOI] [PubMed] [Google Scholar]
Trinh Q., Nguyen H., Nguyen V., et al. Tuberculosis and HIV co-infection—Focus on the Asia-Pacific region. International Journal of Infectious Diseases. 2015;32:170–178. doi: 10.1016/j.ijid.2014.11.023. [DOI] [PubMed] [Google Scholar]
Wang Q., Li Y., Yin Y., et al. Association of air pollutants and meteorological factors with tuberculosis: A national multicenter ecological study in China. International Journal of Biometeorology. 2023;67(10):1629–1641. doi: 10.1007/s00484-023-02524-1. [DOI] [PubMed] [Google Scholar]
Wang Q., Li Y., Yin Y., et al. Association of air pollutants and meteorological factors with tuberculosis: A national multicenter ecological study in China. International Journal of Biometeorology. 2023;67(10):1629–1641. doi: 10.1007/s00484-023-02524-1. [DOI] [PubMed] [Google Scholar]
Wang L., Liu J., Chin D.P. Progress in tuberculosis control and the evolving public-health system in China. The Lancet. 2007;369(9562):691–696. doi: 10.1016/S0140-6736(07)60316-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
Willis M.D., Winston C.A., Heilig C.M., et al. Seasonality of tuberculosis in the United States, 1993–2008. Clinical Infectious Diseases. 2012;54(11):1553–1560. doi: 10.1093/cid/cis235. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wingfield T., Schumacher S.G., Sandhu G., et al. The seasonality of tuberculosis, sunlight, vitamin D, and household crowding. The Journal of Infectious Diseases. 2014;210(5):774–783. doi: 10.1093/infdis/jiu121. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu M.H.H., Chu P., et al. Surveillance of multidrug-resistant tuberculosis in Taiwan, 2008–2019. Journal of Microbiology, Immunology, and Infection. 2023;56(1):120–129. doi: 10.1016/j.jmii.2022.08.004. [DOI] [PubMed] [Google Scholar]
Yang J., Zhang M., Chen Y., et al. A study on the relationship between air pollution and pulmonary tuberculosis based on the general additive model in Wulumuqi, China. International Journal of Infectious Diseases. 2020;96:42–47. doi: 10.1016/j.ijid.2020.03.032. [DOI] [PubMed] [Google Scholar]
Zhang Q., Yan L., He J., et al. Time series analysis of correlativity between pulmonary tuberculosis and seasonal meteorological factors based on theory of human-environmental inter relation. Journal of Traditional Chinese Medical Sciences. 2018;5(2):119–127. [Google Scholar]
Zhu S., Xia L., Wu J., et al. Ambient air pollutants are associated with newly diagnosed tuberculosis: A time-series study in chengdu, China. Science of the Total Environment. 2018;631–632:47–55. doi: 10.1016/j.scitotenv.2018.03.017. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data and materials will be made available on request.

[bib1] Altmann A., Toloşi L., Sander O., et al. Permutation importance: A corrected feature importance measure. Bioinformatics. 2010;26(10):1340–1347. doi: 10.1093/bioinformatics/btq134. [DOI] [PubMed] [Google Scholar]

[bib2] Arora N.K., Mishra I. United nations sustainable development goals 2030 and environmental sustainability: Race against time. Environmental Sustainability. 2019;2(4):339–342. [Google Scholar]

[bib3] Bo H., Moure U.A.E., Yang Y., et al. Mycobacterium tuberculosis-macrophage interaction: Molecular updates. Frontiers in Cellular and Infection Microbiology. 2023;13 doi: 10.3389/fcimb.2023.1062963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] Bonell A., Contamin L., Thai P.Q., et al. Does sunlight drive seasonality of TB in Vietnam? A retrospective environmental ecological study of tuberculosis seasonality in Vietnam from 2010 to 2015. BMC Infectious Diseases. 2020;20:1–11. doi: 10.1186/s12879-020-4908-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] Breiman L. Random forests. Machine Learning. 2001;45:5–32. [Google Scholar]

[bib6] Camargo E.P. Tropical diseases. Estudos Avançados. 2008;22:95–110. [Google Scholar]

[bib7] Chakaya J., Khan M., Ntoumi F., et al. Global tuberculosis report 2020–reflections on the global TB burden, treatment and prevention efforts. International Journal of Infectious Diseases. 2021;113:S7–S12. doi: 10.1016/j.ijid.2021.02.107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] Chakaya J., Petersen E., Nantanda R., et al. The WHO global tuberculosis 2021 report–not so good news and turning the tide back to end TB. International Journal of Infectious Diseases. 2022;124:S26–S29. doi: 10.1016/j.ijid.2022.03.011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] Chan M., Pai K., Su S., et al. Explainable machine learning to predict long-term mortality in critically ill ventilated patients: A retrospective study in central Taiwan. BMC Medical Informatics and Decision Making. 2022;22(1):75. doi: 10.1186/s12911-022-01817-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] Chen T. Xgboost: Extreme gradient boosting. R package version 04-2. 2015;1(4) [Google Scholar]

[bib11] Chen C. Taiwan’s rapidly aging population: A crisis in the making? MPRA Paper 116543. University Library of; Munich, Germany: 2023. [Google Scholar]

[bib14] Cui H., Li J., Sun Y., et al. A novel framework for quantitative attribution of particulate matter pollution mitigation to natural and socioeconomic drivers. Science of The Total Environment. 2024;926:171910. doi: 10.1016/j.scitotenv.2024.171910. [DOI] [PubMed] [Google Scholar]

[bib15] De Jong M.C., Diekmann O., Heesterbeek H. Vol. 84. 1995. How does transmission of infection depend on population size. (Epidemic models: Their structure and relation to data). [Google Scholar]

[bib16] Deng J., Wahyuni H., Yulianto V. Labor migration from southeast Asia to Taiwan: Issues, public responses and future development. Asian Education and Development Studies. 2020;10(1):69–81. [Google Scholar]

[bib17] Desalu O. Seasonal variation in hospitalisation for respiratory diseases in the tropical rain forest of south Western Nigeria. The Nigerian Postgraduate Medical Journal. 2011;18(1):39–43. [PubMed] [Google Scholar]

[bib18] Didan K. 2015. MOD13A3 MODIS/Terra vegetation indices monthly L3 global 1km SIN grid V006. [DOI] [Google Scholar]

[bib12] Directorate‑General of Budget . Accounting and Statistics, Executive Yuan, Taiwan, Province of China. 2024. https://winstacity.dgbas.gov.tw/DgbasWeb/ZWeb/StateFile_ZWeb.aspx [Google Scholar]

[bib19] Fares A. Seasonality of tuberculosis. Journal of Global Infectious Diseases. 2011;3(1):46–55. doi: 10.4103/0974-777X.77296. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] Friedman J.H. Greedy function approximation: A gradient boosting machine. Annals of Statistics. 2001:1189–1232. [Google Scholar]

[bib21] Giri N., Chavan S., Heda R., et al. 2019 IEEE pune section international conference (PuneCon) 2019. Disease migration, mitigation, and containment: Impact of climatic conditions & air quality on tuberculosis for India; pp. 1–6. [Google Scholar]

[bib22] Goldstein A., Kapelner A., Bleich J., et al. Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational & Graphical Statistics. 2015;24(1):44–65. [Google Scholar]

[bib23] Guo C., Du Y., Shen S., et al. Spatiotemporal analysis of tuberculosis incidence and its associated factors in mainland China. Epidemiology and Infection. 2017;145(12):2510–2519. doi: 10.1017/S0950268817001133. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] Gupta D., Das K., Balamughesh T., et al. Role of socio-economic factors in tuberculosis prevalence. Indian Journal of Tuberculosis. 2004;51(1):27–32. [Google Scholar]

[bib25] Harris R.I. Testing for unit roots using the augmented dickey-fuller test: Some issues relating to the size, power and the lag structure of the test. Economics Letters. 1992;38(4):381–386. [Google Scholar]

[bib26] He K., Zhang X., Ren S., et al. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. Deep residual learning for image recognition; pp. 770–778. [Google Scholar]

[bib27] Hecht-Nielsen R. Theory of the backpropagation neural network. International 1989 Joint Conference on Neural Networks. 1989;1:593–605. [Google Scholar]

[bib28] Hersbach H., Bell B., Berrisford P., et al. 2023. ERA5 monthly averaged data on single levels from 1940 to present. Published online 2023. [Google Scholar]

[bib29] Hersbach H., Bell B., Berrisford P., et al. 2023. ERA5 monthly averaged data on pressure levels from 1940 to present. Published online 2023. [Google Scholar]

[bib30] Hornik K., Stinchcombe M., White H. Multilayer feedforward networks are universal approximators. Neural Networks. 1989;2(5):359–366. [Google Scholar]

[bib31] Houtmeyers E., Gosselink R., Gayan-Ramirez G., et al. Regulation of mucociliary clearance in health and disease. European Respiratory Journal. 1999;13(5):1177–1188. doi: 10.1034/j.1399-3003.1999.13e39.x. [DOI] [PubMed] [Google Scholar]

[bib32] Hu H., Nigmatulina K., Eckhoff P. The scaling of contact rates with population density for the infectious disease models. Mathematical Biosciences. 2013;244(2):125–134. doi: 10.1016/j.mbs.2013.04.013. [DOI] [PubMed] [Google Scholar]

[bib33] Hwang S., Lee J., Lee J., et al. Impact of outdoor air pollution on the incidence of tuberculosis in the Seoul metropolitan area, South Korea. Korean Journal of Internal Medicine (English Edition) 2014;29(2):183–190. doi: 10.3904/kjim.2014.29.2.183. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] Khan M.T., Kaushik A.C., Ji L., et al. Artificial neural networks for prediction of tuberculosis disease. Frontiers in Microbiology. 2019;10:395. doi: 10.3389/fmicb.2019.00395. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] Kharwadkar S., Attanayake V., Duncan J., et al. The impact of climate change on the risk factors for tuberculosis: A systematic review. Environmental Research. 2022;212 doi: 10.1016/j.envres.2022.113436. [DOI] [PubMed] [Google Scholar]

[bib36] Knorst M., Kienast K.J., Ferlinz R. Effect of sulfur dioxide on cytokine production of human alveolar macrophages in vitro. Archives of Environmental Health: An International Journal. 1996;51(2):150–156. doi: 10.1080/00039896.1996.9936009. [DOI] [PubMed] [Google Scholar]

[bib37] Koh G.C., Hawthorne G., Turner A.M., Kunst H., Dedicoat M. Tuberculosis incidence correlates with sunshine: An ecological 28-year time series study. PLoS One. 2013;8(3) doi: 10.1371/journal.pone.0057752. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] LeCun Y., Bengio Y., Hinton G. Deep learning. Nature. 2015;521(7553):436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]

[bib39] Li Q., Liu M., Zhang Y., et al. The spatio-temporal analysis of the incidence of tuberculosis and the associated factors in mainland China, 2009-2015. Infection, Genetics and Evolution. 2019;75 doi: 10.1016/j.meegid.2019.103949. [DOI] [PubMed] [Google Scholar]

[bib40] Liao C., Hsieh N., Huang T., et al. Assessing trends and predictors of tuberculosis in Taiwan. BMC Public Health. 2012;12:1–12. doi: 10.1186/1471-2458-12-29. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] Lim T.K., Siow W.T. Pneumonia in the tropics. Respirology. 2018;23(1):28–35. doi: 10.1111/resp.13137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] Lin S., Chien J., Chiang H., et al. Ambulatory independence is associated with higher incidence of latent tuberculosis infection in long-term care facilities in Taiwan. Journal of Microbiology, Immunology, and Infection. 2021;54(2):319–326. doi: 10.1016/j.jmii.2019.07.008. [DOI] [PubMed] [Google Scholar]

[bib43] Linardatos P., Papastefanopoulos V., Kotsiantis S. Explainable ai: A review of machine learning interpretability methods. Entropy. 2020;23(1):18. doi: 10.3390/e23010018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib44] Liu Y., Zhao S., Li Y., et al. Effect of ambient air pollution on tuberculosis risks and mortality in Shandong, China: A multi-city modeling study of the short-and long-term effects of pollutants. Environmental Science and Pollution Research. 2021;28:27757–27768. doi: 10.1007/s11356-021-12621-6. [DOI] [PubMed] [Google Scholar]

[bib45] Liu L., Zhao X., Zhou Y. A tuberculosis model with seasonality. Bulletin of Mathematical Biology. 2010;72(4):931–952. doi: 10.1007/s11538-009-9477-8. [DOI] [PubMed] [Google Scholar]

[bib46] McCallum H., Barlow N., Hone J. How should pathogen transmission be modelled? Trends in Ecology & Evolution. 2001;16(6):295–300. doi: 10.1016/s0169-5347(01)02144-9. [DOI] [PubMed] [Google Scholar]

[bib13] Ministry of Environment Environmental Information Open Platform, MOENV. (2024) Concentration of Air Pollutants. https://data.moenv.gov.tw/

[bib47] Mohidem N.A., Osman M., Muharam F.M., et al. Prediction of tuberculosis cases based on sociodemographic and environmental factors in gombak, Selangor, Malaysia: A comparative assessment of multiple linear regression and artificial neural network models. International Journal of Mycobacteriology. 2021;10(4):442–456. doi: 10.4103/ijmy.ijmy_182_21. [DOI] [PubMed] [Google Scholar]

[bib48] Murray M., Oxlade O., Lin H. Modeling social, environmental and biological determinants of tuberculosis. International Journal of Tuberculosis & Lung Disease. 2011;15(6):S64–S70. doi: 10.5588/ijtld.10.0535. [DOI] [PubMed] [Google Scholar]

[bib49] Naranbat N., Nymadawa P., Schopfer K., et al. Seasonality of tuberculosis in an Eastern-Asian country with an extreme continental climate. European Respiratory Journal. 2009;34(4):921–925. doi: 10.1183/09031936.00035309. [DOI] [PubMed] [Google Scholar]

[bib50] Nie Y., Lu Y., Wang C., et al. Effects and interaction of meteorological factors on pulmonary tuberculosis in urumqi, China, 2013–2019. Frontiers in Public Health. 2022;10 doi: 10.3389/fpubh.2022.951578. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib52] Organization W.H. 2024. WHO Consolidated Guidelines on tuberculosis. Module 3: Diagnosis–rapid diagnostics for tuberculosis detection. [PubMed] [Google Scholar]

[bib51] Organization, W.H. 2008. Implementing the WHO stop TB Strategy: A handbook for national TB control programmes. [PubMed] [Google Scholar]

[bib54] Prokhorenkova L., Gusev G., Vorobev A., et al. CatBoost: Unbiased boosting with categorical features. Advances in Neural Information Processing Systems. 2018;31 [Google Scholar]

[bib55] Raviglione M., Sulis G. Tuberculosis 2015: Burden, challenges and strategy for control and elimination. Infectious Disease Reports. 2016;8(2):6570. doi: 10.4081/idr.2016.6570. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib56] Shapley L.S. A value for n-person games. Contributions to the Theory of Games II (Annals of Mathematics Studies. 1953;28:307–317. [Google Scholar]

[bib57] Taiwan Centers for Disease Control. (2024) Taiwan infectious disease statistics. https://nidss.cdc.gov.tw/en/

[bib58] Talat N., Perry S., Parsonnet J., et al. Vitamin D deficiency and tuberculosis progression. Emerging Infectious Diseases. 2010;16(5):853. doi: 10.3201/eid1605.091693. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib59] Tao Y., Zhao J., Cui H., et al. Exploring the impact of socioeconomic and natural factors on pulmonary tuberculosis incidence in China (2013–2019) using explainable machine learning: A nationwide study. Acta Tropica. 2024;253 doi: 10.1016/j.actatropica.2024.107176. [DOI] [PubMed] [Google Scholar]

[bib60] Trinh Q., Nguyen H., Nguyen V., et al. Tuberculosis and HIV co-infection—Focus on the Asia-Pacific region. International Journal of Infectious Diseases. 2015;32:170–178. doi: 10.1016/j.ijid.2014.11.023. [DOI] [PubMed] [Google Scholar]

[bib61] Wang Q., Li Y., Yin Y., et al. Association of air pollutants and meteorological factors with tuberculosis: A national multicenter ecological study in China. International Journal of Biometeorology. 2023;67(10):1629–1641. doi: 10.1007/s00484-023-02524-1. [DOI] [PubMed] [Google Scholar]

[bib62] Wang Q., Li Y., Yin Y., et al. Association of air pollutants and meteorological factors with tuberculosis: A national multicenter ecological study in China. International Journal of Biometeorology. 2023;67(10):1629–1641. doi: 10.1007/s00484-023-02524-1. [DOI] [PubMed] [Google Scholar]

[bib63] Wang L., Liu J., Chin D.P. Progress in tuberculosis control and the evolving public-health system in China. The Lancet. 2007;369(9562):691–696. doi: 10.1016/S0140-6736(07)60316-X. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib64] Willis M.D., Winston C.A., Heilig C.M., et al. Seasonality of tuberculosis in the United States, 1993–2008. Clinical Infectious Diseases. 2012;54(11):1553–1560. doi: 10.1093/cid/cis235. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib65] Wingfield T., Schumacher S.G., Sandhu G., et al. The seasonality of tuberculosis, sunlight, vitamin D, and household crowding. The Journal of Infectious Diseases. 2014;210(5):774–783. doi: 10.1093/infdis/jiu121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib66] Wu M.H.H., Chu P., et al. Surveillance of multidrug-resistant tuberculosis in Taiwan, 2008–2019. Journal of Microbiology, Immunology, and Infection. 2023;56(1):120–129. doi: 10.1016/j.jmii.2022.08.004. [DOI] [PubMed] [Google Scholar]

[bib67] Yang J., Zhang M., Chen Y., et al. A study on the relationship between air pollution and pulmonary tuberculosis based on the general additive model in Wulumuqi, China. International Journal of Infectious Diseases. 2020;96:42–47. doi: 10.1016/j.ijid.2020.03.032. [DOI] [PubMed] [Google Scholar]

[bib68] Zhang Q., Yan L., He J., et al. Time series analysis of correlativity between pulmonary tuberculosis and seasonal meteorological factors based on theory of human-environmental inter relation. Journal of Traditional Chinese Medical Sciences. 2018;5(2):119–127. [Google Scholar]

[bib69] Zhu S., Xia L., Wu J., et al. Ambient air pollutants are associated with newly diagnosed tuberculosis: A time-series study in chengdu, China. Science of the Total Environment. 2018;631–632:47–55. doi: 10.1016/j.scitotenv.2018.03.017. [DOI] [PubMed] [Google Scholar]

PERMALINK

Data-driven model analysis of the impact of environmental and socioeconomic factors on tuberculosis incidence

Yiwen Tao

Jiaxin Zhao

Hao Cui

Zhanlue Liang

Jian Li

Jingli Ren

Huaiping Zhu

Abstract

Graphical abstract

1. Introduction

2. Material and methods

2.1. Study area

Fig. 1.

2.2. Data source

Table 1.

2.3. Data preprocessing

2.4. Augmented Dickey-Fuller test

2.5. Feature selection

2.6. Modelling

2.7. Model selection

2.8. Explainable machine learning methods

2.8.1. Shapley additive explanations

2.8.2. Permutation feature importance

2.8.3. Individual conditional expectation

2.8.4. Value window

Fig. 2.

3. Results

3.1. Spatiotemporal pattern of TB incidence

Fig. 3.

Fig. 4.

3.2. Model selection

Fig. 5.

Fig. 6.

Table 2.

3.3. Model interpretation

3.3.1. Relative importance of drivers on TB

Fig. 7.

3.3.2. Response between drivers and TB incidence

Fig. 8.

Fig. 9.

Fig. 10.

3.3.3. Regulation of TB incidence using value window

Fig. 11.

3.3.4. Sustainable model

Fig. 12.

4. Discussion

5. Conclusion

CRediT authorship contribution statement

Ethics approval and consent to participate

Consent for publication

Availability of data and materials

Funding

Declaration of competing interest

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases