Abstract
Background
The burden of pulmonary tuberculosis in China continues to increase, and the potential impact of environmental changes warrants serious attention. While the association between meteorological factors and pulmonary tuberculosis has garnered increasing interest, relatively few studies have examined the effects of air pollutants on the disease. Leveraging real-world evidence, this study aims to investigate the potential long-term effects of exposure to both meteorological variables and air pollutants on the incidence of various forms of pulmonary tuberculosis.
Methods
We obtained daily data on meteorological factors and air pollutants from National Oceanic and Atmospheric Administration (2014–2022), and pulmonary tuberculosis counts from Jining Center for Disease Control and Prevention (2009–2022). We used different time series (Single-factor Seasonal Autoregressive Integrated Moving Average (SARIMA) model, Holt-Winters model and Generalized Autoregressive conditional heteroskedasticity model (GARCH) models) and machine learning models to construct predictive models of pulmonary tuberculosis, followed by distributional lag nonlinear modelling (DLNM) to explore the chronic effects of meteorological conditions and pollutant exposure on the risk of pulmonary tuberculosis among different age and gender subgroups. Bayesian kernel machine regression (BKMR) models were used to screen pollutant drivers for different classifications of pulmonary tuberculosis.
Results
SARIMA and GARCH models demonstrate different advantages in capturing variations in disease incidence rates. Extremely low levels of PM10 and very high levels of SO2 had a hazardous effect on pulmonary tuberculosis at the maximum number of lagged days (22 d) with a relative risk (RR) (95% CI): 1.186 (1.045, 1.345) and 1.591 (1.186, 2.135), respectively. Patients under 12 years of age exhibited heightened sensitivity to elevated levels of PM₁₀, while females demonstrated greater susceptibility to the pollutant compared to males. SO₂ emerged as the primary environmental driver associated with pulmonary tuberculosis cases that were either bacteria-negative or lacked sputum test results. In contrast, PM₁₀ was identified as the main environmental factor influencing non-sputum and culture-positive pulmonary tuberculosis cases.
Conclusions
Different time series models can predict disease incidence rates by capturing fluctuations across various temporal scales. Long-term exposure to air pollutants such as SO₂ and PM₁₀ has been shown to increase susceptibility to pulmonary tuberculosis, exerting significant lagged effects over time. Notably, individuals of younger age and those with different subtypes of pulmonary tuberculosis display varying degrees of sensitivity to specific pollutants.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12889-026-26257-z.
Keywords: Pulmonary tuberculosis, Time series analysis, Air pollutants, DLNM, BKMR, Pathogenetic classification
Introduction
Pulmonary tuberculosis is a chronic respiratory infectious disease caused by the Mycobacterium pulmonary tuberculosis complex. It is primarily transmitted through the inhalation of cough aerosols [1, 2]. Pulmonary tuberculosis is a type B infectious disease, diagnosed and classified based on the following categories: Bacteria [3], Smear (+), No sputum test, Positive only, and Rifampicin resistant. Pulmonary tuberculosis ranks among the top 10 causes of death globally and is the leading cause of death caused by a single infectious agent, surpassing Human Immunodeficiency Virus/Acquired Immune Deficiency Syndrome (HIV/AIDS). Globally in 2023, an estimated 10.8 million people (95% uncertainty interval [UI]: 10.1–11.7 million) fell ill with pulmonary tuberculosis (incident cases), a further increase from 10.7 million (95% UI: 10.0–11.5 million) in 2022, 10.4 million (95% UI: 9.7–11.1 million) in 2021 and 10.1 million (95% UI: 9.5–10.7 million) in 2020 [4]. Despite valiant efforts to combat this devastating disease, pulmonary tuberculosis continues to pose a significant global health challenge, burdened by its high incidence, substantial medical expenses, drug resistance, and co-infections [5]. Pulmonary tuberculosis is the infectious disease that causes the highest number of deaths from a single pathogen. By 2020, China had become the second-largest country in the world in terms of pulmonary tuberculosis burden. Although the rate of decline has slowed in recent years, it remains a significant public health challenge.
In addressing health issues related to pulmonary tuberculosis infection, priority should be given to optimizing the allocation of healthcare resources and medical services, exploring the temporal patterns of its prevalence, and forecasting future trends. To this end, it is essential to develop statistical models that can serve as early warning systems for pulmonary tuberculosis outbreaks. Several studies have successfully utilized time series approaches—such as the Seasonal Autoregressive Integrated Moving Average (SARIMA) model and the Holt-Winters exponential smoothing method—to forecast the incidence of pulmonary tuberculosis and tuberculous pleurisy, yielding favorable predictive performance [5, 6]. Nevertheless, relatively few studies have comparatively evaluated the forecasting efficacy of different time series methodologies, either individually or in combination, specifically for infectious diseases like pulmonary tuberculosis. In addition, there remains a research gap regarding the fluctuating characteristics of infectious diseases.
A growing body of recent research has demonstrated that statistical models such as the Distributed Lag Non-Linear Model (DLNM) and the Generalized Additive Model (GAM) have been widely applied across various regions of China to investigate the lagged effects between pulmonary tuberculosis and environmental exposures, including meteorological factors and air pollutants. For instance, studies in Beijing revealed a positive association between pulmonary tuberculosis and NO₂ and wind speed, while an inverse association was observed with ozone levels. In Shijiazhuang, PM₁₀ was identified as having a significant impact on pulmonary tuberculosis [7, 8]. Furthermore, a multi-region study conducted in Nanjing found that sulfur dioxide (SO₂), nitrogen dioxide (NO₂), PM₁₀, and fine particulate matter (PM₂.₅) were all positively correlated with pulmonary tuberculosis risk, exhibiting notable lagged effects [9]. The mentioned studies have certain limitations. For instance, the parameters of previous lag models (such as degrees of freedom and lag days) are mostly derived from literature references or empirical experience, which may lead to deviations in the results. Beyond lagged associations, the independent effects of environmental factors also warrant attention. For air pollutants, Liu Yao et al. employed the logistic regression model to assess the correlations between PM₁₀, carbon monoxide (CO), and pulmonary tuberculosis [10]. In previous studies on different subtypes of infectious diseases, research conducted in multiple regions of China has found that various influenza subtypes (such as Influenza A and Influenza B) exhibit a certain degree of susceptibility to air pollutants, yet the differences between these influenza subtypes are not significant [11, 12]. Additionally, there is a paucity of research on the susceptibility of different pulmonary tuberculosis subtypes to environmental pollutants. For meteorological factors, some scholars have incorporated the geographically weighted regression (GWR) model into environmental science research to assess spatial variations in the association between pulmonary tuberculosis distribution and climatic variables [13]. However, most existing studies have examined the effects of individual environmental factors in isolation, with limited efforts to integrate temporal lag structures and the cumulative or interactive impacts of complex environmental mixtures on pulmonary tuberculosis.
The overall goal of this study is to explore the epidemiological characteristics of pulmonary tuberculosis, lag and mixtures effects of climate and pollutants, and the subsequent development of models applicable to predicting pulmonary tuberculosis outbreaks. Our specific objectives were as follows: (a) to conduct a direct comparison of SARIMA and Generalized Autoregressive conditional heteroskedasticity model (GARCH) models for forecasting pulmonary tuberculosis, with particular emphasis on capturing volatility patterns; (b) to apply DLNM to assess the effects of environmental exposures on pulmonary tuberculosis, stratified by age and sex; (c) to employ Bayesian kernel machine regression (BKMR) to investigate the joint effects of environmental pollutant mixtures on specific pulmonary tuberculosis subtypes; (d) to develop predictive models for pulmonary tuberculosis onset, selecting the most appropriate modeling approach based on population characteristics.
Materials and methods
Data collection
The National Health Commission of the People’s Republic of China classifies pulmonary tuberculosis as a Category B notifiable infectious disease. Case reporting and individual case management shall be conducted in accordance with standard guidelines within 24 h. Surveillance data on pulmonary tuberculosis reported cases for the study region from 2009 to 2022 was provided by the Jining Center for Disease Control and Prevention (CDC) for our analysis. We extracted the symptom onset dates of the cases and aggregated them into daily pulmonary tuberculosis counts for Jining City, Shandong Province. The total number of sample cases was 38,667, with no missing or excluded data. To facilitate subsequent analyses, the case report counts were used as the outcome variable and divided into monthly and daily datasets, respectively.
In 2013, the national air pollution population health impact monitoring project was officially launched [14]. Relevant daily meteorology and pollutants data from 2014 to 2022, were publicly provided by the National Oceanic and Atmospheric Administration (NOAA) (https://www.noaa.gov/). Meteorological parameters included mean temperature, air pressure, humidity, wind speed and rainfall. Air pollution parameters included PM2.5, PM10, NO2, SO2, CO and O3. Among these, univariate prediction was performed using the monthly case report dataset from 2009 to 2022, while all other analyses employed the daily case report dataset and concurrent environmental factor data from 2014 to 2022.
Classification of pulmonary tuberculosis cases
There are various classification systems for pulmonary tuberculosis; however, based on the needs of epidemic surveillance and prevention and control adapted by China’s CDC system, there are 5 surveillance classifications derived from etiological test results and drug resistance status (Bacteria (-), Smear (+), No sputum test, Positive only, Rifampicin resistant), which can be cross-layered with official clinical classifications. Their core function is to guide prevention and control rather than serve as mere clinical diagnostic classifications. All data were sourced from the Jining CDC, with diagnostic criteria for pulmonary tuberculosis derived from laboratory diagnostics of the Jining CDC and Chinese national standards: Diagnostic Criteria for Pulmonary tuberculosis (WS 288–2017) and Diagnostic Criteria for Drug-Resistant Pulmonary tuberculosis (WS 288–2019) [15]. The classification basis and infectivity of the 5 main pulmonary tuberculosis categories are as follows:
Smear-positive pulmonary tuberculosis (hereinafter abbreviated as Smear (+)): Positive for acid-fast bacilli (AFB) on sputum smear microscopy, with strong infectivity. Culture-positive-only pulmonary tuberculosis (hereinafter abbreviated as Positive only): Negative on sputum smear but positive for Mycobacterium pulmonary tuberculosis (MTB) in sputum culture, with weak infectivity (low risk). Bacteriologically negative pulmonary tuberculosis (hereinafter abbreviated as Bacteria (-)): Negative results in both sputum smear and culture; diagnosis requires clinical symptoms and other auxiliary evidence, with no infectivity. Pulmonary tuberculosis without sputum testing (hereinafter abbreviated as No sputum test): Clinically suspected or confirmed cases without core etiological tests (e.g., sputum smear, sputum culture), with no definite infectivity. Rifampicin-resistant pulmonary tuberculosis (hereinafter abbreviated as Rifampicin resistant): MTB confirmed to be resistant to rifampicin via drug susceptibility testing (DST) or molecular testing; infectivity depends on the etiological status.
Holt-Winters, SARIMA and GARCH model construction
The Holt-Winters model, proposed by Holt and Winters in 1960, is a forecasting technique based on exponential smoothing [16]. which uses parameters in the equations to address the seasonal trend in the original data, Holt’s linear equations incorporate a seasonal factor equation, allowing for direct capturing of seasonality. The Holt-Winters model finds extensive applications in time series analysis characterized by seasonal fluctuations. It utilizes three smoothing equations to calculate and estimate the deseasonalized series, trend, and seasonal factors [17].
The SARIMA model, which is used to forecast seasonal and non-stationary time series, is based on a combination of non-seasonal ARIMA (p, d, q) and seasonal ARIMA (P, D, Q)s models. The model, denoted generally as ARIMA (p, d, q) × ARIMA(P, D, Q)S, was presented in the equation below, where ∇d is the difference operator: (1 − B)d, ∇D S is the seasonal difference operator: (1 − BS)D, Θ(B) is the moving average polynomial:1 − θ1B−⋯−θqBq, ΘS(B) is the seasonal moving average polynomial:1 − θ1BS−⋯−θQBQS, Φ(B) is an autoregressive polynomial:1 − ϕ1B−⋯−ϕpBp, ΦS(B) is an seasonal autoregressive polynomial:1 − ϕ1BS−⋯−ϕPBPS [18]. In selecting the model orders, this study adopted a hybrid approach that integrates the auto.arima() function with minimum information criterion values (AIC/BIC). The procedure consisted of the following steps: First, an initial order was automatically determined using the auto.arima() function. Next, the estimated coefficients and corresponding standard errors for each term were examined. If any coefficient was found to be less than twice its standard error, the model order was reduced by one level, and the model was refitted accordingly. This iterative process continued until all coefficients were greater than or equal to twice their respective standard errors. Finally, among the candidate models satisfying this condition, the model order at which both the AIC and BIC values simultaneously reached their minimum was selected as the optimal modeling configuration.
![]() |
The GARCH (p, q) model is capable of accurately modeling the heteroscedastic function with long-term memory, and can effectively fit and forecast non-stationary time series data. The model is represented by the following equation:
![]() |
In Formula above, α0 is the constant term, αj is the coefficient corresponding to the ARCH process, βi is the coefficient corresponding to the GARCH process, i = 1,2,⋯p, j = 1,2,⋯q, p is the model tailing value, and q is the model tailing value. The lag number for which the autocorrelation coefficient value of the residual series ACF was mostly stable was used to determine the threshold for p, while the lag number for which the partial autocorrelation coefficient value PACF of the residual series was mostly stable was used to determine the threshold for q [19].
Among these, the Holt-Winters and SARIMA models were primarily used for time series prediction and analysis of seasonal and stationarity characteristics. The seasonal characteristics were obtained mainly by decomposing the data using the decompose () function, while the Box-Ljung test was adopted for stationarity testing, where a significance level of p > 0.05 indicated stationarity. As for the GARCH model, it was mainly used to capture the volatility of diseases in the time series. Its pre-test employed the LM ARCH Test; if p < 0.05, the presence of the ARCH effect was confirmed, making it suitable for constructing a GARCH model. For residual diagnosis of the GARCH model, the Ljung-Box Test combined with the LM ARCH Test was used, where a significance level of p > 0.05 for both tests indicated the validity of the variance equation.
The lagging and mixtures effect of DLNM and BKMR model
The DLNM model can be written as follows:
![]() |
In this context, Yt represents the daily count of pulmonary tuberculosis cases, α1 refers to the intercept of the overall model, NS denotes a natural cubic spline serving as a smooth function within the model, and M represents the estimated climate or pollutant variable developed for pulmonary tuberculosis. Xt signifies other climate and pollutant variables involved in the pathogenesis of pulmonary tuberculosis that necessitate adjustment for nonlinear confounding effects. These variables, denoted as
, are not included in the meteorological model but are present in the pollution model, with meteorological factors being utilized as confounding factors. NS was used to adjust for daily confounding in the model; Day is a binary variable used to control the effect of time, β is the regression coefficients; We estimated the optimal degrees of freedom (df) and lagging days for the spline function using the Akaike information criterion for quasi-Poisson (Q-AIC) and Minimum partial regression coefficient (PACFmin) criteria. For climate and pollutant factors, we utilized NS with 5 df, while the lag space was set to 4 df [14, 20]. In this study, sensitivity analyses were conducted for different lag durations/degrees of freedom (df). The calculation of degrees of freedom was based on PACFmin criterion: specifically, the lag days and the degrees of freedom of relevant environmental factors were incorporated into the model one by one from 1 to 5 to construct functions, the PACF was calculated, and the degree of freedom corresponding to the minimum PACF value was selected. After determining the degrees of freedom, the maximum lag days were calculated subsequently: QAIC values of the model were computed by iterating through lag days from 1 to 30, and the cumulative 22 days corresponding to the minimum QAIC value were selected as the maximum lag days.
In our study, the reference value for calculating the relative risk (RR) was determined as the median of the environmental variables. Additionally, we assessed the impact of extreme environmental factors on the pulmonary tuberculosis by comparing the 5th, 25th, 75th, and 95th percentiles of environmental variables against the median. To identify susceptible populations and their sensitivities, we examined the impact of environmental factors across different sex and age groups.
Next, we conducted BKMR analysis using the R package “bkmr” to examine potential non-linear exposure-response and interaction relationships between pulmonary tuberculosis and environmental indicators [21] among different types of pulmonary tuberculosis. In the BKMR model, environmental chemical mixtures suspected of exhibiting nonlinear effects or interaction patterns were selected based on Spearman correlation analysis, focusing on pollutants with statistically significant associations. Potential confounders were identified as meteorological variables that showed significant correlations in univariate analyses. The Markov Chain Monte Carlo (MCMC) simulation was configured with the following parameters: a total of 20,000 iterations, a burn-in period of 4,000 iterations (accounting for 20% of the total), and a thinning interval of 5. Following the burn-in phase, 16,000 samples remained. After applying thinning, 3,200 effective samples were retained for posterior inference. Convergence diagnostics indicated satisfactory performance of the MCMC algorithm: trace plots of key parameters—including the kernel weight (r), covariate coefficients (β), and error variance (σ²)—displayed stable and randomly fluctuating patterns after the burn-in period. All R-hat values were below 1.05, the effective sample size (ESS) for each parameter exceeded 2,000, and autocorrelation coefficients dropped below 0.1 within 10 lags. These results collectively suggest that the MCMC chains achieved convergence and that the parameter estimates are reliable.
Construction of machine learning algorithms
Based on the wide distribution (non-normal distribution) of environmental factors, we used Spearman method to select meteorological and pollutant factors that are correlated with the incidence of pulmonary tuberculosis. To maintain consistency in training the models, each model underwent 10-fold cross-validation and was trained on 75% of the training sample. A standard holdout set of 25% was maintained for all models, and the statistics obtained from this set were used for comparing the results across the models. Subsequently, we employed several machine learning methods to predict the incidence of pulmonary tuberculosis, including Decision Tree (DT), Bagging, Random Forest (RF), and support vector machine (SVM) models with four different kernel functions (linear, polynomial, radial, and sigmoid). The detailed descriptions of the aforementioned machine learning algorithms are as follows:
DT: DT characterize groups based on the ordered form of their values. DT consists of multiple branches and nodes. Each node represents a set of attributes, and determining the attribute for the root node at each level is a key challenge in DT. The primary objectives of DT are to find the highest information gain and the minimum entropy. Entropy determines how data analysis selects data splits, and it influences the way DTs generate boundaries. The formula for calculating entropy (E) is as follows:
, where pi = the probability of event i in category m [22].
Bagging: Bagging is a powerful technique based on the principle of generating multiple derived subsets from the main dataset through the bootstrap sampling method. Each of these discrete subsets is used to train an individual model, and the final prediction is an aggregation of the predictions generated by each model. Let D denote the original training set of size n, and Q(X, Y/D) represent the probability distribution for uniformly and randomly sampling training samples (xi, yi) from D. Based on Qn, m data subsets D1, D2, …, Dm can be sampled, where each subset has the same size as the original dataset (i.e., n) and is obtained through bootstrap sampling (sampling with replacement). For each subset Dj, the Bagging algorithm is used to train a mode hj. The final prediction for a new input sample x is obtained by averaging the predictions of all models [23]:
.
RF: RF method is an ensemble learning approach based on DT. Following the principle of “majority voting,” the classification of a sample is determined by the voting results of each decision branch across all trees, and the class with the highest number of votes among all decision trees is ultimately selected as the final classification. To enhance the discriminative power of the data, the minimum Gini index is required. For the calculation of the Gini index,
, where Pi = probabilistic class [24, 25].
SVM: By adjusting the weights of positive and negative samples in the loss function, SVM can assign different penalty coefficients to positive and negative samples, i.e., imposing distinct misclassification costs on the two classes of samples. The loss function of SVM consists of the sum of the hinge loss function and a regularization term, with the calculation formula as follows:
, where xi denotes the i-th sample; yi is the class label of xi ; w and b are the parameters of the hyperplane; ||*|| represents the L2 norm; if xi ∈P, then w = w1; if xi ∈N, then w = w2. There are four types of loss functions, namely Linear, Polynomial, Radial Basis Function (RBF), and Sigmoid [26].
Statistical methods
Multiple time series models were constructed to evaluate the predictive performance for pulmonary tuberculosis. Spearman rank correlation analysis was employed for feature selection to identify key climate and air pollution factors associated with disease occurrence. Subsequently, a variety of machine learning models were applied to generate incidence forecasts. To assess the lagged, stratified, and non-linear effects of environmental exposures on pulmonary tuberculosis cases, we developed a DLNM with a maximum lag of 22 days. Furthermore, we applied BKMR to investigate the joint effects of pollutant mixtures on different subtypes of pulmonary tuberculosis. All analyses were conducted using R software (version 4.1.3).
Results
Descriptive characteristics of pulmonary tuberculosis cases
A total of 38,667 pulmonary tuberculosis cases were reported in the Jining city, Shandong provinces of China from 2009 to 2022, showing a decreasing trend yearly (Table 1). The incidence of pulmonary tuberculosis was dominated by aged 18–59 years, accounting for 62.76% of all cases. In terms of gender distribution, the male-to-female ratio is 2.58:1. In terms of occupational distribution, most cases occur among farmers (78.52%). Regarding seasonal distribution, the incidence is highest during the spring and summer seasons. In different classifications of pulmonary tuberculosis, the predominant type is Bacteria (52.44%), followed by Smear (+) (42.79%). The case report data from 2014 to 2022 showed similarities to those of the overall period (2009–2022) in terms of both seasonal and long-term trends.
Table 1.
Characterizations of the incidence of different types of tuberculosis in different populations and Temporal distributions - occupation, age, sex, seasonal distribution
| Characteristic | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No of tuberculosis cases | ||||||||||||||||
| Occupation | Farmers | 3155 (80.42%) | 3179 (82.64%) | 2575 (79.80%) | 2750 (79.53%) | 2558 (83.27%) | 2275 (82.67%) | 2207 (79.56%) | 1907 (77.84%) | 1920 (75.35%) | 1913 (76.70%) | 1847 (73.32%) | 1563 (73.69%) | 1305 (73.11%) | 1206 (71.07%) | 30,360 (78.52%) |
| Homemaker | 90 (2.29%) | 96 (2.50%) | 86 (2.67%) | 124 (3.59%) | 114 (3.71%) | 79 (2.67%) | 136 (2.56%) | 121 (77.94%) | 149 (5.85%) | 156 (6.26%) | 208 (8.26%) | 164 (7.73%) | 150 (8.40%) | 196 (11.55%) | 1869 (4.83%) | |
| Students | 147 (3.75%) | 128 (3.33%) | 106 (3.28%) | 120 (3.47%) | 66 (2.15%) | 69 (3.67%) | 106 (3.56%) | 106 (77.33%) | 117 (4.59%) | 132 (5.29%) | 147 (5.84%) | 122 (5.75%) | 96 (5.38%) | 73 (4.30%) | 1535 (3.97%) | |
| Workers | 201 (5.12%) | 198 (5.15%) | 127 (3.94%) | 140 (4.05%) | 83 (2.70%) | 85 (5.67%) | 88 (3.56%) | 64 (77.61%) | 75 (2.94%) | 59 (2.37%) | 86 (3.41%) | 86 (4.05%) | 43 (2.41%) | 47 (2.77%) | 1382 (3.57%) | |
| Retirees | 82 (2.09%) | 64 (1.66%) | 69 (2.14%) | 91 (2.63%) | 77 (2.51%) | 78 (1.67%) | 99 (2.56%) | 76 (77.10%) | 112 (4.40%) | 67 (2.69%) | 78 (3.10%) | 72 (3.39%) | 83 (4.65%) | 75 (4.42%) | 1123 (2.90%) | |
| Others | 248 (6.32%) | 182 (4.73%) | 264 (8.18%) | 233 (6.74%) | 174 (5.66%) | 166 (4.67%) | 138 (8.56%) | 176 (77.18%) | 175 (6.87%) | 167 (6.70%) | 153 (6.07%) | 114 (5.37%) | 108 (6.05%) | 100 (5.89%) | 2398 (6.20%) | |
| Age | Age < 12 | 6 (0.15%) | 6 (0.16%) | 5 (0.15%) | 5 (0.14%) | 6 (0.20%) | 4 (0.67%) | 9 (0.56%) | 10 (77.41%) | 10 (0.39%) | 7 (0.28%) | 11 (0.44%) | 3 (0.14%) | 12 (0.67%) | 7 (0.41%) | 101 (0.26%) |
| Age 12–17 | 68 (1.73%) | 41 (1.07%) | 49 (1.52%) | 63 (1.82%) | 57 (1.86%) | 58 (1.67%) | 118 (1.56%) | 82 (77.35%) | 81 (3.18%) | 70 (2.81%) | 73 (2.90%) | 61 (2.88%) | 49 (2.75%) | 40 (2.36%) | 910 (2.35%) | |
| Age 18–59 | 2548 (64.95%) | 2518 (65.45%) | 2078 (64.39%) | 2218 (64.14%) | 1920 (62.50%) | 1801 (65.67%) | 1716 (64.56%) | 1493 (77.94%) | 1546 (60.68%) | 1504 (60.30%) | 1555 (61.73%) | 1317 (62.09%) | 1094 (61.29%) | 959 (56.51%) | 24,267 (62.76%) | |
| Age 60- | 1301 (33.16%) | 1282 (33.32%) | 1095 (33.93%) | 1172 (33.89%) | 1089 (35.45%) | 889 (33.67%) | 931 (33.56%) | 865 (77.31%) | 911 (35.75%) | 913 (36.61%) | 880 (34.93%) | 740 (34.89%) | 630 (35.29%) | 691 (40.72%) | 13,389 (34.63%) | |
| Sex | Male | 2854 (72.75%) | 2752 (71.54%) | 2354 (72.95%) | 2516 (72.76%) | 2276 (74.09%) | 1996 (71.67%) | 2034 (72.56%) | 1769 (77.20%) | 1806 (70.88%) | 1795 (71.97%) | 1834 (72.81%) | 1467 (69.17%) | 1234 (69.13%) | 1195 (70.42%) | 27,882 (72.11%) |
| Female | 1069 (27.25%) | 1095 (28.46%) | 873 (27.05%) | 942 (27.24%) | 796 (25.91%) | 756 (28.67%) | 740 (27.56%) | 681 (77.80%) | 742 (29.12%) | 699 (28.03%) | 685 (27.19%) | 654 (30.83%) | 551 (30.87%) | 502 (29.58%) | 10,785 (27.89%) | |
| Disease classification | Bacteria (-) | 1125 (28.68%) | 1120 (29.11%) | 1355 (41.99%) | 1929 (55.78%) | 1943 (63.25%) | 1874 (29.67%) | 1944 (41.56%) | 1614 (77.88%) | 1689 (66.29%) | 1546 (61.99%) | 1500 (59.55%) | 1098 (51.77%) | 708 (39.66%) | 833 (49.09%) | 20,278 (52.44%) |
| Smear (+) | 2591 (66.05%) | 2603 (67.66%) | 1745 (54.07%) | 1321 (38.20%) | 999 (32.52%) | 786 (67.67%) | 653 (54.56%) | 620 (77.31%) | 624 (24.49%) | 708 (28.39%) | 987 (39.18%) | 1002 (47.24%) | 1055 (59.10%) | 851 (50.15%) | 16,545 (42.79%) | |
| No sputum test | 205 (5.23%) | 119 (3.09%) | 122 (3.78%) | 202 (5.84%) | 113 (3.68%) | 69 (3.67%) | 162 (3.56%) | 209 (77.53%) | 191 (7.50%) | 104 (4.17%) | 0 (0.00%) | 0 (0.00%) | 0 (0.00%) | 0 (0.00%) | 1496 (3.87%) | |
| Positive only | 2 (0.05%) | 5 (0.13%) | 5 (0.15%) | 6 (0.17%) | 17 (0.55%) | 23 (0.67%) | 15 (0.56%) | 7 (77.29%) | 28 (1.10%) | 111 (4.45%) | 0 (0.00%) | 0 (0.00%) | 0 (0.00%) | 0 (0.00%) | 219 (0.57%) | |
| Rifampicin resistant | 0 (0.00%) | 0 (0.00%) | 0 (0.00%) | 0 (0.00%) | 0 (0.00%) | 0 (0.67%) | 0 (0.56%) | 0 (77.00%) | 16 (0.63%) | 25 (1.00%) | 32 (1.27%) | 21 (0.99%) | 22 (1.23%) | 13 (0.77%) | 129 (0.33%) | |
| Seasons | Spring(Mar-May) | 1048 (26.71%) | 1023 (26.59%) | 914 (28.32%) | 971 (28.08%) | 734 (23.89%) | 758 (26.67%) | 742 (28.56%) | 732 (77.88%) | 717 (28.14%) | 673 (26.98%) | 695 (27.59%) | 596 (28.10%) | 459 (25.71%) | 434 (25.57%) | 10,496 (27.14%) |
| Summer(Jun-Aug) | 989 (25.21%) | 1102 (28.65%) | 747 (23.15%) | 863 (24.96%) | 808 (26.30%) | 681 (28.67%) | 672 (23.56%) | 583 (77.80%) | 695 (27.28%) | 626 (25.10%) | 654 (25.96%) | 561 (26.45%) | 504 (28.24%) | 535 (31.53%) | 10,020 (25.91%) | |
| Autumn(Sep-Nov) | 943 (24.04%) | 807 (20.98%) | 692 (21.44%) | 703 (20.33%) | 718 (23.37%) | 556 (20.67%) | 584 (21.56%) | 509 (77.78%) | 534 (20.96%) | 491 (19.69%) | 505 (20.05%) | 470 (22.16%) | 391 (21.90%) | 285 (16.79%) | 8188 (21.18%) | |
| Winter(Dec-Feb) | 943 (24.04%) | 915 (23.78%) | 874 (27.08%) | 921 (26.63%) | 812 (26.43%) | 757 (23.67%) | 776 (27.56%) | 626 (77.55%) | 602 (23.63%) | 704 (28.23%) | 665 (26.40%) | 494 (23.29%) | 431 (24.15%) | 443 (26.10%) | 9963 (25.77%) | |
| Total | 3923 | 3847 | 3227 | 3458 | 3072 | 2752 | 2774 | 2450 | 2548 | 2494 | 2519 | 2121 | 1785 | 1697 | 38,667 | |
Time-series analysis of pulmonary tuberculosis among Holt-Winters, SARIMA and GARCH model in monthly prediction
Based on the seasonal distribution characteristics of pulmonary tuberculosis cases in Jining City from 2009 to 2022—characterized by a single peak during the spring and summer months (March to July) and an overall declining trend (see Figure S1)—we developed monthly time series models using the Holt–Winters three-parameter method, SARIMA, and GARCH. Based on the model fitting and prediction results shown in Fig. 1A, B, and Table S1, it can be observed that the predicted values are not perfectly consistent with the actual values. However, the actual values fall within the corresponding 95% confidence interval of the predicted values. Through model comparison, it can be concluded that the SARIMA (1,1,1) (2,1,0) [12] model exhibits better fitting and accuracy. In contrast, the fitting plot of the GARCH model shown in Fig. 1C indicates that, except for a few months, the actual values generally fall within the 95% prediction interval of the GARCH (2,0) model. This suggests that the GARCH model has a good predictive trend effect. The model parameters and tests are shown in Table S2, it gives the forecasting accuracy of two models for the pulmonary tuberculosis series. The SARIMA model has lower values for Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE), which means the SARIMA is more accurate. Finally, we conducted tests on the residual sequence and standardized residuals. As shown in Table S3, the sequence exhibits significant non-constant variance, and the square of the residual has a significant correlation relationship. Thus, the first to third order ARCH models are all significantly established.
Fig. 1.
Comparison of predicted and actual values of three models. A Holt-Winters model. B SARIMA model. C GARCH model
The stratified lag relationship between environmental variables and the incidence of pulmonary tuberculosis
According to the correlation analysis results shown in Figure S2, the daily cases of pulmonary tuberculosis is negatively correlated with daily air pressure (r < 0), and positively correlated with daily mean temperature, rainfall, PM10, SO2, CO, and O3 (r > 0). A total of 21,140 cases of pulmonary tuberculosis were reported in Jining City from 2014 to 2022, with an average daily patient of 6.43 cases and a maximum daily patient of 80 cases. The patient population was concentrated in the Bacteria pulmonary tuberculosis group. See Table S4 for details. Based on the overall effect of meteorological factors and pollutants on the incidence of pulmonary tuberculosis, both showed a non-linear lagged relationship. Different lag days correspond to different effects. Specifically, high temperature, high levels of SO2, and CO exhibit rapid onset and short duration of effects. On the other hand, high air pressure, high rainfall, and high levels of PM10 exhibit longer lag times and have a greater impact (Figure S3 and S4). The dose-response relationships show that as temperature and air pressure levels gradually decrease, the risk of incidence gradually increases. The highest risk of incidence for rainfall, SO2, and CO occurs at levels of 20 mm, 150 µg/m3, and 8 µg/m3, respectively. The relationship between PM10 and O3 and the risk of incidence is a “U” shape (Figure S5).
Regarding the extreme weather effects, as shown in Figure S6, extremely high temperature and high air pressure exhibit a hazardous effect with 0-day lag in the early period, while extremely low temperature and low air pressure exhibit a hazardous effect with a lag of around 11–18 days in the late period. Extremely low and extremely high rainfall exhibit hazardous effects with a lag of 0–3 days and 12–20 days, respectively. No significant gender differences were observed in the lagged effects of meteorological factors. Among different age groups, individuals under 12 years of age exhibited no discernible lag effect. In contrast, adolescents aged 12–17 years displayed a pattern that deviated from the overall trend. Specifically, under conditions of extremely high atmospheric pressure, a detrimental effect was observed, with a prolonged lag of 16 days. At the maximum lag days, overall patients exhibit a hazardous lag effect (RR (95% CI) P5=1.002 (1.000, 1.004), RR (95% CI) P25=1.004 (1.000, 1.008)) on incidence under low levels of rainfall (P5, P25). Patients aged 60 and above exhibit a hazardous lag effect (RR (95% CI) P5=1.256 (1.053, 1.497)) on incidence under low levels of air pressure (P25) (Table S6).
From Fig. 2, in general, extremely high and low levels of PM10, as well as extremely low levels of O3, exhibit hazardous effects with a lag of 2–4 days, while extreme levels of CO, O3, and SO2 respectively exhibit hazardous effects with lags of 0 days, 1–2 days, and 4–8 days. No significant differences in lagged effects of air pollution were observed between genders. However, among different age groups, children under 12 years of age exhibited distinct patterns compared to the overall population. Specifically, under conditions of extremely high PM₁₀ concentrations and extremely low O₃ levels, a detrimental effect was observed, with a prolonged lag of 15 days. At the maximum lag days, overall patients show a hazardous lag effect (RR (95% CI) P5=1.190 (1.050, 1.350), RR (95% CI) P25=1.090 (1.030, 1.150), RR (95% CI) P95=1.590 (1.190, 2.140)) on disease incidence under low levels of PM10 (P5, P25) and high levels of SO2 (P95), while under extremely high levels of PM10 (P75, P95), a protective lag effect on disease incidence is observed. In terms of gender differences, males exhibited patterns consistent with the overall findings. In contrast, females demonstrated a significant hazardous lagged effect on pulmonary tuberculosis under conditions of extremely high SO₂ levels (≥ 95th percentile, P95). Among different age groups, individuals aged 12–17 and 18–59 years showed a markedly increased risk of disease incidence associated with exposure to extremely high SO₂ levels. Conversely, those aged 60 years and above exhibited a protective effect on pulmonary tuberculosis under high PM₁₀ exposure (Table 2).
Fig. 2.
Relationship between the lag times of tuberculosis in total, different sex and ages at extreme pollutants (5%, 95%) with relative risk (RR) using DLNM model. The median value of each meteorological factor (PM10: 89 µg/m3, SO2: 19 µg/m3, CO: 0.9 µg/m3, O3: 105 µg/m3) is used as a reference level
Table 2.
Summary of estimated extreme effects for pollutants on tuberculosis cases in different groups at 22 days
| Series | Variables | Cumulative effects(95%CI) | |||
|---|---|---|---|---|---|
| Low PM10 effect | High PM10 effect | Low SO2 effect | High SO2 effect | ||
| Sex | Total cases | 1.19(1.05,1.35) | 0.89(0.82,0.95) | 0.93(0.86,1.01) | 1.12(1.00,1.26) |
| 1.09(1.03,1.15) | 0.70(0.56,0.87) | 0.96(0.92,1.01) | 1.59(1.19,2.14) | ||
| Male | 1.20(1.03,1.39) | 0.88(0.81,0.96) | 0.92(0.84,1.02) | 1.13(0.99,1.29) | |
| 1.09(1.02,1.17) | 0.72(0.55,0.93) | 0.96(0.90,1.01) | 1.47(1.04,2.08) | ||
| Female | 1.16(0.92,1.47) | 0.88(0.77,1.01) | 0.96(0.82,1.11) | 1.10(0.89,1.37) | |
| 1.08(0.97,1.20) | 0.63(0.42,0.96) | 0.97(0.89,1.06) | 2.00(1.14,3.50) | ||
| Age | Age < 12 | 4.64(0.25,86.5) | 0.75(0.15,3.82) | 0.45(0.06,3.37) | 1.75(0.13,23.9) |
| 1.88(0.49,7.25) | 29.5(0.32,275) | 0.64(0.20,2.04) | 0.00(0.00,76.8) | ||
| Age 12–17 | 1.54(0.73,3.27) | 0.85(0.55,1.30) | 1.13(0.72,1.79) | 0.87(0.45,1.66) | |
| 1.21(0.85,1.71) | 1.27(0.37,4.41) | 1.07(0.83,1.40) | 1.26(1.24,6.55) | ||
| Age 18–59 | 1.15(0.98,1.35) | 0.90(0.82,0.99) | 0.90(0.81,1.00) | 1.18(1.02,1.37) | |
| 1.07(0.99,1.15) | 0.70(0.53,0.93) | 0.94(0.89,1.00) | 1.79(1.23,2.61) | ||
| Age 60- | 1.21(1.00,1.49) | 0.87(0.77,0.98) | 0.99(0.87,1.14) | 1.03(0.85,1.25) | |
| 1.10(1.00,1.21) | 0.64(0.44,0.92) | 1.00(0.92,1.08) | 1.31(0.79,2.15) | ||
Bold font indicates statistical significance at the 0.05 level. Cumulative effects include the 5th, 25th (Low level) and the 75th, 95th (High level) for total, different sex and age group. The median value of each pollutant is as a reference level
Sensitivity analysis of the incidence of different pulmonary tuberculosis in relation to mixtures of pollutants
Overall, the protective effect (OR0.25−0.45= (-1.53, -0.55)) of environmental pollutant mixtures on the incidence of pulmonary tuberculosis gradually decreases with increasing exposure concentrations (Fig. 3A). Regarding individual pollutants, a marginally hazardous effect was observed for SO₂ at its median concentration (P50), with RR of 2.090 (95% CI: 0.130–4.050) (Fig. 3B). Dose–response analyses revealed non-linear associations between all four pollutants and pulmonary tuberculosis, with the relationship being most pronounced for SO₂ (Fig. 3C). Interaction analysis further indicated the presence of significant interaction effects between CO and O₃, as well as between SO₂ and PM₁₀ (Fig. 3D).
Fig. 3.
Associations between pollutants mixtures and tuberculosis among the study population by BKMR model. Model adjust for climate indicators including mean temperature, air pressure and rainfall. A The cumulative effect of the pollutants mixtures (estimates and 95% credible intervals). Pollutants mixtures are at a particular percentile (X-axis) compared to when exposures are all at 50th percentile. B The single-exposure effect (estimates and 95% credible intervals). C Univariate exposure-response functions and 95% confidence bands for each pollutant with the other mixtures fixed at the median. D Multiple exposure-response functions for: the other pollutant when one metal fixed at either the 25th, 50th, or 75th percentile and the test of pollutants mixtures is fixed at the median
The effects of different pulmonary tuberculosis classifications on the mixture of pollutants are as follows: The two classifications, “No sputum test” and “Bacteria [3]”, have similar overall effects. The protective effect of environmental pollutant mixtures on the incidence of pulmonary tuberculosis decreases gradually with increasing concentration. Among the remaining pulmonary tuberculosis classifications, although the environmental mixture risk is not statistically significant, the smear (+) and positive only classifications show a trend of increasing risk with increasing concentration (Figure S7). The effects of different pulmonary tuberculosis classifications on individual pollutant exposures are as follows: In patients with “no sputum test” and “bacteria [3]” pulmonary tuberculosis, a hazardous effect (RRrange=(0.340, 2.850)) was observed at low-to-moderate levels of SO2 exposure. Moreover, in comparison, bacteria [3] pulmonary tuberculosis shows a higher sensitivity to SO2. In the no sputum test and positive only pulmonary tuberculosis classifications, a protective effect (RRrange = (-0.210, -0.150)) was observed at high levels of O3 exposure. However, in patients with positive only pulmonary tuberculosis, a hazardous effect (RR (95% CI) = 0.150 (0.050, 0.260)) was observed at moderate levels of PM10 exposure (Figure S8).
Comparison of prediction effect in machine learning algorithms
Table S5 summarize the performance of the seven maximum likelihood algorithms in estimating the predictions of environmental indicators on pulmonary tuberculosis. Based on the results, RF model provided the best performance in both the training set (R2Training = 0.924). The RMSE and MAE are both at very low values when compared to other models, although the performance on the test set is not ideal.
Discussion
Summary and literature comparison of prediction
In different time series models predicting the incidence rate of pulmonary tuberculosis, existing research has shown that the SARIMA model performs better. Specifically, the optimization of the SARIMA model has led to a significant decrease in both the MAPE and RMSE indicators, thereby greatly enhancing the prediction performance [27]. However, most previous studies have largely overlooked the magnitude of fluctuations in the incidence rates of infectious diseases. This study found that although the SARIMA model demonstrates superior prediction accuracy compared to the Holt–Winters exponential smoothing model, it performs better in forecasting periods characterized by relatively small fluctuations in disease incidence, as shown in Fig. 1A. Specifically, as shown in Table S1, the standard error (SE) of the SARIMA model during periods with small fluctuations (autumn and winter) is much lower than that of other models. Furthermore, this study incorporated the GARCH model, which exhibited good performance in forecasting periods with larger fluctuations in the incidence rate, as depicted in Fig. 1C. This model effectively addresses the inherent limitations of conventional models. By incorporating fluctuations at multiple levels, it provides new insights for enhancing the accuracy and reliability of predictive models in the context of infectious disease surveillance and early warning systems. This study evaluated the predictive effect of environmental factors on pulmonary tuberculosis using machine learning algorithms and found that the traditional RF algorithm is more suitable than the more complex SVM algorithm. This finding is somewhat inconsistent with the study by Hansong Zhu et al., which suggested that the deep learning random forest and long short-term memory (RF-LSTM) model is more effective for predicting the incidence and outbreaks of infectious diseases such as influenza [28]. This discrepancy may arise from the need to test and compare different models across various regions and diseases. Such testing and comparison help better address the specific requirements of each region and disease. The specific reasons may be as follows: Low-frequency data have weak temporal dependence, making it difficult for the temporal memory module of LSTM to capture effective information; instead, the increased model complexity may introduce the risk of overfitting. In contrast, RF, based on the ensemble learning characteristics of decision trees, is more suitable for processing coarse-grained data with weak temporal correlation [29]. Additionally, since the variables in this study focus on core environmental factor indicators (low-dimensional features), RF performs optimally in scenarios with a moderate number of variables. Meanwhile, regarding model performance, although the models employed in this study are relatively fixed and highly interpretable, they offer certain advantages for the initial selection of algorithms. A U.S.-based study utilized multiple algorithms for modeling to predict potential carcinogens; except for the Hybrid Neural Network (HNN), all other models were traditional machine learning algorithms [30], which were used to highlight the novelty and superiority of the hybrid model. The study found that complex models are more suitable for large-scale and diverse scenarios (such as qualitative judgment of carcinogenicity of non-congeneric chemicals), which is inconsistent with the scenario of this study. In contrast, traditional algorithm models (e.g., RF and Bagging models) are applicable to scenarios with a moderate-sized dataset (not ultra-large samples) where stable results need to be obtained quickly, which is similar to the data form and application scenario of this study. However, accumulating evidence indicates that both modified and original RF algorithms often demonstrate significant advantages across multiple industries (e.g., healthcare, agriculture) [31, 32].
Potential epidemiological mechanisms of lag effects in different subgroups
In studies on meteorology and air pollution, we found that the prolonged period of extremely low temperature and low air pressure is developed for a significant risk effect on the manifestation of pulmonary tuberculosis. This is consistent with the findings of Kai Huang et al., who discovered the correlation between meteorological factors and pulmonary tuberculosis in Anhui Province. They found that exposure to low temperatures is developed for an acute increase in the risk of hospitalization for pulmonary tuberculosis [33]. Similarly, under conditions of high-level SO2 and low-level PM10, there is also a significant risk effect on the manifestation of pulmonary tuberculosis. The reason for this may be: Pulmonary tuberculosis is mainly transmitted through the respiratory tract. Tubercle bacilli can survive in dry sputum (under low temperatures) for 6–8 months. If they adhere to dust particles (such as PM10), they can remain infectious for 8–10 days [34]. In addition, in the age group sensitivity analysis, even though the majority of the population consists of older individuals, younger patients still exhibit a more pronounced risk of developing pulmonary tuberculosis due to prolonged high temperatures and high levels of PM10. This is different from our pre-epidemic 2013–2018 exploration of pulmonary tuberculosis susceptibility in different age groups. In particular, the older age group is more susceptible to pulmonary tuberculosis in the presence of multiple levels of pollution [35]. The observed differences in these findings may be partly attributable to the inclusion of the COVID-19 pandemic period within the study timeframe. During this period, younger individuals tended to engage in more frequent outdoor activities, which may have led to increased oxidative stress and reduced immune resilience. These physiological changes could heighten susceptibility to the adverse effects of elevated temperatures and fine particulate matter exposure [36, 37]. Meanwhile, no significant differences were observed in meteorological sensitivity between males and females. However, female patients exhibited greater sensitivity to air pollutants. This disparity may be attributed to inherent physiological regulatory differences between genders, which render women more responsive to physiological changes induced by exposure to fine particulate matter. Based on this, we hypothesize that exposure to low temperatures and fine particulate matter can induce physiological changes in the cardiovascular and respiratory systems. This may be one of the reasons why meteorological and pollution factors influence the incidence of pulmonary tuberculosis.
Potential biological mechanisms of sensitivity among different types
In addition to investigating the lag effects of meteorological and pollution factors on different genders and age groups, this study also explores the combined effects of pollutant mixtures on different classifications of pulmonary tuberculosis. In China, based on the classification of pulmonary tuberculosis according to different pathogenic characteristics, the main types are bacteria-negative and smear-positive, followed by non-sputum tested. Only a small minority are culture-positive and drug-resistant types. The classification trend of pulmonary tuberculosis in Jining City, as observed in this study, is consistent with the overall classification trend in China. Different classifications of pulmonary tuberculosis have varying sensitivities to the effects of pollutants on the incidence of the disease. Among them, the non-sputum tested type is more sensitive to increased risk of incidence developed for low to moderate levels of SO2 and moderate levels of PM10. Yingdan Wang et al. investigated the association between pollutants and pulmonary tuberculosis in the capital city of Xinjiang Province, and found a positive correlation between SO2 and PM10 with a lag of 8–9 months, which is similar to the findings of this study [38]. Many patients with the non-sputum-tested type of pulmonary tuberculosis have a dry cough and produce little or no sputum. Exposure to fine particulate matter may worsen this condition and reduce sputum production, which could increase the incidence of this pulmonary tuberculosis subtype [39]. Furthermore, although patients with bacteria-negative pulmonary tuberculosis are not contagious, this study found that these patients are also sensitive to SO2. This may lead patients to minimize their environmental exposure and reduce their frequency of going outdoors, which can result in irregular or interrupted anti-pulmonary tuberculosis treatment for these individuals. Over time, bacteria-negative pulmonary tuberculosis cases may become bacteria-positive, raising the risk of transmission. This underscores the need to study how environmental pollutants relate to bacteria-negative pulmonary tuberculosis [3, 40]. This study also found that sputum culture-positive only pulmonary tuberculosis is sensitive to moderate levels of PM10. Although these results are preliminary and the number of patients with this type of pulmonary tuberculosis is small, identifying culture-positive cases remains difficult, since most diagnoses rely on sputum smear microscopy. Such patients are also the main type involved in recent transmission of pulmonary tuberculosis. These results highlight the need to study culture-positive pulmonary tuberculosis and offer a basis for considering environmental factors when developing future diagnostic methods for these patients.
Public health and policy implications
The study findings are further elevated and hold practical public health significance. The implications are as follows: (a) Seasonal screening: Prioritize pulmonary tuberculosis screening in Jining City during late winter and early spring (the period with the strongest lag effect of low temperatures and SO₂ exposure), covering communities with high pollution exposure. This recommendation is also applicable to other similar warm temperate regions. (b) Targeted interventions for high-risk populations: Conduct health education, respiratory protection guidance, and regular health check-ups for the highly susceptible populations identified in this study (outdoor-working farmers and elderly residents). (c) Integration with air quality early warning systems: Incorporate the pollutant risk thresholds determined in this study (e.g., moderate PM₁₀ concentrations) into Jining City’s air quality early warning system. When pollution exceeds the threshold, promptly issue alerts to pulmonary tuberculosis -susceptible populations and implement preventive measures.
Limitations and implications for future research
However, although the incidence of pulmonary tuberculosis in this study is consistent with the overall trend in our country, there are still regional limitations. In addition, the predictive research on environmental factors in this study has certain limitations related to overfitting. For instance, in RF algorithm, the R2 value of the training set is significantly higher than that of other models, while the R2 value of the test set is the lowest. This phenomenon may be attributed to unavoidable constraints associated with the single-region setting and fixed model specifications employed in this study. It is hoped that future researchers will conduct multi-center predictive studies and improve statistical models. Furthermore, although BKMR does not directly estimate the coefficients of individual exposures, it instead fits the nonlinear surface of the “exposure mixture-outcome” relationship through kernel functions. Even when exposures are highly correlated, the surface fitting remains stable, and only the “combined effect” (rather than individual coefficients) needs to be focused on. Nevertheless, the model still has limitations related to potential multicollinearity. Due to the small number of cases of rifampicin-resistant pulmonary tuberculosis, there is a lack of environmental epidemiological studies on this type of pulmonary tuberculosis. This also provides a prospect for future scholars to further study the association between drug-resistant pulmonary tuberculosis and pollutants. Despite adjusting for common confounding factors such as meteorological variables in the risk assessment of pollutant factors, there may still be unmeasured residual confounding (e.g., individual-level disease history, occupational exposure, socioeconomic status, or area-level healthcare accessibility). These factors could interfere with the estimation of the true association between pollutants and pulmonary tuberculosis. In this study, pollutant exposure levels were primarily derived from environmental monitoring station data, without incorporating details such as individual actual exposure duration and exposure pathways (e.g., differences between indoor and outdoor exposure). This may lead to exposure misclassification, making it difficult to accurately reflect the true exposure status of the study participants and thereby affecting the validity of the association analysis. Furthermore, this study adopted a city-level ecological analysis approach, which can only demonstrate the association between pollutant exposure and infectious disease incidence at the population level but cannot infer causality at the individual level. Meanwhile, there is a risk of “ecological fallacy,” meaning that the association observed at the population level may not be applicable to individuals, and it is challenging to distinguish the differential effects of confounding factors between the population and individual levels. We hope future studies will use more experiments to confirm how air pollutants affect pulmonary tuberculosis.
Conclusion
Pulmonary tuberculosis in Jining exhibits a distinct seasonal epidemic pattern, along with a consistent year-on-year decline in incidence. Different time series models can be used to predict the fluctuation range of different incidence rates across different temporal scales. Long-term exposure to air pollutants such as SO₂ and PM₁₀ has been shown to increase susceptibility to pulmonary tuberculosis, with evident lagged effects. Moreover, individuals of younger age and those with different pulmonary tuberculosis subtypes demonstrate varying degrees of sensitivity to specific pollutants. Moving forward, it is crucial to strengthen the surveillance of susceptible populations, particularly during the spring and summer seasons, to mitigate the risk of pulmonary tuberculosis transmission. Enhanced monitoring of temperature variations and stricter control of SO₂ emissions are also imperative to reduce environmental contributions to disease burden.
Supplementary Information
Authors’ contributions
Haoyue Cao: Software, Conceptual, Methodology, Formal analysis, Investigation, Resources, Writing-original draft, Writing-review & editing. Wei Liu: Conceptualization, Methodology, Formal analysis, Writing-original draft, Writing-review & editing, Supervision. Juxiang Yuan: Software, Conceptual, Methodology, Formal analysis, Investigation, Writing-original draft, Writing-review & editing. Wenjun Wang: Conceptualization, Methodology, Formal analysis, Writing-review & editing, Supervision. Weiming Hou: Conceptualization, Methodology, Formal analysis, Writing-review & editing, Funding acquisition, Supervision. All authors had full access to the data, contributed to the study, approved the final version for publication, and take responsibility for its accuracy and integrity. All authors read and approved the final manuscript.
Funding
None.
Data availability
The diseases data that support the findings of this study are available on request from the Jining Center for Disease Control and Prevention. The data are not publicly available due to privacy or ethical restrictions. Interested parties can apply for the data by contacting the Ethics Committee of Jining Center for Disease Control and Prevention. Relevant daily meteorology and pollutants data were publicly provided to us by the National Oceanic and Atmospheric Administration (NOAA) (https://www.noaa.gov/).
Declarations
Ethics approval and consent to participate
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. The study was approved by the ethics committee of the Jining Center for Disease Control and Prevention. Written informed consent for publication was obtained from all participants.
Consent for publication
All authors have consented to publication of this research.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Wenjun Wang, Email: wwjun1973@163.com.
Weiming Hou, Email: hwm100908@163.com.
References
- 1.Dheda K, Barry CE 3rd, Maartens G. Tuberculosis. Lancet 2016;387(10024):1211–1226. [DOI] [PMC free article] [PubMed]
- 2.Sharma A, Bloss E, Heilig CM, Click ES. Tuberculosis caused by Mycobacterium africanum, united States, 2004–2013. Emerg Infect Dis. 2016;22(3):396–403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Achkar JM, Jenny-Avital ER. Incipient and subclinical tuberculosis: defining early disease States in the context of host immune response. J Infect Dis. 2011;204(Suppl 4):S1179–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Motta I, Boeree M, Chesov D, Dheda K, Günther G, Horsburgh CR Jr., Kherabi Y, Lange C, Lienhardt C, McIlleron HM, et al. Recent advances in the treatment of tuberculosis. Clin Microbiol Infect. 2024;30(9):1107–14. [DOI] [PubMed] [Google Scholar]
- 5.Mao Q, Zhang K, Yan W, Cheng C. Forecasting the incidence of tuberculosis in China using the seasonal auto-regressive integrated moving average (SARIMA) model. J Infect Public Health. 2018;11(5):707–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zhou Y, Luo D, Liu K, Chen B, Chen S, Pan J, Liu Z, Jiang J. Trend of the tuberculous pleurisy notification rate in Eastern China during 2017–2021: Spatiotemporal analysis. JMIR Public Health Surveill. 2023;9:e49859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wang W, Guo W, Cai J, Guo W, Liu R, Liu X, Ma N, Zhang X, Zhang S. Epidemiological characteristics of tuberculosis and effects of meteorological factors and air pollutants on tuberculosis in Shijiazhuang, china: A distribution lag non-linear analysis. Environ Res. 2021;195:110310. [DOI] [PubMed] [Google Scholar]
- 8.Sun S, Chang Q, He J, Wei X, Sun H, Xu Y, Soares Magalhaes RJ, Guo Y, Cui Z, Zhang W. The association between air pollutants, meteorological factors and tuberculosis cases in Beijing, china: A seven-year time series study. Environ Res. 2023;216(Pt 2):114581. [DOI] [PubMed] [Google Scholar]
- 9.Li Z, Liu Q, Chen L, Zhou L, Qi W, Wang C, Zhang Y, Tao B, Zhu L, Martinez L, et al. Ambient air pollution contributed to pulmonary tuberculosis in China. Emerg Microbes Infect. 2024;13(1):2399275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Yao L, LiangLiang C, JinYue L, WanMei S, Lili S, YiFan L, HuaiChen L. Ambient air pollution exposures and risk of drug-resistant tuberculosis. Environ Int. 2019;124:161–9. [DOI] [PubMed] [Google Scholar]
- 11.Huang X, Wang D, Zhang Q, Wang D, Shu Y, Xiao S. Impact of ambient air pollutants on influenza-like illness, influenza A and influenza B: A nationwide time-series study in China. Atmos Environ. 2026;367:121729. [Google Scholar]
- 12.Gao X, Chen S, Zhong Z, Li J, Chen J, Li B, Lin K, Hua Q, Zhang R, Liu D, et al. Global association between air pollution and risk of influenza-related outcomes: a systematic review and meta-analysis. Int J Environ Health Res. 2025:1–16. Online ahead of print. [DOI] [PubMed]
- 13.Zhang Y, Liu M, Wu SS, Jiang H, Zhang J, Wang S, Ma W, Li Q, Ma Y, Liu Y, et al. Spatial distribution of tuberculosis and its association with meteorological factors in Mainland China. BMC Infect Dis. 2019;19(1):379. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Chen Y, Hou W, Dong J. Time series analyses based on the joint lagged effect analysis of pollution and meteorological factors of hemorrhagic fever with renal syndrome and the construction of prediction model. PLoS Negl Trop Dis. 2023;17(7):e0010806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Liu J, Qian H, Jin J, Du M, Wang C, Yu J, Pang P, Shen M, Mei Z, Shi Y, et al. Use of metagenomic next-generation sequencing for accurate diagnosis of tuberculous pleurisy: a retrospective cohort study. J Thorac Dis. 2025;17(9):6771–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Mills ES, Holt CC, Modigliani F, Muth JF, Simon HA. Planning Production, Inventories, and work force. J Am Stat Assoc. 1962;57(297):222. [Google Scholar]
- 17.Zhu Y, Zhao Y, Zhang J, Geng N, Huang D. Spring onion seed demand forecasting using a hybrid Holt-Winters and support vector machine model. PLoS ONE. 2019;14(7):e0219889. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Liu J, Yu F, Song H. Application of SARIMA model in forecasting and analyzing inpatient cases of acute mountain sickness. BMC Public Health. 2023;23(1):56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Zhang H, Liu H, Cui D, Zhang F. A Height Nonlinear Velocity Field Algorithm for CORS Station Based on GARCH Model. Sensors (Basel). 2022;22(19):7589. [DOI] [PMC free article] [PubMed]
- 20.Chen Y, Hou W, Hou W, Dong J. Lagging effects and prediction of pollutants and their interaction modifiers on influenza in Northeastern China. BMC Public Health. 2023;23(1):1826. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Cheng X, Wei Y, Wang R, Jia C, Zhang Z, An J, Li W, Zhang J, He M. Associations of essential trace elements with epigenetic aging indicators and the potential mediating role of inflammation. Redox Biol. 2023;67:102910. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Islam R, Sultana A, Tuhin MN, Saikat MSH, Islam MR. Clinical decision support system for diabetic patients by predicting type 2 diabetes using machine learning algorithms. J Healthc Eng. 2023;2023:6992441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Tusher EH, Ismail MA, Akib A, Gabralla LA, Ibrahim AO, Som HM, Remli MA. Comparative investigation of bagging enhanced machine learning for early detection of HCV infections using class imbalance technique with feature selection. PLoS ONE. 2025;20(6):e0326488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Wang X, Zhai M, Ren Z, Ren H, Li M, Quan D, Chen L, Qiu L. Exploratory study on classification of diabetes mellitus through a combined random forest classifier. BMC Med Inf Decis Mak. 2021;21(1):105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kruppa J, Liu Y, Biau G, Kohler M, König IR, Malley JD, Ziegler A. Probability Estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biom J. 2014;56(4):534–63. [DOI] [PubMed] [Google Scholar]
- 26.Liu L, Wu X, Li S, Li Y, Tan S, Bai Y. Solving the class imbalance problem using ensemble algorithm: application of screening for aortic dissection. BMC Med Inf Decis Mak. 2022;22(1):82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wang Y, Xu C, Li Y, Wu W, Gui L, Ren J, Yao S. An advanced Data-Driven hybrid model of SARIMA-NNNAR for tuberculosis incidence time series forecasting in Qinghai Province, China. Infect Drug Resist. 2020;13:867–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Zhu H, Qi F, Wang X, Zhang Y, Chen F, Cai Z, Chen Y, Chen K, Chen H, Xie Z, et al. Study of the driving factors of the abnormal influenza A (H3N2) epidemic in 2022 and early predictions in Xiamen, China. BMC Infect Dis. 2024;24(1):1093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Park K. Performance comparison of machine learning and deep learning models for supply chain tier order quantity prediction: Emphasis on tree-based and CNN-BILSTM approaches. J Infras Policy Dev. 2024;8(14):9683.
- 30.Limbu S, Dakshanamurthy S. Predicting chemical carcinogens using a hybrid neural network deep learning method. Sensors (Basel). 2022;22(21):8185. [DOI] [PMC free article] [PubMed]
- 31.Song J, Gao Y, Yin P, Li Y, Li Y, Zhang J, Su Q, Fu X, Pi H. The random forest model has the best accuracy among the four pressure ulcer prediction models using machine learning algorithms. Risk Manag Healthc Policy. 2021;14:1175–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Wang T. Improved random forest classification model combined with C5.0 algorithm for vegetation feature analysis in non-agricultural environments. Sci Rep. 2024;14(1):10367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Huang K, Hu CY, Yang XY, Zhang Y, Wang XQ, Zhang KD, Li YQ, Wang J, Yu WJ, Cheng X, et al. Contributions of ambient temperature and relative humidity to the risk of tuberculosis admissions: A multicity study in central China. Sci Total Environ. 2022;838(Pt 3):156272. [DOI] [PubMed] [Google Scholar]
- 34.Nnoaham KE, Clarke A. Low serum vitamin D levels and tuberculosis: a systematic review and meta-analysis. Int J Epidemiol. 2008;37(1):113–9. [DOI] [PubMed] [Google Scholar]
- 35.Huang K, Ding K, Yang XJ, Hu CY, Jiang W, Hua XG, Liu J, Cao JY, Zhang T, Kan XH, et al. Association between short-term exposure to ambient air pollutants and the risk of tuberculosis outpatient visits: A time-series study in Hefei, China. Environ Res. 2020;184:109343. [DOI] [PubMed] [Google Scholar]
- 36.Yang J, Liu HZ, Ou CQ, Lin GZ, Zhou Q, Shen GC, Chen PY, Guo Y. Global climate change: impact of diurnal temperature range on mortality in Guangzhou, China. Environ Pollut. 2013;175:131–6. [DOI] [PubMed] [Google Scholar]
- 37.Wang S, Wu G, Du Z, Wu W, Ju X, Yimaer W, Chen S, Zhang Y, Li J, Zhang W, et al. The causal links between long-term exposure to major PM(2.5) components and the burden of tuberculosis in China. Sci Total Environ. 2023;870:161745. [DOI] [PubMed] [Google Scholar]
- 38.Wang Y, Gao C, Zhao T, Jiao H, Liao Y, Hu Z, Wang L. A comparative study of three models to analyze the impact of air pollutants on the number of pulmonary tuberculosis cases in Urumqi, Xinjiang. PLoS ONE. 2023;18(1):e0277314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Denkinger CM, Kik SV, Cirillo DM, Casenghi M, Shinnick T, Weyer K, Gilpin C, Boehme CC, Schito M, Kimerling M, et al. Defining the needs for next generation assays for tuberculosis. J Infect Dis. 2015;211(Suppl 2):S29–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Service HKC, Madras TC, Council BM. A study of the characteristics and course of sputum smear-negative pulmonary tuberculosis. Tubercle. 1981;62(3):155–67. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The diseases data that support the findings of this study are available on request from the Jining Center for Disease Control and Prevention. The data are not publicly available due to privacy or ethical restrictions. Interested parties can apply for the data by contacting the Ethics Committee of Jining Center for Disease Control and Prevention. Relevant daily meteorology and pollutants data were publicly provided to us by the National Oceanic and Atmospheric Administration (NOAA) (https://www.noaa.gov/).






