Early warning of hepatitis B epidemics in Henan Province, China, from 2014 to 2023 based on Baidu Index and Bayesian Structural Time Series model

Yongbin Wang; Xianxiang Lan; Pan Hu; Fei Lin; Chunjie Xu

doi:10.1186/s13690-026-01837-y

. 2026 Jan 19;84:36. doi: 10.1186/s13690-026-01837-y

Early warning of hepatitis B epidemics in Henan Province, China, from 2014 to 2023 based on Baidu Index and Bayesian Structural Time Series model

Yongbin Wang ^1,^✉, Xianxiang Lan ¹, Pan Hu ¹, Fei Lin ¹, Chunjie Xu ^2,^✉

PMCID: PMC12895623 PMID: 41555371

Abstract

Background

Hepatitis B (HB) poses a significant disease burden in Henan due to its high morbidity and prevalence. Traditional surveillance systems often suffer from reporting delays, limiting real-time epidemic monitoring. Baidu Index (BI) has emerged as a valuable tool for gathering disease-related information, potentially enhancing disease surveillance capabilities. This study aimed to estimate HB epidemics by integrating BI into the traditional surveillance systems.

Methods

Monthly HB incidence data from January 2014 to September 2023 in Henan were collected. The dataset was divided into a training set (from January 2014 to September 2022) and a test set (from October 2022 to September 2023). The training set was utilized to develop Bayesian structural time series (BSTS) and seasonal autoregressive integrated moving average (SARIMA) models, including BI as a covariate (SARIMAX). Model performance was evaluated using mean absolute deviation (MAD), root mean square error (RMSE), mean absolute percentage error (MAPE), and mean error rate (MER). A sensitivity analysis was conducted to ensure robustness.

Results

A total of 739,386 HB cases were reported. HB showed an increasing trend before 2020, followed by fluctuations influenced by the COVID-19 pandemic, eventually stabilizing at a high level. A distinct seasonal pattern was observed, with a peak in March and a trough in December. The BSTS model incorporating BI demonstrated superior forecasting performance. For the 12-month ahead forecast, the BSTS model with BI achieved a MAPE of 0.112, a MAD of 615.94, an RMSE of 777.99, and a MER of 0.113, outperforming the BSTS model without BI (MAPE: 0.246; MAD: 998.22; RMSE: 1305.37; MER: 0.183). Similarly, the SARIMAX model outperformed the standard SARIMA model. Moreover, the best BSTS, with or without BI, resulted in lower forecasting error rates compared to the best SARIMA and SARIMAX. Sensitivity analyses confirmed the stability of these findings.

Conclusions

Integrating BI significantly improves HB incidence forecasting accuracy in Henan. The BSTS model demonstrates particular superiority over traditional SARIMA approaches, offering a more reliable tool for real-time epidemic monitoring. These findings support incorporating internet search data into public health early warning systems to enhance HB surveillance and facilitate timely interventions toward elimination goals.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13690-026-01837-y.

Keywords: Hepatitis B, Bayesian structural time series, Seasonal autoregressive integrated moving average, Baidu index, Early warning, Time series analysis

Text box 1. Contributions to the literature
• Demonstrates Baidu Index’s value in enhancing Hepatitis B incidence prediction accuracy, offering a novel supplement to traditional surveillance.
• BSTS models, particularly when integrated with Baidu Index, outperform SARIMA and SARIMAX in forecasting performance and adaptability.
• COVID-19 impact integration provides realistic epidemic assessment during public health disruptions, highlighting external factor consideration.
• Recommends public health departments incorporate internet search data into advanced prediction systems for improved monitoring and early warning.
• Supports building more robust forecasting tools through multi-source data integration for timely interventions.

Open in a new tab

Background

Hepatitis B (HB), caused by the hepatitis B virus (HBV), is transmitted through contact with infected blood or body fluids and remains a significant global health threat [1]. An estimated 254 million people were chronically infected worldwide in 2022, resulting in approximately 1.1 million deaths [2]. While vaccination is effective, its timely implementation, particularly at birth, remains a challenge in many regions, and epidemiological data from low-income countries is often insufficient [3, 4]. China has made substantial progress in combating HBV over the past three decades through comprehensive measures, including immunization programs and prevention of mother-to-child transmission. Nevertheless, due to its large population base, China still bears a significant number of infections [1, 5–7]. Furthermore, progress is challenged by an aging population [8], HBV reactivation [9], and co-infections with diseases like HIV and COVID-19 [10], creating ongoing hurdles for achieving the goal of eliminating HBV by 2030.

Henan Province, located in central China, has the largest population and number of HBV-infected patients in China, consistently ranking high in reported cases and incidence rates [11, 12]. A major concern is that adults in Henan often do not prioritize HB vaccination, making HBV infection a significant public health problem in this demographic [13]. Currently, HB cases in Henan are primarily reported through a traditional infectious disease surveillance system, which can cause reporting delays and a lack of real-time data, thereby hindering effective and timely public health policymaking [14]. This latency has motivated research into complementary data sources for early outbreak detection.

The emergence of digital epidemiology and infodemiology has opened new possibilities, with internet search data representing a particularly promising avenue for real-time surveillance [15–18]. The paradigm of using search engine data for disease surveillance was pioneered by Google Flu Trends, demonstrating the potential of analyzing search query volumes to track infectious disease activity [19]. In China, where Google is not available, Baidu serves as the dominant search engine [20], and the Baidu Index (BI) provides analogous data to Google Trend [21, 22]. Previous studies have explored BI’s utility for monitoring various diseases including chickenpox [23], COVID-19 [24], and dengue fever [25]. The fundamental premise is that population-level search behavior for disease-related information correlates with actual disease incidence, potentially offering real-time insights when traditional surveillance data is unavailable. This approach aligns with broader advancements in healthcare data analytics, which leverage diverse data sources—including digital traces—to model, predict, and understand disease dynamics, as exemplified during the COVID-19 pandemic [18].

However, the utility of digital data streams extends beyond correlation. The field of infodemics examines the overload of information, including misinformation, during outbreaks, and highlights the need for robust algorithms to discern true epidemic signals from noise [18]. Advanced analytical frameworks, including network algorithms and mathematical models for contagion source detection, are crucial for interpreting complex, multi-source data in both epidemic and infodemic contexts [18, 26]. These approaches emphasize the importance of integrating and validating digital data within structured analytical models to generate reliable insights for public health action [18, 26].

Despite these advances, critical research gaps remain in the application of internet search data for HB surveillance. First, while previous studies have established correlational relationships between search volumes and disease incidence, few have systematically developed and validated predictive models specifically for HB. Second, existing research has predominantly relied on traditional statistical models like autoregressive integrated moving average (ARIMA) or seasonal autoregressive integrated moving average (SARIMA) [27, 28], which assume time series stability and may lack flexibility to adapt to sudden disruptions such as those caused by the COVID-19 pandemic. Third, there is limited research integrating multiple data sources (traditional surveillance and digital trace data) within advanced modeling frameworks that can quantify uncertainty and incorporate complex temporal structures.

The rapid advancement of artificial intelligence applications in healthcare offers new opportunities for addressing these limitations [15]. Studies have demonstrated the value of machine learning approaches in analyzing diverse data streams for public health surveillance [15, 16]. Particularly for infectious disease forecasting, models that can handle structural breaks, incorporate multiple data sources, and provide uncertainty quantification are increasingly valuable. Within this context, Bayesian structural time series (BSTS) models present a flexible alternative that can better handle dynamic changes and incorporate complex covariate structures while providing full posterior inference [29–31].

Our study addresses these research gaps through several key contributions. First, we systematically develop and validate a comprehensive BI for HB surveillance, identifying the most predictive search terms through rigorous correlation and causality testing. Second, we introduce BSTS modeling to HB forecasting, demonstrating its advantages over traditional SARIMA approaches, particularly during periods of epidemiological disruption. Third, we integrate the impact of the COVID-19 pandemic into our models, providing a more realistic assessment of HB trends during this extraordinary period. Finally, we offer a comparative analysis of model performance across different forecasting horizons, providing practical guidance for public health applications.

We selected BSTS over alternative modern forecasting approaches for several theoretical and practical reasons. While models like Prophet, long short-term memory ‌(LSTM), and various ensemble methods have shown promise in epidemiological forecasting, BSTS offers particular advantages for our context. Compared to Prophet, which uses a decomposable time series model with interpretable parameters [30, 31], BSTS provides more flexible handling of uncertainty through full Bayesian inference and more sophisticated covariate selection via spike-and-slab priors. Against LSTM networks, which excel at capturing complex nonlinear patterns [32], BSTS offers greater interpretability and better performance with limited training data—a crucial consideration given our monthly data points [30, 31]. BSTS also automatically handles multiple seasonality and provides natural uncertainty quantification through posterior predictive distributions, features particularly valuable for public health decision-making where understanding forecast uncertainty is as important as point predictions [30, 31].

Therefore, this study aims to develop and validate an integrated forecasting framework for HB incidence in Henan Province, China, by combining BI with BSTS modeling, comparing its performance against traditional SARIMA approaches and rigorously assessing its utility for real-time epidemic monitoring.

Materials and methods

Data collection

The monthly HB incidence data from January 2014 to September 2023 was obtained from the Henan Provincial Health Commission. The annual resident population data came from the Henan Province Statistical Yearbook and Henan Province Statistical Bulletin. The HB related keywords were mainly obtained through keyword mining in Aizhan.com (https://ci.aizhan.com/), 5118 platform (https://www.5118.com/), and the demand map of BI Platform (https://index.baidu.com/); the BI of HB related keywords was mainly obtained through the BI download website (https://www.cmshj.com/).

HB related keywords mining and BI acquisition

Two methods were employed to identify HB-related keywords: ① Long-tail keywords were mined using Aizhan.com and the 5118 platform, with ‘HB’ as the core query. Keywords widely searched by internet users and indexed across the web were selected. ② Relevant keywords were also obtained using the BI demand map. The BI, reflecting the search frequency by Baidu users, was calculated for each HB-related keyword. Using data from January 2014 to September 2023, the BI for selected keywords was determined on the BI platform, considering both PC and mobile devices. Keywords not included in the BI thesaurus were excluded. Finally, BI data for the remaining keywords were downloaded and aggregated from daily to monthly values using Microsoft Excel 2021 for further analysis.

Keyword selection and analysis

Establishing the keyword database: A keyword database was created using Microsoft Excel 2021, categorizing a total of 95 identified keywords into six distinct groups: the HB comprehensive category (17 keywords), immune indicator category (25 keywords), symptom category (4 keywords), examination category (8 keywords), treatment category (11 keywords), and prevention category (30 keywords) (Table S1).

Keyword exclusion criteria: The following criteria were applied to exclude certain keywords: 1) Spearman rank correlation analysis was conducted between HB and the BI, excluding keywords with a Spearman’s rank correlation (r_s) of less than 0.5 or a p-value greater than 0.05; 2) Keywords with a maximum cross-correlation coefficient of less than 0.5 were excluded based on time-lag cross-correlation analysis. Ultimately, seven keywords were selected for further analysis (Table S2).

Construction of BI

The BI was constructed by integrating the search volumes of selected keywords, with weights assigned based on their temporal correlation with HB incidence. The calculation method was as follows:

In the formula, Weight_ki represents the weight for keyword i at lag k; Inline graphic is the rank correlation coefficient between keyword i at lag k and HB incidence; N represents the number of keywords contained in each time lag; is the comprehensive Baidu Index for keyword i at lag k.

SARIMA and SARIMAX models

The general structure of a SARIMA model is expressed as SARIMA(p,d,q)(P,D,Q)_S. where p represents the autoregressive (AR) order, P the seasonal AR order, d the degree of non-seasonal differencing, D the degree of seasonal differencing, q the moving average (MA) order, Q the seasonal MA order, and s the seasonal period [27, 28]. SARIMAX is an extension of the SARIMA model that incorporates exogenous variables. Typically, the modeling process for SARIMA or SARIMAX involves four key steps (Figure S1): ① Stationarity check: The Augmented Dickey-Fuller (ADF) test is applied to assess the stationarity of the hepatitis B (HB) incidence series. If the series is non-stationary, differencing is used to transform it into a stationary sequence [27, 28]. ② Parameter identification and estimation: Based on the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots of the stationary series, a trial-and-error approach is employed to preliminarily identify plausible model parameters [27, 28]. The optimal model is selected by minimizing information criteria such as the Akaike information criterion (AIC), corrected AIC (CAIC), and Bayesian information criterion (BIC), while simultaneously maximizing the log-likelihood (LL) [27, 28]. ③ Model diagnostics: A t-test is conducted to evaluate the significance of the coefficients in the selected model. In addition, the Ljung–Box Q statistic is used to test whether the residuals resemble a white noise process [27, 28]. ④ Forecasting: The best-fitting model is used to forecast HB incidence in the test set.

BSTS models

The BSTS is a stochastic state space method that integrates three statistical models: the Kalman filter, Spike and Slab regression, and Bayesian model averaging [30]. During modeling, the state components—including trend, seasonality, regression, autocorrelation, and static intercept—must be specified [30]. Trend components are categorized into local horizontal trends, local linear trends, semi-local linear trends, and shared local horizontal trends [30]. Seasonal components can be monthly, biannual, or annual [30]. The regression component incorporates covariates (e.g., BI) and allows for multiple control variables. The trend and seasonality of the target series are estimated using the Kalman filter, and variable selection is performed using Spike and Slab priors to estimate the effect size of each covariate in each Markov Chain Monte Carlo (MCMC) iteration (For a non-technical explanation of these terms, see Supplementary Appendix A) [30]. The “Spike” prior assigns a positive probability to zero effect, while the “Slab” prior represents a weakly informative Gaussian prior for non-zero effects. During prediction, BSTS automatically identifies the prior information and likelihood function, and combines them to obtain the posterior distribution [33]. The process involves using the MCMC algorithm to estimate the posterior distribution of each BSTS parameter and then applying Bayesian model averaging to weigh and smooth the posterior distribution results of each model for prediction.

Statistical analysis

The HB incidence series from January 2014 to September 2022 (105 observations) was used as the training set for model development, and the data from October 2022 to September 2023 (12 observations) served as the test set to assess the short-term forecasting accuracy of the models. We applied Seasonal and Trend decomposition using Loess (STL) to decompose the HB incidence series into trend, seasonal, and residual components. r_s and cross-correlation analyses between the BI of each HB-related keyword and the actual HB incidence were performed using SPSS software (version 26). To detect multicollinearity between keywords and HB cases, r_s and the variance inflation factor (VIF) were calculated. A value of rs > 0.8 or VIF > 10 was indicative of multicollinearity [34]. The Granger Causality Test was conducted to examine the causal relationship between Baidu Search Index keywords related to HB and reported hepatitis B case numbers [35]. SARIMA and SARIMAX were built using the “tseries” and “forecast” packages in RStudio software (version 4.1.1), while the BSTS was constructed using the “bsts” package. The Lagrange multiplier statistic (LM) test was employed to examine conditional heteroskedasticity behavior and volatility (ARCH effect) in the model residuals. Besides, a sensitivity analysis was performed to evaluate the mid-term prediction accuracy of the models (using data from January 2014 to December 2020 as the training set and the remaining data as the test set). As depicted in Figure S2, the COVID-19 epidemic in February 2020, January 2022, alongside November and December 2022 had a significant impact on HB. Thus, a covariate was included in the models to mitigate the effect of COVID-19 on forecasting performance (encoded as “1” in these periods and “0” in other periods). The mean absolute deviation (MAD), root mean square error (RMSE), mean absolute percentage error (MAPE), and mean error rate (MER) were used to assess the prediction accuracy of the models. A two-sided p<0.05 was considered statistically significant.

Results

Descriptive analysis

Between January 2014 and September 2023, Henan Province recorded 739,386 newly diagnosed HB cases, with annual and monthly average incidence rates of 70.49 and 5.87 per 100,000 population, respectively. As illustrated in Figs. 1A-B, HB incidence demonstrated a consistent upward trajectory prior to 2020. The post-2020 period saw alternating cycles of decline and resurgence, ultimately stabilizing at elevated levels. The peak incidence occurred in 2018 with 82,101 reported cases (83.23 per 100,000 population), representing a 1.4-fold increase compared to the 2022 trough of 60,669 cases (61.46 per 100,000 population). Figures 1C-D reveal distinct seasonal patterns, with March-August constituting the high-risk epidemic period (peaking in March) and December marking the annual nadir (Fig. 2).

Fig. 1 — Decomposition of the monthly Hepatitis B incidence time series in Henan Province (January 2014–September 2023) using Seasonal and Trend decomposition using Loess (STL). (A) Original series: The observed monthly HB incidence data; (B) Trend component: The long-term trend of HB incidence, with key periods labeled (e.g., pre-2020 increase, COVID-19 fluctuations, post-2021 stabilization); (C) Seasonal component: The repeating seasonal pattern, with annotations indicating high-risk months (March–August, seasonal factor > 1) and low-risk months (September–February, seasonal factor < 1). The peak (March) and trough (December) are explicitly marked; and (D) Residual component: The irregular fluctuations after removing trend and seasonality. The plot showed that there was an overall descending trend and then a rising trend and remarkable seasonal behavior in HB incidence in Henan

Fig. 2 — Consistent seasonal pattern of monthly Hepatitis B incidence in Henan Province, China, January 2014–September 2023. The plot illustrates stable annual epidemic cycles with consistent March peaks (seasonal factor > 1) and December troughs (seasonal factor < 1), despite disruptions from the COVID-19 pandemic

Correlation analysis between HB related keywords BI and HB cases

Spearman analysis identified seven keywords with a strong temporal correlation to HB incidence (HBsAg: r_s =0.598; HB five items: r_s =0.594; HB five checks: r_s =0.574; HB DNA: r_s =0.562; How to get HB vaccine: r_s =0.505; HBV-DNA: r_s =0.503; HB surface antibody weakly positive: r_s =0.500) (Table S2). The BI of the remaining keywords demonstrated weak correlation or no statistical significance (Table S1). Figure 3A illustrates the time series of these seven keywords and HB, indicating that their fluctuation trends were relatively consistent. Additionally, the overall trend of each keyword was in line with the epidemic trend of HB.

Fig. 3 — Time series comparison between Baidu Index (BI) and reported Hepatitis B cases in Henan Province, China, January 2014–September 2023. A Search volumes for seven selected Hepatitis B-related keywords show synchronous fluctuations with case numbers. B Composite BI closely mirrors actual epidemic trends, supporting its utility as a predictive indicator

Time-lag cross-correlation analysis and construction of BI

Time-lag cross-correlation analysis indicated that seven HB keywords with the largest cross-correlation coefficient >0.5 (p<0.001) within the lag range were selected. The final keyword set covers diverse aspects of public engagement with HB, including diagnostic testing (HBsAg, HBV-DNA), vaccination inquiries (“how to get hepatitis B vaccine”), clinical understanding (“Hepatitis B surface antibody weakly positive”), and comprehensive searches (“Hepatitis B five items/checks”) (Table S2). This diversity ensures broad representation of search behaviors related to disease awareness, prevention, and management. Further collinearity diagnosis results show that no r_s of the keywords is greater than 0.8, and no VIF is greater than 5, indicating that there is no collinearity among the keywords (Table S3). The results in Table S4 and Fig. 3A suggested a synchronization effect between HB-related keywords and HB, with the lag order being 0 and the mutual correlation coefficient being the largest (>0.5). This indicated a strong correlation between the BI of these seven keywords and HB. The Granger causality test results indicate that all keywords demonstrate significant Granger causality with HB cases (Table S4). Then, the weight of each keyword was calculated using the formula, and the BI was constructed by adding the BI of each keyword according to the weight. Spearman rank correlation analysis demonstrated a strong correlation between HB and BI (r_s=0.639). Moreover, cross-correlation analysis revealed a synchronization effect between BI and HB (Table S4), and the changing trends of the two-time series were highly consistent (Fig. 3B). Therefore, BI could be considered as an indicator for predictive models.

Construction and prediction of SARIMA and SARIMAX models

This study constructed two main types of models: 1) SARIMA without BI; 2) SARIMAX with BI. The fitting and prediction results of both models were then introduced. After conducting the ADF test on the HB incidence series (ADF=-3.315, p=0.072), it was found that the HB incidence sequence was not stationary. Seasonal difference was first performed to reduce seasonal effects. After the difference (ADF=-3.377, p=0.063), it was indicated that the series needed to be further differenced. After seasonal and non-seasonal differences (ADF=-5.768, p<0.001), it was concluded that the sequence was stationary, and it could be seen that d and D were both 1. The ACF and PACF diagrams of the stationary sequence were drawn, as shown in Figure S3 and Tables 1 and 2. The trial and error method was used to roughly select SARIMA(1,1,1)(0,1,1)₁₂ as the optimal model, with AIC (1476.342) and CAIC (1476.580) being the smallest, and LL (734.171) being the largest. For the SARIMAX containing BI, the same method was used to select SARIMAX(1,1,1)(0,1,1)₁₂ as the best model, with AIC (1423.112), CAIC (1423.718), and BIC (1438.242) being the smallest, and LL (-705.556) being the largest. Residual diagnosis was then performed on the two best models, and the results are shown in Figure S4. Most values in the ACF and PACF analysis were within two standard deviations. The p value for the Ljung-Box Q statistic was >0.05 on different lags, and there was no ARCH effect in the residuals (Table 3). All diagnostic results demonstrated that the above-mentioned models were appropriate for fitting HB incidence. Finally, both models were used to predict the HB incidence from October 2022 to September 2023 (Table 4 and Figs. 4A-B).

Table 1.

Model selection criteria for seasonal autoregressive integrated moving average (SARIMA) models fitted to the training set of monthly hepatitis B incidence data in Henan Province, China, January 2014–September 2022

Models	AIC	CAIC	BIC	LL
SARIMA(1,1,1)(0,1,1)₁₂	1476.342	1476.580	1486.429	-734.171
SARIMA(0,1,1)(0,1,1)₁₂	1478.335	1478.453	1485.900	-736.168
SARIMA(2,1,0)(0,1,1)₁₂	1483.341	1483.578	1493.428	-737.670
SARIMA(1,1,0)(0,1,1)₁₂	1488.036	1488.154	1495.601	-741.018
SARIMA(1,1,1)(1,1,0)₁₂	1499.316	1499.553	1509.403	-745.658
SARIMA(0,1,1)(1,1,0)₁₂	1502.350	1502.468	1509.915	-748.175
SARIMA(2,1,0)(1,1,0)12	1506.668	1506.906	1516.755	-749.334
SARIMA(1,1,0)(1,1,0)₁₂	1510.585	1510.703	1518.151	-752.293

Open in a new tab

AIC Akaike information criterion, CAIC Corrected AIC, BIC Bayesian information criterion, LL Log-likelihood

The model with the lowest AIC, CAIC, and BIC, and the highest LL was selected as optimal

Table 2.

Model selection criteria for seasonal autoregressive integrated moving average with exogenous variables (SARIMAX) models fitted to the training set of monthly hepatitis B incidence data in Henan Province, China, January 2014–September 2022

Models	AIC	CAIC	BIC	LL
SARIMAX(1,1,1)(0,1,1)₁₂	1423.112	1423.718	1438.242	-705.556
SARIMAX(0,1,1)(0,1,1)₁₂	1433.610	1434.010	1446.219	-711.805
SARIMAX(2,1,0)(0,1,1)₁₂	1441.903	1442.509	1457.034	-714.952
SARIMAX(1,1,1)(1,1,0)₁₂	1447.526	1448.132	1462.657	-717.763
SARIMAX(0,1,1)(1,1,0)₁₂	1458.485	1458.885	1471.094	-724.242
SARIMAX(2,1,0)(1,1,0)₁₂	1465.737	1466.343	1480.868	-726.869
SARIMAX(1,1,0)(1,1,0)₁₂	1468.587	1468.987	1481.196	-729.294

Open in a new tab

AIC Akaike information criterion, CAIC Corrected AIC, BIC Bayesian information criterion, LL Log-likehood

The model with the lowest AIC, CAIC, and BIC, and the highest LL was selected as optimal

Table 3.

Box-Ljung Q statistics and Lagrange multiplier test results for residuals from the seasonal autoregressive integrated moving average (SARIMA) and seasonal autoregressive integrated moving average with exogenous variables (SARIMAX) models for hepatitis B incidence in Henan Province, China, January 2014–September 2022

Lags	SARIMA(1,1,1)(0,1,1)₁₂				SARIMAX(1,1,1)(0,1,1)₁₂
Lags	Q	p	LM	p	Q	p	LM	p
1	0.001	0.976	2.558	0.110	0.082	0.774	0.964	0.326
2	0.039	0.981	2.523	0.283	0.554	0.758	0.908	0.635
3	0.284	0.963	2.567	0.463	1.250	0.741	0.854	0.837
4	0.937	0.919	2.525	0.640	4.685	0.321	0.897	0.925
5	0.937	0.968	2.482	0.779	8.038	0.154	1.154	0.949
6	0.946	0.988	2.455	0.874	8.194	0.224	2.077	0.913
7	1.554	0.980	2.792	0.904	8.195	0.316	3.043	0.881
8	2.564	0.959	2.757	0.949	8.615	0.376	4.211	0.838
9	3.897	0.918	3.110	0.960	9.389	0.402	4.277	0.892
10	3.922	0.951	3.170	0.977	9.915	0.448	4.694	0.911

Open in a new tab

Q Box-Ljung statistic, LM Lagrange multiplier statistic

A p-value > 0.05 indicates that the residuals are consistent with white noise and no ARCH effect is present

Table 4.

Forecasted versus observed monthly hepatitis B incidence in Henan Province, China, from October 2022 to September 2023, using seasonal autoregressive integrated moving average (SARIMA) and seasonal autoregressive integrated moving average with exogenous variables (SARIMAX) models

Time	Original observations	SARIMA(1,1,1)(0,1,1)₁₂		SARIMAX(1,1,1)(0,1,1)₁₂
Time	Original observations	Forecasts	95% CI	Forecasts	95% CI
2022-10	4178	5385	4114 ~ 6656	5135	4191 ~ 6079
2022-11	3105	5370	3966 ~ 6775	3133	2118 ~ 4147
2022-12	2778	5402	3944 ~ 6861	3036	2008 ~ 4064
2023-01	4429	5324	3830 ~ 6817	5381	4351 ~ 6410
2023-02	6795	5120	3594 ~ 6645	5641	4611 ~ 6672
2023-03	6657	6634	5078 ~ 8190	6560	5530 ~ 7591
2023-04	6282	5913	4328 ~ 7499	5927	4897 ~ 6958
2023-05	6013	5817	4203 ~ 7431	5623	4593 ~ 6654
2023-06	5521	5587	3944 ~ 7229	5469	4438 ~ 6500
2023-07	6459	5981	4310 ~ 7651	5681	4650 ~ 6711
2023-08	7020	5630	3932 ~ 7328	5551	4520 ~ 6582
2023-09	6158	5098	3373 ~ 6822	5024	3993 ~ 6055

Open in a new tab

CI Confidence interval

Forecasts are presented with 95% confidence intervals

Fig. 4 — Comparison of 12-month and 33-month ahead forecasts for monthly Hepatitis B incidence in Henan Province, China, 2014–2023, using seasonal autoregressive integrated moving average, seasonal autoregressive integrated moving average with exogenous variables, Bayesian structural time series without Baidu Index, and Bayesian structural time series with Baidu Index models. A Seasonal autoregressive integrated moving average model for 12-month ahead forecasts. B Seasonal autoregressive integrated moving average with exogenous variables model for 12-month ahead forecasts. C Bayesian structural time series model without Baidu Index for 12-month ahead forecasts. D Bayesian structural time series model with Baidu Index for 12-month ahead forecasts. E Seasonal autoregressive integrated moving average model for 33-month ahead forecasts. F Seasonal autoregressive integrated moving average with exogenous variables model for 33-month ahead forecasts. G Bayesian structural time series model without Baidu Index for 33-month ahead forecasts. H Bayesian structural time series model with Baidu Index for 33-month ahead forecasts. The Bayesian structural time series models, especially with Baidu Index, demonstrate superior predictive performance

Construction and prediction of BSTS models

The HB incidence series from January 2014 to September 2022 (105 observations) served as the training set, while data from October 2022 to September 2023 (12 months) were used as the test set to evaluate short-term (12-month ahead) forecast accuracy. During the development of the BSTS, the local level trend was ultimately selected because 1) different tests have revealed that the local horizontal trend component is more useful for prediction than those of others; and 2) it provided the best fit for our data, which exhibited stable seasonal patterns in early years followed by substantial disruptions during the COVID-19 pandemic. This component captures gradual stochastic changes in the baseline level without imposing strong assumptions about trend direction, making it suitable for series affected by external shocks. Additionally, adding more state components proved to more accurately capture other detailed changing characteristics in the training data, such as seasonality and regression components, thereby improving model prediction performance. Therefore, we incorporated both 6-month and 12-month seasonal components to capture the semi-annual and annual patterns observed in the HB incidence data. The dual seasonality specification proved essential for modeling the disrupted seasonal patterns during the pandemic period while maintaining the underlying annual cycle. For the BSTS without BI, the local level was used to capture trend characteristics, while seasonal components with periods of 6 and 12 were added to adapt to the disruption of seasonal patterns caused by the COVID-19 epidemic. For the BSTS model with BI, multiple different state components (local level, seasonality with periods of 6 and 12, autocorrelation, horizontal intercept) were added to the BSTS. When 3500 and 5000 MCMC samplings were performed for the BSTS without BI and the BSTS with BI, respectively, it was found that the sampling was in a stable state after 2540 and 1234 iterations. The relevant parameters generated are shown in Table S5. The probability of inclusion for BI was 100%. Further diagnosis of the BSTS residuals showed that most correlation coefficients in the ACF and PACF plots of the residuals were within the confidence interval (Figure S5). The results of the Ljung-Box Q and LM tests showed that there was no statistical significance between the residuals at different lag periods (Table 5), indicating that the model residuals were white noise and the residuals had no ARCH effect. Therefore, these two best BSTS were used to predict the HB incidence from October 2022 to September 2023 (Table 6 and Figs. 4C-D).

Table 5.

Box-Ljung Q statistics and Lagrange multiplier test results for residuals from the Bayesian Structural Time Series (BSTS) models for hepatitis B incidence in Henan Province, China, January 2014–September 2022

Lags	BSTS (without BI)				BSTS (with BI)
Lags	Q	p	LM	p	Q	p	LM	p
1	0.056	0.812	3.304	0.069	0.003	0.956	0.845	0.845
2	2.964	0.227	3.360	0.186	1.494	0.474	0.519	0.771
3	3.864	0.277	3.371	0.338	2.410	0.492	0.591	0.899
4	6.987	0.137	3.403	0.493	5.630	0.229	1.801	0.772
5	7.480	0.187	3.387	0.641	8.605	0.126	2.231	0.816
6	8.116	0.230	3.382	0.760	8.769	0.187	2.745	0.840
7	9.097	0.246	3.873	0.794	8.826	0.265	4.238	0.752
8	10.401	0.238	3.836	0.872	9.428	0.308	4.165	0.842
9	12.670	0.178	4.154	0.901	10.494	0.312	4.363	0.886
10	12.715	0.240	4.235	0.936	11.421	0.326	4.915	0.897

Open in a new tab

Q Box-Ljung statistic, LM Lagrange multiplier statistic, BI Baidu Index

A p-value > 0.05 indicates that the residuals are consistent with white noise and no ARCH effect is present

Table 6.

Forecasted versus observed monthly hepatitis B incidence in Henan Province, China, from October 2022 to September 2023, using Bayesian Structural Time Series (BSTS) models with and without Baidu index

Time	Original observations	BSTS (without BI)		BSTS (with BI)
Time	Original observations	Forecasts	95% CI	Forecasts	95% CI
2022-10	4178	5441	4043 ~ 6838	5191	4204 ~ 6158
2022-11	3105	5464	3914 ~ 6864	3102	1909 ~ 4339
2022-12	2778	5449	4036 ~ 6923	3054	1789 ~ 4280
2023-01	4429	5377	3925 ~ 6800	5473	4429 ~ 6504
2023-02	6795	5182	3605 ~ 6903	5680	4653 ~ 6722
2023-03	6657	6677	5142 ~ 8423	6584	5564 ~ 7641
2023-04	6282	6007	4374 ~ 7689	5947	4887 ~ 6983
2023-05	6013	5895	4302 ~ 7531	5675	4647 ~ 6732
2023-06	5521	5691	3982 ~ 7550	5555	4530 ~ 6603
2023-07	6459	6074	4346 ~ 7942	5761	4711 ~ 6827
2023-08	7020	5771	3853 ~ 7615	5595	4544 ~ 6655
2023-09	6158	5252	3505 ~ 7255	5119	4091 ~ 6154

Open in a new tab

CI Confidence interval, BI Baidu index

Forecasts are presented with 95% confidence intervals

Sensitivity analysis

To assess mid-term (33-month ahead) predictive performance, a sensitivity analysis was conducted using data from January 2014 to December 2020 for training and the remaining period (January 2021–September 2023) for testing. The construction process of SARIMA and SARIMAX led to the selection of SARIMA(0,1,1)(0,1,1)₁₂ and SARIMAX(1,1,1)(2,1,0)₁₂ as the best models for fitting the HB incidence sequence. Tables S6-S10 and Figs. 4E-F and S6 summarized the parameter coefficients of both models, their significance testing, model selection, residual diagnosis, and prediction results. In the training data, the relatively stable seasonal pattern in the early period and the large fluctuations in the trend caused by the COVID-19 epidemic in the later period were considered. After multiple attempts, it was found that capturing such fluctuations was easier when the local linear trend was applied to the BSTS. For the BSTS without covariates, the approach involved combining the local linear trend, 12-month and 6-month seasonality, and autocorrelation components. For the BSTS model that includes BI, a static intercept component was further added. The results indicated that for the BSTS model without BI and with BI, when 3500 and 5000 MCMC samplings were performed, respectively, the sampling was in a stable state after 1015 and 559 iterations. The relevant parameters generated are in Table S5. The probability of inclusion for BI was 100%. Figure S7 and Table S11 indicated that the BSTS residual was white noise and had no ARCH effect. Finally, the BSTS was used for prediction (Table S12 and Figs. 4G-H).

Prediction effect evaluation

Table 7 and Figs. 4A-H present a comparison of the prediction accuracy and reliability of the optimal SARIMA, SARIMAX, BSTS without BI, and BSTS with BI models across various training sets. The BSTS models, both with and without BI, exhibited lower prediction errors than the SARIMA and SARIMAX models, indicating superior predictive performance. Furthermore, incorporating BI (BSTS with BI and SARIMAX) resulted in reduced prediction errors compared to their counterparts without BI (BSTS without BI and SARIMA), suggesting that considering the impact of BI significantly improves model accuracy. These findings were consistently observed in the sensitivity analysis, confirming the enhanced predictive accuracy of models integrating BI.

Table 7.

Predictive performance metrics of seasonal autoregressive integrated moving average (SARIMA), seasonal autoregressive integrated moving average with exogenous variables (SARIMAX), Bayesian Structural Time Series (BSTS), and Bayesian structural time series with Baidu index models for hepatitis B incidence forecasting in Henan Province, China

Models	Testing Horizons
Models	MAPE	MAD	RMSE	MER
12-data ahead forecasts
SARIMA	0.247	1020.713	1308.543	0.187
SARIMAX	0.115	635.231	793.726	0.117
BSTS without BI	0.246	998.223	1305.365	0.183
BSTS with BI	0.112	615.942	777.992	0.113
33-data ahead forecasts
SARIMA	0.157	699.763	1025.342	0.124
SARIMAX	0.104	519.565	697.182	0.092
BSTS without BI	0.153	692.886	996.436	0.123
BSTS with BI	0.094	492.951	663.745	0.087

Open in a new tab

MAPE Mean absolute percentage error, MAD Mean absolute deviation, RMSE Root mean square error, MER Mean error rate

Lower values indicate better predictive performance

Discussion

Impact of the COVID-19 pandemic on HB epidemiology

This study underscores the significant interplay between public health emergencies and the epidemiology of existing diseases like HB. The consistent growth trend in HB incidence in Henan before 2020, likely driven by its large population and increasing migrant numbers [1], was abruptly disrupted by the COVID-19 pandemic. The sharp decline in HB cases in 0 can be attributed to non-pharmaceutical interventions, which led to an overburdened health system reducing non-emergency services and public fear of contracting COVID-19 discouraging hospital visits [11, 36]. The subsequent gradual return of HB incidence to pre-2020 levels by 2021 likely reflects the effective efforts of health authorities to resume normal services, including HB screening and resource allocation.

Notably, the pandemic’s influence was multifaceted. While control measures initially suppressed HB reporting, treatments for severe COVID-19, such as tocilizumab and corticosteroids, may have increased the risk of HBV reactivation [37]. Furthermore, subsequent waves of COVID-19, including the Omicron variant emergence and the sudden shift in China’s pandemic policy in late 2022, repeatedly shifted medical resources and public focus [38], leading to transient decreases in HB detection rates. After 2023, with the cessation of specific COVID-19 responses, HB testing and reporting normalized. This period highlights the complex and dynamic impact of a major public health event on disease surveillance and incidence patterns.

Seasonal patterns of HB

Throughout the study period, HB incidence in Henan exhibited a distinct seasonal pattern, peaking in March and reaching a trough in December. The March peak aligns with findings from Zhao et al. [1] and may be associated with the massive population movement following the Spring Festival holiday, potentially increasing transmission opportunities. The consistent low point in December, however, may represent a specific characteristic of the studied period. The COVID-19 pandemic has been shown to alter the seasonal characteristics of other infectious diseases [39]. It is plausible that the pandemic response measures also influenced HB’s seasonality, making the December trough a particular feature from 2014 to 2021, though this requires further investigation.

Baidu index as a supplementary surveillance tool

Traditional disease surveillance systems often face challenges like reporting delays. This study demonstrates the value of incorporating internet search data, specifically the BI, to overcome these limitations. We identified seven HB-related keywords (e.g., HBsAg, HB vaccine, HB-DNA) that showed a strong positive correlation with official HB case numbers. This indicates heightened public attention to these specific topics as the epidemic evolves, consistent with previous research in China [25, 28].

Cross-correlation analysis revealed no time lag between these keywords and case reports, demonstrating a synchronization effect [20]. Importantly, the Granger causality test results indicate that all keywords demonstrate significant Granger causality with HB cases. This supports the potential use of these search terms as leading indicators for HB surveillance. This real-time nature of search data contrasts with some studies on COVID-19, where BI showed leading or lagging effects depending on the case definition (new vs. cumulative) [40]. The synchronization observed here likely stems from the public’s use of search engines for immediate health information. The ability of BI to mirror epidemic trends in real-time supports its utility as an early warning indicator and a tool for monitoring the dynamic changes of the HB epidemic. Furthermore, it is crucial to acknowledge that BI trends can be influenced by external factors not directly related to disease prevalence, such as media coverage, public health campaigns, holidays, or social events, which could temporarily inflate search volumes for specific keywords. Therefore, continuous refinement of keyword selection and validation against gold-standard surveillance data remains essential.

Model performance and comparison

Our modeling results highlight the superiority of the BSTS model for this dataset. The standalone SARIMA model produced less accurate predictions. Incorporating BI as a covariate significantly improved the prediction accuracy of both SARIMAX and, more notably, BSTS models. This aligns with studies using BI to predict other diseases [24, 41]. Crucially, our study is the first to integrate the COVID-19 epidemic context into the model, thereby more realistically capturing the HB trends during this disruptive period and underscoring the value of BI in future infectious disease prediction systems.

The BSTS model demonstrated smaller prediction errors and greater robustness than SARIMAX. Theoretically, several inherent features of the BSTS framework account for its superior performance: 1) Handling of time-varying parameters: Unlike SARIMA, which relies on fixed parameters, BSTS allows model parameters to evolve dynamically over time via a state-space approach [42]. This flexibility is critical for capturing stochastic behaviors and adapting to structural changes in the data, such as the profound disruptions caused by the COVID-19 pandemic. 2) Modularity and resistance to structural changes: The BSTS model is modular, systematically combining components like trend, seasonality, and covariates. This structure allows it to be more robust to structural breaks. The Spike-and-Slab priors facilitate an automated and effective variable selection process, enhancing model reliability and interpretability while preventing overfitting [31, 43]. 3) Integration of complex covariates [31]: BSTS can seamlessly incorporate complex covariate structures and their potential time-lagged effects, making it inherently superior to traditional models like SARIMAX for integrating dynamic external regressors like the Baidu Index.

For short-term predictions, the performance difference between SARIMAX (MAPE: 0.115) and BSTS (MAPE: 0.112) was marginal, confirming that SARIMAX remains a satisfactory and mainstream tool for short-term forecasting [44]. However, for medium-term predictions, the BSTS model demonstrated a distinct advantage, achieving a MAPE of less than 10%, which is consistent with its documented strengths in longer-term forecasts [30, 33, 45]. It is important to acknowledge the limitations of BSTS, primarily its relatively high computational complexity. This complexity may make SARIMAX a more practical choice for small sample sizes or when only short-term predictions are required. As novel combination models continue to emerge, further comparisons with BSTS are warranted to validate predictive accuracy.

Limitations

This study has several limitations. First, the BI keywords are susceptible to changes in personal search behavior and may not encompass all relevant terms, potentially leading to an underestimation of relevance. Keyword volatility can also be influenced by media coverage or social events, necessitating continuous updates to the keyword list. Second, HB incidence is influenced by multifaceted factors like climate and social conditions, which were not fully integrated, potentially affecting prediction accuracy. Future work should incorporate multiple data sources. Third, the study is based solely on data from Henan Province, which may limit the generalizability of the findings. Expanding the research to other regions is essential for validation. Fourth, the analysis is based on observational data and cannot establish causality. Finally, the relatively small sample size, coupled with the disrupted seasonal patterns due to COVID-19, introduced significant errors in the training data. Extending the research period to provide more stable data would further enhance the model’s performance.

Conclusions

This study establishes that combining BI with BSTS modeling creates a superior framework for HB incidence forecasting in Henan Province, outperforming traditional SARIMA models. Our work provides three principal contributions: 1) a novel methodology integrating internet search data with advanced time series analysis for HB surveillance, 2) empirical validation of BI’s utility as an effective real-time indicator, and 3) a flexible modeling approach that successfully captures complex epidemic disruptions, including those caused by the COVID-19 pandemic. For public health practice, these findings advocate for the formal integration of internet search data into infectious disease early warning systems to enable more proactive interventions. Future research should focus on several promising directions: incorporating mobile-specific BI data streams, accounting for media coverage variables that may influence search behavior, and validating this framework across multiple regions to establish generalizability and build more robust national surveillance capabilities.

Supplementary Information

Supplementary Material 1.^{(477.8KB, docx)}

Acknowledgements

We appreciated the Henan Provincial Health Commission for sharing the number of HB morbidity data.

Authors' contributions

Wang YB conceived, initiated, and performed this work. Lan XX, Hu P, Li L and Xu CJ collected and analyzed, and interpreted the data for this study. Wang YB, Lan XX, Hu P, Li L and Xu CJ edited and improved this original manuscript. All authors reviewed and approved the manuscript.

Funding

This work was supported by the Natural Science Foundation in Henan Province and the Open Project Program of Priority funding of The First Hospital of Xinxiang Medical University (222300420265 and XZZX2022002).

Data availability

The data supporting the findings of this study are publicly available through the CHARLS repository. Researchers can access the datasets by registering and submitting a request via the official CHARLS website: http://charls.pku.edu.cn. The use of CHARLS data complies with the terms and conditions outlined by the CHARLS team, including restrictions on redistribution and requirements for proper citation.

Declarations

Ethics approval and consent to participate

The institutional review board of Xinxiang Medical University approved this study protocol (No: XYLL-2019072). All methods were carried out under relevant guidelines and regulations. The need for informed consent was waived by the study Ethics Committee of Xinxiang Medical University because the HB cases were shared anonymously and we cannot access any identifying information of the patients (available from: https://wsjkw.henan.gov.cn/).

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Yongbin Wang, Email: wybwho@163.com.

Chunjie Xu, Email: xuchunjie@imb.pumc.edu.cn.

References

1.Zhao D, Zhang H, Cao Q, Wang Z, Zhang R. The research of SARIMA model for prediction of hepatitis B in Mainland China. Med (Baltim). 2022;101(23):e29317. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.WHO, Hepatitis B. Available from: https://www.who.int/news-room/fact-sheets/detail/hepatitis-b. Accessed 13 Nov 2025.
3.Lin CL, Kao JH, Hepatitis B. Immunization and impact on natural history and cancer incidence. Gastroenterol Clin North Am. 2020;49(2):201–14. [DOI] [PubMed] [Google Scholar]
4.Zampino R, Boemio A, Sagnelli C, Alessio L, Adinolfi LE, Sagnelli E, et al. Hepatitis B virus burden in developing countries. World J Gastroenterol. 2015;21(42):11941–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Liu J, Liang W, Jing W, Liu M. Countdown to 2030: eliminating hepatitis B disease, China. Bull World Health Organ. 2019;97(3):230–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Chen S, Mao W, Guo L, Zhang J, Tang S. Combating hepatitis B and C by 2030: achievements, gaps, and options for actions in China. BMJ Global Health. 2020;5(6):e002306. [DOI] [PMC free article] [PubMed]
7.Wang YW, Shen ZZ, Jiang Y. Comparison of ARIMA and GM(1,1) models for prediction of hepatitis B in China. PLoS ONE. 2018;13(9):e0201987. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Kemp L, Clare KE, Brennan PN, Dillon JF. New horizons in hepatitis B and C in the older adult. Age Ageing. 2019;48(1):32–7. [DOI] [PubMed] [Google Scholar]
9.Shi Y, Zheng M. Hepatitis B virus persistence and reactivation. BMJ. 2020;370:m2200. [DOI] [PubMed] [Google Scholar]
10.Kim AY, Hepatitis B, Virus, Coinfection HIV. Fibrosis, Fat, and future directions. Am J Gastroenterol. 2019;114(5):710–2. [DOI] [PubMed] [Google Scholar]
11.Li X, Li Y, Xu S, Wang P, Hu M, Li H, et al. Evaluation of the impact of COVID-19 on hepatitis B in Henan Province and its epidemic trend based on bayesian structured time series model. BMC Public Health. 2025;25(1):1312. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Guo YH, Chen YP, Dou QH, Liu Q, Yang JH, Seng MH, et al. Seroepidemiological analysis of hepatitis B virus infection among adolescents aged 0–14 years in Henan Province and preliminary evaluation of the effectiveness of childhood hepatitis B vaccine immunization program. Zhonghua Yu Fang Yi Xue Za Zhi. 2024;58(2):202–7. [DOI] [PubMed] [Google Scholar]
13.Yong Hao G, Da Xing F, Jin X, Xiu Hong F, Pu Mei D, Jun L, et al. The prevalence of hepatitis B infection in central china: an adult population-based serological survey of a large sample size. J Med Virol. 2017;89(3):450–7. [DOI] [PubMed] [Google Scholar]
14.Wang Y, Zhou H, Zheng L, Li M, Hu B. Using the Baidu index to predict trends in the incidence of tuberculosis in Jiangsu Province, China. Front Public Health. 2023;11:1203628. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Hang CN, Yu PD, Chen S, Tan CW, Chen G. MEGA: machine Learning-Enhanced graph analytics for infodemic risk management. IEEE J Biomedical Health Inf. 2023;27(12):6100–11. [DOI] [PubMed] [Google Scholar]
16.Gallotti R, Valle F, Castaldo N, Sacco P, De Domenico M. Assessing the risks of ‘infodemics’ in response to COVID-19 epidemics. Nat Hum Behav. 2020;4(12):1285–93. [DOI] [PubMed] [Google Scholar]
17.Fallatah DI, Adekola HA. Digital epidemiology: Harnessing big data for early detection and monitoring of viral outbreaks. Infect Prev Pract. 2024;6(3):100382. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Tan CW, Yu PD. Contagion source detection in epidemic and infodemic outbreaks: mathematical analysis and network algorithms. Found Trends^® Netw. 2023;13(2–3):106–251. [Google Scholar]
19.Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. Detecting influenza epidemics using search engine query data. Nature. 2009;457(7232):1012–4. [DOI] [PubMed] [Google Scholar]
20.Li K, Liu M, Feng Y, Ning C, Ou W, Sun J, et al. Using Baidu search engine to monitor AIDS epidemics inform for targeted intervention of HIV/AIDS in China. Sci Rep. 2019;9(1):320. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Kachlik Z, Walaszek M, Nazar W, Sokolowska M, Karbiak A, Pilarska E et al. Predicting suicide attempt trends in youth: A machine learning analysis using Google trends and historical data. J Clin Med. 2025;14(18):6373. [DOI] [PMC free article] [PubMed]
22.Ayyoubzadeh SM, Ayyoubzadeh SM, Zahedi H, Ahmadi M, Kalhori SR. Predicting COVID-19 incidence through analysis of google trends data in Iran: data mining and deep learning pilot study. JMIR Pub Health Surveill. 2020;6(2):e18828. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Wang Z, He J, Jin B, Zhang L, Han C, Wang M, et al. Using Baidu index data to improve chickenpox surveillance in Yunnan, china: infodemiology study. J Med Internet Res. 2023;25:e44186. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Zhou W, Zhong L, Tang X, Huang T, Xie Y. Early warning and monitoring of COVID-19 using the Baidu search index in China. J Infect. 2022;84(5):e82–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Li Z, Liu T, Zhu G, Lin H, Zhang Y, He J, et al. Dengue Baidu search index data can improve the prediction of local dengue epidemic: A case study in Guangzhou, China. PLoS Negl Trop Dis. 2017;11(3):e0005354. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Tran NK, Kretsch C, LaValley C, Rashidi HH. Machine learning and artificial intelligence for the diagnosis of infectious diseases in immunocompromised patients. Curr Opin Infect Dis. 2023;36(4):235–42. [DOI] [PubMed] [Google Scholar]
27.Qiu H, Zhao H, Xiang H, Ou R, Yi J, Hu L, et al. Forecasting the incidence of mumps in Chongqing based on a SARIMA model. BMC Public Health. 2021;21(1):373. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Liu J, Yu F, Song H. Application of SARIMA model in forecasting and analyzing inpatient cases of acute mountain sickness. BMC Public Health. 2023;23(1):56. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Gianacas C, Liu B, Kirk M, Di Tanna GL, Belcher J, Blogg S, et al. Bayesian structural time series, an alternative to interrupted time series in the right circumstances. J Clin Epidemiol. 2023;163:102–10. [DOI] [PubMed] [Google Scholar]
30.Scott SL, Varian HR. Predicting the present with bayesian structural time series. Int J Math Modelling Numer Optimisation. 2014;5(1/2):4. [Google Scholar]
31.McQuire C, Tilling K, Hickman M, de Vocht F. Forecasting the 2021 local burden of population alcohol-related harms using bayesian structural time-series. Addiction. 2019;114(6):994–1003. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Jang G, Seo J, Lee H. Analyzing the impact of COVID-19 on seasonal infectious disease outbreak detection using hybrid SARIMAX-LSTM model. J Infect Public Health. 2025;18(7):102772. [DOI] [PubMed] [Google Scholar]
33.Vavilala H, Yaladanda N, Krishna Kondeti P, Rafiq U, Mopuri R, Gouda KC, et al. Weather integrated malaria prediction system using bayesian structural time series model for Northeast States of India. Environ Sci Pollut Res Int. 2022;29(45):68232–46. [DOI] [PubMed] [Google Scholar]
34.Mason CH, Perreault WD. Collinearity, Power, and interpretation of multiple regression analysis. J Mark Res. 1991;28(3):268–80. [Google Scholar]
35.Şanlıtürk D. The causal link between air pollution and respiratory diseases: evidence from Granger causality test. Thorac Res Pract. 2025;26(6):314–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Sun X, Xu Y, Zhu Y, Tang F. Impact of non-pharmaceutical interventions on the incidences of vaccine-preventable diseases during the COVID-19 pandemic in the Eastern of China. Hum Vaccin Immunother. 2021;17(11):4083–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Alqahtani SA, Buti M. COVID-19 and hepatitis B infection. Antivir Ther. 2021;25(8):387–97. [DOI] [PubMed] [Google Scholar]
38.Xiao J, Liu L, Peng Y, Wen Y, Lv X, Liang L, et al. Anxiety, depression, and insomnia among nurses during the full liberalization of COVID-19: a multicenter cross-sectional analysis of the high-income region in China. Front Public Health. 2023;11:1179755. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Wu H, Xue M, Wu C, Lu Q, Ding Z, Wang X, et al. Trend of hand, foot, and mouth disease from 2010 to 2021 and Estimation of the reduction in enterovirus 71 infection after vaccine use in Zhejiang Province. China. 2022;17(9):e0274421. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Gong X, Han Y, Hou M, Guo R. Online public attention during the early days of the COVID-19 pandemic: infoveillance study based on Baidu index. JMIR Public Health Surveillance. 2020;6(4):e23098. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Zhao C, Yang Y, Wu S, Wu W, Xue H, An K, et al. Search trends and prediction of human brucellosis using Baidu index data from 2011 to 2018 in China. Sci Rep. 2020;10(1):5896. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Navas Thorakkattle M, Farhin S, Khan AA. Forecasting the trends of Covid-19 and causal impact of vaccines using bayesian structural time series and ARIMA. Annals Data Sci. 2022;9(5):1025–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Brodersen KH, Gallusser F, Koehler J, Remy N, Scott SL. Inferring causal impact using bayesian structural time-series models. Annals Appl Stat. 2015;9(1):247–74. [Google Scholar]
44.Chen H, Lin MX, Wang LP, Huang YX, Feng Y, Fang LQ, et al. Driving role of Climatic and socioenvironmental factors on human brucellosis in china: machine-learning-based predictive analyses. Infect Dis Poverty. 2023;12(1):36. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Ding W, Li Y, Bai Y, Li Y, Wang L, Wang Y. Estimating the effects of the COVID-19 outbreak on the reductions in tuberculosis cases and the epidemiological trends in china: A causal impact analysis. Infect Drug Resist. 2021;14:4641–55. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1.^{(477.8KB, docx)}

Data Availability Statement

[CR1] 1.Zhao D, Zhang H, Cao Q, Wang Z, Zhang R. The research of SARIMA model for prediction of hepatitis B in Mainland China. Med (Baltim). 2022;101(23):e29317. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.WHO, Hepatitis B. Available from: https://www.who.int/news-room/fact-sheets/detail/hepatitis-b. Accessed 13 Nov 2025.

[CR3] 3.Lin CL, Kao JH, Hepatitis B. Immunization and impact on natural history and cancer incidence. Gastroenterol Clin North Am. 2020;49(2):201–14. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Zampino R, Boemio A, Sagnelli C, Alessio L, Adinolfi LE, Sagnelli E, et al. Hepatitis B virus burden in developing countries. World J Gastroenterol. 2015;21(42):11941–53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Liu J, Liang W, Jing W, Liu M. Countdown to 2030: eliminating hepatitis B disease, China. Bull World Health Organ. 2019;97(3):230–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Chen S, Mao W, Guo L, Zhang J, Tang S. Combating hepatitis B and C by 2030: achievements, gaps, and options for actions in China. BMJ Global Health. 2020;5(6):e002306. [DOI] [PMC free article] [PubMed]

[CR7] 7.Wang YW, Shen ZZ, Jiang Y. Comparison of ARIMA and GM(1,1) models for prediction of hepatitis B in China. PLoS ONE. 2018;13(9):e0201987. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Kemp L, Clare KE, Brennan PN, Dillon JF. New horizons in hepatitis B and C in the older adult. Age Ageing. 2019;48(1):32–7. [DOI] [PubMed] [Google Scholar]

[CR9] 9.Shi Y, Zheng M. Hepatitis B virus persistence and reactivation. BMJ. 2020;370:m2200. [DOI] [PubMed] [Google Scholar]

[CR10] 10.Kim AY, Hepatitis B, Virus, Coinfection HIV. Fibrosis, Fat, and future directions. Am J Gastroenterol. 2019;114(5):710–2. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Li X, Li Y, Xu S, Wang P, Hu M, Li H, et al. Evaluation of the impact of COVID-19 on hepatitis B in Henan Province and its epidemic trend based on bayesian structured time series model. BMC Public Health. 2025;25(1):1312. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Guo YH, Chen YP, Dou QH, Liu Q, Yang JH, Seng MH, et al. Seroepidemiological analysis of hepatitis B virus infection among adolescents aged 0–14 years in Henan Province and preliminary evaluation of the effectiveness of childhood hepatitis B vaccine immunization program. Zhonghua Yu Fang Yi Xue Za Zhi. 2024;58(2):202–7. [DOI] [PubMed] [Google Scholar]

[CR13] 13.Yong Hao G, Da Xing F, Jin X, Xiu Hong F, Pu Mei D, Jun L, et al. The prevalence of hepatitis B infection in central china: an adult population-based serological survey of a large sample size. J Med Virol. 2017;89(3):450–7. [DOI] [PubMed] [Google Scholar]

[CR14] 14.Wang Y, Zhou H, Zheng L, Li M, Hu B. Using the Baidu index to predict trends in the incidence of tuberculosis in Jiangsu Province, China. Front Public Health. 2023;11:1203628. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Hang CN, Yu PD, Chen S, Tan CW, Chen G. MEGA: machine Learning-Enhanced graph analytics for infodemic risk management. IEEE J Biomedical Health Inf. 2023;27(12):6100–11. [DOI] [PubMed] [Google Scholar]

[CR16] 16.Gallotti R, Valle F, Castaldo N, Sacco P, De Domenico M. Assessing the risks of ‘infodemics’ in response to COVID-19 epidemics. Nat Hum Behav. 2020;4(12):1285–93. [DOI] [PubMed] [Google Scholar]

[CR17] 17.Fallatah DI, Adekola HA. Digital epidemiology: Harnessing big data for early detection and monitoring of viral outbreaks. Infect Prev Pract. 2024;6(3):100382. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Tan CW, Yu PD. Contagion source detection in epidemic and infodemic outbreaks: mathematical analysis and network algorithms. Found Trends^® Netw. 2023;13(2–3):106–251. [Google Scholar]

[CR19] 19.Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. Detecting influenza epidemics using search engine query data. Nature. 2009;457(7232):1012–4. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Li K, Liu M, Feng Y, Ning C, Ou W, Sun J, et al. Using Baidu search engine to monitor AIDS epidemics inform for targeted intervention of HIV/AIDS in China. Sci Rep. 2019;9(1):320. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Kachlik Z, Walaszek M, Nazar W, Sokolowska M, Karbiak A, Pilarska E et al. Predicting suicide attempt trends in youth: A machine learning analysis using Google trends and historical data. J Clin Med. 2025;14(18):6373. [DOI] [PMC free article] [PubMed]

[CR22] 22.Ayyoubzadeh SM, Ayyoubzadeh SM, Zahedi H, Ahmadi M, Kalhori SR. Predicting COVID-19 incidence through analysis of google trends data in Iran: data mining and deep learning pilot study. JMIR Pub Health Surveill. 2020;6(2):e18828. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Wang Z, He J, Jin B, Zhang L, Han C, Wang M, et al. Using Baidu index data to improve chickenpox surveillance in Yunnan, china: infodemiology study. J Med Internet Res. 2023;25:e44186. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Zhou W, Zhong L, Tang X, Huang T, Xie Y. Early warning and monitoring of COVID-19 using the Baidu search index in China. J Infect. 2022;84(5):e82–4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Li Z, Liu T, Zhu G, Lin H, Zhang Y, He J, et al. Dengue Baidu search index data can improve the prediction of local dengue epidemic: A case study in Guangzhou, China. PLoS Negl Trop Dis. 2017;11(3):e0005354. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Tran NK, Kretsch C, LaValley C, Rashidi HH. Machine learning and artificial intelligence for the diagnosis of infectious diseases in immunocompromised patients. Curr Opin Infect Dis. 2023;36(4):235–42. [DOI] [PubMed] [Google Scholar]

[CR27] 27.Qiu H, Zhao H, Xiang H, Ou R, Yi J, Hu L, et al. Forecasting the incidence of mumps in Chongqing based on a SARIMA model. BMC Public Health. 2021;21(1):373. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Liu J, Yu F, Song H. Application of SARIMA model in forecasting and analyzing inpatient cases of acute mountain sickness. BMC Public Health. 2023;23(1):56. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Gianacas C, Liu B, Kirk M, Di Tanna GL, Belcher J, Blogg S, et al. Bayesian structural time series, an alternative to interrupted time series in the right circumstances. J Clin Epidemiol. 2023;163:102–10. [DOI] [PubMed] [Google Scholar]

[CR30] 30.Scott SL, Varian HR. Predicting the present with bayesian structural time series. Int J Math Modelling Numer Optimisation. 2014;5(1/2):4. [Google Scholar]

[CR31] 31.McQuire C, Tilling K, Hickman M, de Vocht F. Forecasting the 2021 local burden of population alcohol-related harms using bayesian structural time-series. Addiction. 2019;114(6):994–1003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Jang G, Seo J, Lee H. Analyzing the impact of COVID-19 on seasonal infectious disease outbreak detection using hybrid SARIMAX-LSTM model. J Infect Public Health. 2025;18(7):102772. [DOI] [PubMed] [Google Scholar]

[CR33] 33.Vavilala H, Yaladanda N, Krishna Kondeti P, Rafiq U, Mopuri R, Gouda KC, et al. Weather integrated malaria prediction system using bayesian structural time series model for Northeast States of India. Environ Sci Pollut Res Int. 2022;29(45):68232–46. [DOI] [PubMed] [Google Scholar]

[CR34] 34.Mason CH, Perreault WD. Collinearity, Power, and interpretation of multiple regression analysis. J Mark Res. 1991;28(3):268–80. [Google Scholar]

[CR35] 35.Şanlıtürk D. The causal link between air pollution and respiratory diseases: evidence from Granger causality test. Thorac Res Pract. 2025;26(6):314–22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Sun X, Xu Y, Zhu Y, Tang F. Impact of non-pharmaceutical interventions on the incidences of vaccine-preventable diseases during the COVID-19 pandemic in the Eastern of China. Hum Vaccin Immunother. 2021;17(11):4083–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Alqahtani SA, Buti M. COVID-19 and hepatitis B infection. Antivir Ther. 2021;25(8):387–97. [DOI] [PubMed] [Google Scholar]

[CR38] 38.Xiao J, Liu L, Peng Y, Wen Y, Lv X, Liang L, et al. Anxiety, depression, and insomnia among nurses during the full liberalization of COVID-19: a multicenter cross-sectional analysis of the high-income region in China. Front Public Health. 2023;11:1179755. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Wu H, Xue M, Wu C, Lu Q, Ding Z, Wang X, et al. Trend of hand, foot, and mouth disease from 2010 to 2021 and Estimation of the reduction in enterovirus 71 infection after vaccine use in Zhejiang Province. China. 2022;17(9):e0274421. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Gong X, Han Y, Hou M, Guo R. Online public attention during the early days of the COVID-19 pandemic: infoveillance study based on Baidu index. JMIR Public Health Surveillance. 2020;6(4):e23098. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Zhao C, Yang Y, Wu S, Wu W, Xue H, An K, et al. Search trends and prediction of human brucellosis using Baidu index data from 2011 to 2018 in China. Sci Rep. 2020;10(1):5896. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Navas Thorakkattle M, Farhin S, Khan AA. Forecasting the trends of Covid-19 and causal impact of vaccines using bayesian structural time series and ARIMA. Annals Data Sci. 2022;9(5):1025–47. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Brodersen KH, Gallusser F, Koehler J, Remy N, Scott SL. Inferring causal impact using bayesian structural time-series models. Annals Appl Stat. 2015;9(1):247–74. [Google Scholar]

[CR44] 44.Chen H, Lin MX, Wang LP, Huang YX, Feng Y, Fang LQ, et al. Driving role of Climatic and socioenvironmental factors on human brucellosis in china: machine-learning-based predictive analyses. Infect Dis Poverty. 2023;12(1):36. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Ding W, Li Y, Bai Y, Li Y, Wang L, Wang Y. Estimating the effects of the COVID-19 outbreak on the reductions in tuberculosis cases and the epidemiological trends in china: A causal impact analysis. Infect Drug Resist. 2021;14:4641–55. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Early warning of hepatitis B epidemics in Henan Province, China, from 2014 to 2023 based on Baidu Index and Bayesian Structural Time Series model

Yongbin Wang

Xianxiang Lan

Pan Hu

Fei Lin

Chunjie Xu

Abstract

Background

Methods

Results

Conclusions

Supplementary Information

Background

Materials and methods

Data collection

HB related keywords mining and BI acquisition

Keyword selection and analysis

Construction of BI

SARIMA and SARIMAX models

BSTS models

Statistical analysis

Results

Descriptive analysis

Fig. 1.

Fig. 2.

Correlation analysis between HB related keywords BI and HB cases

Fig. 3.

Time-lag cross-correlation analysis and construction of BI

Construction and prediction of SARIMA and SARIMAX models

Table 1.

Table 2.

Table 3.

Table 4.

Fig. 4.

Construction and prediction of BSTS models

Table 5.

Table 6.

Sensitivity analysis

Prediction effect evaluation

Table 7.

Discussion

Impact of the COVID-19 pandemic on HB epidemiology

Seasonal patterns of HB

Baidu index as a supplementary surveillance tool

Model performance and comparison

Limitations

Conclusions

Supplementary Information

Acknowledgements

Authors' contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases