Skip to main content
BMC Public Health logoLink to BMC Public Health
. 2025 Aug 18;25:2827. doi: 10.1186/s12889-025-24188-9

Hospitalisations in Brazil: an ecological time series analysis of the impact of medical decision support data as an exogenous variable

Dayanna Quintanilha 1,, Eduardo Moura 1,#, Danielly Xavier 1,#
PMCID: PMC12359944  PMID: 40826054

Abstract

Purpose

Public health surveillance depends on continuous monitoring to guide interventions and allocate resources effectively. This study aimed to evaluate whether structured medical search data from the Afya Whitebook®, a clinical decision-support platform, can serve as exogenous variables to enhance the explanatory capacity of time series models characterising hospitalisation patterns within Brazil’s public health system.

Methods

An ecological time series analysis was conducted using hospitalisation data (SIH/SUS) and Afya Whitebook® search volumes from 2021 to 2024. SARIMAX models assessed temporal associations between search activity and hospital admissions across Brazilian states, compared to univariate SARIMA models to evaluate the added value of search data.

Results

In 278 of the 478 time series, SARIMAX models provided a better fit than univariate SARIMA models, particularly for conditions such as chronic obstructive pulmonary disease, dengue, urinary tract infections, type 2 diabetes, asthma, depression, and chronic kidney disease. Model fit varied by disease and region, underscoring the influence of contextual factors in the association between search behaviour and hospital admissions.

Conclusion

This study demonstrates that structured medical search data can serve as exogenous variables to improve the explanatory capacity of time series models of hospitalisation patterns. Despite variation between diseases and regions, this approach shows promise in supporting public health surveillance and could be strengthened by incorporating contextual data in future studies.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12889-025-24188-9.

Keywords: Health system surveillance, Point-of-care information tool, Search engine, Hospitalisations, Disease outbreaks

Introduction

Public health surveillance is defined as the “the continuous and systematic collection, orderly consolidation, and evaluation of pertinent data with prompt dissemination of results to those who need to know, particularly those who are in a position to take action.” [1]. This process involves collecting, analysing, and interpreting health data to quickly identify and respond to emerging threats, ultimately protecting the health of the population [2]. In Brazil, a country of continental dimensions with nearly 6,000 municipalities characterized by substantial heterogeneity, it is essential to monitor key health indicators and risk factors to ensure effective responses to regional needs [3].

With technological advancements, new sources of data have emerged that can complement traditional public health surveillance systems, including medical research applications and clinical decision support summaries. These applications are increasingly integrated into healthcare professionals’ daily routines, enabling quick access to evidence-based information and facilitating clinical decision-making [4]. Research has demonstrated that search trends on digital platforms may reflect health behaviours and increases in medical care demands, functioning as indirect indicators of clinical conditions and disease outbreaks [57].

The integration of digital data, such as search trends, provides real-time disease activity estimates, helping to bridge data gaps caused by delays in traditional surveillance systems, especially in resource-limited areas [8]. This study aimed to evaluate whether structured medical search data, used as exogenous variables, enhance the explanatory capacity of time series models in characterising hospitalisation patterns across Brazilian states. The analysis focused on identifying consistent temporal associations between search activity and hospital admissions for specific health conditions.

Materials and methods

This is an analytical ecological time series study designed to investigate the association between access to a decision support tool and hospitalisations in Brazil, covering the period between 2021 and 2024. The analysis included medical research access data from the Afya Whitebook® application, aiming to identify temporal patterns and their relationships with disease-related hospitalisations, which served as the primary outcome of interest.

Population, setting, and timeframe

The study population comprised all searches conducted on the Afya Whitebook® platform and all hospital admissions for selected climate-sensitive diseases recorded across all federative units (UFs) of Brazil between January 2021 and August 2024. This spatiotemporal framework provides a comprehensive basis for examining associations between access to digital clinical decision-support tools and hospitalisation patterns across diverse socioeconomic contexts within the country.

Afya Whitebook® is a widely adopted clinical decision-support platform among healthcare professionals in Brazil. Designed to facilitate evidence-based medical practice, the platform offers rapid access to structured, topic-specific content, including disease overviews, diagnostic algorithms, therapeutic guidelines, drug dosing information, and clinical calculators. The platform is optimized for use on both mobile and desktop devices and is intended to support clinical decision-making in real-time across a variety of care settings.

Although primarily developed for use by licensed physicians—including those engaged in postgraduate clinical training (e.g., residents)—the user base also encompasses medical students and allied health professionals such as nurses and physician assistants [9]. For this analysis, only search data generated by users verified as licensed physicians with active professional credentials were included. The Afya Whitebook® platform recorded approximately 150,000 active physician users in December 2024. As a point of reference, Brazil had 597,428 registered physicians in the same yearstate [10]. The geographic distribution of unique active users between February 2023 and January 2024 is detailed in Supplementary File 1, alongside the number of registered physicians per state.

Data sources

This study utilised two primary data sources: (i) the Hospital Information System of the Unified Health System (SIH/SUS) [11] and (ii) the research database of the Afya Whitebook® platform [9]. The SIH/SUS database, maintained by the Department of Informatics of the Unified Health System (DATASUS), provides anonymized, publicly available data on hospital admissions throughout Brazil. These data are updated monthly, with a standard release delay of one to two months from the date of care. However, under Portaria SAES n° 1.110/2021 [12], this delay may extend up to four months to accommodate administrative processing and data corrections.

The Afya Whitebook® research database contains metadata on user search behaviour within the platform, which is collected in real time and can be accessed retrospectively for research purposes via a structured query interface. To ensure confidentiality and compliance with ethical standards, all user data were anonymised, preventing the identification of individual healthcare professionals.

Population estimates used to calculate hospitalisation rates per 100,000 inhabitants were obtained from the DATASUS TABNET system [13], which provides official demographic projections and census-adjusted figures.

Study variables

The study variables included the primary International Classification of Diseases, 10th Revision (ICD-10) code associated with each hospital admission, which was used to establish thematic correspondence with search topics recorded on the Afya Whitebook® platform. The selection of diseases analysed was based on established evidence of their sensitivity to climatic and environmental factors, including temperature, precipitation, and humidity. For presentation purposes, these conditions were grouped into six broader categories: vector-borne diseases (arboviral diseases — dengue, chikungunya fever — and malaria) [14]; respiratory diseases (such as influenza, COVID-19, pneumonia, tuberculosis, asthma, and chronic obstructive pulmonary disease) [15, 16]; heart diseases (acute myocardial infarction and heart failure) [15]; renal and urinary tract conditions (acute kidney injury, chronic kidney disease, and urinary tract infections) [17]; metabolic disorders (type 2 diabetes mellitus and diabetic ketoacidosis) [18]; and mental health conditions (depressive disorders and generalised anxiety disorder) [19]. To ensure clarity and reproducibility, a detailed list of all individual diseases and their corresponding ICD-10 codes is provided in Supplementary File 2. The ICD codes reported in our table reflect those that returned hospital admission records in the database at the time of our query.

Biases

Ecological time-series studies that rely on secondary data are inherently subject to multiple sources of epidemiological bias, which may affect both the interpretation of results and the performance of statistical models. In the present study, four primary categories of bias were considered: ecological fallacy, selection bias, information bias, and confounding.

Ecological fallacy refers to the erroneous inference of individual-level associations based on aggregate-level data. As this analysis was conducted at the federative unit (state) level, none of the findings should be interpreted as reflecting individual-level risk. The observed associations between clinical search activity and hospital admissions represent population-level trends and are not generalisable to smaller geographic units, such as municipalities, or to individual patients.

Selection bias may also be a consideration, given that the Afya Whitebook® user base comprises physicians who choose to engage with a digital clinical support tool. Although the platform is widely used across Brazil, we did not explicitly assess the geographic or institutional distribution of its users. It is therefore unclear whether certain regions or healthcare settings—such as large urban centres or hospitals—are overrepresented. This uncertainty may limit the generalisability of findings to the entire physician population or to underrepresented contexts such as rural.

Information bias may result from variability or misclassification within the data sources. Hospital diagnoses, recorded using ICD-10 codes, may vary in coding accuracy and practices across institutions and regions. Similarly, the mapping between search terms entered in the Afya Whitebook® and specific diagnostic codes is not always exact, potentially introducing noise and imprecision into the analysis.

Although it was not possible to adjust for individual-level confounders—such as age, comorbidities, or disease severity—due to the use of aggregated, daily-level time-series data, this study did not aim to establish causal relationships. The observed temporal patterns may be shaped by shared underlying factors, including seasonal variation, healthcare access, or disease burden, which should be considered when interpreting the results.

While these sources of bias cannot be entirely eliminated, several methodological approaches were adopted to mitigate their influence. These included the use of standardised national data repositories, the restriction of search activity to verified physician users, the application of consistent preprocessing procedures across all time series, and the comprehensive inclusion of hospitalisations recorded in the public health system. Taken together, these strategies enhance data quality and improve the comparability of findings across disease groups and geographic regions.

Data analysis

The data analysis was conducted in sequential stages to ensure a comprehensive understanding of the phenomenon under investigation. Initially, a detailed descriptive analysis was performed to characterise the study population and hospital admissions. Hospitalisation counts and rates per 100,000 inhabitants were calculated and stratified by disease and federative unit (UF), aiming to identify potential disparities and regional patterns. Based on these data, a total of 478 time series were constructed, corresponding to 18 disease categories across the 27 federative units of Brazil (26 states and the Federal District). Not all possible disease–region combinations resulted in a time series due to the absence or insufficiency of data in certain cases.

In the subsequent phase, association and correlation analyses were employed to explore the relationships between access to medical information and hospital admissions. Time series decomposition was conducted for each medical condition, revealing consistent seasonal components across the data (Supplementary Material, File 3). These patterns reinforced the appropriateness of the Seasonal Autoregressive Integrated Moving Average (SARIMA) model, which is specifically designed to capture both trend and seasonality in time series data.

Finally, the SARIMA model was applied to examine the relationship between hospitalisation trends and the frequency of searches in the Afya Whitebook®. In this model, the number of hospitalisations was considered the dependent variable, while the volume of searches on the Afya Whitebook® platform served as the independent variable.

Descriptive tables were generated using Microsoft Excel® (version [16.89.1]) [20], facilitating the presentation of key statistics and graphical representations of regional trends and disease-specific patterns.

SARIMA

The SARIMA models employed in this study were implemented using the SARIMAX class from the statsmodels Python library [21], which extends the traditional ARIMA framework to accommodate both seasonal components and optional exogenous regressors. The general formulation is denoted as SARIMA(pdq)(PDQs), where p, d, and q refer to the non-seasonal autoregressive order, differencing order, and moving average order, respectively, and P, D, Q, and s correspond to their seasonal analogs and the length of the seasonal cycle.

Mathematically, the model is represented as:

graphic file with name d33e457.gif 1

where B is the backshift operator; Inline graphic and Inline graphic represent polynomials for autoregressive and moving average terms; and Inline graphic is the white noise error term. This structure enables the SARIMA model to jointly capture short-term fluctuations, long-term trends, and periodic seasonal effects in time series data [21, 22].

The models were fitted using Python (version 3.12.3) [23] with the statsmodels package (version 0.14.2) [21], specifically through the statsmodels.tsa.statespace.SARIMAX module. This open-source library provides a comprehensive implementation of time series models, including functionality for diagnostics, forecasting, and inclusion of exogenous variables. All analyses were performed in Python to ensure reproducibility and transparency.

Each time series corresponded to a specific health condition and contained 1,461 daily observations spanning four years, including one leap year—yielding approximately 30 data points per month. Days with no hospitalisation events were recorded as zero. Similarly, search activity data were compiled daily, with zero values assigned to days without queries related to the corresponding health condition.

The modelling process involved a series of preparatory steps, including testing for stationarity via the Augmented Dickey–Fuller (ADF) test and identifying autocorrelation structures using autocorrelation function (ACF) and partial autocorrelation function (PACF) plots. These diagnostics informed the selection of both seasonal and non-seasonal SARIMA parameters.

Model performance was assessed through standard error metrics—mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE)—calculated on both training and test sets. Diagnostic validation involved the Ljung–Box test for autocorrelation and the Jarque–Bera test for residual normality, supplemented by residual plots and histograms. All tests adopted a significance threshold of 0.05.

To address the issue of multiple hypothesis testing arising from evaluating the statistical significance of the exogenous variable’s coefficients across the time series, the associated p-values were adjusted using the Holm–Bonferroni correction. This method controls the family-wise error rate while maintaining statistical power. The adjusted p-values are reported in Supplementary File 3 [24].

For a SARIMAX model to be considered preferable to the univariate SARIMA model, we required all of the following conditions to be met: (1) the exogenous variable was statistically significant; (2) the SARIMAX model yielded a higher Log-Likelihood; (3) both AIC and BIC values were lower than those of the SARIMA model; and (4) the residuals of the SARIMAX model did not exhibit significant autocorrelation, as assessed by the Ljung–Box test (p > 0.05). This comprehensive set of criteria ensured that the inclusion of the exogenous variable led to statistically robust and parsimonious improvements in model fit.

Although the Hannan–Quinn Information Criterion (HQIC) and residual normality, assessed via the Jarque–Bera test, skewness, and kurtosis, were calculated, they were not used for model selection or acceptance decisions. In cases of disagreement between AIC and BIC, the SARIMAX model was considered preferable only if both yielded lower values compared to the univariate model. This strategy was designed to balance model fit and parsimony in a consistent and transparent manner [25, 26].

Finally, all models were trained and evaluated using a walk-forward validation strategy, preserving the temporal order of the data. Each time series was split into 70% for training (1,022 days) and 30% for testing (439 days) [26].

Ethical approval declarations

This study was approved by the Research Ethics Committee of the Centro Universitário Presidente Tancredo de Almeida Neves (UNIPTAN), under substantiated opinion number 6.991.028 and CAAE 81119724.6.0000.9667. The research complies with the ethical principles established by Resolution CNS 466/12 of the Brazilian National Health Council, ensuring the exclusive use of anonymized data obtained from public databases, respecting individuals’ privacy, and without direct intervention in participants. Furthermore, all security measures have been adopted to ensure the protection and integrity of the analysed information.

Results

Access and hospitalisation by federative unit

Hospitalisation rates consistently decreased in most Brazilian states between 2021 and 2024, as shown in Table 1. During the average period (2021–2024), the highest hospitalisation rates were observed in Rio Grande do Sul (930.48), Paraná (868.67), Santa Catarina (846.17), Mato Grosso do Sul (814.85), and Roraima (808.70).

Table 1.

Hospitalisation rates by federative unit (2021–2024)

State Hospitalisations (%)/Rate per 100,000 inhabitants Mean rate
2021 2022 2023 2024
Rio Grande do Sul 152738 (7.17%)/1333.39 121209 (7.92%)/1055.33 111799 (7.72%)/970.82 41884 (7.54%)/362.38 930.48
Paraná 150522 (7.07%)/1300.09 109483 (7.16%)/940.04 104368 (7.21%)/891.27 40435 (7.28%)/343.26 868.66
Santa Catarina 88698 (4.16%)/1206.80 69303 (4.53%)/931.26 66359 (4.58%)/883.55 27496 (4.95%)/363.07 846.17
Mato Grosso do Sul 33595 (1.58%)/1191.97 26220 (1.71%)/904.16 25285 (1.75%)/866.85 8720 (1.57%)/296.40 814.84
Roraima 8854 (0.42%)/1354.86 6307 (0.41%)/943.01 5503 (0.38%)/806.90 911 (0.16%)/130.02 808.70
Minas Gerais 251946 (11.83%)/1172.86 191341 (12.51%)/886.40 177788 (12.28%)/817.93 71447 (12.86%)/327.37 801.14
Piauí 38351 (1.8%)/1162.69 28695 (1.88%)/866.73 26476 (1.83%)/798.15 10014 (1.8%)/301.57 782.29
Rondônia 22777 (1.07%)/1264.56 15421 (1.01%)/833.92 14474 (1.0%)/777.19 4422 (0.8%)/235.19 777.72
Ceará 90976 (4.27%)/978.98 70374 (4.6%)/753.67 67904 (4.69%)/721.52 27421 (4.94%)/290.34 686.13
Mato Grosso 38872 (1.83%)/1092.87 24134 (1.58%)/673.77 23726 (1.64%)/649.56 8199 (1.48%)/222.95 659.79
Espírito Santo 43000 (2.02%)/1041.80 29970 (1.96%)/716.98 27002 (1.87%)/640.96 9692 (1.74%)/227.43 656.79
São Paulo 457929 (21.5%)/982.98 302272 (19.76%)/641.30 284250 (19.64%)/598.48 110781 (19.94%)/232.06 613.71
Goiás 76449 (3.59%)/1057.87 43977 (2.88%)/598.13 43195 (2.98%)/580.80 15961 (2.87%)/212.19 612.25
Amazonas 38342 (1.8%)/894.89 25940 (1.7%)/597.57 29122 (2.01%)/662.33 10433 (1.88%)/231.77 596.64
Pará 76040 (3.57%)/864.55 58485 (3.82%)/659.46 55037 (3.8%)/610.81 18434 (3.32%)/202.96 584.45
Bahia 113564 (5.33%)/758.41 91344 (5.97%)/609.56 88877 (6.14%)/592.31 34768 (6.26%)/231.02 547.83
Tocantins 13768 (0.65%)/852.79 8955 (0.59%)/546.94 8878 (0.61%)/537.30 4639 (0.83%)/279.26 554.07
Pernambuco 93209 (4.38%)/961.22 62196 (4.07%)/634.18 57959 (4.0%)/587.11 21598 (3.89%)/217.48 600.00
Rio de Janeiro 135490 (6.36%)/778.43 95393 (6.24%)/546.69 91821 (6.34%)/523.89 35560 (6.4%)/202.05 512.76
Paraíba 29173 (1.37%)/711.62 23762 (1.55%)/576.47 25195 (1.74%)/610.70 11326 (2.04%)/273.42 543.05
Rio Grande do Norte 24943 (1.17%)/703.97 17059 (1.12%)/480.63 15625 (1.08%)/437.98 6289 (1.13%)/172.21 448.70
Alagoas 22459 (1.05%)/670.85 14712 (0.96%)/439.32 11761 (0.81%)/349.49 3713 (0.67%)/110.37 392.51
Sergipe 14125 (0.66%)/603.09 8649 (0.57%)/361.82 8272 (0.57%)/344.88 3556 (0.64%)/147.70 364.37

Rates are per 100,000 inhabitants. Source: Unified Health System (SUS) - DATASUS records

The southeastern region reported the greatest number of accesses to the Afya Whitebook®, as detailed in Table 2 and illustrated in Fig. 1. São Paulo led in both hospitalisations (1,155,232) and access to the Afya Whitebook® (1,192,175). States such as Rio Grande do Sul, Paraná, and Minas Gerais, which also reported high hospitalisation rates, presented significant numbers of accesses to the application.

Table 2.

Accesses to the Afya Whitebook® Application by State (2021–2024)

UF 2021 2022 2023 2024 Total
São Paulo 322926 347774 328043 193432 1192175
Minas Gerais 156786 194127 192228 117607 660748
Rio de Janeiro 149538 141348 124752 67998 483636
Paraná 87045 115100 108534 60880 371559
Rio Grande do Sul 84135 116080 103684 59216 363115
Bahia 94444 103724 94239 48928 341335
Santa Catarina 45713 68574 66055 38852 219194
Pernambuco 72379 66154 53158 24241 215932
Ceará 60603 65736 55264 31837 213440
Distrito Federal 55555 59036 50184 24154 188929
Goiás 30045 42279 40888 24039 137251
Pará 38331 39428 33854 18347 129960
Mato Grosso 28944 31259 29795 16087 106085
Amazonas 30882 29438 25320 11669 97309
Mato Grosso do Sul 22684 25596 20653 11064 79997
Maranhão 21799 23399 19923 10255 75376
Paraíba 20731 22414 17356 9482 69983
Espírito Santo 10740 16456 24855 16855 68906
Rio Grande do Norte 15552 20711 18451 9318 64032
Piauí 10779 15187 13844 6760 46570
Alagoas 8895 11165 12496 6580 39136
Rondônia 8679 10558 11596 6279 37112
Sergipe 9527 11480 8983 4687 34677
Tocantins 6535 8133 8978 5380 29026
Acre 2800 3098 3356 2191 11445
Amapá 2161 2175 1839 927 7102
Roraima 290 1199 2311 1841 5641

Source: Afya Whitebook®

Fig. 1.

Fig. 1

Geographic Distribution of Hospitalisations and Accesses to the Afya Whitebook by UF (Brazil, 2021–2024)

In contrast, states such as Roraima (hospitalisations: 21,575, accesses: 5,641) and Acre (hospitalisations: 18,948, accesses: 11,445) reported low values for both hospitalisations and application accesses.

Hospitalisations by health condition

As detailed in Table 3, among the main observed trends, there was a sharp decline in the numbers and percentages related to COVID-19, particularly after 2021. Conversely, arbovirus cases significantly increased over time, increasing from 17,005 cases in 2021 to 144,558 cases in 2024. Heart diseases and pneumonia remained the categories with the highest impact in absolute and percentage terms. Despite being an acute rather than chronic condition, pneumonia accounted for more than 38% of cases in 2023 and 2024, indicating a significant burden on health systems.

Table 3.

Hospitalisation rates by health condition (2021–2024)

Health condition Admissions (%)/Rate per 100,000 inhabitants
2021 2022 2023 2024
Arboviral diseases 17005 (0.78%)/1.99 46753 (2.91%)/5.44 48682 (3.19%)/4.50 144558 (14.39%)/13.28
Asthma 51628 (2.37%)/6.05 75815 (4.71%)/8.82 79377 (5.19%)/9.17 37715 (3.75%)/4.33
Covid-19 1151929 (52.87%)/540.01 152982 (9.51%)/71.21 13881 (0.91%)/6.42 3850 (0.38%)/1.77
Chronic Obstructive Pulmonary Disease 52535 (2.41%)/6.15 81649 (5.08%)/9.50 79736 (5.22%)/9.21 47532 (4.73%)/5.45
Diabetes 21663 (0.99%)/0.78 23544 (1.46%)/0.84 26317 (1.72%)/0.93 15950 (1.59%)/0.56
Chronic Kidney Disease 111640 (5.12%)/5.23 133942 (8.33%)/6.23 149172 (9.76%)/6.89 86428 (8.60%)/3.97
Heart Diseases 306981 (14.09%)/10.27 357867 (22.25%)/11.89 368758 (24.13%)/12.17 214400 (21.34%)/7.03
Urinary Tract Infection 28703 (1.32%)/1.68 31516 (1.96%)/1.83 33541 (2.19%)/1.93 18563 (1.85%)/1.06
Influenza 19935 (0.91%)/1.55 29886 (1.86%)/2.31 23619 (1.55%)/1.82 13415 (1.34%)/1.02
Malaria 1370 (0.06%)/0.05 1601 (0.10%)/0.06 1582 (0.10%)/0.07 716 (0.07%)/0.03
Pneumonia 369814 (16.97%)/6.42 619935 (38.55%)/10.68 642602 (42.05%)/11.00 386118 (38.43%)/6.56
Mental Disorders 28715 (1.32%)/0.70 33910 (2.11%)/0.83 39465 (2.58%)/0.95 24156 (2.40%)/0.58
Tuberculosis 16903 (0.78%)/0.41 18795 (1.17%)/0.48 21367 (1.40%)/0.52 11299 (1.12%)/0.27

Rates are per 100,000 inhabitants. Source: Unified Health System (SUS) - DATASUS records

Time-series analysis of hospitalisations using SARIMA

Among the 478 time series analysed, models incorporating exogenous variables demonstrated a better fit in 278 cases, whereas univariate models provided a better fit in 200 cases. Notably, in 12 cases where the univariate model showed superior fit, the exogenous variable in the alternative model remained significant, suggesting that—even if not directly improving model fit—it may still capture external influences relevant to hospitalisation dynamics.

As detailed in Fig. 2, exogenous models demonstrated better explanatory fit for certain conditions on specific health conditions. For chronic obstructive pulmonary disease (COPD), these models demonstrated a better fit than univariate models in 26 states, with statistical significance observed in most comparisons. Other conditions, such as dengue, urinary tract infection, depression, asthma, chronic kidney disease, and type 2 diabetes mellitus, also showed notable associations in 25 states each. Conversely, no positive series were identified for infectious diseases such as chikungunya fever, heart failure, influenza, pneumonia, and tuberculosis, suggesting that, within this dataset, exogenous variables may have limited utility in modelling these conditions.

Fig. 2.

Fig. 2

Performance of the Exogenous vs. Univariate Models by Disease

The analysis of the effectiveness of the exogenous model compared to the univariate approach, stratified by Brazilian federal units, revealed substantial variations across states. Overall, most states exhibited a high proportion of series in which the inclusion of exogenous variables resulted in superior performance, with notable highlights in Mato Grosso, Rio Grande do Sul, and Santa Catarina, where over 70% of the series benefited from the exogenous model. Conversely, states such as Roraima and Amapá demonstrated considerably lower effectiveness, with proportions below 25%. These findings indicate that the utility of exogenous variables in SARIMA modelling may be sensitive to regional specificities, suggesting the need for tailored approaches according to the epidemiological and socio-economic contexts of each state. A complete breakdown of state-level results and statistical details is provided in the Supplementary File 4.

Discussion

The findings of this study underscore the potential of exogenous variables to enhance the explanatory capacity of SARIMA models, particularly for conditions such as dengue, urinary tract infections, chronic obstructive pulmonary disease (COPD), type 2 diabetes mellitus, asthma, depression, and chronic kidney disease, where exogenous models consistently provided a better fit across multiple states. These results suggest that incorporating external variables may help to capture underlying factors related to health information demand and hospital admission dynamics that are not accounted for by univariate models.

Among these conditions, dengue was the only infectious disease for which the inclusion of exogenous variables consistently showed a stronger association. This may be explained by the disease’s well-established seasonal behaviour in Brazil, with incidence closely tied to climatic factors such as rainfall and temperature, which affect mosquito vector dynamics [27]. Physicians may anticipate or respond to outbreak periods with increased search activity, leading to synchronised patterns between information-seeking behaviour and hospital admissions. Additionally, dengue is often documented with specific ICD-10 codes and elicits targeted clinical queries, unlike broader respiratory syndromes. These characteristics may have contributed to a stronger alignment between search data and hospitalisation trends.

Conversely, conditions such as influenza and pneumonia exhibited limited or no improvement with the inclusion of exogenous variables. Although influenza can cause spikes in hospitalisations, these are likely to be more accurately reflected in epidemiological surveillance data than in hospital admission records coded with ICD classifications. Influenza-related hospitalisations are frequently subsumed under broader respiratory diagnoses, such as pneumonia or acute respiratory distress syndrome, potentially diluting the relationship between search trends and recorded admissions.

Similarly, pneumonia was another condition where univariate models more often provided a better fit. One possible explanation is that pneumonia hospitalisations are influenced by multiple factors unrelated to information-seeking behaviour, such as community-acquired infections, bacterial coinfections, and pre-existing chronic diseases (e.g., COPD and heart failure). These influences may not be adequately reflected in search data. The absence of a consistent, well-defined search pattern may explain why univariate SARIMA models, relying solely on historical hospitalisation trends, better captured the temporal dynamics of this condition.

For other conditions such as chikungunya fever, heart failure, and tuberculosis, exogenous models also did not demonstrate a consistent improvement. These diseases may be influenced by complex and heterogeneous factors, including variations in clinical management, disease chronicity, and patterns of health information-seeking behaviour, which may not be adequately captured by the exogenous variables considered in this study.

Taken together, these findings highlight the heterogeneity in the relevance of exogenous variables across different health conditions. They suggest that while exogenous data may enhance the explanatory capacity of models for certain chronic and seasonally-driven diseases, its utility is limited for others. This underscores the need for disease-specific considerations when integrating external variables into time series analyses and cautions against a one-size-fits-all approach in public health modelling.

The variation in the performance of exogenous models across Brazilian states further underscores the importance of considering regional specificities when applying time series modelling. Whilst in some states the incorporation of exogenous variables substantially improved model fit, in others their contribution was limited. These findings reinforce the relevance of including exogenous variables in time series analyses, as they can play an important role in characterising hospital admission patterns for specific health conditions. Nevertheless, the observed heterogeneity across states and diseases emphasises the necessity of context-specific adjustments in modelling approaches.

Time series models, such as ARIMA and SARIMA, have been extensively used in Brazil to monitor hospitalisations and seasonal disease dynamics. For example, SARIMA models have been applied to analyse dengue in Campinas, São Paulo [28], and visceral leishmaniasis in Maranhão [29], illustrating their applicability in monitoring infectious diseases with seasonal patterns. Internationally, SARIMA models have also been widely used to examine hospitalisations related to respiratory diseases, integrating both seasonal and environmental factors [30]. These studies highlight the relevance of time series models and the potential contribution of exogenous variables to characterising disease dynamics across diverse epidemiological contexts.

The study results align with the literature on the use of search data to monitor health events. Carneiro and Mylonakis [5] demonstrated that Google search queries related to influenza could signal outbreaks earlier than traditional surveillance systems, such as the CDC. Similarly, Ginsberg et al. (2009) validated the use of search engine data for identifying epidemic trends, underscoring their potential as complementary tools for public health monitoring [5, 6].

While prior research, such as Santillana et al. has demonstrated that clinician search behaviour can reflect early signals of disease activity—specifically in the context of influenza outbreaks [31]—this study offers a complementary contribution by examining temporal associations between search activity and hospital admissions across a broader range of conditions. Unlike previous work focused on outbreak detection, our analysis leveraged data from a clinical decision-support platform widely used by physicians in Brazil’s publicly funded healthcare system, capturing routine, evidence-based queries at the point of care. This setting allows for the exploration of population-level clinical demand in a middle-income country, across multiple disease categories and geographic regions marked by socioeconomic heterogeneity.

Although our study focused exclusively on model fit metrics, the results suggest that including an exogenous variable—based on search data from a Clinical Decision Support Software—may improve the explanatory capacity of time series models for hospital admissions. We trained 478 models across 18 diseases (both infectious and chronic) in Brazil’s 27 states, comparing univariate models with those incorporating the exogenous variable. In 278 cases, models with the exogenous input were considered preferable based on a comprehensive set of criteria. While we did not directly assess forecasting performance, prior evidence suggests that well-fitted models may yield more accurate forecasts. For example, Wang et al. demonstrated that the SARIMA-NARX model, which achieved the best fit among the methods tested, also produced the most accurate forecasts for scarlet fever cases in China [32].

Naturally, several limitations must be considered. As an ecological analysis based on aggregated data, individual-level inference is not possible. Variability in search behaviour across physician profiles, regions, and care settings may also introduce distortions. Although the Afya Whitebook® platform is widely used across Brazil, our data were aggregated at the state level, preventing disaggregation by urban versus rural areas. As such, we were unable to assess whether search activity is disproportionately concentrated in more urbanized settings—a factor that may influence the spatial representativeness of the results.

In addition, while our time series were sufficiently long overall, some disease–region combinations presented sparse events, which may have limited model flexibility or stability in those specific contexts. Finally, the SARIMA framework, although well suited to capturing trends and seasonality, does not account for latent or non-linear interactions. Accordingly, our findings should be interpreted as context-specific statistical associations, not as predictive forecasts, since predictive modelling was beyond the scope of this study.

To strengthen analytical validity, we employed rigorous data preprocessing, uniform parameter estimation, and residual diagnostics. Data sources were official, standardized, and restricted to verified physicians, enhancing semantic consistency between search terms and ICD-coded outcomes. These methodological choices increase transparency and help ensure that the observed associations reflect structured patterns rather than random fluctuations.

Conclusion

This study highlights the potential use of exogenous variables—such as structured medical search data from clinical decision-support platforms—to inform time series models exploring hospitalisation patterns. Promising results were observed for conditions such as chronic obstructive pulmonary disease, dengue, urinary tract infections, depression, asthma, chronic kidney disease, and type 2 diabetes mellitus, suggesting that physicians’ information-seeking behaviour may reflect changes in healthcare demand. The models captured consistent temporal associations between search behaviour and hospital admissions across multiple states, demonstrating improved explanatory fit.

Importantly, this study contributes by leveraging real-time clinical data generated by healthcare professionals in a publicly funded, middle-income health system—an application not widely explored in the literature. Nonetheless, limitations such as reduced performance for certain conditions, regional disparities, and reliance on aggregated data point to the need for complementary strategies.

Future research should incorporate additional modelling approaches, such as machine learning, and integrate contextual data—climatic, demographic, and socioeconomic—to enhance model accuracy and generalizability. These refinements may support the development of more responsive and locally tailored public health surveillance systems.

Supplementary Information

12889_2025_24188_MOESM1_ESM.zip (10.8MB, zip)

Supplementary Material 1. The supplementary materials include aggregated data on user access to the Afya Whitebook® platform (Supplementary File 1), the categorization of ICD-10 codes used in the analysis (Supplementary File 2), graphical representations of the decomposed components—trend, seasonality, and residuals—for each disease at the national level (Supplementary File 3), the complete statistical outputs of the SARIMA and SARIMAX models for all time series (Supplementary File 4), and time series plots illustrating hospitalisation trends and search activity across Brazilian states (Supplementary File 5)

Acknowledgements

The authors thank Marcela Motta de Castro for her support in extracting user research data from the Afya Whitebook® platform.

Abbreviations

ACF

Autocorrelation Function

ADF

Augmented Dickey–Fuller

AIC

Akaike Information Criterion

ARIMA

Autoregressive Integrated Moving Average

BIC

Bayesian Information Criterion

CAAE

Certificate of Presentation for Ethical Consideration

CDC

Centers for Disease Control and Prevention

CNS

Brazilian National Health Council

COPD

Chronic Obstructive Pulmonary Disease

DATASUS

Department of Informatics of the Unified Health System

ICD-10

International Classification of Diseases, 10th Revision

MAE

Mean Absolute Error

MAPE

Mean Absolute Percentage Error

MSE

Mean Squared Error

PACF

Partial Autocorrelation Function

SARIMA

Seasonal Autoregressive Integrated Moving Average

SARIMAX

SARIMA with exogenous variables

SIH/SUS

Hospital Information System of the Unified Health System

UF

Federative Unit

UNIPTAN

Presidente Tancredo de Almeida Neves University Center

Authors’ contributions

D.Q., D.X., and E.M. conceptualized and designed the study. D.Q. and D.X. conducted the investigation and performed data collection. Formal analysis was carried out by D.Q. and D.X., while data curation was led by D.X. D.Q. wrote the original draft of the manuscript, with all authors contributing to the review and editing process. E.M. supervised the study and was responsible for funding acquisition. All authors reviewed and approved the final manuscript.

Funding

This research received no external funding.

Data availability

The SIH/SUS data used in this study are anonymized and publicly available at https://datasus.saude.gov.br/acesso-a-informacao. In contrast, Afya Whitebook® data are proprietary and not publicly accessible due to commercial restrictions. However, aggregated time-series data used in the analysis will be provided in the supplementary materials. These datasets allow for the replication of key analyses and validation of findings while ensuring compliance with data privacy and commercial confidentiality. Researchers requiring additional details may request access to specific data upon reasonable inquiry, subject to confidentiality agreements.

Declarations

Ethics approval and consent to participate

This study was approved by the Research Ethics Committee of the Centro Universitário Presidente Tancredo de Almeida Neves (UNIPTAN), under substantiated opinion number 6.991.028 and CAAE 81119724.6.0000.9667. The research complies with the ethical principles established by Resolution CNS 466/12 of the Brazilian National Health Council. All data used were anonymized and derived from public or internal databases, with no individual-level identifiers or direct participant interaction.

Consent for publication

Not applicable.

Competing interests

The authors are employees of Afya ®, the company that owns the Afya Whitebook ® application. However, Afya ® had no role in the study design, data collection, analysis, interpretation, or decision to publish this manuscript. We emphasize the importance of transforming the data we generate into useful information for society, reinforcing the role of digital health tools in improving public health surveillance and decision-making.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Eduardo Moura and Danielly Xavier contributed equally to this work.

References

  • 1.WHO Regional Office for the Eastern Mediterranean. Public health surveillance. Cairo. 2023. https://www.emro.who.int/health-topics/public-health-surveillance/index.html. Cited 2023-07-07.
  • 2.Thacker SB, Berkelman RL. Public health surveillance in the United States. Epidemiol Rev. 1988;10(1):164–90. [DOI] [PubMed] [Google Scholar]
  • 3.Martins TCDF, da Silva JHCM, Máximo GDC, Guimarães RM. Transition of morbidity and mortality in Brazil: A challenge on the thirtieth anniversary of the SUS. Cienc Saude Coletiva. 2021;26(10):4483–96. [DOI] [PubMed] [Google Scholar]
  • 4.Kwag KH, González-Lorenzo M, Banzi R, Bonovas S, Moja L. Providing doctors with high-quality information: An updated evaluation of web-based point-of-care information summaries. J Med Internet Res. 2016;18(1):e15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Carneiro HA, Mylonakis E. Google Trends: A Web-Based Tool for Real-Time Surveillance of Disease Outbreaks. Clin Infect Dis. 2009;49(10):1557–64. [DOI] [PubMed] [Google Scholar]
  • 6.Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. Detecting influenza epidemics using search engine query data. Nature. 2009;457(7232):1012–4. [DOI] [PubMed] [Google Scholar]
  • 7.Choi H, Varian H. Predicting the Present with Google Trends. Econ Rec. 2012;88(s1):2–9. [Google Scholar]
  • 8.Aiken EL, McGough SF, Majumder MS, Wachtel G, Nguyen AT, Viboud C, et al. Real-time estimation of disease activity in emerging outbreaks using internet search information. PLoS Comput Biol. 2020;16(8):e1008117. 10.1371/journal.pcbi.1008117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Afya Whitebook. Clinical decision support platform. https://whitebook.pebmed.com.br/. Accessed 13 May 2025.
  • 10.Federal Council of Medicine (CFM). Observatório CFM – Demografia Médica. 2024. https://observatorio.cfm.org.br/demografia/dashboard/. Acessed 20 Feb 2025.
  • 11.Ministry of Health of Brazil. Hospital Morbidity of SUS – SIH/SUS. 2025. https://datasus.saude.gov.br/acesso-a-informacao/morbidade-hospitalar-do-sus-sih-sus/.
  • 12.Ministério da Saúde, Brasil. Portaria SAES n° 1.110, de 18 de novembro de 2021. 2021. Estabelece diretrizes para o reenvio e processamento das bases nacionais do SIH/SUS e SIA/SUS. https://bvsms.saude.gov.br/bvs/saudelegis/saes/2021/prt1110_18_11_2021.html.
  • 13.Ministério da Saúde (Brasil). TABNET – Informações de saúde: População residente. 2024. https://datasus.saude.gov.br/populacao-residente/. Accessed 15 Jan 2024.
  • 14.de Souza WM, Weaver SC. Effects of climate change and human activities on vector-borne diseases. Nat Rev Microbiol. 2024;22(8):476–91. 10.1038/s41579-024-01026-0. Epub 2024 Mar 14. [DOI] [PubMed] [Google Scholar]
  • 15.Danesh Yazdi M, Wei Y, Di Q, Requia WJ, Shi L, Sabath MB, et al. The effect of long-term exposure to air pollution and seasonal temperature on hospital admissions with cardiovascular and respiratory disease in the United States: a difference-in-differences analysis. Sci Total Environ. 2022;843:156855. 10.1016/j.scitotenv.2022.156855. Epub 2022 Jun 21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Maharjan B, Gopali RS, Zhang Y. A scoping review on climate change and tuberculosis. Int J Biometeorol. 2021;65(10):1579–95. 10.1007/s00484-021-02117-w. Epub 2021 Mar 16. [DOI] [PubMed] [Google Scholar]
  • 17.Wen B, Xu R, Wu Y, Coêlho MdSZS, Saldiva PHN, Guo Y, et al. Association between ambient temperature and hospitalization for renal diseases in Brazil during 2000–2015: a nationwide case-crossover study. Lancet Reg Health Americas. 2021;6:100101. Published online 2021 Oct 31. 10.1016/j.lana.2021.100101. [DOI] [PMC free article] [PubMed]
  • 18.Al-Shihabi F, Moore A, Chowdhury TA. Diabetes and climate change. Diabet Med. 2023;40(3):e14971. 10.1111/dme.14971. Epub 2022 Oct 21. [DOI] [PubMed] [Google Scholar]
  • 19.Rosenbaum D, Levitt A. Mental health and the climate crisis: a call to action for Canadian psychiatrists. Can J Psychiatry. 2023;68(11):870–5. 10.1177/07067437231175532. Epub 2023 Jun 5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Microsoft Corporation. Microsoft Excel. 2024. Version 16.89.1, Microsoft Excel for Mac. https://www.microsoft.com/excel.
  • 21.Seabold S, Perktold J, contributors. statsmodels.tsa.statespace.sarimax.SARIMAX — statsmodels v0.14.2 documentation. 2024. https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html. Accessed 14 May 2025.
  • 22.Hyndman RJ, Athanasopoulos G. Forecasting: principles and practice. 2nd ed. OTexts; 2018. https://otexts.com/fpp2/.
  • 23.Python Software Foundation. Python language reference, version 3.12.3. 2024. https://www.python.org. Accessed 14 May 2025.
  • 24.Holm S. A simple sequentially rejective multiple test procedure. Scand J Stat. 1979;6(2):65–70. [Google Scholar]
  • 25.Dai J, Xiao Y, Sheng Q, Zhou J, Zhang Z, Zhu F. Epidemiology and SARIMA model of deaths in a tertiary comprehensive hospital in Hangzhou from 2015 to 2022. BMC Public Health. 2024;24(1):2549. 10.1186/s12889-024-20033-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Fávero LPL, Belfiore PP. Manual de Análise de Dados: Estatística e Modelagem Multivariada com Excel, SPSS e Stata. Rio de Janeiro: Elsevier; 2017.
  • 27.Lowe R, Gasparrini A, Van Meerbeeck CJ, Lippi CA, Mahon R, Trotman A, et al. Spatio-temporal dynamics of dengue in Brazil: Seasonal travelling waves and implications for control. PLOS Negl Trop Dis. 2018;12(3):e0007012. 10.1371/journal.pntd.0007012. [Google Scholar]
  • 28.Martinez EZ, Silva EAS, Fabbro ALD. A SARIMA forecasting model to predict the number of cases of dengue in Campinas, State of São Paulo. Brazil Rev Soc Bras Med Trop. 2011;44(4):436–40. [DOI] [PubMed] [Google Scholar]
  • 29.Pimentel KBA, Oliveira RS, Aragão CF, Aquino Júnior J, Moura MES, Guimarães-e Silva AS. Prediction of visceral leishmaniasis incidence using the Seasonal Autoregressive Integrated Moving Average model (SARIMA) in the state of Maranhão, Brazil. Braz J Biol. 2024;84. 10.1590/1519-6984.257402. [DOI] [PubMed]
  • 30.Xavier JMdV, Silva FDdS, Olinda RAd, Querino LAL, Araujo PSB, Lima LFC. Climate seasonality and lower respiratory tract diseases: a predictive model for pediatric hospitalizations. Rev Bras Enfermagem. 2022;75(2). 10.1590/0034-7167-2021-0680. [DOI] [PubMed]
  • 31.Santillana M, Nguyen AT, Dredze M, Loskill DE, Shah NH, Brownstein JS. Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance. PLoS Comput Biol. 2015;11(10):e1004513. 10.1371/journal.pcbi.1004513. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Wang Y, Xu C, Wang Z, Yuan J. Seasonality and trend prediction of scarlet fever incidence in mainland China from 2004 to 2018 using a hybrid SARIMA-NARX model. PeerJ. 2019;7:e6165. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12889_2025_24188_MOESM1_ESM.zip (10.8MB, zip)

Supplementary Material 1. The supplementary materials include aggregated data on user access to the Afya Whitebook® platform (Supplementary File 1), the categorization of ICD-10 codes used in the analysis (Supplementary File 2), graphical representations of the decomposed components—trend, seasonality, and residuals—for each disease at the national level (Supplementary File 3), the complete statistical outputs of the SARIMA and SARIMAX models for all time series (Supplementary File 4), and time series plots illustrating hospitalisation trends and search activity across Brazilian states (Supplementary File 5)

Data Availability Statement

The SIH/SUS data used in this study are anonymized and publicly available at https://datasus.saude.gov.br/acesso-a-informacao. In contrast, Afya Whitebook® data are proprietary and not publicly accessible due to commercial restrictions. However, aggregated time-series data used in the analysis will be provided in the supplementary materials. These datasets allow for the replication of key analyses and validation of findings while ensuring compliance with data privacy and commercial confidentiality. Researchers requiring additional details may request access to specific data upon reasonable inquiry, subject to confidentiality agreements.


Articles from BMC Public Health are provided here courtesy of BMC

RESOURCES