Comparison of machine learning methods in forecasting and characterizing the birch and grass pollen season

Daniel Bulanda; Małgorzata Bulanda; Małgorzata Sacha; Adrian Horzyk; Dorota Myszkowska

doi:10.1371/journal.pone.0332093

. 2026 Feb 18;21(2):e0332093. doi: 10.1371/journal.pone.0332093

Comparison of machine learning methods in forecasting and characterizing the birch and grass pollen season

Daniel Bulanda ^1,^*, Małgorzata Bulanda ², Małgorzata Sacha ², Adrian Horzyk ¹, Dorota Myszkowska ²

Editor: Rafael Duarte Coelho dos Santos³

PMCID: PMC12915917 PMID: 41706687

Abstract

The primary approach to the treatment of seasonal allergic diseases involves minimizing exposure to allergens and initiating early personalized therapy. The medication should be introduced about 7 days before the start of the pollen season and intensified during the period of the highest concentrations of sensitizing pollen. Therefore, forecasts for the concentration of pollen that causes clinical symptoms are of indisputable value to both doctors and patients. The study was carried out in Krakow (Southern Poland) with birch (Betula) and grasses (Poaceae) pollen data collected using the volumetric method in 1991-2024. The following meteorological data were collected and used in the study: temperature (mean, minimum and maximum), humidity, cloud cover, sunshine duration, mean wind speed, mean pressure at sea level, global radiation and snow depth. Eight machine learning models from four distinct families (lazy, linear, tree-based, and deep learning) were chosen to estimate the probability of the occurrence of pollen concentration within specified categories. These predictions were based on meteorological data combined with pollen concentration levels in the preceding days. Using the occurrence of pollen concentration in the selected categories as the target variable, the top-performing models achieved accuracies of 92.2%, 88.3%, and 87.2% for 1-day, 4-day, and 7-day forecasts of Betula pollen, respectively. Similarly, for Poaceae pollen, the models achieved 86.1%, 81.8%, and 80.0% accuracy for predictions of 1 day, 4 days, and 7 days ahead, respectively. In addition, a feature importance analysis and an association rule mining were performed to explain the dependencies between pollen concentration and meteorological variables. The tested machine learning methods achieve results that allow for satisfactory efficiency in predicting pollen concentration for up to seven consecutive days. The best-performing machine learning methods were boosted trees, associative knowledge graphs, and deep neural networks with memory cells.

Introduction

According to epidemiological studies, the current prevalence of allergic rhinitis (AR) in the world, depending on geographical latitude, is 5-50%. The incidence of pollen-induced allergic asthma worldwide ranges from 1% to 18% of the population. [1–4]. In recent years, we have observed an increase in the incidence of allergic atopic diseases in the world [5,6], which may be related to the changing environmental factors that influence the seasons of plant pollen [5].

The prediction and early within-season characterization of pollen dynamics are of great importance for patients with pollen allergies and medical doctors. Seasonal AR negatively impacts overall quality of life, causes absences from work and school, and generates enormous costs to the healthcare system. Untreated AR increases the risk of poor asthma control and exacerbations [7]. The basic treatment for seasonal allergic diseases is to avoid exposure to the allergen and early and individually selected therapy, including allergen specific immunotherapy [8]. The medication to control symptoms should be introduced approximately 7 days before the pollen season starts and intensified during the highest concentrations of sensitizing pollen [9].

Modern machine learning methods can help characterize pollen seasons, determine which environmental factors influence them strongly, and most importantly, predict the changes in pollen concentration during pollen seasons [10,11]. In such scenarios, commonly used methods include linear models, boosted trees, and deep neural networks [10,12–15].

Traditionally, aerobiological time-series have been modeled using statistical approaches such as multiple linear regression, generalized additive models (GAMs), ARIMA, and dynamic regression [12,16,17]. While these methods are straightforward and interpretable, they rely on assumptions of linearity, normality, and homoscedasticity, and may struggle to capture complex, non-linear interactions between meteorological drivers and pollen release dynamics.

In recent years, machine learning techniques have emerged as powerful alternatives, capable of flexibly modeling non-linear relationships and high-order interactions without strict distributional assumptions. Astray et al. [18] developed Random Forest, Support Vector Machine, and neural network models on 24 years of Parietaria pollen data in northwest Spain, achieving a mean absolute error of approximately one day in peak-date prediction and RMSE 5.55 - 5.84 for one day ahead prediction. Cordero et al. [19] applied LightGBM and neural network ensembles to 20 years of Olea pollen data, accurately forecasting season-peak timing (mean peak date error < 1 day) and daily concentrations (RMSE 25.03 - 29.27). In Switzerland, Shokouhi et al. [20] compared linear models (LASSO, Ridge and Elastic net), nonlinear models (XGBoost, Random forest and neural networks) and ensembling approaches for birch and grass pollen, finding tree-based and hybrid models outperform linear approaches. For the single-model approach they achieved RMSE 23.0 - 27.7 for grass and RMSE 105.7 - 140.7 for birch. Zewdie et al. [21] similarly demonstrated that Random Forests, Support Vector Machines, and neural networks could predict Ambrosia pollen with R² between 0.21 and 0.37, with Random Forest producing superior performance to the other models tested.

Despite these promising results, few studies have systematically benchmarked a broad suite of both linear and non-linear algorithms on long-term pollen datasets spanning multiple taxa. The aim of our work was to select the optimal machine learning methods for the prediction and characterization of pollen seasons. Here, we focus on several popular and representative models that have shown their effectiveness in pollen seasons forecasting. Our goal is to reveal their strengths and weaknesses, as well as application scenarios.

Materials and methods

Data

Study was performed in Krakow (Southern Poland; near the grid point $50^{\circ} N$ - $20^{\circ} E$ ). The city is surrounded by farmland and forests that prevail west of the city. The study area corresponds to the municipality of Krakow, which in 2021 covered an area of 327 ${km}^{2}$ and had a population of 780 796 inhabitants [22]. Krakow is located in a moderate, warm, and transitional climate between maritime and continental air masses.

Daily pollen concentrations were obtained within the framework of regular airborne pollen analyzes performed by the Aerobiological Monitoring Station at the Department of Clinical and Environmental Allergology of the Jagiellonian University Medical College in Krakow for 34 years. Based on the volumetric method, the Hirst-type sampler was used according to the European recommendations [23]. The station is located on the roof of the Collegium Śniadeckiego building, 20 meters above ground level and 200 meters above sea level (50^° 3’ 49” N; 19^° 57’ 19” E).

The pollen grains were sucked into a rotating drum covered with transparent tape (Melinex tape) with an adhesive fluid, which was changed once a week and then divided into seven segments corresponding to 24-hour periods. The tape fragments prepared in this way are placed on microscope slides, secured in a mixture of glycerin and gelatin with phenol (gelvatol) added, and then stained with basic fuchsin. The samples were examined using a light microscope at 400× magnification. Pollen grains were counted along 4 horizontal transects in Krakow. This method meets the requirement to count the minimum surface examined 10% of the entire deposition area [24]. The number of pollen counted in all horizontal lines is uploaded to the online database stored on the server of Jagiellonian University Medical College [25]. Pollen concentrations were automatically recalculated and expressed as pollen grains per cubic meter of air, per 24 hours ( ${Pollen/m}^{3}$ ).

The daily birch and grasses pollen concentrations obtained during the pollen seasons defined as the periods when pollen is present in the air were used in the study. As the season beginning, we assessed the first day with pollen concentration above zero, while the last day with pollen was considered as the season end. This definition of the season is consistent with [26]. All days with zero pollen count within the pollen seasons were also included into the analyses.

Fig 1 shows the distribution of daily birch and grass pollen concentrations in the years 1991 - 2024. The distribution of pollen concentration during the pollen season is more heterogeneous for grasses, so the creation of separate predictive models for birch and grasses was required. In addition, the pollen season dynamics of both studied taxa differ significantly over the years, making the prediction more challenging.

Table 1 describes the basic statistics of the pollen concentration data. The different number of data points results from the fact that data outside the pollen seasons are omitted, and the duration of the pollen seasons is longer in the case of grasses.

Table 1. Descriptive statistics of the daily birch and grass pollen concentrations used into the study.

taxon	data points	feature	mean	std	min	median	max	distribution
Betula	3425	date	–	–	1991-04-03	2003-08-15	2024-05-11	uniform
Betula	3425	${Pollen/m}^{3}$	59.902	232.874	0	1	4199	non-normal (p < 0.001)
Poaceae	5395	date	–	–	1991-05-10	2008-05-05	2024-08-31	uniform
Poaceae	5395	${Pollen/m}^{3}$	15.892	29.6273	0	3	437	non-normal (p < 0.001)

Open in a new tab

std - standard deviation, min - minimum value, max - maximum value.

The meteorological data fully cover the pollen concentration data in daily intervals. Data were collected from the European Climate Assessment and Dataset website [27] as blended data in ASCII file format. All observations were obtained from the Krakow-Balice station located 233 meters above sea level, and its coordinates are 50^° 4’ 49” and 19^° 48’ 6” E. The horizontal distance between the weather station and the pollen collection station is 11.11 kilometers and the height difference is 30 meters.

To our knowledge, based on reports on the meteorological data in both stations (Balice and Kraków city center) the differences between them are relatively low. According to [28], in 2001-2010 annual temperature in Kraków center was higher by 0.7 ^°C in comparison to Balice station, whereby the differences were observed mainly in winter, while the pollen seasons are finished. The other data, like sunshine, cloud cover, annual relative humidity (77% vs 78%, Kraków and Balice, respectively), were slightly different.

We have selected 10 meteorological features that are statistically described in Table 2.

Table 2. Statistical description of meteorological data for birch and grass pollen seasons.

taxon	feature	unit	mean	std	min	median	max	distribution
Betula	mean temperature	${0.1}^{\circ} C$	132.301	59.494	-136.0	139.0	281.0	non-normal (p < 0.001)
	minimum temperature	${0.1}^{\circ} C$	79.4	56.151	-166.0	85.0	206.0	non-normal (p < 0.001)
	maximum temperature	${0.1}^{\circ} C$	188.682	69.186	-112.0	196.0	351.0	non-normal (p < 0.001)
	humidity	1%	37.747	36.838	0.0	53.0	98.0	non-normal (p < 0.001)
	cloud cover	okta	4.269	2.154	0.0	5.0	8.0	non-normal (p < 0.001)
	sunshine duration	0.1 hour	41.804	45.016	0.0	21.0	152.0	non-normal (p < 0.001)
	mean wind speed	0.1 m/s	15.786	20.193	0.0	10.0	242.0	non-normal (p < 0.001)
	mean sea level pressure	0.1 hPa	10061.021	984.711	47.0	10159.0	10412.0	non-normal (p < 0.001)
	global radiation	W/m²	152.026	81.385	15.0	125.0	341.0	non-normal (p < 0.001)
	snow depth	1 cm	0.157	1.39	0.0	0.0	20.0	non-normal (p < 0.001)
Poaceae	mean temperature	${0.1}^{\circ} C$	160.739	48.353	-29.0	165.0	281.0	non-normal (p < 0.001)
	minimum temperature	${0.1}^{\circ} C$	108.355	45.422	-78.0	112.0	222.0	non-normal (p < 0.001)
	maximum temperature	${0.1}^{\circ} C$	218.343	58.706	2.0	224.0	373.0	non-normal (p < 0.001)
	humidity	1%	39.94	37.581	0.0	59.0	98.0	non-normal (p < 0.001)
	cloud cover	okta	4.091	2.147	0.0	4.0	8.0	non-normal (p < 0.001)
	sunshine duration	0.1 hour	32.784	43.318	0.0	12.0	153.0	non-normal (p < 0.001)
	mean wind speed	0.1 m/s	15.002	19.031	0.0	10.0	275.0	non-normal (p < 0.001)
	mean sea level pressure	0.1 hPa	10063.074	982.175	56.0	10160.0	10368.0	non-normal (p < 0.001)
	global radiation	W/m²	134.308	77.854	15.0	102.0	343.0	non-normal (p < 0.001)
	snow depth	1 cm	0.002	0.058	0.0	0.0	3.0	non-normal (p < 0.001)

Open in a new tab

std - standard deviation, min - minimum value, max - maximum value.

Shapiro-Wilk test from the HypothesisTests.jl package [29] was used to test the normality of the data distribution. It performs a test of the null hypothesis that the data come from a normal distribution. The Shapiro–Wilk test indicated that both the daily pollen concentration data and the meteorological data used in this study deviate significantly from a normal distribution.

To provide a minimal data set necessary to replicate our study, two csv files with Betula pollen and two files with Poaceae pollen data obtained in 2022 and 2023, and the meteorological data of ten selected factors were uploaded to the Open Science Framework with DOI number: DOI 10.17605/OSF.IO/9YZCF [30].

Input data preparation.

Each data file was converted to the CSV format and pre-processed using the DataFrames.jl package [31].

We used two data pre-processing strategies depending on the machine learning method being used.

For models that can take time series data as input (for example, recurrent neural networks), we used the past pollen concentration data for an appropriate interval (1-20 days, depending on the experimental variant) preceding the predicted day by n days. For example, for time window 14, to predict 4 days in the future, we used data from the interval between 18 and 4 days in the past. We used the same approach for meteorological data, including the weather forecast simulation, so we included meteorological data for the predicted day and n days earlier. In the experiments, the weather forecast simulation used historical meteorological data rather than an external forecast model. Specifically, for time window w and prediction horizon m, we treated the recorded daily meteorological values on the target day and the preceding m + w days as forecast inputs. This design isolates the performance of the pollen-forecasting algorithms from uncertainties in real-time weather predictions.

We used rolling windows and moving averages for models that prefer tabular data (for example, decision trees). For each feature, additional columns were prepared that contained exponential moving averages (EMAs) [32] and their derivatives. The exponential moving average is given by Eq (1) where t is the size of the window, α is a smoothing factor, and x_t is a feature value at point t.

E M A_{t} = \frac{x_{t} + (1 - α) x_{t - 1} + (1 - α)^{2} x_{t - 2} + . . . + (1 - α)^{t} x_{0}}{1 + (1 - α) + (1 - α)^{2} + . . . + (1 - α)^{t}}

(1)

Let i be an index of a data point, and f be a feature name, then the following features were constructed:

$e m a_{3} = E M A (d a t a [f] [(i - 2) : i])$
$e m a_{7} = E M A (d a t a [f] [(i - 6) : i])$
$e m a_{20} = E M A (d a t a [f] [(i - 19) : i])$
$e m a_{1 / 3} = \frac{d a t a [f] [i]}{e m a_{3}}$
$e m a_{3 / 7} = \frac{e m a_{3}}{e m a_{7}}$
$e m a_{7 / 20} = \frac{e m a_{7}}{e m a_{20}}$

Similarly, given a time window w, to predict n days in the future, we used data from the interval between n + w and n days in the past. For meteorological data, we also included preprocessed meteorological data for the predicted day and n days before.

Target variable and measures.

In this work, the main goal of predictive models is to forecast pollen concentrations 1, 4, and 7 days into the future using supervised learning. Since the target variable is numeric, this is a regression problem. To assess the quality of the regression, the mean absolute error (MAE) described by Eq (2) and root mean square error (RMSE) described by Eq (3) metrics were used.

MAE (y, \hat{y}) = \sum_{i = 1}^{N} | y_{i} - {\hat{y}}_{i} |

(2)

RMSE (y, \hat{y}) = \sqrt{\frac{\sum_{i = 1}^{N} (y_{i} - {\hat{y}}_{i})^{2}}{N}}

(3)

In these equations, y, $\hat{y}$ , $\bar{y}$ , and N denote a reference target value, a predicted target value, the mean of the target variable, and the number of observations, respectively.

However, from a clinical point of view, it is more important to classify pollen concentration into a specific category, allowing appropriate preventive measures to be taken. Three selected categories of daily pollen concentrations that cause allergic rhinitis symptoms were modified according to personal observation of the symptoms in Krakow patients [14]. So, from a technical point of view, this is a classification problem in which the target variable is categorical, with three possible classes for each taxon. The categories, along with the frequency of occurrence for each class, are as follows:

Betula
- (a) low: 1 – 10 ${Pollen/m}^{3}$ , 76.0%,
- (b) medium: 11 – 75 ${Pollen/m}^{3}$ , 12.8%,
- (c) high: >75 ${Pollen/m}^{3}$ , 11.2%,
Poaceae
- (a) low: 1 – 10 ${Pollen/m}^{3}$ , 68.3%,
- (b) medium: 11 – 50 ${Pollen/m}^{3}$ , 22.6%,
- (c) high: >50 ${Pollen/m}^{3}$ , 9.1%.

To assess the quality of classification for each class, we used the accuracy defined by Eq (4) where TP, TN, FP, FN denote a true positive value, a true negative value, a false positive value, and a false negative value, respectively.

a c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(4)

Machine learning methods

Machine learning is used to model phenomena based only on data. It is particularly useful for problems in which the relationship between parameters is complicated or the functional relationship between input and output variables is unknown [21].

There are a huge number of machine learning methods that can make predictions and analyze pollen concentrations based on historical data. Popular choices are linear models, boosted trees, and deep neural networks [10,12–15]. We selected several popular and representative methods that proved their usefulness in the prediction of the pollen season. In addition, we include MAGN, which has proven utility in data mining and prediction problems [33]. Table 3 describes the selected models along with information about the origin of the source code. Except for linear regression, the machine learning methods used in this study are nonparametric and do not require parametric assumptions, making them more robust to violations such as nonnormality.

Table 3. Overview of the machine learning methods evaluated in this study.

For each model, the table lists its methodological family, canonical theoretical reference, and the specific software implementation used in our experiments. The theoretical references point readers to foundational publications describing each method, while implementation references provide links to the exact packages and libraries used to ensure reproducibility.

family	name	theory	implementation
lazy	K-Nearest Neighbors	[34]	NearestNeighborModels.jl [35]
lazy	MAGN	[33]	witchnet [36]
linear	Linear Regression	[37]	MLJLinearModels.jl [38]
tree-based	Decision Trees	[39]	DecisionTree.jl [40]
	Random Forest	[41]	DecisionTree.jl [40]
	XGBoost	[42]	XGBoost.jl [43]
deep learning	Convolution	[44]	Flux.jl [45]
	LSTM	[46]	Flux.jl [45]
	GRU	[47]	Flux.jl [45]

Open in a new tab

The models were used with default hyperparameter settings, and their values for individual models are available in the supplementary material (S2 File).

The following subsections briefly describe each of the models listed above.

Multi-associative graph network.

Multi-Associative Graph Networks (MAGN) [33] is a novel graph-based approach to representing and processing large-scale training data along with key relationships between data elements. Unlike conventional feedforward neural networks, where backpropagation is the primary training mechanism, MAGN uses a recursive, feedback-oriented graph structure inspired by how the human brain stores and retrieves information. This design makes incorporating new data on the fly easier and adapting existing models without a complete retraining phase.

The features are represented as sensory fields composed of sensory neurons using a dedicated structure called ASA-graphs [48]. This data structure vertically relates feature values, aggregates duplicates, and is seamlessly combined with other MAGN neurons. The data neurons in the graph represent objects from the database, whereas the connections capture existing and newly discovered relationships. A tuning algorithm further refines the graph by learning which neurons (and their associated objects) should be prioritized, improving the network capacity for more accurate classification and more substantial relational dependencies. Overall, MAGN aims to offer a flexible, brain-inspired architecture that can quickly handle new information, preserve and exploit relational structures, and enable more efficient computational intelligence processes than standard feed-forward systems.

Fig 2 illustrates how MAGN encodes numerical and categorical information into a structured, multi-layer graph that supports learning through association rather than gradient-based optimization. Each sensory field (A and B) corresponds to a specific input feature, and the sensory neurons within them (A.1, A.2, A.3; B.1, B.2, B.3) represent distinct observed feature values aggregated across the dataset. The duplicate counter beneath each sensory neuron reflects the frequency of that value, allowing MAGN to prioritize more statistically relevant information. Object neurons (O.1, O.2, O.4) serve as integrative units connecting the individual feature values that co-occur within a training instance. Defining connections (solid arrows) establish the composition of an object in the feature space, whereas similarity connections (dotted lines) capture the statistical co-occurrence of feature values across many objects. This relational structure enables MAGN to retrieve patterns by following associative paths rather than performing iterative optimization, which explains its strong performance across accuracy, MAE, execution time, and memory metrics. The interactions between sensory neurons, object neurons, and associations enable MAGN to store training data, discover new relationships, and generalize over previously seen data.

This work used the lazy regressor and classifier based on neural activation propagation in MAGN for predictions. It utilized similarity connections to fuzzy inputs represented by the sensory neurons. With fuzzy activation, the signal propagates to data neurons by defining connections. Then, an algorithm called similarity voting estimates the target value using different techniques depending on the type of target variable (classification for categorical variables and regression for numerical variables).

In addition, due to its efficient structure, MAGN was used to calculate mutual information [49] to analyze the relationship between input and target features. Mutual information measures how much knowledge of one random variable reduces the uncertainty about another. Mathematically, mutual information is defined by Eq (5) [50]. Here, P_(X,Y) is the joint probability distribution for the pair of random variables X and Y, both defined in the same probability space. At the same time, P_X and P_Y are the marginal distributions for X and Y, respectively.

M I (X, Y) = \sum_{x \in X} \sum_{y \in Y} P_{(X, Y)} (x, y) \log \frac{P_{(X, Y)} (x, y)}{P_{X} (x) P_{Y} (y)}

(5)

MAGN was also used to extract frequent patterns and association rules to characterize the pollen season. A frequent pattern refers to a combination of feature values (captured by the sensory neurons) that appears at least as often as the specified minimum support threshold [51]. A support of pattern A measures how frequently a pattern appears in a dataset to be considered significant and is defined by Eq (6).

s u p p o r t (A) = \frac{c o u n t e r (A)}{c o u n t e r (o b j e c t s)}

(6)

An association rule is an if-then statement of the form $A \to B$ , where A and B are frequent patterns [52]. It indicates that whenever pattern A appears in an object, pattern B is also likely to appear. Frequent patterns and association rules are highly effective in data mining to uncover hidden relationships within datasets. For a given rule $A \to B$ , the confidence measures the strength of the implication by calculating the conditional probability that pattern B occurs given that pattern A occurs [53]. Confidence is a measure of certainty of a rule and is defined as in Eq (7) where $P (B ∣ A)$ is the conditional probability (the probability of an event occurring, given that another event is already known to have occurred).

c o n f i d e n c e (A \to B) = P (B ∣ A)

(7)

Lift measures the strength of association between the feature values in a rule compared to what would be expected if they were independent [54]. Lift is defined by Eq (8).

l i f t (A \to B) = \frac{P (B ∣ A)}{P (B)}

(8)

K-Nearest neighbors.

The k-Nearest Neighbors (k-NN) algorithm [34] is a non-parametric classification and regression method that assigns target values to predicted data based on the values of the target variable of the k-closest training samples in the n-dimensional feature space. It calculates the distance (commonly using the Euclidean distance) between the query point and all training points to determine the nearest neighbors. k-NN is a lazy algorithm, meaning that it does not build a model but instead makes predictions at runtime based on the raw training data. The choice of k is crucial: a small k may lead to overfitting, while a large k may oversmooth the decision boundaries [55]. Despite its simplicity, k-NN performs well in low-dimensional spaces but struggles with high-dimensional data due to the curse of dimensionality [56] while all dimensions influence results in the same way, even if some of them are irrelevant. There is no attention mechanism that could prioritize more essential data features.

The k-NN algorithm has also been widely analyzed in the context of practical data pre-processing requirements. In many applications, the performance of k-NN depends strongly on feature scaling, because distance metrics such as Euclidean or Manhattan distance are sensitive to differences in feature ranges. Standardization or normalization is therefore typically applied to ensure that no single variable disproportionately influences the computed distances [57]. Additionally, various distance metrics can be used to better adapt k-NN to specific data distributions or to reduce the impact of correlated features. These enhancements enable the algorithm to be more robust in heterogeneous datasets, such as meteorological time series combined with biological variables.

Linear regression.

Linear regression is a fundamental machine learning method used to model the relationship between a dependent variable y and n independent variables x_n by fitting a linear equation Eq (9) [37]. The parameters (β coefficients) are usually estimated by minimizing the sum of squared residuals (residual is the difference between the predicted value and observed value, e.g., $y_{i} - {\hat{y}}_{i}$ ). Despite its simplicity and interpretability, linear regression assumes linearity, independence, homoscedasticity (an assumption of equal or similar variances in different groups), and normality of residuals, making it sensitive to outliers and multicollinearity [58].

y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + . . . + β_{n} x_{n}

(9)

In practical applications, several extensions of classical linear regression are used to mitigate its limitations and improve predictive performance. Techniques such as regularization, including Ridge (L2) and Lasso (L1) regression [59], add penalty terms to the loss function to reduce overfitting and address multicollinearity by shrinking or eliminating coefficients. These modifications help stabilize parameter estimates and improve generalization, particularly when predictors are highly correlated or when the number of features is large relative to the number of observations. Regularized linear models have been widely used in environmental and atmospheric sciences due to their robustness and ability to handle noisy real-world data [15].

Tree-based methods.

A decision tree is a supervised machine learning algorithm structured as a tree where each internal node represents a decision based on a feature, each branch represents a subset of feature values, and each leaf node represents a predicted class or value [39]. The tree is constructed using algorithms such as ID3 or CART, which recursively split data based on criteria such as information gain, Gini impurity, or variance reduction [60]. Decision trees are interpretable and can handle both numerical and categorical data. However, they are prone to overfitting, especially with deep trees, which can be mitigated using pruning or more advanced models such as gradient boosting and random forests.

Random Forest is an ensemble learning method that builds multiple decision trees and combines their results to improve predictive accuracy and reduce overfitting [41]. It works by training each tree on a random subset of the data and selecting a random subset of features at each split to enhance diversity. Predictions are made by voting among trees. The model is highly robust to overfitting and works well with large datasets, but it is less interpretable than simpler models. Random Forest is widely used in various applications due to its versatility, scalability, and ability to effectively handle missing data and imbalanced datasets [61].

XGBoost (Extreme Gradient Boosting) [42] is a machine learning model based on gradient boosting. This technique assumes that the next model minimizes the overall prediction error when combined with previous models. It builds an ensemble of decision trees sequentially, where each new tree corrects the errors of the previous ones using gradient-based optimization. XGBoost incorporates regularization techniques (such as L1 and L2 penalties) to prevent overfitting, making it more robust than traditional gradient boosting methods. Due to its high predictive accuracy and execution speed, XGBoost has become a dominant choice in machine learning competitions and real-world applications, including medical diagnosis [62].

Deep neural networks.

Deep neural networks (DNNs) are multilayer artificial neural networks that consist of multiple hidden layers between the input and output layers. Using techniques such as backpropagation and activation functions, DNNs have achieved state-of-the-art performance in various fields, including time series prediction [44].

Convolution is a mathematical operation used in signal processing and deep learning, where a filter (kernel) slides over input data to extract features by computing element-wise multiplications and summing the results [44]. In deep learning, 1-dimensional convolutional layers help capture temporal dependencies in sequences. The convolution operation is efficient because it enables weight sharing and local connectivity, significantly reducing the number of parameters compared to fully connected layers [63].

Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) designed to overcome the vanishing gradient problem, allowing it to learn long-term dependencies in sequential data [46]. LSTMs use input, forget, and output gates to regulate the flow of information, ensuring that relevant data is retained while irrelevant information is discarded. The cell state acts as a memory unit, allowing LSTMs to store and update information across long sequences, making them well-suited for time series forecasting. Despite their effectiveness, LSTMs can be computationally expensive, which reduces their practical application.

Similarly to LSTMs, Gated Recurrent Units (GRUs) are a type of RNN designed to capture long-term dependencies in sequential data while addressing the problem of vanishing gradients [47]. Unlike LSTMs, GRUs use only two gates: the reset and update gates, making them computationally more efficient. The update gate determines how much past information should be carried forward, while the reset gate controls how much new information should be incorporated. GRUs have been widely applied in time series forecasting, often performing comparably to LSTMs with fewer parameters [64], effectively reducing computational complexity without performance sacrifice.

Fig 3 shows the architecture of the deep neural network used in the experiments.

Experiments

Two experiments were carried out independently for birch and grass pollen:

Experiment 1: prediction of predefined pollen concentration classes (low, medium, high) 1, 4, and 7 days ahead,
Experiment 2: characterization of pollen seasons using machine learning methods.

The experiments were carried out using the MLJ.jl package [65]. The following subsections describe each experiment in detail.

Experiment 1.

The goal of Experiment 1 was to systematically compare the forecasting accuracy of nine machine learning algorithms listed in Table 3 across three forecast horizons (1, 4, and 7 days ahead).

For each forecast feature vectors were prepared as detailed in the Input Data Preparation section.

We evaluated each model on two parallel tasks:

Regression: predicting the continuous pollen concentration, assessed via Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).
Classification: predicting discrete pollen concentration categories, assessed by overall accuracy as described in the Target Variable and Measures subsection.

To prevent temporal leakage, we grouped records by calendar year and randomly assigned 70% of years to the training set and the remaining 30% to the independent test set, ensuring no overlap in years between partitions.

All algorithms were trained using their default hyperparameter settings to maintain fair comparability. Performance metrics were computed exclusively on the held-out test set for each horizon.

This design allowed us to identify which methods most effectively capture both the numeric trends and categorical thresholds of birch and grass pollen seasons under varying lead times.

Experiment 2.

Building on the predictive performance established in Experiment 1, Experiment 2 aimed to interpret and characterize the relationships between the input and target variables. It comprised three components:

Feature importance analysis: feature importance learned by the top-performing model to rank each predictor’s contribution to model accuracy.
Mutual information assessment: normalized mutual information between each input variable and the target (pollen concentration or category) to quantify their statistical dependence without assuming linear relationships.
Association rule mining: association rules mined by MAGN uncover patterns of meteorological and pollen variables that precede high or low pollen days.

These analyses collectively reveal which factors and combinations of them most strongly influence pollen forecasts and offer an interpretable characterization of pollen season dynamics.

Software and hardware.

All experiments were implemented in the Julia programming language v1.11.3 [66]. The packages used are listed in the appropriate subsections above. The source code is publicly available under the MIT license [67].

The experiments were carried out on a dedicated workstation with one AMD Ryzen Threadripper Pro 5965WX CPU, 128 GB RAM, and 2 x GPU NVidia GeForce 3090. To ensure stable thermal conditions during testing, the CPU and GPUs are cooled with a custom water cooling system, and the workstation is placed in a server cabinet with a dedicated connection for mechanical ventilation.

Results

An overview of the study design and main findings is shown in the Supporting information (S1 Fig).

Experiment 1

Table 4 presents the results of the experiment separately for a given taxon and the number of predicted days ahead. The results are sorted by accuracy, but the winning model for each metric is marked in bold. For birch pollen, XGBoost and MAGN deliver the highest classification accuracy. For grass pollen, XGBoost, Random Forest, and MAGN lead in 1- and 4-day forecasts. However, for 7-day predictions, the Conv-LSTM DNN outperforms the others, likely reflecting its superior ability to capture complex, multidimensional temporal patterns.

Table 4. Pollen concentration forecasting results.

taxon	days ahead	model	accuracy	mae	rmse	time	memory
Betula	1	XGBoost	0.922	27.15	136.764	17.336	7340
		MAGN	0.916	26.937	130.66	7.332	837
		Decision Trees	0.894	26.625	130.846	13.999	7336
		K-Nearest Neighbors	0.867	26.874	126.844	10.595	7338
		Random Forest	0.841	28.494	124.383	9.314	7536
		Conv-LSTM DNN	0.84	33.279	158.074	1602.9	214337
		Conv-GRU DNN	0.719	37.767	159.795	3148.41	228738
		Linear Regression	0.602	71.349	173.272	11.58	7337
	4	MAGN	0.883	39.327	160.51	4.342	561
		XGBoost	0.879	40.024	176.681	12.709	7324
		Decision Trees	0.864	47.092	207.021	8.078	7321
		Conv-LSTM DNN	0.844	31.051	152.126	1602.9	2995
		K-Nearest Neighbors	0.836	42.101	178.778	9.867	7435
		Conv-GRU DNN	0.814	36.706	159.676	3148.41	228738
		Random Forest	0.715	42.368	151.122	9.950	7477
		Linear Regression	0.516	78.464	180.295	9.49	7325
	7	MAGN	0.872	42.069	168.523	2.02	484
		XGBoost	0.872	51.319	208.498	12.586	7304
		Decision Trees	0.862	64.129	257.572	11.525	7301
		K-Nearest Neighbors	0.844	54.072	197.97	12.024	7377
		Conv-LSTM DNN	0.83	40.005	159.748	2183.45	208284
		Conv-GRU DNN	0.808	39.95	165.033	3148.41	228738
		Random Forest	0.7	50.885	167.056	11.605	7398
		Linear Regression	0.503	79.971	182.744	8.773	7306
Poaceae	1	Random Forest	0.861	7.864	20.464	20.245	10579
		MAGN	0.857	8.852	25.32	14.549	737
		XGBoost	0.848	8.109	21.109	25.60	10239
		Decision Trees	0.827	9.313	22.651	12.987	10234
		Linear Regression	0.805	11.373	23.236	23.80	10237
		Conv-LSTM DNN	0.787	12.325	31.441	4509.94	390109
		K-Nearest Neighbors	0.77	11.071	24.437	19.86	10237
		Conv-GRU DNN	0.725	12.979	29.396	5493.77	422627
	4	XGBoost	0.818	9.895	25.43	22.58	10216
		Random Forest	0.802	9.972	24.43	15.1031	10482
		MAGN	0.802	10.598	27.36	7.851	665
		K-Nearest Neighbors	0.799	10.725	26.53	19.424	10214
		Conv-LSTM DNN	0.796	10.525	27.265	2995.14	391027
		Decision Trees	0.786	11.318	27.232	20.315	10211
		Conv-GRU DNN	0.762	11.83	27.84	5493.77	422627
		Linear Regression	0.73	12.606	25.807	14.131	10213
	7	Conv-LSTM DNN	0.8	10.416	27.833	4509.94	390109
		Random Forest	0.775	11.497	26.905	20.177	10351
		MAGN	0.771	12.235	29.771	3.985	419
		K-Nearest Neighbors	0.768	12.072	28.719	17.976	10183
		XGBoost	0.766	11.876	28.316	18.549	10184
		Conv-GRU DNN	0.754	12.268	29.294	5493.77	422627
		Decision Trees	0.747	13.131	30.837	13.0	10182
		Linear Regression	0.666	14.537	28.169	21.908	10182

Open in a new tab

time unit: second, memory unit: megabyte.

Fig 4 shows the comparison of models based on the average results for accuracy, MAE, execution time, and total memory used during training and prediction. XGBoost and MAGN produced similarly superior performance compared to the other models overall, while MAGN proved the most efficient in terms of memory usage and runtime. This is achieved by the algorithm-as-a-structure approach, where multiple relations are represented in the structure of the neural network, eliminating the need to compute these relations in the training phase.

Fig 5 shows a graph of the observed and predicted grass and pollen concentrations predicted by the XGBoost model for the sample seasons. This model was selected because it had the best prediction accuracy for this sample period.

Experiment 2

Fig 6 shows feature importance and normalized mutual information for all input variables in predicting pollen concentration. Across both birch and grass, past pollen levels are by far the strongest predictor, with temperature ranking second (relative normalized mutual information = 36.1%–65.0%).

A deeper look at relative normalized mutual information for seven lower-ranked meteorological drivers reveals that humidity contributes 7.7%–8.1%, sunshine duration 6.4%–9.5%, radiation 4.7%–6.4%, cloud cover 4.3%–5.3%, wind speed 2.0%–3.9%, sea level pressure 2.2%–3.0%, and snow depth 0.5%–1.6%. Although these secondary factors fall below past pollen and temperature in overall importance, they still make non-negligible contributions. This demonstrates that including a broader suite of environmental variables can further refine model performance under varying weather conditions.

Fig 7 illustrates the association rules linking combinations of meteorological variables from the previous three days with the resulting pollen concentration classes for Betula (A) and Poaceae (B). Several notable patterns emerge from the plot. First, the large circles located in the upper-right area of each panel represent rules with simultaneously high support and high confidence, indicating that certain combinations of temperature, humidity, cloud cover, wind speed, and sunshine duration frequently precede specific pollen concentration levels and do so with high reliability. The color scale further shows that many of these high-confidence rules also exhibit elevated lift values, meaning that the occurrence of these meteorological conditions increases the likelihood of the pollen class more than expected by chance.

A key point of interest is the presence of a few large bubbles with high confidence and support in both pollen types, suggesting that some meteorological patterns are strongly and consistently associated with typical seasonal peaks. For Betula (panel A), the highest-lift rules tend to cluster at moderate support values, implying that particularly strong meteorological triggers occur less frequently but are highly predictive when they appear. In contrast, Poaceae (panel B) shows a broader spread of medium-to-high support rules, indicating that grass pollen is influenced by a wider variety of meteorological combinations. Together, these patterns demonstrate that association rule mining can effectively uncover both common and rare but highly predictive meteorological configurations, providing insights into how weather conditions shape pollen dynamics.

Discussion

Predicting pollen concentrations, characterizing, and modeling upcoming pollen seasons are among the most critical goals of aerobiology [15] and are helpful for allergologists in clinical practice. Furthermore, tracking pollen concentrations is essential in clinical trials of specific immunotherapy [68]. The European Medicines Agency guideline on the clinical development of allergen immunotherapy products states that, for seasonal allergies, trials must record exposure to relevant allergens and specify in the study protocol the minimum pollen count necessary to establish both the baseline and evaluation periods [69]. Until now, machine learning methods have been used to predict changes in daily pollen concentrations [14,18,19,21], the severity of pollen-induced symptoms [70], daily alarm threshold [70]. As rightly emphasized in [15], studies in the analyzed topic differ significantly in terms of modeling techniques, predictor variables, and validation methods.

Building on these advances, our study is the first to compare seven widely used machine learning algorithms (K-Nearest Neighbors, Linear Regression, Decision Trees, Random Forest, XGBoost, Conv-LSTM DNN, Conv-GRU DNN) and Multi-Associative Graph Networks for retrospective daily birch and grass pollen forecasting over 34 years in Krakow. Unlike many studies, this work did not use temporal predictors such as day of the year, month of the year, and season of the year. However, a relatively large number of meteorological variables (ten) were used.

Our work’s innovative nature is demonstrated by using a large original database containing a long pollen data series from 1991 to 2024 and by comparing the usefulness of different machine-learning methods in predicting and characterizing pollen seasons. The work was created as a collaboration between an interdisciplinary team consisting of an aerobiologist, machine learning experts, and practicing allergists. We compared the usefulness of many machine learning methods from various orthogonal families. To our knowledge, this is the only work that examines the use of associative knowledge graphs (MAGNs) to predict pollen concentration and the mining association rules that characterize pollen seasons [71].

The proposed experiments have shown that the best average efficiency in pollen concentration classification is achieved by XGBoost, followed by MAGN. These are effective and fast classifiers with a low memory footprint but sensitive to proper feature engineering. In turn, the best representation of the pollen concentration curve (regression task) was achieved by a deep neural network with convolutional layers and LSTM cells. These networks can effectively model time series, but their training and maintenance costs are high. The most time- and memory-efficient classifier was the one based on MAGN.

In order to guide future aerobiological modeling efforts, we have synthesized the practical trade-offs among the nine algorithms evaluated in this study. Table 5 provides a concise comparison of each method’s principal advantages, such as predictive accuracy, interpretability, and resource efficiency-and their corresponding limitations, including distributional assumptions, computational cost, and ease of implementation. This overview enables researchers to select the most appropriate algorithm based on specific project requirements, data availability, and hardware constraints.

Table 5. Comparative Overview of Machine Learning Methods for Pollen Forecasting.

Model	Advantages	Disadvantages
Linear Regression	•Highly interpretable	•Assumes linearity and normality
Linear Regression	•Very fast training and inference	•Low accuracy
K-Nearest Neighbors	•No training phase	•Prediction slow for large datasets
K-Nearest Neighbors	•Captures local non-linear patterns	•Moderate accuracy
Decision Trees	•Interpretable via tree splits	•Prone to overfitting
Decision Trees	•Fast inference	•Sensitive to data changes
Random Forest	•Good out-of-the-box performance	•Higher memory and compute than a single tree
Random Forest	•Easy variable importance estimation	•Less interpretable than a single tree
XGBoost	•State-of-the-art accuracy on pollen prediction	•Complex hyperparameter tuning
XGBoost	•Built-in regularization	•Higher computational cost than Random Forest
Conv-LSTM DNN	•Models complex, multidimensional temporal patterns	•Long training times
Conv-LSTM DNN	•End-to-end feature extraction	•High memory and hardware requirements
Conv-GRU DNN	•Similar benefits to Conv-LSTM with fewer parameters	•Still long training times
Conv-GRU DNN	•Faster convergence	•Still high memory and hardware requirements
MAGN	•Comparable accuracy to XGBoost	•Novel, fewer off-the-shelf tools
MAGN	•Fastest runtime and lowest memory footprint	•Requires graph construction step

Open in a new tab

In our study, we used default hyperparameter settings for all machine learning models as the number of possible combinations of them surpasses our computational possibilities and the default settings were already optimized as usually well-options. This choice was made to ensure fairness and comparability across a wide range of models and configurations. While more extensive hyperparameter tuning could potentially lead to improved performance for some methods, the high accuracy achieved with default settings demonstrates the robustness of the proposed framework. Future work could incorporate hyperparameter optimization techniques, such as grid search, evolutionary search, or Bayesian optimization, to further enhance performance.

In our work, we analyzed the use of different machine learning methods in assessing the impact of weather factors on the concentration of birch and grass pollen. The most important environmental variable that determines daily total pollen concentrations is the past daily total pollen concentration, which has been proven in our and other authors’ work [10,15]. The literature emphasizes the influence of the mean daily temperature on a current or preceding day [72–74], maximum and minimum daily temperature [75,76]. These findings depend on the taxon studied, for example, in the case of birches, previous studies performed in Krakow confirmed that mean and minimum temperature and relative humidity explained the variability in daily pollen concentration of birch more effectively [13,14], while the dominant independent variable included in the models that predicted daily pollen concentration of grass was temperature [14]. Local microclimate conditions can have a strong influence on pollen release [77,78]. Therefore, the best solution is to develop forecasting models on a not-wide spatial scale, and only after a thorough evaluation of their effectiveness may they prove to be more universal [79]. In Switzerland, grass pollen concentration exhibits a positive relationship with temperature, as pollen levels tend to increase when the temperature exceeds 10^° C. At the same time, both precipitation and humidity have a negative relationship with grass pollen, leading to decreased predicted pollen concentration with higher rainfall and humidity levels exceeding 70%.

In addition, analysis of association rules showed that there are many rules with high confidence, strong support, and significant support. This suggests that these combinations of parameters can effectively help predict pollen concentrations.

A key limitation of this study is the reliance on data collected solely from the Krakow region over a span of 34 years. While the long-term nature of the dataset strengthens temporal analysis, the geographic constraint may limit the direct applicability of the results to other regions with differing climatic and environmental conditions. However, the machine learning framework developed in this study is flexible and can be applied to other datasets, provided that region-specific pollen and meteorological data are available.

In the context of the limitations of the presented data pre-processing, it is also worth noting that the sliding-window approach inherently produces overlapping feature intervals across adjacent forecast dates, leading to high autocorrelation among lagged predictors. While nonparametric methods such as tree-based models and neural networks are relatively robust to multicollinearity, this overlap may still inflate apparent feature importance for highly autocorrelated variables. Future work could explore orthogonalization techniques or feature selection methods that mitigate such bias.

The practical implication of predicting pollen concentration is the possibility of intensifying antiallergic treatment and avoid activities outdoors at critical moments of pollen concentration to prevent the exacerbation of clinical symptoms in patients. An additional benefit of monitoring the pollen season is the precise assessment of the effectiveness of specific immunotherapy (AIT). The guidelines of the European Academy of Allergy and Clinical Immunology emphasize that monitoring daily pollen concentrations is necessary to determine the appropriate time window for assessing the effectiveness of AIT for seasonal allergic rhinoconjunctivitis (ARC) [68]. The first stage of our work was the construction and evaluation of models, which is presented in this publication. In the further stages of the project, we plan to use the presented models to build a mobile application that will predict pollen concentrations and will be made available to patients.

Conclusion

Predicting and characterizing pollen seasons is a crucial aspect of aerobiology, with significant implications for allergy sufferers and public health. Our study contributes to this field by using a comprehensive historical database (1991–2024) and systematically comparing multiple machine learning methods for pollen concentration prediction. Our results confirm that past pollen concentration is the single most powerful predictor. Among the meteorological predictors used, temperature emerges as the leading meteorological driver across taxa, whereas the remaining predictors (humidity, cloud cover, sunshine duration, mean wind speed, mean pressure at sea level, global radiation and snow depth) showed variable and significantly less significance. The application of associative knowledge graphs (MAGNs) and association rules for the characterization of the pollen season is an approach that has not been explored previously in this context.

Our study’s principal methodological advance is the comprehensive evaluation of diverse machine-learning approaches, including the first application of MAGNs to pollen forecasting. Consistent with previous research, tree-based models (especially XGBoost) derived the highest predictive accuracy. Crucially, MAGNs achieved comparable performance while requiring the least memory and exhibiting the fastest runtimes among all models tested, making them particularly well suited for general deployment as well as for use on edge devices or in hardware-constrained settings. Conversely, deep neural architectures (Conv-LSTM and Conv-GRU) achieved the highest fidelity in long-term predictions, albeit at the cost of greater computational resources. This broad comparative framework provides practical guidance for selecting appropriate algorithms based on accuracy requirements, interpretability, and hardware constraints.

Our findings underscore the transferability of the proposed modeling pipeline: although trained on data from a single mid-latitude city, the approach requires only local pollen and meteorological inputs and can be applied directly to other regions. Future research should therefore (1) validate generalizability across diverse climatic zones, (2) incorporate automated hyperparameter optimization to further boost model performance, and (3) explore real-time deployment using operational weather forecasts. Ultimately, these efforts will pave the way for user-friendly tools, such as mobile applications to deliver personalized pollen forecasts, support allergists in timing interventions, and help individuals manage exposure effectively.

Supporting information

S1 Fig. Graphical abstract.

(TIF)

pone.0332093.s001.tif^{(516.1KB, tif)}

S2 File. Supplementary material.

(DOCX)

pone.0332093.s002.docx^{(20.1KB, docx)}

Acknowledgments

We acknowledge the data providers in the ECA&D project. Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface air temperature and precipitation series for the European Climate Assessment. Int. J. of Climatol., 22, 1441-1453. Data and metadata available at https://www.ecad.eu.

Data Availability

The meteorological data files are available from the European Climate Assessment & Dataset project database: https://www.ecad.eu/dailydata/predefinedseries.php. To provide a minimal data set necessary to replicate our analyses, two csv files with Betula pollen and two files with Poaceae pollen data obtained in 2022 and 2023, and the meteorological data of ten selected factors were uploaded to the Open Science Framework with DOI number: DOI 10.17605/OSF.IO/9YZCF (https://osf.io/9yzcf). Requests for access to the whole dataset underlying this study should be directed to the Department of Clinical and Environmental Allergology Jagiellonian University Medical College via email: zaklad.alergologii@cm-uj.krakow.pl. Researchers seeking access must meet the criteria for access to confidential data, this means providing a research plan and confirming compliance with data use and confidentiality policies, i.e. use them to a given, specific study aim/purpose and not sharing with a third party.

Funding Statement

The study was supported by the statutory project of the Ministry of Science and Higher Education in Poland N41/DBS/001323. Initials of the authors who received the award: MB. URL of the funder: https://www.gov.pl/web/science. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Alqahtani JM. Atopy and allergic diseases among Saudi young adults: a cross-sectional study. J Int Med Res. 2020;48(1):300060519899760. doi: 10.1177/0300060519899760 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Oliveira TB, Persigo ALK, Ferrazza CC, Ferreira ENN, Veiga ABG. Prevalence of asthma, allergic rhinitis and pollinosis in a city of Brazil: a monitoring study. Allergol Immunopathol (Madr). 2020;48(6):537–44. doi: 10.1016/j.aller.2020.03.010 [DOI] [PubMed] [Google Scholar]
3.Masoli M, Fabian D, Holt S, Beasley R, Asthma (GINA) Program GI. The global burden of asthma: executive summary of the GINA dissemination committee report. Allergy. 2004;59(5):469–78. [DOI] [PubMed] [Google Scholar]
4.Asthma GI. Global strategy for asthma management and prevention. 2024. https://ginasthma.org/reports/
5.Wise SK, Damask C, Roland LT, Ebert C, Levy JM, Lin S, et al. International consensus statement on allergy and rhinology: allergic rhinitis–2023. International forum of allergy & rhinology. Wiley Online Library. 2023. p. 293–859. [DOI] [PubMed]
6.Hoque F, Nayak R. Focused overview of the 2024 global initiative for asthma guidelines. APIK Journal of Internal Medicine. 2024;13(1):4–12. doi: 10.4103/ajim.ajim_76_24 [DOI] [Google Scholar]
7.Cohen B. Allergic rhinitis. Pediatr Rev. 2023;44(10):537–50. doi: 10.1542/pir.2022-005618 [DOI] [PubMed] [Google Scholar]
8.Patel N, Bhattacharyya A. Rhinitis in primary care. Prim Care. 2025;52(1):37–45. doi: 10.1016/j.pop.2024.09.006 [DOI] [PubMed] [Google Scholar]
9.Zhang Y, Lan F, Zhang L. Advances and highlights in allergic rhinitis. Allergy. 2021;76(11):3383–9. doi: 10.1111/all.15044 [DOI] [PubMed] [Google Scholar]
10.Makra L, Coviello L, Gobbi A, Jurman G, Furlanello C, Brunato M, et al. Forecasting daily total pollen concentrations on a global scale. Allergy. 2024;79(8):2173–85. doi: 10.1111/all.16227 [DOI] [PubMed] [Google Scholar]
11.Sofiev M, Palamarchuk J, Kouznetsov R, Abramidze T, Adams-Groom B, Antunes CM, et al. European pollen reanalysis 1980 -2022, for alder, birch, and olive. Sci Data. 2024;11(1):1082. doi: 10.1038/s41597-024-03686-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Zhong J, Xiao R, Wang P, Yang X, Lu Z, Zheng J, et al. Identifying influence factors and thresholds of the next day’s pollen concentration in different seasons using interpretable machine learning. Sci Total Environ. 2024;935:173430. doi: 10.1016/j.scitotenv.2024.173430 [DOI] [PubMed] [Google Scholar]
13.Myszkowska D. Predicting tree pollen season start dates using thermal conditions. Aerobiologia (Bologna). 2014;30(3):307–21. doi: 10.1007/s10453-014-9329-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Myszkowska D, Majewska R. Pollen grains as allergenic environmental factors–new approach to the forecasting of the pollen concentration during the season. Ann Agric Environ Med. 2014;21(4):681–8. doi: 10.5604/12321966.1129914 [DOI] [PubMed] [Google Scholar]
15.Nowosad J, Stach A, Kasprzyk I, Chłopek K, Dabrowska-Zapart K, Grewling Ł, et al. Statistical techniques for modeling of Corylus, Alnus, and Betula pollen concentration in the air. Aerobiologia. 2018;34(3):301–13. doi: 10.1007/s10453-018-9514-x [DOI] [Google Scholar]
16.Bringfelt B, Engström I, Nilsson S. An evaluation of some models to predict airborne pollen concentration from meteorological conditions in stockholm, sweden. Grana. 1982;21(1):59–64. doi: 10.1080/00173138209427680 [DOI] [Google Scholar]
17.Cotos-Yáñez TR, Rodríguez-Rajo F, Jato M. Short-term prediction of Betula airborne pollen concentration in Vigo (NW Spain) using logistic additive models and partially linear models. International Journal of Biometeorology. 2004;48:179–85. [DOI] [PubMed] [Google Scholar]
18.Astray G, Amigo Fernandez R, Fernandez-Gonzalez M, Dias-Lorenzo DA, Guada G, Rodrıguez-Rajo FJ. Machine learning to forecast airborne parietaria pollen in the North-West of the Iberian Peninsula. Sustainability. 2025;17(4):1528. [Google Scholar]
19.Cordero JM, Rojo J, Gutiérrez-Bustillo AM, Narros A, Borge R. Predicting the olea pollen concentration with a machine learning algorithm ensemble. International Journal of Biometeorology. 2021;65(4):541–54. [DOI] [PubMed] [Google Scholar]
20.Valipour Shokouhi B, de Hoogh K, Gehrig R, Eeftens M. Spatiotemporal modelling of airborne birch and grass pollen concentration across Switzerland: a comparison of statistical, machine learning and ensemble methods. Environ Res. 2024;263(Pt 1):119999. doi: 10.1016/j.envres.2024.119999 [DOI] [PubMed] [Google Scholar]
21.Zewdie GK, Lary DJ, Levetin E, Garuma GF. Applying deep neural networks and ensemble machine learning methods to forecast airborne Ambrosia Pollen. Int J Environ Res Public Health. 2019;16(11):1992. doi: 10.3390/ijerph16111992 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Srodowiska Szoosp K. Program ochrony środowiska dla miasta Krakowa na lata 2020 –2030. 2021. https://obywatelski.krakow.pl/zalacznik/385031
23.Galán C, Smith M, Thibaudon M, Frenguelli G, Oteros J, Gehrig R, et al. Pollen monitoring: minimum requirements and reproducibility of analysis. Aerobiologia. 2014;30(4):385–95. doi: 10.1007/s10453-014-9335-5 [DOI] [Google Scholar]
24.U. Ambient air-sampling and analysis of airborne pollen grains and fungal spores for networks related to allergy-volumetric Hirst method. ICS. 2019. p. 20.
25.College JUM. Pollen concentration database. 2025. https://www.pylek.cm-uj.krakow.pl
26.Dahl A, Gala´n C, Hajkova L, Pauling A, Sikoparija B, Smith M. The onset, course and intensity of the pollen season. Allergenic pollen: A review of the production, release, distribution and health impacts. Springer; 2012. p. 29–70.
27.Dataset ECA. European Climate Assessment & Dataset: Daily Data. 2025. https://www.ecad.eu/dailydata/predefinedseries.php
28.Matuszko D, Piotrowicz K. Cechy klimatu miasta a klimat Krakowa; 2015. https://ruj.uj.edu.pl/bitstreams/8f6e1bce-6c57-48b0-8930-ff76500c120b/download
29.JuliaStats. HypothesisTests.jl. 2025. https://github.com/JuliaStats/HypothesisTests.jl
30.Myszkowska D. Comparison of machine learning methods in forecasting and characterizing the birch and grasses pollen season - miminal data set necessary to replicate the study. 2025. https://osf.io/9yzcf/?view_only=d4ccc99d460c46bfa775d72ba6ba3a1b [DOI] [PubMed]
31.Bouchet-Valat M, Kamiński B. DataFrames.jl: flexible and fast tabular data in Julia. J Stat Soft. 2023;107(4):1–32. doi: 10.18637/jss.v107.i04 [DOI] [Google Scholar]
32.Brown RG. Smoothing, forecasting and prediction of discrete time series. Courier Corporation; 2004.
33.Horzyk A, Bulanda D, Starzyk JA. Construction and training of multi-associative graph networks. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2023. p. 277–92.
34.Fix E. Fix E. discriminatory analysis: nonparametric discrimination, consistency properties. USAF School of Aviation Medicine; 1985.
35.JuliaAI. Nearest neighbor models. 2025. https://github.com/JuliaAI/NearestNeighborModels.jl
36.Bulanda D. Witchnet.2025. https://github.com/danbulnet/witchnet
37.Galton F. Regression towards mediocrity in hereditary stature. The Journal of the Anthropological Institute of Great Britain and Ireland. 1886;15:246. doi: 10.2307/2841583 [DOI] [Google Scholar]
38.JuliaAI. MLJLinearModels.jl. 2025. https://github.com/JuliaAI/MLJLinearModels.jl
39.Kamiński B, Jakubczyk M, Szufel P. A framework for sensitivity analysis of decision trees. Cent Eur J Oper Res. 2018;26(1):135–59. doi: 10.1007/s10100-017-0479-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Sadeghi B, Chiarawongse P, Squire K, Jones DC, Noack A, St-Jean C. DecisionTree.jl - A Julia implementation of the CART Decision Tree and Random Forest algorithms. 2022.
41.Breiman L. Random forests. Machine Learning. 2001;45(1):5–32. doi: 10.1023/a:1010933404324 [DOI] [Google Scholar]
42.Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. p. 785–94.
43.dmlc. XGBoost.jl. 2025. https://github.com/dmlc/XGBoost.jl
44.Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning. Cambridge: MIT Press; 2016.
45.Innes M, Saba E, Fischer K, Gandhi D, Rudilosso MC, Joy NM. Fashionable modelling with flux. CoRR. 2018. doi: abs/1811.01457 [Google Scholar]
46.Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. doi: 10.1162/neco.1997.9.8.1735 [DOI] [PubMed] [Google Scholar]
47.Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint 2014. https://arxiv.org/abs/1406.1078 [Google Scholar]
48.Horzyk A, Bulanda D, Starzyk JA. ASA-graphs for efficient data representation and processing. International Journal of Applied Mathematics and Computer Science. 2020;30(4). doi: 10.34768/amcs-2020-0053 [DOI] [Google Scholar]
49.Kreer J. A question of terminology. IRE Trans Inf Theory. 1957;3(3):208–208. doi: 10.1109/tit.1957.1057418 [DOI] [Google Scholar]
50.Zhou H, Wang X, Zhu R. Feature selection based on mutual information with correlation coefficient. Appl Intell. 2021;52(5):5457–74. doi: 10.1007/s10489-021-02524-x [DOI] [Google Scholar]
51.Sornalakshmi M, Balamurali S, Venkatesulu M, Krishnan MN, Ramasamy LK, Kadry S, et al. An efficient apriori algorithm for frequent pattern mining using mapreduce in healthcare data. Bulletin EEI. 2021;10(1):390–403. doi: 10.11591/eei.v10i1.2096 [DOI] [Google Scholar]
52.Kabir MR, Vaid S, Sood N, Zaiane OR. Deep associative classifier. In: 2022 IEEE International Conference on Knowledge Graph (ICKG). 2022. p. 113–22. 10.1109/ickg55886.2022.00022 [DOI]
53.Sivanantham S, Mohanraj V, Suresh Y, Senthilkumar J. Rule precision index classifier: an associative classifier with a novel pruning measure for intrusion detection. Pers Ubiquit Comput. 2021;27(3):1395–403. doi: 10.1007/s00779-021-01599-0 [DOI] [Google Scholar]
54.Tuff´ery S. Data mining and statistics for decision making. John Wiley & Sons; 2011.
55.Hart PE, Stork DG, Wiley J. Pattern classification. Toronto: John Wiley & Sons; 2001.
56.Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is “nearest neighbor” meaningful?. In: Database Theory—ICDT’99: 7th International Conference Jerusalem, Israel, January 10–12 1999 Proceedings. 1999. p. 217–35.
57.Mining WID. Data mining: concepts and techniques. Morgan Kaufmann. 2006;10:559–69. [Google Scholar]
58.Seber GA, Lee AJ. Linear regression analysis. John Wiley & Sons; 2012.
59.Goeman J, Meijer R, Chaturvedi N. L1 and L2 penalized regression models. Vignette R Package Penalized. 2018. http://crannedmirrornl/web/packages/penalized/vignettes/penalized
60.Singh S, Gupta P. Comparative study ID3, CART and C4.5 decision tree algorithm: a survey. International Journal of Advanced Information Science and Technology. 2014;27(27):97–103. [Google Scholar]
61.Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2009.
62.XGBoost: The Champion of Competitive Machine Learning. 2025. https://datascientest.com/en/xgboost-the-champion-of-competitive-machine-learning
63.Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems. 2012;25. [Google Scholar]
64.Jozefowicz R, Zaremba W, Sutskever I. In: International conference on machine learning, 2015. 2342–50.
65.Blaom AD, Kiraly F, Lienart T, Simillides Y, Arenas D, Vollmer SJ. MLJ: a Julia package for composable machine learning. 2020. 10.5281/zenodo.4178918 [DOI]
66.Bezanson J, Edelman A, Karpinski S, Shah VB. Julia: a fresh approach to numerical computing. SIAM Rev. 2017;59(1):65–98. doi: 10.1137/141000671 [DOI] [Google Scholar]
67.Bulanda D. Comparison of machine learning methods in forecasting and characterizing the birch and grasses pollen season - source code. 2025. https://github.com/bionetlabs/PollenForecasting.git [DOI] [PubMed]
68.Pfaar O, Bastl K, Berger U, Buters J, Calderon MA, Clot B, et al. Defining pollen exposure times for clinical trials of allergen immunotherapy for pollen-induced rhinoconjunctivitis - an EAACI position paper. Allergy. 2017;72(5):713–22. doi: 10.1111/all.13092 [DOI] [PubMed] [Google Scholar]
69.European Medicines Agency Cfmpfhu. Guideline on the clinical development of products for specific immunotherapy for the treatment of allergic diseases. 2008. https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-clinical-development-products-specific-immunotherapy-treatment-en.pdf
70.Csépe Z, Makra L, Voukantsis D, Matyasovszky I, Tusnády G, Karatzas K, et al. Predicting daily ragweed pollen concentrations using computational intelligence techniques over two heavily polluted areas in Europe. Sci Total Environ. 2014;476–477:542–52. doi: 10.1016/j.scitotenv.2014.01.056 [DOI] [PubMed] [Google Scholar]
71.Vélez-Pereira AM, De Linares C, Belmonte J. Aerobiological modeling I: a review of predictive models. Science of The Total Environment. 2021;795:148783. [DOI] [PubMed] [Google Scholar]
72.Schäppi GF, Taylor PE, Kenrick J, Staff IA, Suphioglu C. Predicting the grass pollen count from meteorological data with regard to estimating the severity of hayfever symptoms in Melbourne (Australia). Aerobiologia. 1998;14(1):29–37. doi: 10.1007/bf02694592 [DOI] [Google Scholar]
73.Matyasovszky I, Makra L, Guba Z, Pátkai Z, Páldy A, Sümeghy Z. Estimating the daily Poaceae pollen concentration in Hungary by linear regression conditioning on weather types. Grana. 2011;50(3):208–16. doi: 10.1080/00173134.2011.602984 [DOI] [Google Scholar]
74.Rodríguez-Rajo FJ, Valencia-Barrera RM, Vega-Maray AM, Suárez FJ, Fernández-Gonzales D, Jato V. Prediction of airborne Alnus pollen concentration by using ARIMA models. Annals of Agricultural and Environmental Medicine. 2006;13(1). [PubMed] [Google Scholar]
75.Green BJ, Dettmann M, Yli-Panula E, Rutherford S, Simpson R. Atmospheric Poaceae pollen frequencies, associations with meteorological parameters in Brisbane and Australia: a 5-year record 1994 -1999. Int J Biometeorol. 2004;48(4):172–8. doi: 10.1007/s00484-004-0204-8 [DOI] [PubMed] [Google Scholar]
76.Aboulaich N, Achmakh L, Bouziane H, Trigo MM, Recio M, Kadiri M, et al. Effect of meteorological parameters on Poaceae pollen in the atmosphere of Tetouan (NW Morocco). Int J Biometeorol. 2013;57(2):197–205. doi: 10.1007/s00484-012-0566-2 [DOI] [PubMed] [Google Scholar]
77.Sofiev M, Siljamo P, Ranta H, Linkosalo T, Jaeger S, Rasmussen A, et al. A numerical model of birch pollen emission and dispersion in the atmosphere. Description of the emission module. Int J Biometeorol. 2013;57(1):45–58. doi: 10.1007/s00484-012-0532-z [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Dbouk T, Visez N, Ali S, Shahrour I, Drikakis D. Risk assessment of pollen allergy in urban environments. Sci Rep. 2022;12(1):21076. doi: 10.1038/s41598-022-24819-w [DOI] [PMC free article] [PubMed] [Google Scholar]
79.Valipour Shokouhi B, de Hoogh K, Gehrig R, Eeftens M. Estimation of historical daily airborne pollen concentrations across Switzerland using a spatio temporal random forest model. Sci Total Environ. 2024;906:167286. doi: 10.1016/j.scitotenv.2023.167286 [DOI] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0332093.r001

Decision Letter 0

Manlio Milanese

2 May 2025

PONE-D-25-12434Comparison of machine learning methods in forecasting and characterizing the birch and grasses pollen seasonPLOS ONE

Dear Dr. Bulanda,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jun 16 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Manlio Milanese

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Thank you for uploading your study's underlying data set. Unfortunately, the repository you have noted in your Data Availability statement does not qualify as an acceptable data repository according to PLOS's standards.

At this time, please upload the minimal data set necessary to replicate your study's findings to a stable, public repository (such as figshare or Dryad) and provide us with the relevant URLs, DOIs, or accession numbers that may be used to access these data. For a list of recommended repositories and additional information on PLOS standards for data deposition, please see https://journals.plos.org/plosone/s/recommended-repositories.

Additional Editor Comments:

This manuscript employs various machine learning techniques to predict and analyze the pollen seasons of birch and grasses. While the manuscript is well developed from a methodological point of view, there are several comments to address before it can be considered for publication. The background and objectives need to be better described.

Please, answer point by point to reviewers comments.verthe results obtained and the main findings of this work.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This manuscript employs various machine learning techniques to predict and analyze the pollen seasons of birch and grasses. While the manuscript has the potential to enhance research in this area, there are several minor comments to address before it can be considered for publication

MINOR COMMENTS

1. While meteorological data is included, there is no mention of potential measurement errors or quality variations, which might affect the model's precision. The methodology section could benefit from greater clarity and detail. For example, the description of data preprocessing techniques and modeling strategies could be expanded to include more information on the specific parameters used. In particular, the Authors could to clarify: - The distance between the weather station and the pollen sampling point (over 11 km) could create discrepancies between collected data and actual conditions relevant to the study area.

2. The model relies solely on data from Krakow over a span of 34 years, limiting its applicability to regions with different climatic conditions. It could be useful to comment about that in section weakness or discussion for example.

3. Machine Learning Models: The "default settings" approach for model parameters may restrict performance. Exploring diverse hyperparameter settings would have been beneficial for optimization. In particular, the manuscript does not discuss in detail the impact of data quality (e.g., measurement errors) or the necessity of hyperparameter optimization, which could improve the models' performance.

4. Variable Interpretation: - The feature importance analysis appears detailed but could be expanded to include overlooked or underestimated influential variables

5. Clinical Goal vs Results: Despite the clinical focus on predicting pollen concentrations for personalized treatment, the manuscript does not assess the practical impact of these forecasts on patient health outcomes. It could be useful to comment about that in section weakness or discussion for example.

6. Temporal Overlap: A clear explanation is missing regarding how previous pollen and weather data are synchronized for predictions. Potential temporal overlaps between variables may introduce bias into the models. It could be useful to comment about that in section weakness or discussion for example.

7. Influence of Meteorological Factors: In some parts of the manuscript, it is stated that temperature is the most important environmental factor influencing pollen concentration. However, in other sections, it is suggested that humidity and cloud cover may be of lesser importance. This contradiction could be resolved by providing a clearer and more consistent explanation.

Reviewer #2: Comparison of machine learning methods in forecasting and characterizing the birch and grasses pollen season

General comments

The paper is well developed from a methodological point of view. However, I feel that the manuscript is not sometimes expressed in an aerobiological precise language. Also, the authors should expose correctly the background of this study and explain better the the objectives, the results obtained and the main findings of this work.

- The introduction is very reduced and exposed in very general lines. I think the authors could analyse a brief background of the models of machine learning used in aerobiological studies to predict pollen concentrations, main advantages in comparison to traditional statistical methods and accuracy achieved in the forecasting. Below several interesting and recent related studies using machine learning methods or combination of these methods which is now in practise. After this background, the discussion section should incorporate the main novelty and the most important findings of this study.

- The authors show how pollen time series and meteorological series follow a normal distribution which is strange in the aerobiological field due to the high frequency of zeros and low pollen concentrations during the year. Could you explain this? Anyway, normality assumptions would be only necessary in the case of linear models, are machine learning methods sensitive to the non-normal distribution of data? Discuss this aspect. Perhaps an advantage for using this type of forecasting methods.

- Check the aerobiological terminology of the entire manuscript following standardised terminology in this scientific field (Galán et al., 2017).

Galán, C., Ariatti, A., Bonini, M., Clot, B., Crouzy, B., Dahl, A., Fernandez-González, D., Frenguelli, G., Gehrig, R., Isard, S., Levetin, E., Li, D.W., Mandrioli, P., Rogers, C.A., Thibaudon, M., Sauliene, I., Skjoth, C., Smith, M., Sofiev, M., 2017. Recommended terminology for aerobiological studies. Aerobiologia 33, 293–295. https://doi.org/10.1007/s10453-017-9496-0

- In my opinion, the experiment 2 is not well explained. Indicate better the objectives for each experiment followed.

- The results in the current form is a simple description of figures and tables. In my opinion, the result sections should be a more elaborated presentation of the most important findings of the models.

- Only as a suggestion. It would be interesting to generate a comparative table based on the advantages and disadvantages of the different machine learning methods used in this work. It is a good and useful result for new similar studies.

- Conclusion section should remark the most important findings and the advances in the knowledge, but in the current form, part of the conclusions are general aspects already previously indicated.

Specific comments

- Abstract: "pollens": The word "pollen" has both a singular and plural sense.

- Abstract: "among others": What about the rest of meteorological parameters? Are not relevant?

- Materials and methods (line 30): It is ambiguous. I guess this is the extension of the municipality area.

- Materials and methods (line 48): In "isuploaded" a space is required.

- Materials and methods (Figure 1): Figure 1 is not informative due to the long time series. Perhaps, it would be more useful representing the daily average during the entire series adding lines for the first and third daily quartile to show the variability, or a similar graph.

- Materials and methods (Table 1): Replace "seeds/m3" by "pollen grains/m3".

- Materials and methods (line 79-80): What type of meteorological forecast are used to predict pollen in the future? This is relevant.

- Materials and methods (line 113-120): Replace "seeds/m3" by "pollen grains/m3".

- Materials and methods (line 113-120): What is the meaning of the percentages of each pollen threshold? Is it related to the calculation of the threshold explained in lines 108-110.

- Materials and methods (lines 206-207): Is linear regression the unique method requiring parametric assumptions?

- Materials and methods (line 267): I think "appropriate pollen concentration categories" is not the best statistical term.

- Results: Caption of Figure 5. "Betula" in italics.

- Results (lines 302-303): Why the specific results from the XGBoost are remarked and detailed, and not for the rest of the methods used?

- Results: Caption of Figure 6. "Betula" in italics.

- Results (lines 355-356): A reference is required.

Related interesting literature

Astray et al., 2025. Machine Learning to Forecast Airborne Parietaria Pollen in the North-West of the Iberian Peninsula. Sustainability 17, 1528. https://doi.org/10.3390/su17041528

Cordero et al., 2021. Predicting the Olea pollen concentration with a machine learning algorithm ensemble. Int J Biometeorol 65, 541–554. https://doi.org/10.1007/s00484-020-02047-z

Shokouhi et al., 2024. Spatiotemporal modelling of airborne birch and grass pollen concentration across Switzerland: A comparison of statistical, machine learning and ensemble methods. Environmental Research 263, 119999. https://doi.org/10.1016/j.envres.2024.119999

Zewdie et al., 2019. Applying machine learning to forecast daily Ambrosia pollen using environmental and NEXRAD parameters. Environ Monit Assess 191, 261. https://doi.org/10.1007/s10661-019-7428-x

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2026 Feb 18;21(2):e0332093. doi: 10.1371/journal.pone.0332093.r002

Author response to Decision Letter 1

29 May 2025

Response to Reviewers

Manuscript title: Comparison of machine learning methods in forecasting and characterizing the birch and grasses pollen season

Manuscript ID: PONE-D-25-12434

Journal: PLOS ONE

Dear Editor and Reviewers,

We would like to thank the Academic Editor and both reviewers for their constructive feedback on our manuscript. We appreciate the time and effort invested in the review process, and we have revised the manuscript accordingly to address all points raised.

Below, we provide a detailed point-by-point response to the reviewers’ and editor's comments. All changes have been incorporated into the revised version of the manuscript, which is submitted along with a tracked-changes file for reference.

________________________________________

General Editorial Requirements

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

We have revised the manuscript to comply with PLOS ONE style and formatting guidelines. In particular, the Supporting information section has been moved to the end of the manuscript and the description of the corresponding author has been corrected.

We have deposited all necessary author-generated code to GitHub (https://github.com/bionetlabs/PollenForecasting) to ensure unrestricted public access. Appropriate information has been placed in the Materials and methods section.

The whole material, including row data, pollen data, meteorological data and all the obtained results were stored as project documentation by the Jagiellonian University Medical College.

The meteorological data were collected from the European Climate Assessment and Dataset website [https://www.ecad.eu] as blended data in ASCII file format. All observations were obtained from the Krakow-Balice station. The direct link to the data Policy is as follows: https://knmi-ecad-assets-prd.s3.amazonaws.com/documents/ECAD_datapolicy.pdf.

We acknowledge the data providers in the ECA&D project.

Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface air temperature and precipitation series for the European Climate Assessment. Int. J. of Climatol., 22, 1441-1453.

Moreover, to provide a minimal data set necessary to replicate our study, two csv files with Betula pollen and two files with Poaceae pollen data obtained in 2022 and 2023, and the meteorological data of ten selected factors were uploaded to the Open Science Framework with DOI number: DOI 10.17605/OSF.IO/9YZCF and the direct link: https://osf.io/9yzcf/?view_only=d4ccc99d460c46bfa775d72ba6ba3a1b. Appropriate information has been placed in the Materials and methods section.

4. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

We have uploaded all main-text figure files to the PACE diagnostic tool and confirmed that they now comply with PLOS figure requirements.

________________________________________

Academic Editor Comments

1. The background and objectives need to be better described.

We have significantly revised the Introduction to better describe the background, the motivations for applying machine learning to aerobiological forecasting, and the gap in current literature. Specific objectives of the study are now clearly defined at the end of the Materials and Methods section.

________________________________________

Reviewer #1 Comments

Thank you for pointed out the problem related to the distance between the meteorological and aerobiological stations. We are aware, that the microclimate conditions can differ between both stations, but the main reasons of taking into account the meteorological data from Balice station was that these data are commonly available. It is important not only during the models preparation, but also in the nearest future, when the models will be implemented into the aerobiological data base to predict the pollen concentrations up to date.

For our best knowledge, based on the reports on the meteorological data in both stations (Balice and Kraków city center) the differences between them are relatively low. According to Matuszko et al. (2015) and Matuszko and Piotrowicz (2015), in 2001-2010 annual temperature in Kraków center was higher by 0.7 °C in comparison to Balice station, whereby the differences were observed mainly in winter, while the pollen seasons are finished. The other data, like sunshine, cloud cover, annual relative humidity (77% vs 78%, Kraków and Balice, respectively), were slightly different.

We now discuss the potential impact of measurement errors and spatial discrepancy between the meteorological station and pollen trap in the Materials and Methods under Data subsection. In addition, we have added detailed descriptions of the experimental design and validation procedures to the Materials and Methods section.

Matuszko D., Piotrowicz K., Kowanetz L., 2015, Klimat, [w:] Środowisko przyrodnicze Krakowa. Zasoby, Ochrona, Kształtowanie, Baścik M., Degórska B. (red.), Instytut Geografii i Gospodarki Przestrzennej UJ, Kraków, 81-108.

Matuszko D., Piotrowicz K., 2015, Cechy klimatu miasta a klimat Krakowa, [w:] Miasto w badaniach geografów, Trzepacz P., Więcław-Michniewska J., Brzosko-Sermak A., Kołoś A. (red.), T. 1, Instytut Geografii i Gospodarki Przestrzennej UJ., Kraków, 221-241.

This limitation is acknowledged and discussed in the Discussion section.

We appreciate the Reviewer’s 5 insightful comment. Indeed, exploring the full hyperparameter space could further improve model performance. However, given the large number of models and variants (forecast horizons, taxa), we adopted a standardized pipeline with default parameters to ensure consistency and comparability across all methods.

To address the reviewer’s suggestion, we have now added a discussion in the manuscript highlighting this limitation and acknowledging that more exhaustive hyperparameter tuning could yield better results in some cases. Nonetheless, the models already achieve high performance (e.g., up to 92.2% accuracy), which we believe is sufficient to demonstrate their practical applicability in the context of daily pollen forecasting.

Moreover, regarding data quality, we ensured that the input dataset was thoroughly preprocessed following standardized aerobiological protocols, and we employed robust methods such as moving averages and mutual information analysis to mitigate the effect of noise. This has been clarified in the revised discussion section.

4. Variable Interpretation: - The feature importance analysis appears detailed but could be expanded to include overlooked or underestimated influential variables

We appreciate the reviewer’s suggestion to further examine secondary predictors. Accordingly, we analyzed relative normalized mutual information for the lowest‐ranking meteorological variables: humidity, sunshine duration, radiation, cloud cover, wind speed, sea level pressure, and snow depth. Although these factors fall below the primary drivers in overall rank, they nonetheless contribute meaningfully to model performance. A new paragraph in the Results section now summarizes these insights.

Thank you very much for drawing attention to the clinical aspect of our work. The practical implication of predicting pollen concentration is the possibility of intensifying antiallergic treatment and avoid activities outdoors at critical moments of pollen concentration to prevent the exacerbation of clinical symptoms in patients. The first stage of our work was the construction and evaluation of models, which is presented in this publication. In the further stages of the project, we plan to use the presented models to build a mobile application that will predict pollen concentrations and will be made available to patients. An appropriate note on this topic has been added to the Discussion section.

We thank the reviewer for highlighting the need to clarify how our input windows avoid using future information. In the revised manuscript, we have discussed the potential for autocorrelation bias in the Discussion.

We have revised the relevant paragraphs to present a more coherent interpretation. In particular, wed have expanded the description of Experiment 2 to clearly define its aims and analytical steps and we reworked the appropriate paragraphs in the Results section.

________________________________________

Reviewer #2 Comments

1. The introduction is very reduced and exposed in very general lines. I think the authors could analyse a brief background of the models of machine learning used in aerobiological studies to predict pollen concentrations, main advantages in comparison to traditional statistical methods and accuracy achieved in the forecasting. Below several interesting and recent related studies using machine learning methods or combination of these methods which is now in practise. After this background, the discussion section should incorporate the main novelty and the most important findings of this study.

After correction, the Introduction includes a concise review of the use of machine learning in aerobiology, with emphasis on prior applications and comparative performance. We have also enriched and emphasized the main novelty and the most important findings of this study in the Discussion section.

2. The authors show how pollen time series and meteorological series follow a normal distribution which is strange in the aerobiological field due to the high frequency of zeros and low pollen concentrations during the year. Could you explain this? Anyway, normality assumptions would be only necessary in the case of linear models, are machine learning methods sensitive to the non-normal distribution of data? Discuss this aspect. Perhaps an advantage for using this type of forecasting methods.

We thank the reviewer for this important observation. To clarify, we first excluded all calendar days outside the defined birch and grass pollen seasons. Specifically, those days before the pollen season began and after it ended, when pollen concentrations are zero. After this seasonal trimming, the remaining daily pollen and meteorological values deviate significantly from a normal distribution, as confirmed by Shapiro–Wilk test. We owe you an apology and a clarification regarding our discussion of data normality in the submitted manuscript. Due to a misunderstanding of the Shapiro–Wilk test output and the associated package documentation, we incorrectly reported that both the pollen concentration and meteorological datasets were normally distributed. In fact, the very low p-values (p ≪ 0.05) from our Shapiro–Wilk tests demonstrate that these data strongly deviate from normality. We regret this error and have revised the manuscript to remove the incorrect statements about normality, instead applying and describing appropriate non-parametric analyses. Thank you for catching this mistake, and we appreciate your understanding as we correct our interpretation.

Furthermore, we agree that strict normality is only required for parametric inference in linear regression. In contrast, the remaining models used in our study are non-parametric and thus do not assume any particular distribution of inputs or residuals.

3. Check the aerobiological terminology of the entire manuscript following standardized terminology in this scientific field (Galán et al., 2017).

Thank you for this comment. The aerobiological terminology was adopted to the recommendations proposed by Galan et al. (2017).

4. In my opinion, the experiment 2 is not well explained. Indicate better the objectives for each experiment followed.

We appreciate this feedback and have expanded the description of Experiment 2 to clearly define its aims and analytical steps. Specifically, Experiment 2 was designed to (1) identify which features drive the predictions of our best-performing model, (2) quantify the strength of association between inputs and pollen outcomes via mutual information, and (3) uncover characteristic combinations of meteorological and pollen variables using association rule mining algorithm based on MAGN. We h

Attachment

Submitted filename: Response to Reviewers.pdf

pone.0332093.s003.pdf^{(296.1KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0332093.r003

Decision Letter 1

Manlio Milanese

22 Jul 2025

PONE-D-25-12434R1Comparison of machine learning methods in forecasting and characterizing the birch and grasses pollen seasonPLOS ONE

Dear Dr. Bulanda,

Please submit your revised manuscript by Sep 05 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Manlio Milanese

Academic Editor

PLOS ONE

Journal Requirements:

If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise.

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments :

Thank you for submitting your above-mentioned manuscript to Plos One.

It has now been evaluated by our experts and we are pleased to inform you that it is principally acceptable for publication in our journal, subject to minor changes.

To assist you in making your alterations, you will find the reviewers' remarks below.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: Dear Editor and Authors,

I have carefully examined the authors’ point‐by‐point responses to our minor comments and find that they have addressed each concern in a coherent and satisfactory manner:

Measurement error and station‐distance – The authors acknowledge potential microclimate discrepancies between the Balice weather station and the pollen trap and cite two local climatological studies quantifying the differences (≈0.7 °C in annual temperature, ~1 % in humidity). They have now explicitly discussed these limitations and their likely minimal impact in the Materials & Methods.

Geographic generalizability – They have added a clear caveat in the Discussion regarding the single‐city scope and its implications for broader climatic regimes.

Hyperparameter tuning and data quality – The authors explain their rationale for using default settings to maintain comparability across dozens of model‐horizon–taxa combinations. They have also inserted a discussion admitting that more extensive hyperparameter searches might improve performance, and they detail the standardized preprocessing steps (moving averages, mutual information) used to mitigate noise.

Feature‐importance analysis – In response to the suggestion, they carried out and summarized a secondary mutual‐information analysis on the lower‐ranked meteorological predictors, demonstrating that even “minor” factors contribute to overall accuracy.

Clinical implications – They clarify that the forecasting models will underpin a future mobile app for personalized allergy management, and they explain how advance warnings could guide treatment intensification and behavior changes. This note has been added to the Discussion.

Temporal synchronization and bias – The revised manuscript now explicitly states how input windows are constructed to prevent look-ahead bias and addresses potential autocorrelation in the Discussion.

Consistency in meteorological factor interpretation – All relevant paragraphs have been harmonized to present a unified narrative on the primacy of temperature while acknowledging the secondary roles of humidity and cloud cover.

Overall, the authors’ revisions are both thorough and well integrated. They not only justify their methodological choices but also acknowledge remaining limitations, exactly as requested. I believe these responses fully resolve the reviewer’s technical and substantive points.

Reviewer #2: Comparison of machine learning methods in forecasting and characterizing the birch and grasses pollen season

General comments

The authors have considerably improved the manuscript and they have addressed most of my suggestions. I consider that the first round of review has been very positive, but several minor issues should be addressed yet. Figure 5 is a good outcome from my point of view.

- The authors restrict the application of the model to the pollen season, but the pollen season of birch and grasses was not defined. Have they have used any of the common methods to define the pollen seasons? It is relevant as the application of the method influences in the period selected (Tasioulis et al., 2022).

Tasioulis, T., Karatzas, K., Charalampopoulos, A., Damialis, A., Vokou, D., 2022. Five ways to define a pollen season: exploring congruence and disparity in its attributes and their long-term trends. Aerobiologia. https://doi.org/10.1007/s10453-021-09735-2.

- Based on the previous comment, check the clinical method due to the orientation of this manuscript in public health (Pfaar et al., 2017).

Pfaar, O., Bastl, K., Berger, U., Buters, J., Calderon, M.A., Clot, B., Darsow, U., Demoly, P., Durham, S.R., Galán, C., Gehrig, R., Gerth van Wijk, R., Jacobsen, L., Klimek, L., Sofiev, M., Thibaudon, M., Bergmann, K.C., 2017. Defining pollen exposure times for clinical trials of allergen immunotherapy for pollen-induced rhinoconjunctivitis - an EAACI position paper. Allergy 72, 713–722. https://doi.org/10.1111/all.13092

Specific comments

- Abstract: "mean sea level", perhaps you mean "mean pressure at sea level".

- Abstract: "Ambrosia" genus in italics.

- Materials and methods: Replace "pollen counts" by "pollen concentrations".

- Results: "Betula" genus in italics.

- Discussion: In which fields were MAGNs models used previously? Examples.

- Discussion: Replace "grasses pollen" by "grass pollen".

- Conclusion: "mean sea level", perhaps you mean "mean pressure at sea level".

- Discussion: Replace "signWificance" by "significance".

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Vincenzo Patella

Reviewer #2: No

**********

PLoS One. 2026 Feb 18;21(2):e0332093. doi: 10.1371/journal.pone.0332093.r004

Author response to Decision Letter 2

31 Jul 2025

Response to Reviewers

Manuscript title: Comparison of machine learning methods in forecasting and characterizing the birch and grass pollen season

Manuscript ID: PONE-D-25-12434R1

Journal: PLOS ONE

Dear Editor and Reviewers,

We appreciate your constructive feedback and are grateful for the opportunity to revise our manuscript. Thank you for your positive and valuable evaluation.

Below we address each point raised by the reviewers. All changes in the revised manuscript are highlighted in the “Revised Manuscript with Track Changes” file, and the clean version reflects these changes.

________________________________________

Academic Editor Comments

Thank you for your careful evaluation and for confirming that our manuscript is fundamentally suitable for publication pending minor revisions. We appreciate the insightful feedback from you and the reviewers. We will diligently address all comments and submit our revised manuscript by the deadline.

________________________________________

Reviewer #1 Comments

We are very grateful for your thorough re‐evaluation and kind assessment of our revisions. Your confirmation that we have coherently addressed concerns around measurement error and station-distance discrepancies, geographic generalizability, hyperparameter tuning, feature‐importance analysis, clinical implications, temporal synchronization, and consistency in meteorological factor interpretation is incredibly reassuring.

Your detailed recognition of our efforts to quantify microclimate differences, clarify methodological caveats, justify our modeling choices, and integrate secondary analyses means a great deal to us. Thank you for helping us strengthen the manuscript and for your constructive guidance throughout the review process.

________________________________________

Reviewer #2 Comments

We appreciate Reviewer #2’s encouraging feedback and thoughtful recommendations. Each of your points has been addressed in detail below. We are pleased that Figure 5 resonated with you. We’ve also implemented the minor clarifications listed to enhance clarity and precision.

1. The authors restrict the application of the model to the pollen season, but the pollen season of birch and grasses was not defined. Have they have used any of the common methods to define the pollen seasons? It is relevant as the application of the method influences in the period selected (Tasioulis et al., 2022).

Thank you for pointing out the need for greater clarity in our definition of the pollen season. We have rephrased the Materials and methods section accordingly to make explicit that:

- The birch and grass seasons begin on the first day with a pollen concentration > 0 grains/m³ and end on the last day with any detectable pollen, following Dahl et al. (2013).

- All zero‐count days that fall between these start and end dates were retained in the analysis.

Although our method does not employ any of the other widely used pollen‐season calculation algorithms, we elected to omit a detailed comparison of those approaches from the Materials and Methods.

2. Based on the previous comment, check the clinical method due to the orientation of this manuscript in public health (Pfaar et al., 2017).

Thank you for highlighting the importance of the clinical definition of pollen exposure. In response, we have added new sentences in the Discussion section summarizing the EAACI position paper by Pfaar et al. (2017), which recommends daily monitoring of pollen concentrations in immunotherapy trials.

We have added a statement noting that daily pollen monitoring, is crucial for accurately timing and assessing the effectiveness of specific immunotherapy for seasonal allergic rhinoconjunctivitis.

________________________________________

Reviewer #2 Specific Comments

1. Abstract: "mean sea level", perhaps you mean "mean pressure at sea level".

Corrected.

2. Abstract: "Ambrosia" genus in italics.

Corrected in the Introduction section, no such word in Abstract.

3. Materials and methods: Replace "pollen counts" by "pollen concentrations".

Corrected.

4. Results: "Betula" genus in italics.

All occurrences of the word Betula are in italics.

5. Discussion: In which fields were MAGNs models used previously? Examples.

Multi-Associative Graph Networks (MAGNs) have been evaluated across a wide range of benchmarking tasks in the introductory publication (Horzyk A, Bulanda D, Starzyk JA. Construction and Training of Multi-Associative Graph Networks. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer; 2023. p. 277–292). The authors evaluated over 70 classification datasets from various fields, including medicine, technology, and biology, demonstrating MAGNs’ broad applicability across disciplines. Beyond benchmarking, MAGNs have also been applied in real-world, domain-specific projects. In the automotive industry, MAGNs have been used as recommendation engine and pattern mining framework at GrapeUp Ltd. In the medical domain, they are applied to the classification and analysis of electrocardiogram (ECG) signals in collaboration with Prof. M. Jastrzębski (Jagiellonian University) and Prof. J.A. Starzyk (Ohio University), with the goal of improving diagnostic accuracy and classification mechanisms. In the field of psychiatry, MAGNs are being implemented within the framework of the project MENTALIO – a decision-support system for diagnostics and therapy in adolescent mental health based on artificial intelligence algorithms, funded by the Polish Medical Research Agency (ABM/2022/7). The project is carried out in collaboration with Prof. M. Pilecki (Jagiellonian University) and the Nivalit company. It focuses on modeling complex psychological and behavioral data to predict children's and adolescents’ behaviors and to support suicide prevention. Notably, MENTALIO received the highest ranking in the national ABM competition for innovative AI-based medical technologies, with total funding of 3,768,206 PLN. These practical implementations further underscore the flexibility of MAGNs in handling diverse, high-dimensional, and interrelated data structures across a wide range of application areas.

6. Discussion: Replace "grasses pollen" by "grass pollen".

Corrected.

7. Conclusion: "mean sea level", perhaps you mean "mean pressure at sea level".

Corrected.

8. Discussion: Replace "signWificance" by "significance".

Corrected.

________________________________________

We trust that these revisions address all the remaining concerns. Thank you again for your valuable feedback. We look forward to your decision.

Sincerely,

Daniel Bulanda

AGH University of Krakow

daniel@bulanda.net

Attachment

Submitted filename: Response to Reviewers.docx

pone.0332093.s004.docx^{(40.1KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0332093.r005

Decision Letter 2

Rafael dos Santos

11 Dec 2025

PONE-D-25-12434R2Comparison of machine learning methods in forecasting and characterizing the birch and grass pollen seasonPLOS One

Dear Dr. Bulanda,

Please submit your revised manuscript by Jan 25 2026 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Rafael Duarte Coelho dos Santos, Ph.D.

Academic Editor

PLOS One

Journal Requirements:

Additional Editor Comments:

Please note the additional information sent by the editors and resubmit the new version.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #2: Comparison of machine learning methods in forecasting and characterizing the birch and grasses pollen season

The authors have addressed all of my suggestions. I consider that this manuscript is ready to be published in a very high scientific quality.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

**********

To ensure your figures meet our technical requirements, please review our figure guidelines: https://journals.plos.org/plosone/s/figures

You may also use PLOS’s free figure tool, NAAS, to help you prepare publication quality figures: https://journals.plos.org/plosone/s/figures#loc-tools-for-figure-preparation.

NAAS will assess whether your figures meet our technical requirements by comparing each figure against our figure specifications.

PLoS One. 2026 Feb 18;21(2):e0332093. doi: 10.1371/journal.pone.0332093.r006

Author response to Decision Letter 3

24 Dec 2025

Response to Reviewers

Manuscript title: Comparison of machine learning methods in forecasting and characterizing the birch and grass pollen season

Manuscript ID: PONE-D-25-12434R1

Journal: PLOS ONE

Dear Academic Editor,

We thank the Academic Editor for the thorough and constructive evaluation of our manuscript from the machine learning perspective. We appreciate the opportunity to improve the clarity, balance, and interpretability of our work. Below we respond point-by-point and describe all corresponding revisions implemented in the revised manuscript.

________________________________________

Academic Editor Comments

Below is a list of individual comments along with the authors' responses.

1. There is an emphasis on the MAGN method, and more space is dedicated to explain it when compared with the other methods. I understand that it is a recently-developed method therefore most people won't know about it, but even so the difference on the details on it and the other methods is noteworthy. One of the authors of this paper is also an author of another MAGN paper used as reference to this one.

We appreciate the Editor’s observation regarding the relative emphasis placed on the MAGN method. Our original intention in providing a more detailed description of MAGN was to ensure that readers unfamiliar with this recently developed approach could understand its structure and rationale, since unlike boosted trees or neural networks, it does not yet have widely available canonical references or textbook-style explanations. MAGN is conceptually different from the other methods evaluated and requires additional clarification to ensure transparency and reproducibility.

Importantly, our goal was not to privilege MAGN but rather to prevent ambiguity in the description of a method that most readers would be encountering for the first time. We also acknowledge that two of the co-authors previously contributed to the development of MAGN, which further motivated us to provide an especially clear and self-contained explanation so that the method could be independently assessed.

To address the concern regarding imbalance, in the revised manuscript we have added one additional paragraph to both the linear regression and k-NN method descriptions to provide clearer methodological balance, as these two models were previously underrepresented relative to the others. This ensures that all methods are presented with a more comparable level of detail while still providing the clarity necessary for readers to understand and replicate the MAGN approach.

2. Figure 2, that describes MAGN, is not really useful, more details are needed. The text that refers to the figure is also not so clear. MAGN shows best or second-best results in mean accuracy, mean MAE, mean execution time and mean memory consumption. From the results it seems very impressive, but the lack of details on the method may hinder its adoption by other researchers.

We thank the reviewer for pointing out that Figure 2 and its accompanying description lacked sufficient detail. In the revised manuscript, we have substantially improved the caption of Figure 2 to clearly explain the roles of sensory fields, sensory neurons, object neurons, duplicate counters, defining connections, and similarity connections within the MAGN architecture. In addition, we have added a new explanatory paragraph in the text that walks the reader through the structure step-by-step and clarifies how these components interact during learning and inference. These revisions enhance the clarity and interpretability of the figure and provide a more accessible explanation for researchers who may not yet be familiar with MAGN.

3. Table 3 shows the source code file name for each method, with references to the source code (e.g. 35, 36, 38) that are not really useful. Why not use a reference to the canonical paper or book that describes the method? Links to the source code could be provided as additional references or footnotes.

We thank the reviewer for this helpful suggestion. In the revised manuscript, we have added canonical theoretical references for each machine learning method listed in Table 3, ensuring that readers can easily locate the foundational literature describing each model. We have also included a link to the MAGN implementation, consistent with the other methods, and extended the table caption to clarify the purpose of the theoretical references and implementation links. These changes improve both the scientific rigor and the reproducibility of the presented methods.

4. Explanation for figure 7 could be more clear: not only comment on the graphical representation of the plot but also point to points of interest on it. In other words, which conclusions are supported by that chart's features?

We thank the reviewer for noting that the interpretation of Figure 7 required greater clarity. In the revised manuscript, we have expanded the description of Figure 7 to go beyond the graphical explanation and now explicitly highlight the most important patterns visible in the plot. We identify which meteorological combinations lead to high-confidence and high-lift rules, describe the differences between Betula and Poaceae, and explain how the size, position, and color of the points support our conclusions. This updated interpretation makes the chart more informative and demonstrates how the association rules contribute to understanding the meteorological drivers of pollen concentration.

5. Figure 6 shows that pollen concentration is the main predictor for pollen concentration... this happens for seasonal predictions. The most important applications are related to prediction of when something will start and stop, which I expected to be the case for this paper (see introduction, line 8). What would happen if we didn't use pollen concentration as an input variable?.

We appreciate the reviewer’s insightful comment regarding the use of past pollen concentrations as predictors and the relevance of forecasting the beginning and end of the pollen season. We agree that onset and offset prediction is an important application. However, the scope of the present study is limited to within-season forecasting and early-season characterization, as stated in line 8 of the introduction. The phrase “early characterization of the upcoming pollen seasons” refers specifically to modeling pollen behaviour after the first measurable occurrence, not to predicting the initial onset itself.

For this reason, experiments excluding pollen concentration as an input variable, while highly valuable for onset/offset prediction, fall outside the objectives of this work. Nevertheless, we acknowledge the importance of the reviewer’s suggestion. Our team is currently developing a follow-up study focused explicitly on pollen season start and end prediction and on evaluating models without historical pollen inputs. We expect this new study to be completed within this year and believe it will address the reviewer’s points in full.

To improve clarity, we revised the sentence to explicitly state that our work focuses on the prediction and characterization of pollen levels after the season has begun. This modification aligns the introduction with the scope of the study and prevents misunderstanding regarding onset prediction.

6. About the title: "Comparison of machine learning methods in forecasting and characterizing the birch and grass pollen season" -- I didn't see much of "characterizing of the seasons", as I expected from Experiment 2, which deals with "characterize the relationships between the input and target variables" but without a good, interpretable explanation -- just a description on lines 360-362, which seems to leave the task to the reader.

Overall, considering the ML methods and applications, I think clearer explanations would help. The problem the authors want to solve is clear (predict pollen concentration), the variables are well explained, but the paper title ("Comparison of machine learning methods in forecasting and characterizing the birch and grass pollen season") seems too broad and results could be better explained.

We appreciate the reviewer’s observation regarding the “characterizing” component referenced in the title. In the revised manuscript, we have expanded the interpretation of Experiment 2 to provide a clearer and more explicit characterization of the relationships between meteorological variables and pollen concentration classes. Additional text now highlights the most informative association rules, explains how these rules reveal underlying seasonal patterns, and connects these findings to established aerobiological knowledge. These enhancements ensure that the characterization aspect of the study is more transparent and accessible to the reader.

We acknowledge that the original presentation placed more emphasis on forecasting results, which may have contributed to the perception that the title was broader than the content. With the strengthened explanations, the paper now more fully reflects both elements, forecasting and characterizing, as indicated in the title.

7. Access to the whole dataset is somehow restricted, rules and contacts to getting to it are listed in the documents (but not on the paper?), I sort of expected a better explanation on why the full data is not available, but I also think this is not an issue that would hinder the publication of the work since there are links to a smaller version of the dataset that allows for replicability.

We thank the reviewer for raising this point. The long-term pollen monitoring data are owned and curated by the Department of Clinical and Environmental Allergology at the Jagiellonian University Medical College, and due to institutional data-governance policies and confidentiality considerations, cannot be made openly available in their entirety. Access requires a formal request, a brief research plan, and confirmation of compliance with data-use and confidentiality rules. To the best of our knowledge, the Data Availability section is the proper and expected location for providing these explanations.

To ensure reproducibility despite these restrictions, we have provided a publicly accessible minimal dataset (including 2022-2023 data points). This subset is sufficient to replicate all analyses and modeling procedures presented in the paper.

8. There are still minor issues with the format of the text, notably in tables.

We thank the reviewer for noting the remaining formatting issues. In the revised manuscript, we have reviewed and corrected all identified formatting problems in the tables and surrounding text.

________________________________________

We hope these revisions resolve the remaining issues. Thank you for your thoughtful feedback. We look forward to your decision.

Sincerely,

AGH University of Krakow

daniel@bulanda.net

Attachment

Submitted filename: Response_to_Reviewers_auresp_3.docx

pone.0332093.s005.docx^{(40.7KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0332093.r007

Decision Letter 3

Rafael dos Santos

7 Jan 2026

Comparison of machine learning methods in forecasting and characterizing the birch and grass pollen season

PONE-D-25-12434R3

Dear Dr. Bulanda,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Rafael Duarte Coelho dos Santos, Ph.D.

Academic Editor

PLOS One

Additional Editor Comments (optional):

Thanks for replying to the reviewers' questions!

Reviewers' comments:

PLoS One. doi: 10.1371/journal.pone.0332093.r008

Acceptance letter

Rafael dos Santos

PONE-D-25-12434R2

PLOS ONE

Dear Dr. Bulanda,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Manlio Milanese

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Graphical abstract.

(TIF)

pone.0332093.s001.tif^{(516.1KB, tif)}

S2 File. Supplementary material.

(DOCX)

pone.0332093.s002.docx^{(20.1KB, docx)}

Attachment

Submitted filename: Response to Reviewers.pdf

pone.0332093.s003.pdf^{(296.1KB, pdf)}

Attachment

Submitted filename: Response to Reviewers.docx

pone.0332093.s004.docx^{(40.1KB, docx)}

Attachment

Submitted filename: Response_to_Reviewers_auresp_3.docx

pone.0332093.s005.docx^{(40.7KB, docx)}

Data Availability Statement

[pone.0332093.ref001] 1.Alqahtani JM. Atopy and allergic diseases among Saudi young adults: a cross-sectional study. J Int Med Res. 2020;48(1):300060519899760. doi: 10.1177/0300060519899760 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332093.ref002] 2.Oliveira TB, Persigo ALK, Ferrazza CC, Ferreira ENN, Veiga ABG. Prevalence of asthma, allergic rhinitis and pollinosis in a city of Brazil: a monitoring study. Allergol Immunopathol (Madr). 2020;48(6):537–44. doi: 10.1016/j.aller.2020.03.010 [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref003] 3.Masoli M, Fabian D, Holt S, Beasley R, Asthma (GINA) Program GI. The global burden of asthma: executive summary of the GINA dissemination committee report. Allergy. 2004;59(5):469–78. [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref004] 4.Asthma GI. Global strategy for asthma management and prevention. 2024. https://ginasthma.org/reports/

[pone.0332093.ref005] 5.Wise SK, Damask C, Roland LT, Ebert C, Levy JM, Lin S, et al. International consensus statement on allergy and rhinology: allergic rhinitis–2023. International forum of allergy & rhinology. Wiley Online Library. 2023. p. 293–859. [DOI] [PubMed]

[pone.0332093.ref006] 6.Hoque F, Nayak R. Focused overview of the 2024 global initiative for asthma guidelines. APIK Journal of Internal Medicine. 2024;13(1):4–12. doi: 10.4103/ajim.ajim_76_24 [DOI] [Google Scholar]

[pone.0332093.ref007] 7.Cohen B. Allergic rhinitis. Pediatr Rev. 2023;44(10):537–50. doi: 10.1542/pir.2022-005618 [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref008] 8.Patel N, Bhattacharyya A. Rhinitis in primary care. Prim Care. 2025;52(1):37–45. doi: 10.1016/j.pop.2024.09.006 [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref009] 9.Zhang Y, Lan F, Zhang L. Advances and highlights in allergic rhinitis. Allergy. 2021;76(11):3383–9. doi: 10.1111/all.15044 [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref010] 10.Makra L, Coviello L, Gobbi A, Jurman G, Furlanello C, Brunato M, et al. Forecasting daily total pollen concentrations on a global scale. Allergy. 2024;79(8):2173–85. doi: 10.1111/all.16227 [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref011] 11.Sofiev M, Palamarchuk J, Kouznetsov R, Abramidze T, Adams-Groom B, Antunes CM, et al. European pollen reanalysis 1980 -2022, for alder, birch, and olive. Sci Data. 2024;11(1):1082. doi: 10.1038/s41597-024-03686-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332093.ref012] 12.Zhong J, Xiao R, Wang P, Yang X, Lu Z, Zheng J, et al. Identifying influence factors and thresholds of the next day’s pollen concentration in different seasons using interpretable machine learning. Sci Total Environ. 2024;935:173430. doi: 10.1016/j.scitotenv.2024.173430 [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref013] 13.Myszkowska D. Predicting tree pollen season start dates using thermal conditions. Aerobiologia (Bologna). 2014;30(3):307–21. doi: 10.1007/s10453-014-9329-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332093.ref014] 14.Myszkowska D, Majewska R. Pollen grains as allergenic environmental factors–new approach to the forecasting of the pollen concentration during the season. Ann Agric Environ Med. 2014;21(4):681–8. doi: 10.5604/12321966.1129914 [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref015] 15.Nowosad J, Stach A, Kasprzyk I, Chłopek K, Dabrowska-Zapart K, Grewling Ł, et al. Statistical techniques for modeling of Corylus, Alnus, and Betula pollen concentration in the air. Aerobiologia. 2018;34(3):301–13. doi: 10.1007/s10453-018-9514-x [DOI] [Google Scholar]

[pone.0332093.ref016] 16.Bringfelt B, Engström I, Nilsson S. An evaluation of some models to predict airborne pollen concentration from meteorological conditions in stockholm, sweden. Grana. 1982;21(1):59–64. doi: 10.1080/00173138209427680 [DOI] [Google Scholar]

[pone.0332093.ref017] 17.Cotos-Yáñez TR, Rodríguez-Rajo F, Jato M. Short-term prediction of Betula airborne pollen concentration in Vigo (NW Spain) using logistic additive models and partially linear models. International Journal of Biometeorology. 2004;48:179–85. [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref018] 18.Astray G, Amigo Fernandez R, Fernandez-Gonzalez M, Dias-Lorenzo DA, Guada G, Rodrıguez-Rajo FJ. Machine learning to forecast airborne parietaria pollen in the North-West of the Iberian Peninsula. Sustainability. 2025;17(4):1528. [Google Scholar]

[pone.0332093.ref019] 19.Cordero JM, Rojo J, Gutiérrez-Bustillo AM, Narros A, Borge R. Predicting the olea pollen concentration with a machine learning algorithm ensemble. International Journal of Biometeorology. 2021;65(4):541–54. [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref020] 20.Valipour Shokouhi B, de Hoogh K, Gehrig R, Eeftens M. Spatiotemporal modelling of airborne birch and grass pollen concentration across Switzerland: a comparison of statistical, machine learning and ensemble methods. Environ Res. 2024;263(Pt 1):119999. doi: 10.1016/j.envres.2024.119999 [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref021] 21.Zewdie GK, Lary DJ, Levetin E, Garuma GF. Applying deep neural networks and ensemble machine learning methods to forecast airborne Ambrosia Pollen. Int J Environ Res Public Health. 2019;16(11):1992. doi: 10.3390/ijerph16111992 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332093.ref022] 22.Srodowiska Szoosp K. Program ochrony środowiska dla miasta Krakowa na lata 2020 –2030. 2021. https://obywatelski.krakow.pl/zalacznik/385031

[pone.0332093.ref023] 23.Galán C, Smith M, Thibaudon M, Frenguelli G, Oteros J, Gehrig R, et al. Pollen monitoring: minimum requirements and reproducibility of analysis. Aerobiologia. 2014;30(4):385–95. doi: 10.1007/s10453-014-9335-5 [DOI] [Google Scholar]

[pone.0332093.ref024] 24.U. Ambient air-sampling and analysis of airborne pollen grains and fungal spores for networks related to allergy-volumetric Hirst method. ICS. 2019. p. 20.

[pone.0332093.ref025] 25.College JUM. Pollen concentration database. 2025. https://www.pylek.cm-uj.krakow.pl

[pone.0332093.ref026] 26.Dahl A, Gala´n C, Hajkova L, Pauling A, Sikoparija B, Smith M. The onset, course and intensity of the pollen season. Allergenic pollen: A review of the production, release, distribution and health impacts. Springer; 2012. p. 29–70.

[pone.0332093.ref027] 27.Dataset ECA. European Climate Assessment & Dataset: Daily Data. 2025. https://www.ecad.eu/dailydata/predefinedseries.php

[pone.0332093.ref028] 28.Matuszko D, Piotrowicz K. Cechy klimatu miasta a klimat Krakowa; 2015. https://ruj.uj.edu.pl/bitstreams/8f6e1bce-6c57-48b0-8930-ff76500c120b/download

[pone.0332093.ref029] 29.JuliaStats. HypothesisTests.jl. 2025. https://github.com/JuliaStats/HypothesisTests.jl

[pone.0332093.ref030] 30.Myszkowska D. Comparison of machine learning methods in forecasting and characterizing the birch and grasses pollen season - miminal data set necessary to replicate the study. 2025. https://osf.io/9yzcf/?view_only=d4ccc99d460c46bfa775d72ba6ba3a1b [DOI] [PubMed]

[pone.0332093.ref031] 31.Bouchet-Valat M, Kamiński B. DataFrames.jl: flexible and fast tabular data in Julia. J Stat Soft. 2023;107(4):1–32. doi: 10.18637/jss.v107.i04 [DOI] [Google Scholar]

[pone.0332093.ref032] 32.Brown RG. Smoothing, forecasting and prediction of discrete time series. Courier Corporation; 2004.

[pone.0332093.ref033] 33.Horzyk A, Bulanda D, Starzyk JA. Construction and training of multi-associative graph networks. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2023. p. 277–92.

[pone.0332093.ref034] 34.Fix E. Fix E. discriminatory analysis: nonparametric discrimination, consistency properties. USAF School of Aviation Medicine; 1985.

[pone.0332093.ref035] 35.JuliaAI. Nearest neighbor models. 2025. https://github.com/JuliaAI/NearestNeighborModels.jl

[pone.0332093.ref036] 36.Bulanda D. Witchnet.2025. https://github.com/danbulnet/witchnet

[pone.0332093.ref037] 37.Galton F. Regression towards mediocrity in hereditary stature. The Journal of the Anthropological Institute of Great Britain and Ireland. 1886;15:246. doi: 10.2307/2841583 [DOI] [Google Scholar]

[pone.0332093.ref038] 38.JuliaAI. MLJLinearModels.jl. 2025. https://github.com/JuliaAI/MLJLinearModels.jl

[pone.0332093.ref039] 39.Kamiński B, Jakubczyk M, Szufel P. A framework for sensitivity analysis of decision trees. Cent Eur J Oper Res. 2018;26(1):135–59. doi: 10.1007/s10100-017-0479-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332093.ref040] 40.Sadeghi B, Chiarawongse P, Squire K, Jones DC, Noack A, St-Jean C. DecisionTree.jl - A Julia implementation of the CART Decision Tree and Random Forest algorithms. 2022.

[pone.0332093.ref041] 41.Breiman L. Random forests. Machine Learning. 2001;45(1):5–32. doi: 10.1023/a:1010933404324 [DOI] [Google Scholar]

[pone.0332093.ref042] 42.Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. p. 785–94.

[pone.0332093.ref043] 43.dmlc. XGBoost.jl. 2025. https://github.com/dmlc/XGBoost.jl

[pone.0332093.ref044] 44.Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning. Cambridge: MIT Press; 2016.

[pone.0332093.ref045] 45.Innes M, Saba E, Fischer K, Gandhi D, Rudilosso MC, Joy NM. Fashionable modelling with flux. CoRR. 2018. doi: abs/1811.01457 [Google Scholar]

[pone.0332093.ref046] 46.Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. doi: 10.1162/neco.1997.9.8.1735 [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref047] 47.Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint 2014. https://arxiv.org/abs/1406.1078 [Google Scholar]

[pone.0332093.ref048] 48.Horzyk A, Bulanda D, Starzyk JA. ASA-graphs for efficient data representation and processing. International Journal of Applied Mathematics and Computer Science. 2020;30(4). doi: 10.34768/amcs-2020-0053 [DOI] [Google Scholar]

[pone.0332093.ref049] 49.Kreer J. A question of terminology. IRE Trans Inf Theory. 1957;3(3):208–208. doi: 10.1109/tit.1957.1057418 [DOI] [Google Scholar]

[pone.0332093.ref050] 50.Zhou H, Wang X, Zhu R. Feature selection based on mutual information with correlation coefficient. Appl Intell. 2021;52(5):5457–74. doi: 10.1007/s10489-021-02524-x [DOI] [Google Scholar]

[pone.0332093.ref051] 51.Sornalakshmi M, Balamurali S, Venkatesulu M, Krishnan MN, Ramasamy LK, Kadry S, et al. An efficient apriori algorithm for frequent pattern mining using mapreduce in healthcare data. Bulletin EEI. 2021;10(1):390–403. doi: 10.11591/eei.v10i1.2096 [DOI] [Google Scholar]

[pone.0332093.ref052] 52.Kabir MR, Vaid S, Sood N, Zaiane OR. Deep associative classifier. In: 2022 IEEE International Conference on Knowledge Graph (ICKG). 2022. p. 113–22. 10.1109/ickg55886.2022.00022 [DOI]

[pone.0332093.ref053] 53.Sivanantham S, Mohanraj V, Suresh Y, Senthilkumar J. Rule precision index classifier: an associative classifier with a novel pruning measure for intrusion detection. Pers Ubiquit Comput. 2021;27(3):1395–403. doi: 10.1007/s00779-021-01599-0 [DOI] [Google Scholar]

[pone.0332093.ref054] 54.Tuff´ery S. Data mining and statistics for decision making. John Wiley & Sons; 2011.

[pone.0332093.ref055] 55.Hart PE, Stork DG, Wiley J. Pattern classification. Toronto: John Wiley & Sons; 2001.

[pone.0332093.ref056] 56.Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is “nearest neighbor” meaningful?. In: Database Theory—ICDT’99: 7th International Conference Jerusalem, Israel, January 10–12 1999 Proceedings. 1999. p. 217–35.

[pone.0332093.ref057] 57.Mining WID. Data mining: concepts and techniques. Morgan Kaufmann. 2006;10:559–69. [Google Scholar]

[pone.0332093.ref058] 58.Seber GA, Lee AJ. Linear regression analysis. John Wiley & Sons; 2012.

[pone.0332093.ref059] 59.Goeman J, Meijer R, Chaturvedi N. L1 and L2 penalized regression models. Vignette R Package Penalized. 2018. http://crannedmirrornl/web/packages/penalized/vignettes/penalized

[pone.0332093.ref060] 60.Singh S, Gupta P. Comparative study ID3, CART and C4.5 decision tree algorithm: a survey. International Journal of Advanced Information Science and Technology. 2014;27(27):97–103. [Google Scholar]

[pone.0332093.ref061] 61.Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2009.

[pone.0332093.ref062] 62.XGBoost: The Champion of Competitive Machine Learning. 2025. https://datascientest.com/en/xgboost-the-champion-of-competitive-machine-learning

[pone.0332093.ref063] 63.Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems. 2012;25. [Google Scholar]

[pone.0332093.ref064] 64.Jozefowicz R, Zaremba W, Sutskever I. In: International conference on machine learning, 2015. 2342–50.

[pone.0332093.ref065] 65.Blaom AD, Kiraly F, Lienart T, Simillides Y, Arenas D, Vollmer SJ. MLJ: a Julia package for composable machine learning. 2020. 10.5281/zenodo.4178918 [DOI]

[pone.0332093.ref066] 66.Bezanson J, Edelman A, Karpinski S, Shah VB. Julia: a fresh approach to numerical computing. SIAM Rev. 2017;59(1):65–98. doi: 10.1137/141000671 [DOI] [Google Scholar]

[pone.0332093.ref067] 67.Bulanda D. Comparison of machine learning methods in forecasting and characterizing the birch and grasses pollen season - source code. 2025. https://github.com/bionetlabs/PollenForecasting.git [DOI] [PubMed]

[pone.0332093.ref068] 68.Pfaar O, Bastl K, Berger U, Buters J, Calderon MA, Clot B, et al. Defining pollen exposure times for clinical trials of allergen immunotherapy for pollen-induced rhinoconjunctivitis - an EAACI position paper. Allergy. 2017;72(5):713–22. doi: 10.1111/all.13092 [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref069] 69.European Medicines Agency Cfmpfhu. Guideline on the clinical development of products for specific immunotherapy for the treatment of allergic diseases. 2008. https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-clinical-development-products-specific-immunotherapy-treatment-en.pdf

[pone.0332093.ref070] 70.Csépe Z, Makra L, Voukantsis D, Matyasovszky I, Tusnády G, Karatzas K, et al. Predicting daily ragweed pollen concentrations using computational intelligence techniques over two heavily polluted areas in Europe. Sci Total Environ. 2014;476–477:542–52. doi: 10.1016/j.scitotenv.2014.01.056 [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref071] 71.Vélez-Pereira AM, De Linares C, Belmonte J. Aerobiological modeling I: a review of predictive models. Science of The Total Environment. 2021;795:148783. [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref072] 72.Schäppi GF, Taylor PE, Kenrick J, Staff IA, Suphioglu C. Predicting the grass pollen count from meteorological data with regard to estimating the severity of hayfever symptoms in Melbourne (Australia). Aerobiologia. 1998;14(1):29–37. doi: 10.1007/bf02694592 [DOI] [Google Scholar]

[pone.0332093.ref073] 73.Matyasovszky I, Makra L, Guba Z, Pátkai Z, Páldy A, Sümeghy Z. Estimating the daily Poaceae pollen concentration in Hungary by linear regression conditioning on weather types. Grana. 2011;50(3):208–16. doi: 10.1080/00173134.2011.602984 [DOI] [Google Scholar]

[pone.0332093.ref074] 74.Rodríguez-Rajo FJ, Valencia-Barrera RM, Vega-Maray AM, Suárez FJ, Fernández-Gonzales D, Jato V. Prediction of airborne Alnus pollen concentration by using ARIMA models. Annals of Agricultural and Environmental Medicine. 2006;13(1). [PubMed] [Google Scholar]

[pone.0332093.ref075] 75.Green BJ, Dettmann M, Yli-Panula E, Rutherford S, Simpson R. Atmospheric Poaceae pollen frequencies, associations with meteorological parameters in Brisbane and Australia: a 5-year record 1994 -1999. Int J Biometeorol. 2004;48(4):172–8. doi: 10.1007/s00484-004-0204-8 [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref076] 76.Aboulaich N, Achmakh L, Bouziane H, Trigo MM, Recio M, Kadiri M, et al. Effect of meteorological parameters on Poaceae pollen in the atmosphere of Tetouan (NW Morocco). Int J Biometeorol. 2013;57(2):197–205. doi: 10.1007/s00484-012-0566-2 [DOI] [PubMed] [Google Scholar]

[pone.0332093.ref077] 77.Sofiev M, Siljamo P, Ranta H, Linkosalo T, Jaeger S, Rasmussen A, et al. A numerical model of birch pollen emission and dispersion in the atmosphere. Description of the emission module. Int J Biometeorol. 2013;57(1):45–58. doi: 10.1007/s00484-012-0532-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332093.ref078] 78.Dbouk T, Visez N, Ali S, Shahrour I, Drikakis D. Risk assessment of pollen allergy in urban environments. Sci Rep. 2022;12(1):21076. doi: 10.1038/s41598-022-24819-w [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332093.ref079] 79.Valipour Shokouhi B, de Hoogh K, Gehrig R, Eeftens M. Estimation of historical daily airborne pollen concentrations across Switzerland using a spatio temporal random forest model. Sci Total Environ. 2024;906:167286. doi: 10.1016/j.scitotenv.2023.167286 [DOI] [PubMed] [Google Scholar]

PERMALINK

Comparison of machine learning methods in forecasting and characterizing the birch and grass pollen season

Daniel Bulanda

Małgorzata Bulanda

Małgorzata Sacha

Adrian Horzyk

Dorota Myszkowska

Roles

Abstract

Introduction

Materials and methods

Data

Fig 1. Daily birch and grass pollen concentration.

Table 1. Descriptive statistics of the daily birch and grass pollen concentrations used into the study.

Table 2. Statistical description of meteorological data for birch and grass pollen seasons.

Input data preparation.

Target variable and measures.

Machine learning methods

Table 3. Overview of the machine learning methods evaluated in this study.

Multi-associative graph network.

Fig 2. MAGN structure.

K-Nearest neighbors.

Linear regression.

Tree-based methods.

Deep neural networks.

Fig 3. Deep neural network architecture.

Experiments

Experiment 1.

Experiment 2.

Software and hardware.

Results

Experiment 1

Table 4. Pollen concentration forecasting results.

Fig 4. Comparison of machine learning models based on metric averages.

Fig 5. Prediction examples.

Experiment 2

Fig 6. Feature importance and normalized mutual information.

Fig 7. Association rules analysis for meteorological data.

Discussion

Table 5. Comparative Overview of Machine Learning Methods for Pollen Forecasting.

Conclusion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Manlio Milanese

Roles

Author response to Decision Letter 1

Decision Letter 1

Manlio Milanese

Roles

Author response to Decision Letter 2

Decision Letter 2

Rafael dos Santos

Roles

Author response to Decision Letter 3

Decision Letter 3

Rafael dos Santos

Roles

Acceptance letter

Rafael dos Santos

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases