Skip to main content
PLOS One logoLink to PLOS One
. 2022 Dec 8;17(12):e0278112. doi: 10.1371/journal.pone.0278112

A quantitative analysis of the impact of explicit incorporation of recency, seasonality and model specialization into fine-grained tourism demand prediction models

Amir Khatibi 1,*,#, Ana Paula Couto da Silva 1,#, Jussara M Almeida 1,#, Marcos A Gonçalves 1,#
Editor: Ali Safaa Sadiq2
PMCID: PMC9731488  PMID: 36480566

Abstract

Forecasting is of utmost importance for the Tourism Industry. The development of models to predict visitation demand to specific places is essential to formulate adequate tourism development plans and policies. Yet, only a handful of models deal with the hard problem of fine-grained (per attraction) tourism demand prediction. In this paper, we argue that three key requirements of this type of application should be fulfilled: (i) recency—forecasting models should consider the impact of recent events (e.g. weather change, epidemics and pandemics); (ii) seasonality—tourism behavior is inherently seasonal; and (iii) model specialization—individual attractions may have very specific idiosyncratic patterns of visitations that should be taken into account. These three key requirements should be considered explicitly and in conjunction to advance the state-of-the-art in tourism prediction models. In our experiments, considering a rich set of indoor and outdoor attractions with environmental and social data, the explicit incorporation of such requirements as features into the models improved the rate of highly accurate predictions by more than 320% when compared to the current state-of-the-art in the field. Moreover, they also help to solve very difficult prediction cases, previously poorly solved by the current models. We also investigate the performance of the models in the (simulated) scenarios in which it is impossible to fulfill all three requirements—for instance, when there is not enough historical data for an attraction to capture seasonality. All in all, the main contributions of this paper are the proposal and evaluation of a new information architecture for fine-grained tourism demand prediction models as well as a quantification of the impact of each of the three aforementioned factors on the accuracy of the learned models. Our results have both theoretical and practical implications towards solving important touristic business demands.

1 Introduction

According to the World Travel and Tourism Council (WTTC), as of 2019 annual research covering 185 countries and economies, the global travel and tourism contribution to the Gross Domestic Product (GDP: a monetary measure of the market value of all the final goods and services produced in a specific time period) is at 10.3% supporting 319 million jobs. This corresponds to 10% of the global employment. Considering new jobs across the world, the contribution of travel and tourism industry is even higher, achieving 25% of all global new jobs created over the last five years (reported by Global Economic Impact and Trends 2020 accessible at link https://wttc.org/Research/Economic-Impact). Thus having estimated values of future tourism demand in the weeks, months, and years ahead can serve as a base for preparing activities necessary for creating comprehensive tourism policies [1].

The importance of accurate tourism prediction becomes indisputable when ones realizes that tourism products are generally perishable—unsold flight seats, empty hotel rooms and unsold tickets of a tourism attraction are just a few examples. In addition, tourism demands are sensible to factors like exchange rate [2], fuel price, climate changes [3], local and global financial crises [4] and even epidemics/ pandemics. The new Coronavirus Disease (COVID-19) (more information about this disease at link https://www.who.int/emergencies/diseases/novel-coronavirus-2019/technical-guidance) pandemic has completely shut down the Tourism Industry worldwide. Accurate forecasting could have helped to deal with the crisis letting a better management of the initial sector’s recovery. There has been a real need to develop robust prediction models that not only forecast well the future visits by considering seasonal aspects of tourism behaviour but also show flexibility to recent trends and events and idiosyncratic aspects of the attractions.

In our previous work [5], the effects of environmental and social media data over tourism visitation have been studied in two scenarios—indoor and outdoor attractions. Visitation census, environmental features and social media data, for 27 museums and galleries in the United Kingdom (indoor attractions) as well as 76 national parks in the United States (outdoor attractions) have been exploited. Our proposal showed superior prediction accuracy when compared to the State-Of-The-Art (SOTA) results using features from both social media and environmental data adopting various prediction models and exploiting different modeling approaches.

In our previous analysis, it was observed that for outdoor attractions, environmental features have better predictive power while the social media features have more influence in the case of indoor attractions. In any case, best results, in all scenarios, were obtained when using both types of features jointly as input to a Support Vector Regression (SVR) prediction model obtaining moderate or highly accurate prediction results for around 93% of the attractions.

In this work, a new tourism prediction methodology is proposed that explicitly incorporates aspects related to recency, seasonality and specialization into the prediction models. More than explicitly considering such requirements into our information architecture—something that previous work has not done—in our current study we quantify the impact of each one of these effects as well as their interactions for fine-grained high-accuracy tourism demand prediction task while also improving our previous (state-of-the-art) results in tourism prediction [5]. We do this by arguing and demonstrating that three other key requirements of (1) recency, (2) seasonality and (3) model specialization should be fulfilled by an accurate model. These requirements should be captured as explicit features or properties of models for tourism forecasting, something that the previous state-of-the-art has not explored. Though these characteristics have been considered to different extents in different solutions in isolation, we are the first to consider them altogether as essential aspects that should be explicitly Incorporated in conjunction into prediction model for fine-grained (attraction-level) tourism demand prediction. Our work is the first to measure and quantify the isolated and combined impact of these factors when incorporated into the models. These requirements are briefly introduced next.

Recency considers the impact of recent events on the prediction models. Prior work mostly focuses on the importance of seasonality as the main temporal aspect for tourism prediction. We argue that other temporal aspects should be considered to assess whether and how recent events such as financial crises, new trends, epidemics/ pandemics, new infrastructures, may impact predictions.

A few prior studies analyze the effect of recency on tourism demands. In [6] and [7], the authors study temporal aspects considered as important to predict tourism visits. Particularly, in [6], an algorithm for recommending personalized tours is proposed using users’ recent preferences as one of the variables of their model. In their tour recommendation algorithm, they enhance the models by a weighted update of user interests based on the recency of their visits giving more emphasis to more recent Point of Interest (PoI: an entity of interest with well-defined location for example museums, churches, waterfalls and coffee shops) visits. They show improvements upon earlier tour recommendation work.

Though prediction models such as Auto Regressive Integrated Moving Average—ARIMA-based [8] ones indirectly exploit recency by means of temporal series modeling, our argument is that recency should be promoted as an explicit first-class feature to be incorporated into the prediction models. First-class features are input features that in other models such as ARIMA-based ones are captured implicitly.

Seasonality focuses on the inherently cyclic behaviour of tourism demands. Several studies in the literature focus on the importance of seasonality as the main temporal aspect for tourism prediction. Seasonality has been defined as the inherently cyclic behaviour of tourism demands. The authors of [9] state that seasonality is one of the main phenomena affecting tourism. According to them, seasonality is the systematic, although not necessarily regular, intra-year movement caused by changes in the weather, the calendar, and timing of decisions made by the agents of the economy, directly or indirectly through the production and consumption decisions. The authors of [10], instead, explain seasonality as a temporal imbalance in the phenomenon of tourism, which may be expressed in terms of dimensions of such elements as numbers of visitors, expenditure of visitors, traffic on highways and other forms of transportation, employment, and admissions to attractions.

In [11], the authors state that, regarding periodicity, the main focus of interest had been annual seasonality, with studies that show the differences in tourism activity between different seasons. In contrast, in their work, they perform a decomposition analysis of yearly, monthly and weekly seasonalities of tourism demand. They do so by conducting an in-depth analysis of intra-monthly and intra-weekly tourism demands using entropy and relative redundancy measures. The authors show that seasonality is present in annual, monthly and weekly frequencies using the Balearic Islands airports as their case study. In addition, they show that monthly and weekly seasonality differs across geographical markets. Since variations during the year are often caused by the climate or other social factors, intra-monthly and intra-weekly changes in tourism demand should be more closely associated with institutional or social factors, due to non-working days during the week, work holidays and other events that take place at specific times, such as Christmas, school or university holidays and work vacations.

The authors of [12] focus their study on the impact of seasonality on cultural tourism—defined as tourism focused on cultural motivations, including visits to museums and archaeological sites. They analyze tourism seasonality in some selected destinations in Sicily, concluding that cultural destinations are less impacted by seasonality in tourism flows.

In a recent survey on tourism forecasting [7], the authors state that the most widely adopted statistical time series prediction method is the seasonal auto-regressive integrated moving average—SARIMA. They claim that SARIMA is able to capture seasonality and recency implicitly in their forecasts. SARIMA is one of our baselines serving the purpose of comparing explicit versus implicit modeling of such requirements.

Our experiments aim to demonstrate that explicitly exploiting seasonality can greatly improve the prediction accuracy.

Model specialization advocates creating specialized individual models for each touristic attraction. The main motivation is that particular attractions may have very specific intrinsic patterns of visitations. Studies such as [13, 14] explore the tourists’ motivations in the process of attraction selection. For instance, the authors of [13] identify different motivations and behavioral patterns in visits to different types of museums. For example, tourists visiting the historic Rembrandt House were more likely to be accompanying other people, more likely to want to learn new things, as well as more likely to be in search of entertainment or local culture and history than those interviewed at the Stedelijk museum of modern art. They also find distance of the visitor from the origin, geographical origin of tourists, their socio-demographic characteristics, travel form and the period of staying in the destination are also important factors affecting the choice of attractions to visit. On the other hand, in [14], the authors study the generation Y preferences (generation Y is the generation born in the 1980s and 1990s, comprising primarily the children of the baby boomers and typically perceived as increasingly familiar with digital and electronic technology), finding that this generation has his own profile and patterns of consumption. They discuss money spending preferences, the technology facilities in the attractions, the design of the place and the presence of information in social media as some of motivational differences. All in all, this serves as another factor that can affect differently the visitation patterns of different attractions motivating specialized models of visitation for each attraction. However, creating specialized individual model for each attraction can be advantageous since individual attractions may have very specific idiosyncratic patterns of visitations. On the other hand, there may be cases when one may not have enough data to train an individual model for each site. In this case, it is more viable to train single models for attractions of a given type to benefit from a vast amount of available social, climate and official data in the training process. Our experiments analyze this trade-off in depth.

In our work, we aim to demonstrate that, by explicitly exploiting the three proposed key requirements of tourism prediction as features, our models can greatly improve prediction accuracy regarding the SOTA results. Indeed, our experimental results, considering a rich set of indoor and outdoor attractions with environmental and social data, show that the explicit incorporation of such requirements into the models can improve the rate of highly accurate predictions by more than 320% against the current SOTA [5]. Our proposed models can even help to solve difficult prediction cases, poorly solved by the current solutions. For instance, the National Portrait Gallery in the U.K. saw a huge increase in social media reviews (over 50% by April 2015) but that was not accompanied by real world visits, causing the models to mistakenly follow the social patterns, ultimately implying in low accuracy. Another example is the Bryce Canyon national park in the U.S., in which the visits experience some period of untypical increases (more than 20% in Feb. to Sep. 2016 in comparison with the same period in 2015). That increase was not reflected neither in the environmental features nor in the social media reviews, both inputs exploited by the SOTA model. These situations can be dealt with by explicitly incorporating recency and seasonality features.

1.1 Research questions

The main hypotheses of our work, posed as explicit research question to be answered, include:

  • RQ 1: Do recency, seasonality and model specialization (characteristic of attraction) influence the accuracy of predicting visits in tourist sites? Adopting our collected rich set of indoor and outdoor attractions, our specialized and global prediction models explicitly exploit recency and seasonality features in order to evaluate whether each of the key requirements of tourism prediction influence the accuracy of the models.

  • RQ 2: What is the impact of each of recency, seasonality and model specialization in tourism demand prediction? In order to quantify the isolated and combined impact of each of these three requirements (recency, seasonality, and specialization), a factorial design analysis [15] is applied. In this analysis, the impact of each of the three requirements is evaluated in two different scenarios of attractions: outdoors (parks) and indoors (museums). In both scenarios seasonality, model specialization and the interaction between them have the largest impact in the prediction accuracy, with seasonality being more important in the case of outdoors. It was observed that model specialization is the most prominent factor to improve results, mainly for highly accurate predictions.

  • RQ 3: How scenarios with data scarcity hinder the accuracy of prediction models while exploiting recency, seasonality and model specialization? Our results show that the absence of recency or seasonality features drastically reduces the accuracy of prediction models in scenarios with data scarcity. Recency features are not as important as the seasonality ones, but they still have a relevant impact on prediction accuracy, mainly for situations in which there is not enough historical data to capture seasonality for a given attraction.

1.2 Related work

Table 1 summarizes the main related studies that exploit in one way or another the aforementioned tourism key requirements. The table highlights, for each work; (i) whether the work applies the proposed techniques in multiple attractions or are concentrated in only one specific case study such as a country or a single touristic site; (ii) whether the work uses external features (data) as a proxy to predict the visitations, for instance, the use of socio-economic or environmental features, and (iii) whether they explicitly explore recency and/or seasonality.

Table 1. Related work and our contributions.

Related Work Work Domain Multiple Attractions External Data Explicit Recency Explicit Seasonality Our work
[16] Predicting coarse-grained tourism demand for entire country Turkey using multiple socio-economic features Use of environmental and social media features in a fine-grained prediction level for multiple attractions
[17] Use of Wikipedia usage trends in order to forecast tourism demand of Hawaii reporting the accuracy of their prediction results only by Root Mean Squared Error (RMSE) Use of environmental and social media features in more than 100 outdoor and indoor attractions
[18] Analyses the relationship between the internet search data (Baidu in China) and the actual tourist flow only for a single city, Beijing Forbidden City Extended study of tens of attractions divided into two groups, studying the performance of different classes of features
[19] Use of Location Based Social Networks (LBSN) to study mobility of tourists and citizens in a coarse-grained fashion Fine-grained analysis in social media networks
[20] Uses the locations of photographs in Flickr to estimate visitation counts in some recreational sites Improving the accuracy of prediction models exploiting environmental features alongside explicit use of recency and seasonality factors
[21] Analyses the climate and visitation data for the U.S. national parks using a single model of third-order polynomial temperature model with an accuracy of 69% Use of multiple prediction models exploiting social media features alongside explicit use of recency and seasonality factors
[22] Exploits travellers’ Google web search and history of tourism arrivals to analyze temporal relationships between search terms and tourist arrivals in a single attraction (a Swedish mountain) Quantification of performance of the explicit use of recency and seasonality factors for more than 100 attractions while improving the accuracy of results by adding environmental features and other types of external data
[23] Use of search engine data with a de-noising step in order to avoid misleading or invalid predictions by comparing the performance of different noise-processing techniques only in Jiuzhaigou park in China Analysis of more than 100 attractions with different characteristics in two categories of indoors and outdoors in a fine-grained manner
[24] Tourism demand Prediction of top five most visited museums in London with free admission evaluating different algorithms exploiting the Google Trends index as the main feature Compared to our work, the former is very limited in terms of the type of attraction, location and exploited features. while their main feature Google Trends index is a black-box with proved probability of overestimation problem
[9] Analyse seasonality as one of the main phenomena affecting tourism, i.e systematic, although not necessarily regular, intra-year movement caused by changes in the weather, the calendar, and timing of decisions made by the agents of the economy In contrast to this work, our contribution is in explicit use of recency factor besides exploiting external data to improve the accuracy of fine-grained prediction models
[10] Analyse seasonality as a temporal imbalance in the phenomenon of tourism, which may be expressed in terms of dimensions of such elements as numbers of visitors, expenditure of visitors, traffic on highways, employment, and admissions to attractions Fine-grained analysis and quantification of the effects of the recency factor, isolatedly and conjointly with seasonal factors in two different class of indoor and outdoor attractions
[11] Perform a decomposition analysis of yearly, monthly and weekly seasonalities of tourism demand showing that seasonality is present in annual, monthly and weekly frequencies using the Balearic Islands airports as their single case study A complete factorial design analysis for quantifying the effect of the recency factor in specialized and global models separately for two types of attractions
[12] Study the impact of seasonality on cultural tourism—defined as tourism focused on cultural motivations, including visits to museums and archaeological sites. They analyze tourism seasonality in some selected destinations in Sicily, concluding that cultural destinations are less impacted by seasonality in tourism flows our work not only quantifies the effects of recency and seasonality factors in isolation and conjointly but also presents improvements in performance for our predictions exploiting external features regarding social media and environmental data using dozens of attractions
[7] Analyse and state that the most widely adopted statistical time series prediction method is the SARIMA which is able to capture seasonality and recency implicitly in forecasts Analysis of multiple prediction models exploiting external features of social media and environmental alongside the explicit use of recency and seasonality in specialized models. SARIMA is used as one of our baselines.
[6] Propose an algorithm for recommending personalized tours based on users’ recent preferences as one of the variables of their model enhancing their models by a weighted update of user interests based on the recency of their visits giving more emphasis to more recent visits Our focus is in the task of tourism demand prediction and not on recommendation of touristic sites, though recency and seasonality features are explored as in that work. Our work also quantifies the effect of each of the factors for two classes of indoor and outdoor attractions
[25] Investigate experiences of Chinese economy hotel guests using online reviews as proxy. Applies a deep learning fine-grained sentiment analysis to rank each of positive and negative sentiments associated with tourists sentiments such as location, facilities, service, price, image, sound insulation. Similarly to our work, the authors use external data for fine-grained predictions. However, the focus of our work is on prediction of visits instead of sentiment analysis. Our work also exploits recent and seasonal behaviors explicitly in our feature-set.
[26] Develops a scalable online platform for extracting, analyzing, and sharing multi-source multi-scale human mobility flows to assist human mobility monitoring and analysis during disaster events such as the ongoing COVID-19 pandemic in understanding human mobility dynamics. The focus of this work is mostly on providing and monitoring fine-grained spatio-temporal mobility data while our work analyses multiple prediction models exploiting external data alongside explicit use of seasonality and recency in order to predict tourism demand.
[27] Builds a fine-grained tourist satisfaction prediction model based on deep learning, using features such as “location, service, cost performance, environment, facilities and others” of the destination, and their division into several fine-grained dimensions. Similarly to our work, the authors use external data for fine-grained prediction. However the focus of our work is in visits prediction instead of tourist satisfaction prediction. In addition, Our work exploits recent and seasonal features to obtain more accurate results.

1.3 Contributions and outline of the paper

To summarize, the main contributions of this article are:

  • Section 2 presents our collected dataset specification, followed by problem definition, short description of exploited prediction techniques, experimental methodology, features exploited in the prediction models, and the evaluation metrics. The investigation of all proposed RQs requires a rich dataset to permit in-depth analysis of the effects of tourism requirements in multiple category of attractions.

  • Section 3 aims to experimentally answer our posed Research Questions (RQs) for tourism attractions. Sections 3.1 and 3.2 investigate RQ1 demonstrating that the three tourism key requirements, i.e. recency, seasonality and model specialization are essential for fine-grained high-accuracy tourism demand prediction task. More than that, these requirements should be incorporated as explicit features into the learning models. Our experimental evaluation confirms our hypotheses, with observed gains over the other solutions. We also show that the explicit incorporation of such requirements into the models help to solve very hard-to-solve cases.

    Regarding RQ2, a factorial analysis is performed in order to quantify the impact of each of the three requirements on the accuracy of the learned models in indoor and outdoor attractions. The analyses show that the most impacting ones are model specialization and seasonality but recency is effective when there is not enough historical data about a specific attraction.

    To investigate the RQ3, a study on the performance of each of the tourism prediction requirements is performed for cases with no recent or historical data for a given attraction. The study shows that the absence of recency or seasonality features drastically reduces the accuracy of prediction models in different scenarios.

  • Section 4 discusses and analyses our achieved experimental results, connecting them with the posed Research Questions. It provides insights regarding why our proposed methods work the way they did based on our interpretations of all performed analyses.

  • Section 5 summarizes our results to answer RQs. It emphasizes that the best results are obtained when using all types of features, i.e. external data features jointly with key tourism requirements. Conclusions and directions for future work are also provided.

All in all, the main contributions of this paper are (i) a new information architecture that explicitly incorporates three new factors into the state-of-the-art fine-grained tourism prediction models, greatly improving model accuracy and (ii) a quantification of the impact of each of the three factors on the accuracy of the learned models. Our results have both theoretical and practical implications towards solving important touristic business demands.

2 Materials and methods

In this section, we elaborate our experimental methodology for the task of evaluating the role of recency, seasonality and model specialization in tourism demand prediction. We first review the adopted datasets and then present the problem formulation. Next, the exploited prediction techniques are explained, offering a brief description of each of them. The learning and parameterization of the prediction models are discussed next. Finally, the prediction architecture is illustrated along with the definition of the factors for the factorial design analysis.

2.1 Datasets

Our present work relies on our previously published FISETIO dataset [28] for experimental analysis. This dataset was collected from five official and governmental sources: (1) the U.S. National Park Service was selected as the main source for the official data for the outdoor tourism demands; (2) TripAdvisor was used as the source for the social media related features of the outdoor and indoor dataset; (3) U.S. national climate data center was used as the origin of the climate data for the outdoor attractions; (4) the Department for Digital, Culture, Media and Sport of England providing the official visits for the indoor attractions and (5) the U.K. national weather service (Met Office) to gather weather conditions for indoor dataset.

We collected, cleaned and merged all data into two categories of attractions, namely, outdoors and indoors. Fig 1 illustrates the data collection phases to obtain the indoor and outdoor datasets for our analysis. Table 2 provides the sources and features -social media and environmental- in our datasets in brief. The reader is referred to our published dataset paper [28] for a detailed description of data sources and the data collection, data cleaning and the data integration processes. The data collection method is in compliance with the terms and conditions of the data provider in this case Mendeley Repository.

Fig 1. Data collection phases.

Fig 1

Table 2. Overview of indoors (I) and outdoors (O) datasets and features (VIS: visits, SOC: social media features, ENV: environmental features.

Dataset Provider Granularity Features Data Range
I.VIS Department for Digital, Culture, Media and Sport of England monthly total number of visitors to museums and galleries 2004-03 to 2018-07
I.ENV U.K national weather service (Met Office) monthly min and max temperature, rainfall, sunny hours and days of air frost 1996-01 to 2018-08
I.SOC TripAdvisor travel website monthly number of reviews, average ratings 2001-08 to 2018-08
O.VIS U.S. National Park Service monthly total number of visitors 1996-01 to 2016-08
O.ENV U.S. National Climate Data Center monthly minimum, average, maximum temperature, average precipitation 2000-01 to 2016-10
O.SOC TripAdvisor travel website monthly number of reviews, average ratings 2011-01 to 2016-09

2.2 Problem Definition

The faced problem is forecasting the visitation for fine-grained touristic points. Yet, in addition to the social media and environmental features exploited in our previous work [5], we here also incorporate both recency and seasonality requirements. Moreover, we consider per-attraction model specialization. Given a touristic attraction, the prediction problem is defined as follows.

First, equally spaced non-overlapping time windows are created with the same temporal granularity (e.g., a month, a week, a day, an hour, etc) for each time-series of the variables in social media, environmental data, recency and seasonality features in the format of X = {X1, X2, …, Xm} where m is the number of features. These time-series (e.g., number of reviews, average temperature, visits in the last month, visits in the last year) serve as the input of the prediction models. A time series Xi is a sequence {xi(1),xi(2),...,xi(t)}, where xi(t) denotes the value of variable Xi measured (time-lagged) in time window t for the specific touristic attraction that is the target of prediction. Measured variables are social media and environmental features that have been measured at timestamp t while recency and seasonality features correspond to the history of visitation counts (i.e. response variable) in recent months or last year of visitation, which have been augmented and time-lagged to the time window t.

The objective function (f) is forecasting y(t), the tourism visitation at timestamp t in a target attraction with the lowest prediction error, giving the input vector X as the feature-set including social media, environmental, recency and seasonality features for each time window in the interval of [1, tk] (for k > 0), i.e., {x1(1),x2(1),...,xm(1),x1(2),x2(2),...,xm(2),...,x1(t-k),x2(t-k),...,xm(t-k)} where m is the number of available features (in some cases the objective function f combines a input vector X and the response variable y).

2.3 Prediction techniques

This section offers a brief description of the techniques exploited for forecasting fine-grained tourist visit counts. Building on top of our previous work—the current state-of-the-art in the field [5], the learning models with the best prediction accuracy were selected, namely Support Vector Regression (SVR) [29], and the Auto Regressive Integrated Moving Average (ARIMA) based models -Seasonal ARIMA (SARIMA) [8] and Seasonal ARIMA with eXogeneous variables (SARIMAX) [30]. In here a Deep Neural-Network based method—Long Short Term Memory (LSTM) [31]— is also considered, which has not been included in [5] and is currently a popular method. Finally, we introduce two naive models in which are simple models that are based exclusively on historical observation [32].

The objective of all techniques is to estimate y(t), the number of visits in a given touristic place in the timestamp t giving the input vector X as the feature-set including social media, environmental, recency and seasonality features for each time window in the interval of [1, tk]. However, there are some variations in the way each model adopts the features. For instance SARIMA models can exploit only the history of the number of visits, while SARIMAX exploits not only such history, but also the complete feature-set. The other models use the complete feature set, i.e. social media, environmental, recency and seasonality features.

2.3.1 Support Vector Regression

Support Vector Regression (SVR) is an extension of Support Vector Machines (SVM) widely used for regression tasks [29]. SVR performs a “linear regression” in a high-dimensional feature space resulting from a (nonlinear) mapping provided by a kernel function. The linear model (in the feature space) is given by:

f(X,W)=j=1mWjgj(X)+b, (1)

where W is the weight vector to be “learned”, gj(X) denotes a set of nonlinear transformations on the input feature set, and b is the “bias” term. SVR pursues the best trade-off between the model’s empirical error and the model complexity by constraining SVR regression function f(,) to the hyper-planes function class, and employing a margin around the hyper-plane. Moreover, f(,) only depends on a reduced set of the training data called the Support Vectors (SV), those which correspond to the active constraints in the optimization problem [29] defined as:

L(y,f(X,W))={0if|y-f(X,W)|ϵ|y-f(X,W)|-ϵotherwise (2)

where y is the value to estimate.

The key parameters of SVR are the kernel function K, the margin of tolerance ϵ, and the trade-off C between the model complexity and the degree to which deviations larger than ϵ are tolerated.

2.3.2 Seasonal Auto Regressive Integrated Moving Average (SARIMA)

Auto Regressive Integrated Moving Average models (ARIMA), is a classical time series forecasting method which was firstly proposed by Box and Jenkins [33]. In this model, the future value of a time series is a linear function of previous values of the original series and random errors. In other words, ARIMA projects the future values of a series based entirely on its own inertia. Thus, the set of predictor variables X used by ARIMA consists of the past measurements of the response variable y(t), that is, X = {y(1), y(2), …, y(tk)}, k > 0. When a seasonal effect is observed, a generalization of the ARIMA model is used, i.e. the SARIMA model. A SARIMA model is an equation in the following form:

f(X,t)=Θθϵ(t)ΦφΔδ, (3)

where Θ, Φ and Δ are polynomials that compute the seasonal auto-regressive, differences and moving average components, respectively, θ, φ and δ quantify the respective regular (non seasonal) polynomials and ϵ(t) is the estimation error.

2.3.3 Seasonal Auto Regressive Integrated Moving Average with eXogenous variables (SARIMAX)

Due to the importance of exogenous data (i.e., social media, environmental data, recency and seasonality features) in our experiments, SARIMAX models (SARIMA with exogenous variables) [30] are applied. SARIMAX in addition to the history of response variable, takes into the account the input features, i.e. the external predictor variables, that is, the set of time series X={xi(1),xi(2),...,xi(t)} where xi(t) denotes the value of variable Xi measured (time-lagged) in time window t. The SARIMAX model could be formulated as:

f(X,t)=Θθϵ(t)ΦφΔδ+βX, (4)

where the definition of the parameters Θ, Φ, Δ, θ, φ and δ is as same as Eq 3.

2.3.4 Neural network

Introduced in [31], Long Short-Term Memory (LSTM) neural network models are well-suited to classification and regression as well as prediction tasks based on time series data. LSTMs have a notion of memory that may help capturing past trends in the data. The use of LSTMs in the context of tourism prediction is not new; in [34] the authors apply LSTM to tourism flow prediction, presenting interesting results.

A LSTM network consists of a chain of cells—each LSTM cell is configured by four gates: input gate, input modulation gate, forget gate and output gate. Input gates take new inputs from outside and process newly incoming data. Memory gates take inputs from the output of the LSTM cell in the last iteration. Forget gates decide when to forget the output results, thus selecting the optimal time lag for the input sequence. Output gates take all results calculated and generate final output [31].

Consider a time-series input represented as X={xi(1),xi(2),...,xi(t)} where xi(t) denotes the value of variable Xi measured (time-lagged) in time window t and hidden state cells H = {h(1), h(2), …, h(t)} For t = 1, …, T LSTM computes:

f(X,t)=Whyht+by (5)
ht=H(Whyxt+Whhht-1+bh), (6)

where W and b are respectively weight matrices and bias vector parameters which need to be learned during model training.

2.3.5 Naive models

In general, naive forecasting models are simple models that are based exclusively on historical observation [32]. Our work defines two naive models as our baselines based on seasonality and on recency of tourism activities—Naive-Seasonality and Naive-Recency. For a naive prediction with seasonality, a simple approach to determine y(t) in a time window t, is to pick the number of visits at y(t−12) in the available past data. Similarly, for recency, the naive model predicts y(t) based on the number of visits at y(t−1).

f(t)={y(t-12)naiveseasonalityy(t-1)naiverecency (7)

Table 3 lists all prediction techniques presented in this section, summarizing their main characteristics. A rich and diverse set of techniques is exploited, aiming at investigating how each of them performs in our target prediction problem.

Table 3. Prediction techniques comparison.
Method Exploit history of visits Exploit social media and environmental features Consider temporal dependency among data observations
SVM No
SARIMAX Yes
SARIMA Yes
LSTM Yes
Naive Models Yes

2.4 Model learning and parameterization

Note that a specific model is learned (and later evaluated) for each attraction (for each park, museum or gallery), and thus, there is a different parameter choice for each of them. Nonetheless, for the sake of brevity, the values of the best parameters reported next are averages over all attractions. Parameter tuning for the models was performed as follows.

2.4.1 Support Vector Regression

For Support Vector Regression(SVR), the kernel function is set to “linear”, because in preliminary experiments it produced the best results, besides being more efficient (lower execution time). The cost C parameter was varied in the interval of [2−5, 210], and the best value varied for different attractions; however on average the best value was C = 116. The tolerance ϵ was tested in the range of (0, 1) with steps of 0.1 and 0.3 was found to be the best value of ϵ (again on average across all attractions).

2.4.2 Seasonal Auto Regressive Integrated Moving Average and Seasonal Auto Regressive Integrated Moving Average with eXogeneous variables

Regarding Seasonal Auto Regressive Integrated Moving Average (SARIMA) and Seasonal Auto Regressive Integrated Moving Average with eXogeneous variables (SARIMAX) models, the forecast package in R (available at https://github.com/robjhyndman/forecast) was used in order to optimally find the best parameters (order of each polynomial) of the SARIMA model, as well as to find the seasonality pattern of the data.

2.4.3 Long Short-Term Memory

Related to Long Short-Term Memory(LSTM), different network architectures were explored, applying the ADAptive Moment estimation (ADAM) optimizer [35] for parameter optimization. ADAM is an adaptive learning rate optimization algorithm, designed specifically for training deep neural networks. Best results were obtained by: (i) normalizing all the variables in the range of (-1,1); (ii) using the mean-squared-error metric for the loss function; (iii) using a sequential model with one dense layer consisting 100 neurons using the Keras library in python (Keras is an open-source neural-network library written in Python); and (iv) the following setting: number of epochs was set to 1000, dropout to 0.2 and batch size of 30.

2.5 Prediction architecture

Fig 2 depicts the methodology, with the division of the datasets into training and test sets, having social media, environmental, recency and seasonality features and number of visits as the response variable (y). Recency features consist of visit counts in the previous last 4 months (y-1, y-2, y-3, y-4) and their log values (log y-1, log y-2, log y-3, log y-4) while seasonality features are the number of visits in the last year, same period, i.e. (y-12, y-13, y-14, y-15) and their log values (log y-12, log y-13, log y-14, log y-15). Furthermore, in our prediction architecture, as it can be seen in Fig 2, since social media and environmental data may not be available at prediction time, we exploit the values of the input feature in the last year (Xi−12) as the input of the models in the test case. This strategy has been used because of the annual seasonality behaviour of the tourism domain, as discussed in Section 1.

Fig 2. Tourism demand prediction methodology adopting social media, environmental, recency and seasonality features.

Fig 2

Fig 3 illustrates the construction of specialized and global models in order to study the prediction accuracy of the specialized models, with that of a global model trained with all attractions of each category of attractions (indoors our outdoors). Specialized models train a model particularly with features of each attraction while for global models, the model receives feature observations of all attractions of each type building a model.

Fig 3. Model specialization—illustration of global vs. specialized models.

Fig 3

In our experiments, cross-validation was performed to learn and optimize the prediction models. For each attraction, each time series was first divided into two parts: the training set, consisting of the first m months of data (m = 30 for outdoor attractions and m = 76 in indoor attractions), and the test set, consisting of the remaining months of data (4 months for both outdoor and indoor attractions). The training set is used to learn the prediction model and optimize the model’s parameters, while the test set is used for evaluating the learned model and reporting effectiveness results. For models requiring parameter tuning, the training set is further split randomly into additional parts.

Cross-validation with k = 10 (k-fold cross validation) was employed. In k-fold cross-validation, the training data is randomly partitioned into k equal sized sub-samples. Of the k sub-samples, a single sub-sample is retained as the validation data for testing the model, and the remaining k-1 sub-samples are used as training data. The cross-validation process is then repeated k times, with each of the k sub-samples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. Note that as the process of choosing the validation sets is random, it can cause different models and model parameters (but very similar) in each execution and consequently slightly different prediction results.

In order to evaluate the accuracy of the prediction techniques, the Mean Absolute Percentage Error (MAPE) [36] is used, being defined as:

MAPE(%)=1Mt=1M|y(t)-y^(t)y(t)| (8)

where M is the number of forecasting periods, y(t) is the actual visitation count and y^(t) is the predicted visitation count, both for time window t. A lower MAPE(%) value indicates a smaller percentage of errors produced by the prediction model. One commonly used interpretation of MAPE(%) values was suggested by [36] as follows: less than 10% is highly accurate forecasting, 10%-25% is good forecasting, 25%-50% is reasonable forecasting, and 50% or more is inaccurate forecasting.

2.6 Factorial design of tourism key requirements

To further investigate how recency, seasonality and model specialization requirements impact the prediction accuracy of different techniques, a factorial design analysis is performed over the correspondent features of each requirement to quantify the relative importance of each individual feature as well as their interactions on prediction accuracy.

A 2k experimental design technique was employed, since we are interested in determining the effect of k factors, each of which having two alternatives or levels. Such a design can be analyzed using a regression model to compute the main effect of a given factor xi, subtract the average response of all experimental runs for which xi was at its low (False) level from the average response of all experimental runs for which xi was at its high (True) level [15]. The importance of a factor is measured by the proportion of the total variation in the response variable that is explained by this factor.

Specifically, a 2k factorial design was employed with k = 3 factors (i.e.,recency, seasonality and model specialization), each one with two levels (true or false). This design allows us to estimate the relative importance of each factor as well as all factor interactions on the response variable. This importance is estimated by the fraction of the total variation observed in the response that can be explained by each factor (or factor interactions). In the following, we define the considered factors and factor levels. Note that a 2k factorial analysis is performed for each type of attraction (indoors and outdoors):

  • Recency factor (R): two levels are defined: (1) True, if the visit counts in the previous last 4 months (y-1, y-2, y-3, y-4) and their log values (log y-1, log y-2, log y-3, log y-4) are used for training the model and; (2) False, otherwise.

  • Seasonality factor (S): two levels are defined: (1) True, if the visit counts in last year (y-12, y-13, y-14, y-15) and their log values (log y-12, log y-13, log y-14, log y-15) are used for training the model and; (2) False, otherwise.

  • Model Specialization factor (M): two levels of are defined: (1) True, if an individual model for each indoor/outdoor venue is trained and; (2) False, a unique model for all venues of each attraction class is learned.

3 Experimental results

Our analysis starts by introducing the baselines for both indoors and outdoors attractions. Then, we discuss the impact on the results of the incorporation of the two tourism requirements—seasonality and recency—as features into the current SOTA prediction model (from [5]). RQ1 is answered at the end of these two sections.

In Section 3.3, RQ2 is investigated by applying a factorial design analysis in order to evaluate the impact of each factor (requirements) as well as of their interactions on the prediction accuracy of the models. A finer evaluation of the impact of each of the features is provided by an analysis of the coefficients of those features in the final models (Section 3.4).

In order to answer RQ3, the impact on prediction accuracy of each of the recency and seasonality features in different scenarios of scarcity of historical data is analyzed (Section 3.5). Finally, in Section 3.6, some examples of attraction are presented for which only by exploiting the complete set of tourism requirements, good prediction results could be obtained.

3.1 Comparison among baselines

In our previous work [5], we provided evidence of the importance of considering model specialization as an explicit requirement for tourism prediction. Our proposed prediction model—a specialized SVR method adopting social media and climate data—outperformed alternative models in the literature. We also compared the prediction accuracy of the specialized SVR models trained separately for each attraction with that of a global SVR model trained with all attractions of each type—indoors and outdoors.

In this section, for the sake of self-contention, the main results from [5] are summarized, adding to those results a comparison with a new method not exploited in that work—a Long Short-Term Memory (LSTM) neural network model, a popular method was used as as baseline for comparison. For the sake of completeness, our analysis also includes naive and classical models for recency and seasonality, also as baselines. The naive models are included since they directly reflect the behaviour of recency and seasonality features without any additional information in their predictions. Finally, considering the great popularity of the Neural Network models in the recent years, our analysis included one of the most successful neural network models in the domain of time-series and tourism prediction [31] which is LSTM. In the following, we elaborate on the baselines and their respective results.

3.1.1 Naive models

Two naive models are defined based on seasonality and on recency of tourism activities—Naive-Seasonality and Naive-Recency. For a naive prediction with seasonality, a simple approach to determine y(t) in a time window t, is to pick the number of visits at y(t−12) in the available past data. Similarly, for recency, y(t) is predicted based on the number of visits at y(t−1).

3.1.2 Classical Models

ARIMA models are one of the classical time-series prediction techniques widely used in the tourism prediction task. Accordingly, results of SARIMAX and SARIMA models are also reported where Seasonal-ARIMA (SARIMA) incorporates the known seasonality (periodicity) of the data into an ARIMA model, enhancing its predictive power. The SARIMAX baseline adds social media and climate features as the exogenous features. Both models were considered in [5], however for the sake of completeness, those results are replicated in this paper.

3.1.3 State-of-the-art (SOTA) model

In our previous work [5], we observed that, training specialized models (spe.) for each attraction using SVR outperformed the case where a single global model (glo.) is trained for all attractions of each type. We report both results (SVR spe. and SVR glo.) alongside the other baselines.

3.1.4 Neural network model

In here a robust neural network model—LSTM— which is actively used in time-series prediction tasks, is also exploited. LSTM models are famous for their memory-based architecture capable of capturing past trends in the data.

Accuracy results for each of these models are presented in Tables 4 and 5 for indoor and outdoor attractions, respectively. Since our goals is to predict the number of visits with the best possible accuracy, we focus our attention on the cases where MAPE is lower than 25% (MAPE < 25%—accurate predictions). In this scenario, the specialized SVR model with Environmental and Social features is the best model, predicting accurately for the highest percentage of attractions (almost 93% of Museums and 95% of Parks), considerably outperforming other models. However, best results for MAPE less than 10% (highly accurate results) are achieved by the naive-recency (26%) for the indoor attractions and by the naive-seasonality (42%) for the outdoor attractions. The success of the naive methods in highly accurate results (MAPE < 10%) is one of the reasons that motivates the adoption of recency and seasonality in our specialized models. The success of naive seasonality in parks is associated with the seasonal-cyclic behavior of climate in this type of outdoor attractions, as seasonality has been considered one of the main phenomena affecting tourism, principally due to changes in the weather conditions [9]. Climate conditions are less important when inspecting indoor attractions such as museums and galleries. (Naive) Recency, on its turn, in the case of indoor sites demonstrated to be a very good predictor for specific cases. Next, the performance of recency and seasonality is evaluated when explicitly incorporated as features into the specialized SVR model.

Table 4. Baseline results for 27 museums in U.K. (indoors).

The values in the table represent the percentage of attractions with MAPE in each specified range.

Museums
MAPE naive recency naive seasonality SARIMAX SARIMA LSTM SVR spe. SVR glo.
MAPE<10 25.93% 18.51% 11.11% 11.11% 11.11% 14.81% 3.7%
MAPE<25 70.37% 81.48% 85.19% 74.07% 74.07% 92.59% 11.11%
Table 5. Baseline results for 76 national parks in U.S. (outdoors).

The values in the table represent the percentage of attractions with MAPE in each specified range.

Parks
MAPE naive recency naive seasonality SARIMAX SARIMA LSTM SVR spe. SVR glo.
MAPE<10 6.58% 42.11% 13.16% 7.89% 23.68% 22.37% 5.26%
MAPE<25 56.58% 82.89% 69.74% 39.47% 84.21% 94.74% 18.42%

3.1.5 Computational complexity and execution time

The computational complexity of the Support Vector Machine (SVR) for time complexity is of O(N3) and for space complexity is of O(N2) where N is the number of points [37]. In our experiments, as few training points (N = 30 for outdoor attractions and N = 76 in indoor attractions) are used and the model is trained once a month, the execution time for the prediction task is not a major concern.

The computational complexity of other algorithms is as following: for SARIMA models the complexity is in the order of O(n) while in the case of neural networks they take more time due to numerous iterations applying forward and back-propagation—back-propagation, in the order of O(n5) is much slower than the forward propagation, in the order of O(n4). Finally, for the case of naive models the complexity is O(1) since they only pick the defined index of historical data to pass as the naive prediction.

All in all, for specialized SVR models, the mean execution time is about 45 seconds (min exec. time: 21 seconds and max exec. time: 92 seconds) while for generalized models for different sets of features the mean execution time was around 4 hours (min exec. time: 1 hour and 33 minutes and max exec. time: 20 hours). After conclusion of the training step, the prediction phase is quite fast—average of 4 seconds for all 100 attractions independent of global or specialized models. The machine used in our experiments was a desktop PC with 4 CPUs and 16 GBs of RAM memory using the R programming language.

3.2 Augmentation with RECENCY and SEASONALITY

Our focus now shifts to demonstrate how the addition of the other two tourism requirements, i.e. recency and seasonality, into the state-of-the-art specialized SVR models with Social and Environmental features, hereafter called SpecES (Specialized with Environmental and Social)), can improve the accuracy of forecasting the visitations for fine-grained touristic points. Results of adding recency and seasonality to global models can be found in the Appendix A. Training specialized models for each individual attraction allows the models to learn specific patterns of visitation at each touristic point. Tables 6 and 7 show the prediction performance of the models when all the three tourism prediction requirements are present i.e. model specialization, seasonality and/or recency. As previously discussed, for indoor attractions, the specialized models without the new features (SpecES) have a good performance (MAPE < 25%)—over 92% for museums and 95% for parks (column SpecES in Tables 6 and 7 refer to column SVR.spec in Tables 4 and 5, repeated here to facilitate comparison).

Table 6. SpecES prediction results augmented with the other two tourism requirements—seasonality and/or recency, trained for each of the 27 museums in U.K. (indoors).

The best prediction models are in bold face.

Museums
MAPE SpecES SpecES+recency SpecES+seasonality SpecES+recency+seasonality
MAPE<10 14.81% 29.63% 48.15% 48.15%
MAPE<25 92.59% 96.30% 96.30% 92.59%

Table 7. SpecES prediction results augmented with the other two tourism requirements—Seasonality and/or recency trained for each of the 76 national parks in U.S. (outdoors).

The best prediction models are in bold face.

Parks
MAPE SpecES SpecES+recency SpecES+seasonality SpecES+recency+seasonality
MAPE<10 22.37% 40.79% 48.68% 50.00%
MAPE<25 94.74% 94.74% 96.05% 96.05%

Considering the results in Tables 6 and 7, it can be noted that the combination of only seasonality and model specialization for museums (fourth column in Table 6) results in a slightly higher accuracy (96% for MAPE < 25 and 48% for MAPE < 10) than when all features are used. For parks, instead, the combination of all tourism requirements (fifth column in Table 7) performs the best (96% for MAPE < 25 and 50% for MAPE < 10). This aspect will be further analyzed in the next section when we perform a factorial analysis over the tourism requirements.

Regarding the high accuracy cases (MAPE < 10), remind that the combination of SpecES with the other two key tourism requirements- recency and seasonality- produced the best overall results. In more details, for the indoor attractions (Table 6), SpecES+recency+seasonality produced high prediction accuracy for about 48% of the museums compared to 22% obtained by the naive-recency (Table 4), the best baseline in this category. A similar behavior is seen for the outdoor attractions—comparing the results in Tables 5 and 7, for (MAPE < 10), SpecES+recency+seasonality has highly accurate predictions for 50% of the parks compared to around 42% using the naive-seasonality.

3.3 Factorial analysis

This section investigates the impact of each of tourism prediction requirements, i.e. recency, seasonality and model specialization by means of a factorial design analysis. Factorial design techniques help to analyze the effect of each factor (requirement) as well as the effects of their interactions on the tourism demand (visits count) in each touristic attraction.

We employ a regression analysis for evaluating the amount of variation in the prediction results that can be explained by each factor (and interaction). A 2k r experimental design technique was adopted to estimate the effect of k = 3 factors (recency, seasonality and model specialization), each of which having two levels (requirement is incorporated into the model or not, for the prediction task) and with r replications per configuration. As reported in Section 2.3, applying cross-validation along with the SVR model produces small variations in prediction results due to the stochastic nature of the task. In order to reduce this variation and increase the accuracy of results, each experiment was executed several times to calculate the average and standard deviation of the variation of results. The adequate number of runs was estimated based on 95% confidence level and accepted error percentage of 2%, as being 5 runs. In our factorial analysis, the response variable is the % of attractions (indoors/outdoors) that fall in each MAPE range. The goal is to estimate the importance of each factor (interaction) on the variation observed in those % of touristic attractions. When all three requirements are turned off, the global SVR model (non-specialized model trained for all attractions of each type—indoors and outdoors) is used with only the Environmental and Social media features, i.e. absence of all three factors. Results of adding recency and seasonality to global models can be found in the Appendix B in Tables 14 and 15.

Table 14. Prediction results adopting seasonality and/or recency tourism requirements for a global model trained with 27 museums in U.K. (indoors).

The best prediction models are in bold face.

Museums—Global Model
MAPE global mdl. global mdl.+recency global mdl.+seasonality global mdl.+recency+seasonality
MAPE<10 3.7% 7.41% 25.93% 7.41%
MAPE<25 11.11% 48.15% 77.78% 74.07%

Table 15. Prediction results adopting seasonality and/or recency tourism requirements for a global model trained with 76 national parks in U.S. (outdoors).

The best prediction models are in bold face.

Parks—Global Model
MAPE global mdl. global mdl.+recency global mdl.+seasonality global mdl.+recency+seasonality
MAPE<10 5.26% 3.95% 30.26% 30.26%
MAPE<25 18.42% 46.05% 81.58% 86.84%

Table 8 shows the variation explained by each tourism requirement on the prediction results in each category of attractions. It can be observed that in both indoors and outdoors attractions, model specialization and then seasonality have the largest contributions. In the case of MAPE < 25, there is also a significant contribution of the interaction between these two factors—seasonality and Model specialization (24.4% for Parks and 13.3% in the case of Museums). The analysis of the statistical significance of our numerical results are presented in details in the Appendix B in Tables 16 and 17.

Table 8. Contribution of each of tourism prediction requirements: Recency, seasonality, model specialization and their interactions into the response variable in each category of attractions: Parks and museums; results for MAPE < 10 and MAPE < 25 in 5 runs.

The contributions higher than 5% are in bold face. The analysis of the statistical significance of our numerical results are presented in details in the Appendix B in Tables 16 and 17.

Requirements contribution (%)
MAPE < 25 MAPE < 10
Museums Parks Museums Parks
Recency 1.0 1.1 0 1.1
Seasonality 21.1 24.6 19.8 32.6
Model spec. 55.9 46.0 70.0 58.5
Recency, Seasonality 5.5 0.7 1.8 2.0
Recency, Model spec. 0.1 1.2 1.3 2.0
Seasonality, Model spec. 13.3 24.4 2.9 1.8
Recency, Seasonality, Model spec. 0.7 1.8 1.6 0.9
Residuals 2 0.2 3 1

In addition, it can be observed that model specialization, in relative terms, is more important to the variation observed for MAPE < 10 than for the results for MAPE < 25 (explains 70% versus 56% of result variation for Museums and 59% versus 46% for Parks). One may say that if highly accurate prediction results (MAPE < 10%) are needed, the use of specialized models becomes even more important.

Table 8 also shows that the impact of recency and its interactions with other factors on the prediction results are almost negligible. Despite that, recency can improve results (look for instance at the second and third columns in Tables 6 and 7), indicating that it should be used, mainly if the seasonality features are not available.

Seasonality alone has more than 20% of contribution in both parks and museums, for MAPE < 25. This indicates that when only the historical data for an attraction is present, significant improvements in accuracy can be obtained by injecting seasonality features into the model as input variables. It can also be observed that seasonality has even a higher impact (32.6% versus 19.8%) in outdoor attractions for very accurate prediction results (MAPE < 10). This is in alignment with what was discovered in [5] when we showed that in outdoor attractions the impact of climate features is much higher than in indoor attractions, considering that the climate features have a high correlation with seasonality [10].

3.4 A drill down analysis of encapsulated features in recency and seasonality factors

In the previous section, the impact of each of the tourism prediction requirements was quantified. In the following, we delve further into the role that each of the recency and seasonality features (introduced in the Section 2) play regarding the prediction task accuracy. We will do so by analyzing the learned coefficients of the global models in the indoor and outdoor scenarios. In other words, the global models will be used as an analytical tool (only). This option was chosen to avoid the complexity of analysing all the 103 models produced with specialization (one for each attraction).

As can be seen in Tables 14 (indoors) and 15 (outdoors) (in the Appendix A for the sake of space, easiness of analysis and flow of discourse), the impact of the incorporation of the recency and seasonality features into the global models is similar to that of the specialized models, with significant improvements over the case in which such features are not used, for MAPE < 10 and MAPE < 25, although results are not as good as with the latter.

Tables 9 and 10 show the learned coefficients of global models in indoor and outdoor scenarios, respectively. In more details, for this analysis, global models were built for all attractions of each type, adopting each time a different feature-set: (I) soc + env: global model trained having only social media and environmental features in the feature-set; (II) recency (soc + env + rec): global model having recency features in addition to the social media and environmental features; (III) seasonality (soc + env + seas): global model having seasonality features in addition to the social media and environmental features and; (IV) seasonality+recency (soc + env + rec) + seas): global model having all features including social media, environmental, recency and seasonality features.

Table 9. The coefficients of features of global (single) model for all 27 U.K museums adopting each time a different set of features: (I) only social media and environmental features (soc+env), (II) social media, environmental and recency features (soc+env+rec), (III) social media, environmental and seasonality features (soc+env+seas), (IV) complete feature set: Social media, environmental, seasonality and recency feature (soc+env+rec+seas).

The bold face shows the top 2 features in each column.

Features soc+env soc+env+rec soc+env+seas soc+env+rec+seas
tmin -0.093 0.025 -0.004 -0.031
tavg 0.024 0.005 -0.001 0.001
tmax 0.116 -0.011 0.002 0.026
air_frost_days 0.004 0.005 0.000 0.008
rain -0.022 0.006 0.003 0.018
sunny_hr -0.037 0.020 0.004 0.022
revs 0.517 0.010 0.004 0.002
rating -0.051 -0.001 0.000 -0.004
month -0.007 -0.060 -0.002 -0.026
y-1 - 0.511 - 0.407
y-2 - 0.258 - 0.158
y-3 - 0.156 - 0.033
y-4 - 0.054 - 0.026
log y-1 - 0.025 - 0.089
log y-2 - -0.003 - -0.033
log y-3 - -0.032 - -0.015
log y-4 - 0.003 - -0.017
y-12 - - 0.764 0.658
y-13 - - 0.038 -0.291
y-14 - - 0.084 -0.051
y-15 - - 0.089 0.034
log y-12 - - 0.011 -0.028
log y-13 - - -0.002 -0.027
log y-14 - - -0.003 0.036
log y-15 - - -0.007 0.003

Table 10. The coefficients of features of global (single) model for all 76 U.S. National Parks adopting each time a different set of features: (I) only social media and environmental features (soc+env), (II) social media, environmental and recency features (soc+env+rec), (III) social media, environmental and seasonality features (soc+env+seas), (IV) complete feature set: Social media, environmental, seasonality and recency feature (soc+env+rec+seas).

The bold face shows the top 2 features in each column.

Features soc+env soc+env+rec soc+env+seas soc+env+rec+seas
tmin 0.006 -0.290 -0.001 -0.011
tavg 0.021 0.580 0.002 0.022
tmax 0.002 -0.278 0.000 -0.010
temp_dif -0.015 0.003 0.003 0.003
pcp(rain) 0.000 -0.002 -0.003 -0.001
revs 0.278 0.019 0.001 0.000
rating -0.007 0.005 0.009 0.005
month -0.007 -0.038 -0.001 -0.005
y-1 - 1.218 - 0.340
y-2 - -0.297 - 0.014
y-3 - -0.070 - 0.018
y-4 - 0.066 - 0.014
log y-1 - -0.007 - 0.034
log y-2 - 0.023 - 0.004
log y-3 - -0.023 - -0.013
log y-4 - 0.001 - -0.001
y-12 - - 0.947 0.928
y-13 - - 0.026 -0.262
y-14 - - 0.031 -0.018
y-15 - - -0.010 -0.038
log y-12 - - 0.009 -0.013
log y-13 - - -0.003 -0.024
log y-14 - - -0.012 -0.005
log y-15 - - 0.003 0.013

A similar pattern can be seen in the learned coefficients of outdoor attractions (Table 10). The model has larger weights for average temperature and number of reviews in the simple model; y-1 and average temperature in the recency model; y-12 and y-14 in the seasonality model; and finally y-12 and y-1 for complete feature-set model, which is consistent with our previous discussions in the factorial design analysis.

Regarding indoor attractions (Table 9), the learned coefficients indicate the high importance of number of reviews and then maximum temperature in the simplest model. In the recency model (soc + env + rec), instead, higher weights are given to the number of visits in the last two months (y-1 and y-2 features). The number of visits in the last year (y-12) and in 15 months before (y-15) are more relevant when seasonality is incorporated into the model (soc + env + seas). Finally, visits in the last year and in the last month (y-12 and y-1) contribute more to the accuracy of the complete model (soc + env + rec + seas). Interestingly, the impact of visits in the last year, same period (y-12) has a larger weight than visits in the last month (y-1) which is aligned with what was observed in the factorial analysis of the impact of tourism prediction requirements—seasonal features have more contribution to the model than recency ones.

3.5 Impact of historical data scarcity on the prediction task

As discussed in the previous sections, learning specialized models trained with the complete information regarding social, environmental, recency, and seasonality information considerably improves the accuracy of the prediction models. However, having full information regarding recency and seasonality is not always guaranteed. In the following, we further investigate the individual impact of recency and seasonality in the prediction task in scenarios without full availability of historical information on (number of) visits, social media and environmental data for touristic attractions. For these analyses, we revisit the prediction architecture and redefine the training and test sets when necessary.

3.5.1 Only recency—scarcity in seasonal data

In scenarios in which there is not enough historical information for an attraction, i.e., there is only very recent data on visits, social media and environmental data of a touristic place, the model can exploit recency features in order to improve the prediction of the future visitation. This situation may occur, for instance, for new attractions or attractions that have only started to collect (visitation) data very recently. To simulate this scenario in our datasets, the model only uses the last four (4) months of the historical data of each attraction to train each prediction model while filtering out the rest of the data. Fig 4 presents our revised prediction architecture to deal with this new prediction scenario.

Fig 4. Tourism demand prediction methodology in scarcity of seasonal data adopting social media, environmental and recency features.

Fig 4

Since the features of the last 12 months are not available to evaluate the prediction model, we adopted two different scenarios for defining the input value of each feature in the test-set: (i) last month case, in which the previous month information is used as the input of the model and; (ii) mean of 4-months case, in which the mean of each feature of the train-set is used as the input feature values of the models. Tables 11 and 12 (two leftmost columns) show the results. The percentage of parks with an accurate prediction (MAPE<10) is quite low in both cases (about 2%) while the percentages are a little higher (≈ 15%) in museums. Regarding good predictions (i.e., MAPE<25), using the last month as the input has the same results as using the mean of 4-months features (37% in museums in both cases) whereas regarding the parks, using the last month as the input has a slightly better performance(25%) than using the mean of 4-months features (21%).

Table 11. Scarcity in seasonal historical and recent data—Evaluation of performance of recency and seasonality features in 27 Museums in U.K.
MAPE Only Recency Only Seasonality
last month case mean of 4-months case unavailable last 4 months unavailable last year
MAPE<10 14.81% 14.81% 44.44% 29.63%
MAPE<25 37.00% 37.00% 81.48% 74.00%
Table 12. Scarcity in seasonal historical and recent data—Evaluating performance of recency and seasonality features in 76 national parks in U.S.
MAPE Only Recency Only Seasonality
last month mean of 4-months unavailable last 4 months unavailable last year
MAPE<10 1.32% 2.63% 43.00% 0.00%
MAPE<25 25.00% 21.00% 85.00% 21.00%

3.5.2 Only seasonality—scarcity in recent data

Likewise the recency features, we analyze the performance of seasonality features when the most recent data is not available. This may happen in cases when data collection is periodical (or seasonal) and lasts longer periods and the most recent data is not yet available for prediction. In this scenario, seasonality features can be exploited, i.e. number of visits, social media and environmental data in the previous years in order to predict the future visitation, if this information is available. Fig 5 shows our revised prediction architecture to deal with this prediction scenario.

Fig 5. Tourism demand prediction methodology in scarcity of recent data adopting social media, environmental and seasonality features.

Fig 5

For this, the most recent historical data of each attraction are not used and only the remaining historical data are used for training the prediction model. For constructing the training-set, two cases are defined regarding the unavailability of historical data: (i) unavailable history of the last 4 months of each feature; (ii) unavailable last 12 months (last year) of each feature. The first case corresponds to the situation where the previous last 4 months (y-1, y-2, y-3, y-4) are not available while the second case is when we do not have one complete cycle of historical data (annual seasonality) [11].

Tables 11 and 12 (two rightmost columns) present the results for indoor and outdoor attractions. The percentage of museums with an accurate prediction (MAPE<10) is much higher in the first case when only the last 4 months of the historical data is unavailable in comparison to the case when the complete historical data of the last year is missing (44% versus 30%). The scenario is even more dramatic for parks (43% versus 0%). This behaviour is similar for good predictions (MAPE<25) for both, indoor (81% versus 74%) and outdoor (85% versus 21%) attractions. These results again suggest the importance of the historical data. In other words, having the last trends of visitations besides the periodical/historical behaviors is essential for an accurate prediction.

3.6 Improving accuracy of difficult cases by incorporating explicit tourism prediction requirements

In our previous work [5], we have identified a small set of indoor and outdoor tourism attractions for which their best prediction models performed poorly. In this section, we evaluate whether the incorporation of seasonality and recency features into the specialized models for these attractions can help to mitigate the found problems. In particular, we focus on two attractions—National Portrait Gallery in U.K. and Bryce Canyon National Park in U.S.

In the case of National Portrait Gallery, the social media reviews had a non-typical major increase by April 2015 but there was a gradual decrease in the number of visits (Fig 6). This atypical behaviour could be explained considering the annual report published by National Portrait Gallery available at https://www.npg.org.uk/assets/files/pdf/accounts/npgaccounts2015—16.pdf), informing that the virtual audience grew on a national and international level during 2015/16 with an increased number of people having access to exhibitions, displays and the collection online through the gallery’s website. As a result, more social media activity is observed but less in-site visitations. The incorporation of the recency and seasonality features helped to detect this behavior change and consequently improved the model accuracy (a reduction of 137% mean percentage error to 13% in Table 13).

Fig 6. Temporal evolution of number of visits and social media comments in National Portrait Gallery in U.K.—highly more accurate prediction model using specialized model with all recency and seasonality features.

Fig 6

Table 13. Accuracy of difficult cases incorporating explicit tourism prediction requirements in indoor and outdoor attractions.

Attraction MAPE-SOTA results (from [5]) MAPE—our results
U.K. National Portrait Gallery (indoor) 137.90% 13.34%
U.S. Bryce Canyon National Park (outdoor) 35.19% 24.43%

Regarding the Bryce Canyon national park, the difficulty was that the considerable increase in number of visits (more than 20% starting in February 2016 until September of the same year in comparison with the same period in 2015) was not accompanied by social media reviews (same behavior as previous years plus a slightly decrease in May 2016 compared to May 2015) (Fig 7). A possible reason was the waiving of the entrance fees in 2016. Again, by explicitly exploiting seasonality and recency such anomalies can be captured, reducing the mean percentage error of the models from 35% to 24% (reported in Table 13).

Fig 7. Temporal evolution of number of visits and social media comments in Bryce Canyon National Park in U.S.—significant gain using specialized model with all recency and seasonality features.

Fig 7

4 Discussion

4.1 Answering the posed research questions

This Section discusses answers to our posed Research Questions in Section 1.1 based on the experimental results presented in Section 3.

RQ1 focused on analyzing whether key tourism prediction requirements could influence the prediction accuracy of fine-grained tourists’ visits. Our experimental evaluation confirmed a positive answer to this question, corroborating our initial hypotheses. Significant gains have been observed over the previous state-of-the-art (SOTA) results [5], by explicitly incorporating the three requirements into the models.

In general, our analyses demonstrated a more prominent role of the seasonality features in the case of outdoor attractions (national parks), which can be explained by the weather´s seasonal behaviour that affects visitors´ decisions in paying a visit to an outdoor attractions. A significant importance was also given by the models to recency features, especially for the the case of indoor attractions (national museums). This fact can be justified by smaller importance of weather conditions in case of indoor attractions and a relatively significant impact of recent events such as festivals or pandemics in this type of attraction.

The impact of the incorporation of recency and seasonality as explicit first-class features into specialized models is perhaps better demonstrated by the capacity of the new models in dealing with previously unsolvable cases by the SOTA. The National Portrait Gallery in the U.K., for instance, had a huge increase in social media reviews but that was not accompanied by real world visits, causing the SOTA models to mistakenly follow the social patterns, consequently resulting in low accuracy. The explicit incorporating of recency features helped the models to capture recent trends in visitations, diminishing the importance of the tendency observed for the social media.

Another example is the Bryce Canyon national park in the U.S., in which the visits experience some period of untypical increases. That increase was not explainable by neither the environmental features nor the social media reviews, solely inputs exploited by the SOTA model. In this example, the explicit incorporating of seasonality features helped the models to capture variations in cyclic behaviour of tourists, enhacing the prediction accuracy for this attraction.

RQ2 aimed at quantifying the combined and isolated influence of each of key tourism prediction requirements in improving prediction accuracy. Usually, we observed that our specialized models with presence of recency and seasonal features gave a much higher importance for these features than for social or environmental features exploited by the SOTA models. Intuitively, recency features capture the impact of recent events on the prediction models that may deviate from the historical patterns. We have indeed observed that the incorporation of recent trends allowed small adjustments of the model, helping to avoid high drops in effectiveness. Let´s take again the hard-to-solve case of outdoor national park in U.S.—Bryce Canyon to illustrate our argument. An evaluation of features weights of the proposed model for this attraction showed relatively higher weights for recency (e.g. weight 0.74 for feature y-1: last month’s number of visits) and environmental features (e.g., weight 0.018 of feature tmax: maximum temperature and weight 0.002 for feature revs: number of reviews). This resulted in a quick response to the deviations in visitations in the testing period, helping to produce a 30% reduction of prediction error compared to the SOTA model.

Similarly, seasonality features capture the inherently cyclic behaviour of tourism demands. Again, to exemplify, the explicit use of seasonality features was helpful to obtain more accurate predictions in the case of hard-to-solve case of National Portrait Gallery, UK (indoor). The specialized model trained for this attraction, could learn the seasonal cyclic behaviour of visitations over many years of historical data, not being mislead by the increase of visitations in a short period of time. In a drill down analysis of the features´ importance, a relatively higher weight was given to seasonal feature y-12: last year’s number of visits (weight of 0.63) in comparison to social feature revs: number of reviews (weight of 0.46) that had been much important for the global model. As a consequence, our model could correct the gradual disassociation of social media review counts and number of visits, resulting in a model accuracy 9 times better than the SOTA model in the period of evaluation.

Finally, model specialization allowed us to capture very specific idiosyncratic patterns of visitations of individual parks that global models could not. For several attractions such as Aztec Ruins national park, the specialized model could obtain much more accurate results (up to 8 times better than global model). In these cases, particular recent and seasonal behaviors of visitations, alongside other social media and environmental features where better captured by the specialized model, not being confounded by general/global patterns of other attractions visitations’ patterns. In other words, the weights of features assigned to each individual attraction, in our example Aztec Ruins national park, was better adjusted to the time-series of visitations for this particular attraction.

An interesting general pattern that deserves attention is the higher improvements for the high accuracy cases (MAPE < 10), mostly because the easier cases where solvable by the SOTA models, as explained above. Specialization, recency and seasonality stand out, as discussed above, for the hard cases. But there still a lot of room for improvements, as we are still at a rate of 50% for this very accurate predictions.

To answer RQ3, we explored how scarcity in historical data—recent and seasonal features—impacted the prediction accuracy of models. This analysis aimed to simulate real world scenarios with lack of official census for some periods of time due to intermittent survey of visitation counts in touristic attractions. We observed that absence of recency or seasonality features drastically reduces the accuracy of prediction models. Although scarcity of recent data did not jeopardize the results as much as the absence of seasonality data did, recency features still had a significant impact on prediction accuracy, mainly for situations in which there was not enough historical data to capture seasonality for a given attraction.

This behaviour can be justified since for most touristic attractions, the seasonal touristic activities happen repetitively and repetition helps the learned model to emphasize certain behaviors as captured by the features. The lack of such seasonal data, consequently, may significantly reduce accuracy.

On the other hand, unusual or unexpected deviations from the historical touristic patterns because of recent events are rare and may not necessarily cause huge declines in prediction accuracy. These findings are inline with the good performance of the naive seasonal models.

4.2 Limitations and practical applications

Our analyses, though rich, have limitations. The main one relates to the lack of official data for some attractions in order to test our methodology on even finer time-grained data (weekly or daily basis). Predictions were evaluated on a monthly basis. This granularity of time was selected since the official ground-truth data was available and aggregated at this level. However some preliminary experiments suggest that there is a strong possibility of successfully applying the same methodology on a finer granularity of time.

Regarding the practical application of our results in real-word scenarios, although exploiting data from social media is fascinating, especially recent and seasonal data, a critical question that will determine their utility for forecasting future visitation is: how well do they reflect on-the-ground visitor surveys and records? In our work, we showed that there is a strong relationship between the number of reviews and visitation field-based records for a large fraction of the attractions, particularly those that are outdoors. This may provide a powerful new tool for forecasting tourism demands, helping tourism accommodations to get prepared even when there is no prior survey for their regions (or one is not even possible), only by using freely available social media data empowered by environmental records. However, correlating environmental and social data, including recent and seasonal, with official visits demonstrated to be key to motivate the simplicity of our prediction model. One needs to perform such analysis in a much higher scale in the future to determine the real practical and economical benefits of the proposed techniques.

5 Conclusions

We have investigated the impact of exploiting recency and seasonality features alongside social media and environmental data to improve the performance of specialized prediction models for touristic attractions (indoor and outdoor). Our experiments showed that by using specialized SVR models including all the tourism requirements, specially the explicit use of recency and seasonality features—outperforms all the baselines, including state-of-the-art solutions [5]. Improvements were obtained in all scenarios, mainly for highly accurate predictions (MAPE < 10%) with gains of more than 300% over the previous solutions.

We have also analyzed the impact of each of the tourism prediction requirements individually and their interactions applying a 2k factorial design analysis. We quantified the performance of each of the three tourism prediction factors (requirements) in the learned models, observing the higher impact of model specialization and seasonality features in model accuracy. But even the less impacting recency features can increase the accuracy of the models, mainly when there is no available seasonal data for a given attraction.

Furthermore, to have a deeper understanding of the impact of the recency and seasonality aspects, we analyzed how scarcity in historical recent and seasonal data impacts the prediction accuracy of models. The general observation was that recent trends of visitation are essential in the accuracy of the models. Finally, we showed how explicit incorporation of seasonality and recency features into the specialized models of indoor and outdoor attractions could improve the accuracy of the tourism demand in attractions in which the state-of-the-art models could not provide an accurate prediction.

5.1 Future work

In future work, we intend to continue improving accuracy, mainly of highly accurate predictions (MAPE < 10%), by evaluating the contents and sentiments of the reviews of each attraction. We intend to apply text analysis techniques such as Temporal Topic Modeling and Sentiment Analysis in order to extract useful information from visitors daily reviews and their possible visiting behaviour trends. Another possible research direction is to cluster attractions into a few groups in order to create specific prediction models for each cluster, making it simpler and more practical to use our solutions in the real life of business owners. This could bring a lot of benefits specially by producing robust forecasting models for touristic places with low availability of visitation census, due to multiple reasons such as high costs of surveys or difficulty to collect data in remote places.

Appendices

This section includes the results of Global model application in the first part while the second part presents the statistical analysis of our factorial design experiments.

A. Global model (Model specialization = OFF)

The application of model specialization may be considerably jeopardized when there is not enough data to train individual models for each site. In this case, it is more viable to train and apply a single global model taking advantage of the complete social and environmental (training) data for multiple attractions. In here, we evaluate the prediction power of trained global models augmented with seasonality and recency tourism features for each type of tourism attraction—indoors and outdoors. Table 14 and 15 shows these results. In the case of indoor attractions, the global model with only social and environmental features (global mdl.) has a good MAPE (MAPE < 25%) only for about 11% of museums with a similar scenario for outdoor attractions, where there is only about 18% of parks with good prediction results using a global model.

It can also be observed in the Tables that introducing recency and seasonality as features into the global models significantly improves the accuracy of the prediction task. Global models produce good predictions (MAPE < 25) for about 74% of the museums and 87% of the parks. Those results however, are worse than when specialization is applied (if data availability allows), mainly for highly accurate predictions (MAPE < 10). In any case, the good accuracy provided by the global models with recency and seasonality encourage its application for the cases in which there is a lack of enough training data for specific attractions.

B. Statistical analyses

The statistical significance of our results is presented (according to a t-test with significance level of α = 0.05). Specifically, we report 95% confidence intervals for the effect of each tourism requirement factor on the prediction task (according to our factorial analysis presented in Section 3.3) for both considered scenarios of indoor and outdoor attractions. These are shown in Tables 16 and 17, respectively. In general terms, we find that, with the specified statistical significance, model specialization and seasonality have the largest contributions.

Table 16. Contribution of each of tourism prediction requirements: Recency, seasonality, model specialization and their interactions into the response variable in indoor attractions (U.K. Museums and Galleries); results for MAPE < 10 and MAPE < 25 in 5 runs.

Minimum and maximum confidence interval for 95% confidence are reported.

Requirements/Factors MAPE < 25 MAPE < 10
contri (%) CI min CI max contri (%) CI min CI max
A = recency 1.0 0.1 3.6 0.0 0.0 0.3
B = seasonality 21.1 12.4 38.1 19.8 12.5 31.3
C = model spec. 55.9 36.8 92.2 70.0 51.3 98.2
AB 5.5 4.4 6.8 1.8 0.6 3.3
AC 0.1 0.1 0.6 1.3 0.3 3.9
BC 13.3 13.2 14.0 2.9 1.0 6.5
ABC 0.7 0.0 2.9 1.6 0.3 4.2
error 2 - - 3 - -
Table 17. Contribution of each of tourism prediction requirements: Recency, seasonality, model specialization and their interactions into the response variable in outdoor attractions (U.S. National Parks); results for MAPE < 10 and MAPE < 25 in 5 runs.

Minimum and maximum confidence interval for 95% confidence are reported.

Requirements/Factors MAPE < 25 MAPE < 10
contri (%) CI min CI max contri (%) CI min CI max
A = recency 1.1 0.7 1.5 1.1 0.5 2.2
B = seasonality 24.6 21.5 28.3 32.6 26.0 41.2
C = model spec. 46.0 40.9 52.0 58.5 48.1 71.9
AB 0.7 0.5 0.9 2.0 2.8 1.4
AC 1.2 1.0 1.0 2.0 1.1 3.5
BC 24.4 24.0 24.8 1.8 1.2 2.6
ABC 1.8 1.4 2.5 0.9 0.5 1.5
error 0.2 - - 1.0 - -

Data Availability

All the data exploited in our paper is already published and publicly available in Mendeley Repository: https://data.mendeley.com/datasets/t7bfhtzhxg/1.

Funding Statement

This research is partially funded by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPQ) and Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG). The authors and grant numbers are as following: ● Amir Khatibi: (CNPq: 169823/2017-2) ● Ana Paula Couto da Silva and Marcos A. Gonçalves: (FAPEMIG: PPM-00543-17 and PPM-00177-18), (CNPq: 422593/2018-4, 310538/2020-3, 310668/2020-4, 402711/2021-1 and 403184/2021-5) ● Jussara Almeida: (CNPq: 305683/2019-5 and 403106/2021-4) The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Chetty P. Advantages of demand forecast for the tourism industry. projectguru. 2011;. [Google Scholar]
  • 2. Webber AG. Exchange rate volatility and cointegration in tourism demand. Journal of Travel research. 2001;39(4):398–405. doi: 10.1177/004728750103900406 [DOI] [Google Scholar]
  • 3. Hengyun Li HS, Li L. A Dynamic Panel Data Analysis of Climate and Tourism Demand: Additional Evidence. Journal of Travel Research. 2016;I-14. [Google Scholar]
  • 4. Maditinos Z, Vassiliadis C. Crises and disasters in tourism industry: happen locally, affect globally. In: MIBES; 2008. p. 67–76. [Google Scholar]
  • 5. Khatibi A, Belém F, da Silva APC, Almeida JM, Gonçalves MA. Fine-grained tourism prediction: Impact of social and environmental features. Information Processing & Management. 2019; p. 102057. [Google Scholar]
  • 6. Lim KH, Chan J, Leckie C, Karunasekera S. Personalized trip recommendation for tourists based on user interests, points of interest visit durations and visit recency. Knowledge and Information Systems. 2018;54(2):375–406. doi: 10.1007/s10115-017-1056-y [DOI] [Google Scholar]
  • 7. Moro S, Rita P. Forecasting tomorrow’s tourist. Worldwide Hospitality and Tourism Themes. 2016;8(6):643–653. [Google Scholar]
  • 8. Hillmer SC. Time Series Analysis Univariate and Multivariate Methods. Journal of the American Statistical Association. 1991; p. 245. [Google Scholar]
  • 9. Hylleberg S. Modelling seasonality. Econometrics. 1992; p. 163–170. [Google Scholar]
  • 10. Butler R. Seasonality in tourism: Issues and implications. Seasonality in tourism. 2001; p. 5–21. doi: 10.1016/B978-0-08-043674-6.50005-2 [DOI] [Google Scholar]
  • 11. Rosselló J, Sansó A. Yearly, monthly and weekly seasonality of tourism demand: A decomposition analysis. Tourism Management. 2017;60:379–389. doi: 10.1016/j.tourman.2016.12.019 [DOI] [Google Scholar]
  • 12. Cuccia T, Rizzo I. Tourism seasonality in cultural destinations: Empirical evidence from Sicily. Tourism Management. 2011;32(3):589–595. doi: 10.1016/j.tourman.2010.05.008 [DOI] [Google Scholar]
  • 13. Richards G. Tourism attraction systems: Exploring cultural behavior. Annals of tourism research. 2002;29(4):1048–1064. doi: 10.1016/S0160-7383(02)00026-9 [DOI] [Google Scholar]
  • 14. Leask A, Fyall A, Barron P. Generation Y: An agenda for future visitor attraction research. International Journal of Tourism Research. 2014;16(5):462–471. doi: 10.1002/jtr.1940 [DOI] [Google Scholar]
  • 15. Jain Rea. A test of goodness of fit. The Art of Computer Systems Performance Analysis: techniques for experimental design, measurement, simulation, and modeling. 1991;49(268):765–769. [Google Scholar]
  • 16. Cankurt S, Subasi A. Developing tourism demand forecasting models using machine learning techniques with trend, seasonal, and cyclic components. Balkan Journal of Eletrical and Computer Engineering. 2015;Vol.3, No.1. [Google Scholar]
  • 17. Khadivi P, Ramakrishnan N. Wikipedia in the Tourism Industry: Forecasting Demand and Modeling Usage Behavior. In: ICWSM; 2016. p. 4016–4021. [Google Scholar]
  • 18. Huang X, Zhang L, Ding Y. The Baidu Index: Uses in predicting tourism flows–A case study of the Forbidden City. Tourism management. 2017;58:301–306. doi: 10.1016/j.tourman.2016.03.015 [DOI] [Google Scholar]
  • 19. Li N, Chen G. Analysis of a location-based social network. In: Computational Science and Engineering. vol. 4; 2009. p. 263–270. [Google Scholar]
  • 20. Spencer A Wood JMS Anne D Guerry, Lacayo M. Using social media to quantify nature-based tourism and recreation. Scientific Report. 2013;3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Fisichelli NA, Schuurman GW, Monahan WB, Ziesler PS. Protected Area Tourism in a Changing Climate: Will Visitation at US National Parks Warm Up or Overheat? PLoS ONE. 2015;10(6). doi: 10.1371/journal.pone.0128226 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Höpken W, Eberle T, Fuchs M, Lexhagen M. Google Trends data for analysing tourists’ online search behaviour and improving demand forecasting: the case of Åre, Sweden. Information Technology & Tourism. 2018; [Google Scholar]
  • 23. Xiaoxuan L, Qi W, Geng P, Benfu L. Tourism forecasting by search engine data with noise-processing. African Journal of Business Management. 2016;10(6):114–130. doi: 10.5897/AJBM2015.7945 [DOI] [Google Scholar]
  • 24. Volchek K, Liu A, Song H, Buhalis D. Forecasting tourist arrivals at attractions: Search engine empowered methodologies. Tourism Economics. 2018; p. 1354816618811558. [Google Scholar]
  • 25. Luo J, Huang SS, Wang R. A fine-grained sentiment analysis of online guest reviews of economy hotels in China. Journal of Hospitality Marketing & Management. 2021;30(1):71–95. doi: 10.1080/19368623.2020.1772163 [DOI] [Google Scholar]
  • 26. Li ea Zhenlong. ODT FLOW: Extracting, analyzing, and sharing multi-source multi-scale human mobility. Journal of Plos one. 2021;16(8):e0255259. doi: 10.1371/journal.pone.0255259 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Zhang H, Wang Z, Ke M, Cai M, Sun Q. Fine-grained Tourist Satisfaction Prediction Based on Deep Learning. In: 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC); 2021. p. 30–35.
  • 28. Khatibi A, da Silva APC, Almeida JM, Gonçalves MA. FISETIO: A FIne-grained, Structured and Enriched Tourism Dataset for Indoor and Outdoor attractions. Mendeley Data, Information Processing & Management. 2019;1. doi: 10.1016/j.dib.2019.104906 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V. Support Vector Regression Machines. In: NIPS; 1996. p. 155–161. [Google Scholar]
  • 30.Vagropoulos ea Stylianos I. Comparison of SARIMAX, SARIMA, modified SARIMA and ANN-based models. IEEE International Energy Conference (ENERGYCON). 2016;.
  • 31. Hochreiter S, Schmidhuber J. Long Short-term Memory. Neural computation. 1997;9:1735–80. doi: 10.1162/neco.1997.9.8.1735 [DOI] [PubMed] [Google Scholar]
  • 32. McLaughlin RL. Forecasting models: Sophisticated or naive? Journal of Forecasting. 1986;2(3):274. [Google Scholar]
  • 33. Box GE, Jenkins GM. Time series analysis: Forecasting and control San Francisco. Calif: Holden-Day. 1976;. [Google Scholar]
  • 34. Li Y, Cao H. Prediction for tourism flow based on LSTM neural network. Procedia Computer Science. 2018;129:277–283. doi: 10.1016/j.procs.2018.03.076 [DOI] [Google Scholar]
  • 35. Tato A, Nkambou R. Improving adam optimizer. In: ICLR 2018; 2018. [Google Scholar]
  • 36. Lewis CD. Industrial and business forecasting methods: A practical guide to exponential smoothing and curve fitting. Butterworth-Heinemann; 1982. [Google Scholar]
  • 37.Hui Y, Wenzhu S, Xiuzhi Z, Guotao Z, Wenting H. Heuristic sample reduction based support vector regression method. In: 2016 IEEE International Conference on Mechatronics and Automation; 2016. p. 2065–2069.

Decision Letter 0

Ali Safaa Sadiq

2 May 2022

PONE-D-22-04884A Quantitative Analysis of the Impact of Explicit Incorporation of Recency, Seasonality and Model Specialization into Fine-Grained Tourism Demand Prediction ModelsPLOS ONE

Dear Dr. Khatibi,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jun 04 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Ali Safaa Sadiq

Academic Editor

PLOS ONE

Journal Requirements:

1. When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. In your Methods section, please include additional information about your dataset and ensure that you have included a statement specifying whether the collection method complied with the terms and conditions for the website." 2) "Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Thank you for stating in your Funding Statement: 

Please include your amended Funding Statement within your cover letter. We will change the online submission form on your behalf.

(Applying to all authors, this work is partially supported by CNPq, CAPES, and Fapemig)

Please provide an amended statement that declares *all* the funding or sources of support (whether external or internal to your organization) received during this study, as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now.  Please also include the statement “There was no additional external funding received for this study.” in your updated Funding Statement. 

4. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. 

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

5. Thank you for stating the following in the Acknowledgments Section of your manuscript: 

(This work is partially supported by CNPq, CAPES, and Fapemig)

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. 

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: 

(Applying to all authors, this work is partially supported by CNPq, CAPES, and Fapemig)

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

7. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

8. We note you have included a table to which you do not refer in the text of your manuscript. Please ensure that you refer to Table 12 and 13 in your text; if accepted, production will need this reference to link the reader to the Table.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The contribution of this paper is good and I am happy to endorse its acceptance at some point. However, there are several major and minor comments to address. I have listed them as follows:

• First off, please clearly state the gap targeted in this paper at the end of introduction and list down the hypotheses

• In terms of research method and design, there is not much in the paper.

• The comparative algorithms in the experiments are not properly acknowledged and cited

• I also suggest adding some figures to better articular the content as the paper looks very dry at the moment.

• Analysis of the results is missing in the paper. There is a big gap between the results and conclusion. There should be the result analysis between these two sections. After comparing the numerical methods, you have to be able to analyse the results and relate them to their structures. It would be interesting to have your thoughts on why the method works that way? Such analyses would be the core of your work where you prove your understanding of the reason behind the results. You can also link the findings to the hypotheses of the paper. Long story short, this paper requires a very deep analysis from different perspectives

• There is no statistical test to judge about the significance of the numerical method’s results. Without such a statistical test, the conclusion cannot be supported

• There is no discussion on the cost effectiveness of the proposed method. What is the computational complexity? What is the runtime? Please include such discussions. You can also use the big oh notation to show the computation complexity.

• Some mathematical notations and Lemma presentations are not rigorous enough to correctly understand the contents of the paper. The authors are requested to recheck all the definition of variables and further clarify these equations.

Reviewer #2: This work is a novel work on tourist prediction with the pre-requirements. But several issues should be critical improved for resubmission. I encourage authors to make thoroughful revision and re-submission.

1. This topic is interesting in Tourism management or visiting prediction. However, the research question or focused scientific / technical question didn’t clarity.

2. Structure of manuscript is vague. For example, works on investigating were in Section 1 and Section 2. Some discussion content was included in Section Result.

3. For the references, it is a narrow review and insufficient. The newest reference is published in 2019. From many journals, like p-one, IEEE Access, you can get many works about toursim with keyword="Fine-Grained" or "tourist" or "toursim".

4.The proposed methods with three key requirements for prediction are helpful to improve grained prediction. However, the suitability and limitations of this work should be tested and written.

Reviewer #3: The abstract is too short and not structured according to the standard format.

The manuscript lack literature review on the topic under discussion. I recommend to add a literature review section to highlight the past related work and how your work is different.

Conclusions and Future Work should be separated into two sections.

Add more up-to-date references.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Dec 8;17(12):e0278112. doi: 10.1371/journal.pone.0278112.r002

Author response to Decision Letter 0


23 Jul 2022

We are also sending a pdf file of response letter in complete with our submission;

################################################################

Paper title: A Quantitative Analysis of the Impact of Explicit Incorporation of Recency, Seasonality and Model Specialization into Fine-Grained Tourism Demand Prediction Models

Autores: Amir Khatibi, Ana Paula Couto da Silva, Jussara Almeida, Marcos A. Gonçalves

Dear Editor in Chief and Reviewers

We are sending a revised version of our paper under submission to PLOS ONE Journal as well as the responses to the reviewers’ comments.

We would like to sincerely thank all reviewers for their thoughtful and detailed comments that helped improve our paper. We did our best to address all their comments. The original reviewers' comments are in bold, each followed by our responses in italic.

We also performed all the requested editorial changes in our revised manuscript.

Best regards,

Amir Khatibi,

Ana Paula Couto da Silva, Jussara Almeida, Marcos A. Gonçalves

################################################################

REVIEWER #1

COMMENT #1:

The contribution of this paper is good and I am happy to endorse its acceptance at some point. However, there are several major and minor comments to address. I have listed them as follows:

First off, please clearly state the gap targeted in this paper at the end of introduction and list down the hypotheses

RESPONSE: We thank the reviewer for the positive feedback on our paper.

As requested, we better elaborated the hypotheses that we investigate in this work as explicit research questions we aim to answer throughout the manuscript.

The questions are explicitly stated in a new subsection (Subsection 1.1, pages 4 and 5) alongside the main contributions (Section 1.3, pages 9 and 10) at the end of the Introduction.

In more detail, in Research Question 1 (RQ1), we investigate whether recency, seasonality and model specialization (characteristic of attraction) do indeed influence the accuracy of predicting visits in tourist sites. In RQ2, we aim to quantify the impact of each of these factors on tourism demand prediction. Finally in RQ3, we study the extent to which scenarios with data scarcity hinder the accuracy of prediction models while exploiting recency, seasonality and model specialization.

The combined answers to these questions advance the state-of-the-art in the field by providing not only a better understanding of the role of the aforementioned factors in the prediction models but also by allowing us to produce prediction models that are more accurate than the current state-of-the-art.

COMMENT #2:

In terms of research method and design, there is not much in the paper.

RESPONSE: We better defined our methodology and experimental design in the revised version by including new Figures and expanded explanations. For instance, in Figure 2 (page 15) we discuss our tourism demand prediction methodology that adopts social media and environmental data as well as embedded recency and seasonality features in both specialized and global models. In Figures 4 and 5 (pages 24 and 25 respectively), we illustrate how we refine our prediction methodology for two scenarios of scarcity in recent and historical data. We also added a new paragraph in Section 2.5 (page 15) alongside Figure 3 (in the same page) to elaborate the differences between specialized and global models in our methodology.

We hope with these new clarifications, it becomes clearer that we have a well defined, solid and statistically sound methodology and experimental design, which also include experiments with folded cross-validation and statistical tests presented in Appendix (Tables 16 and 17), to better assess the generalization of the learned model under different training and test sets. We also used a factorial design analysis as explained in Section 2.6, to quantify the impact of the analyzed factors.

COMMENT #3:

The comparative algorithms in the experiments are not properly acknowledged and cited

RESPONSE: We thank the reviewer for the comment. We have revised the manuscript to properly acknowledge and cite all algorithms.

Notably, in our Prediction Techniques Section (Section 2.3), we presented five prediction algorithms (Support Vector Regression [Drucker 1996], Seasonal ARIMA [Hilmar 1991], Seasonal ARIMAX [Vagro. 2016], Neural Network [Hoch. 1997], Naive Models [McLau. 1986]). As the reviewer suggested, we added a new paragraph properly acknowledging all models with a detailed explanation of calculations and we also revised the SARIMAX model adding necessary references as well.

Moreover, we also added a paragraph in the Factorial Design analysis section (Section 2.6) citing one of the most widely used books describing its use [Jain 1991].

[Drucker 1996]: Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V. Support Vector Regression Machines. In: NIPS; 1996. p. 155–161

[Hilmar 1991]: Hillmer SC. Time Series Analysis Univariate and Multivariate Methods. Journal of the American Statistical Association. 1991; p. 245.

[Vagro. 2016]: Vagropoulos ea Stylianos I. Comparison of SARIMAX, SARIMA, modified SARIMA and ANN-based models. IEEE International Energy Conference (ENERGYCON). 2016

[Hoch. 1997]: Hochreiter S, Schmidhuber J. Long Short-term Memory. Neural computation. 1997; vol 9:1735–80.

[McLau. 1986]: McLaughlin, Robert L. Forecasting models: Sophisticated or naive? Journal of Forecasting (pre-1986) 2.3 (1983): 274.

[Jain 1991]: Jain Rea. A test of goodness of fit. The Art of Computer Systems Performance Analysis: techniques for experimental design, measurement, simulation, and modeling. 1991;49(268):765–769

COMMENT #4:

I also suggest adding some figures to better articular the content as the paper looks very dry at the moment.

RESPONSE: We thank the reviewer for the suggestion.

We have added new figures to make it easy for the reader to follow our manuscript. Specifically, we added:

a new Figure 1 in Section 2 (page 11) to better illustrate the phases of our data collection methodology, and

a new Figure 3 in Section 2 (page 15) to better illustrate the differences between the specialized and global models in an illustrative demonstration.

We have also improved the figure that described the temporal evolution of tourism features (previously Figure 4) breaking it into two new more informative Figures (now Figures 6 and 7).

We believe that these new figures, along with the figures already available in the original version of the manuscript describing various methodological aspects of our study will make the reading much easier.

COMMENT #5:

Analysis of the results is missing in the paper. There is a big gap between the results and conclusion. There should be the result analysis between these two sections. After comparing the numerical methods, you have to be able to analyze the results and relate them to their structures. It would be interesting to have your thoughts on why the method works that way?

RESPONSE: As suggested, we added a new Section (Section 4 - Discussions) following the Results sections, to discuss and analyze our achieved results in a more qualitative fashion.

In the new section, we better relate the obtained experimental results, described in Section 3, with the posed research questions in Section 1, connecting the hypotheses with experimental results more clearly. Finally, we offered insights into why the method works the way it did based on our interpretations of all performed analyses.

COMMENT #6:

Such analyses would be the core of your work where you prove your understanding of the reason behind the results. You can also link the findings to the hypotheses of the paper. Long story short, this paper requires a very deep analysis from different perspectives

RESPONSE: As we mentioned in our response to the previous comment, we have included a new Section (Section 4 - Discussions) where we perform a deeper analysis of our findings, better associating our research questions (a translation of our hypotheses) with the experimental results reported in the paper.

COMMENT #7:

There is no statistical test to judge about the significance of the numerical method’s results. Without such a statistical test, the conclusion cannot be supported

RESPONSE: We have added a new section - Statistical analysis in the Appendix - on the statistical significance of our results (according to a t-test with significance level alpha = 0.05). Specifically, we report 95% confidence intervals for the effect of each tourism requirement factor on the prediction task (according to our factorial analysis presented in Section 3.3) for both considered scenarios of indoor and outdoor attractions. These are shown in Tables 16 and 17, respectively. In general terms, we find that, with the specified statistical significance, model specialization and seasonality have the largest contributions.

COMMENT #8:

There is no discussion on the cost effectiveness of the proposed method. What is the computational complexity? What is the runtime? Please include such discussions. You can also use the big oh notation to show the computation complexity.

RESPONSE: We added a new paragraph in Section 3.1 (pages 18 and 19), reporting computational complexity of the main prediction algorithm and execution times for specialized and global models.

The Support Vector Machine (SVR) has a time complexity of O(n3) and space complexity of O(n2) where N is the number of points [Hui 2016]. In our experiments, since we have few training points (n = 30 for outdoor attractions and n = 76 in indoor attractions), and our model is trained once a month, the execution time for the prediction task was not a major concern.

The computational complexity of the other algorithms is as follows. For SARIMA models the complexity is in the order of O(n). Neural network models, in turn, take more time due to numerous iterations applying forward and back-propagation — Backpropagation, in the order of O(n5), is much slower than the forward propagation in the order of O(n4). Finally, for the case of naive models the complexity is O(1) since they only pick the defined index of historical data to pass as the naive prediction.

Regarding execution time, for specialized SVR models, the mean execution time is about 45 seconds (min exec. time: 21 seconds and max exec. time: 92 seconds) while for generalized models for different sets of features the mean execution time was around 4 hours (min exec time: 1h and 33 minutes and max exec time: 20 hours). After the conclusion of the training step, the prediction phase is quite fast (average of 4 seconds for all 100 attractions independent of global or specialized models).

Observation: As we also reported in the revised version of the paper, for our experiments, we used a desktop PC with 4 CPUs and 16 GBs of RAM memory using the R programming language.

[Hui 2016]: Y. Hui, S. Wenzhu, Z. Xiuzhi, Z. Guotao and H. Wenting, Heuristic sample reduction based support vector regression method, IEEE International Conference on Mechatronics and Automation, 2016, pp. 2065-2069,

COMMENT #9:

Some mathematical notations and Lemma presentations are not rigorous enough to correctly understand the contents of the paper. The authors are requested to recheck all the definition of variables and further clarify these equations.

RESPONSE: We revisited all the mathematical definitions in Sections 2.2 Problem Definition and 2.3 Prediction Techniques in order to make sure that all notations are correct and consistent. There are no lemmas in our manuscript.

################################################################

REVIEWER #2

COMMENT #1:

This work is a novel work on tourist prediction with the pre-requirements. But several issues should be critical improved for resubmission. I encourage authors to make thoroughful revision and re-submission.

This topic is interesting in Tourism management or visiting prediction. However, the research question or focused scientific / technical question didn’t clarity.

RESPONSE: We thank the reviewer for the comment.

As stated in our responses to Comments #1 and #6 of Reviewer #1, to clarify our research questions, we added a new Section at the end of the introduction of the revised manuscript where we explicitly state and explain our Research Questions – RQs (new Section 1.1, pages 4 and 5). We then connect these RQs to the experimental results in the paper with a qualitative analysis in Section 4 - Discussions.

COMMENT #2:

Structure of manuscript is vague. For example, works on investigating were in Section 1 and Section 2. Some discussion content was included in Section Result.

RESPONSE: We followed the PLOS One required structure.

According to the PLOS ONE submission guideline (link), we added our related work study in Introduction - Section 1:

PLOS ONE:

- The introduction should: Include a brief review of the key literature.

- The Materials and Methods section: If materials, methods, and protocols are well established, authors may cite articles where those protocols are described in detail, but the submission should include sufficient information to be understood independent of these references.

Regarding the inclusion of discussion content in Section Results, as indicated in our response to Comment #5 of Reviewer #1, we added a new section, Section 4 - Discussions where we concentrated (and expanded) all the discussions and qualitative analysis of the results.

We also changed the title of Section ‘Experimental Methodology’ to ‘Materials and Methods’ to closely comply with the PLOS ONE guidelines.

COMMENT #3:

For the references, it is a narrow review and insufficient. The newest reference is published in 2019. From many journals, like p-one, IEEE Access, you can get many works about toursim with keyword="Fine-Grained" or "tourist" or "toursim".

RESPONSE: As the reviewer suggested, we searched for recent works in the area of tourism demand prediction. We included the related papers in well-known journals such as IEEE Access and PLOS ONE published in recent years. The citations to three recently published works were included in the Related Work (Section 1.2), describing each work and their similarities and differences to ours.

In more details, In [Luo 2021], authors investigate experiences of Chinese economy hotel guests using online reviews as proxy; similarly to our work, the authors use external data for fine-grained predictions. However, the focus of our work is in visits prediction instead of sentiment analysis. We also exploit recent and seasonal behaviors explicitly in our feature-set.

[Li 2021] develops a scalable online platform for extracting, analyzing, and sharing multi-source multi-scale human mobility flows to assist human mobility monitoring. The focus of this work is mostly on providing and monitoring fine-grained spatio-temporal mobility data while in our work we analyze multiple prediction models exploiting external data alongside explicit use of seasonality and recency in order to predict tourism demand.

In [Zhang 2021], authors build a fine-grained tourist satisfaction prediction model based on deep learning, using destination features as proxy; similarly to our work, the authors use external data for fine-grained prediction. However the focus of our work is in visits prediction instead of tourist satisfaction prediction. In addition, in our work, we exploit recent and seasonal features in order to obtain more accurate results.

[Luo 2021]: Luo J, Huang SS, Wang R. A fine-grained sentiment analysis of online guest reviews of economy hotels in China. Journal of Hospitality Marketing & Management. 2021;30(1):71–95

[Li 2021]: Li ea Zhenlong. ODT FLOW: Extracting, analyzing, and sharing multi-source multi-scale human mobility. Journal of Plos one. 2021;16(8)

[Zhang 2021]: Zhang H, Wang Z, Ke M, Cai M, Sun Q. Fine-grained Tourist Satisfaction Prediction Based on Deep Learning, 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC); IEEE 2021. p. 30–35.

COMMENT #4:

The proposed methods with three key requirements for prediction are helpful to improve grained prediction. However, the suitability and limitations of this work should be tested and written.

RESPONSE: We added a new Limitations subsection (Section 4.1) in the Discussion Section specifying the limitations of our analysis such as lack of official data in order to test our methodology on finer time-grained data (weekly or daily basis). Another limitation is the scale of our study that, though larger than most similar studies, still, needs to be expanded to assess the practicality of our proposals in the real-world.

################################################################

REVIEWER #3

COMMENT #1:

The abstract is too short and not structured according to the standard format.

RESPONSE: According to the PLOS ONE submission guideline (link), an abstract should not exceed 300 words. Our abstract was structured to comply with this guideline.

COMMENT #2:

The manuscript lack literature review on the topic under discussion. I recommend to add a literature review section to highlight the past related work and how your work is different.

RESPONSE: We have analyzed more than sixteen (16) works in the literature in the area of tourism prediction where all are classified and presented in Table 1 in the Introduction section. As stated in our response to Comment #2 of Reviewer #2, according to the PLOS ONE guideline (link), the literature review should be included in the Introduction Section.

Now, in the revised version we have an explicit literature review Section (Section Related Work, 1.2, pages 5, 6, 7, 8 and 9) as a subsection of the Introduction as required by the PLOS One format guidelines. We also expanded the number of works in this Section, including three recent, very related studies.

COMMENT #3:

Conclusions and Future Work should be separated into two sections.

RESPONSE: According to the PLOS ONE submission guideline (link), we could not have a different Section for Future Work:

PLOS ONE: Results, Discussion, Conclusions These sections may all be separate, or may be combined to create a mixed Results/Discussion section.

Authors should explain how the results relate to the hypothesis presented as the basis of the study and provide a succinct explanation of the implications of the findings, particularly in relation to previous related studies and potential future directions for research.

We separated Conclusions and Future Work into two subsections in the same Section of Conclusion, following the PLOS ONE submission guideline.

COMMENT #4:

Add more up-to-date references.

RESPONSE: As stated in our response to Comment #3 of Reviewer #2, after searching for recent tourism demand prediction publications in well-known journals, we added three more citations to Table 1 in our Related Work (Section 1.2, pages 5, 6, 7, 8 and 9).

Luo J, Huang SS, Wang R. A fine-grained sentiment analysis of online guest reviews of economy hotels in China. Journal of Hospitality Marketing & Management. 2021;30(1):71–95

Li ea Zhenlong. ODT FLOW: Extracting, analyzing, and sharing multi-source multi-scale human mobility. Journal of Plos one. 2021;16(8)

Zhang H, Wang Z, Ke M, Cai M, Sun Q. Fine-grained Tourist Satisfaction Prediction Based on Deep Learning, 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC); IEEE 2021. p. 30–35.

Attachment

Submitted filename: Response Letter- PloS One.pdf

Decision Letter 1

Ali Safaa Sadiq

16 Aug 2022

PONE-D-22-04884R1A Quantitative Analysis of the Impact of Explicit Incorporation of Recency, Seasonality and Model Specialization into Fine-Grained Tourism Demand Prediction ModelsPLOS ONE

Dear Dr. Khatibi,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Sep 30 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Ali Safaa Sadiq

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments (if provided):

Authors are invited to submit their second revision of their manuscript to address some of the minor changes suggested by the first reviewer.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: (No Response)

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: (No Response)

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: (No Response)

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: (No Response)

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Some final cosmetic comments:

* The results of your comparative study should be discussed in-depth and with more insightful comments on the behaviour of your algorithm on various case studies. Discussing results should not mean reading out the tables and figures once again.

* Avoid lumping references as in [x, y] and all other. Instead summarize the main contribution of each referenced paper in a separate sentence. For scientific and research papers, it is not necessary to give several references that say exactly the same. Anyway, that would be strange, since then what is innovative scientific contribution of referenced papers? For each thesis state only one reference.

* Avoid using first person.

* Avoid using abbreviations and acronyms in title, abstract, headings and highlights.

* Please avoid having heading after heading with nothing in between, either merge your headings or provide a small paragraph in between.

* The first time you use an acronym in the text, please write the full name and the acronym in parenthesis. Do not use acronyms in the title, abstract, chapter headings and highlights.

* The results should be further elaborated to show how they could be used for the real applications.

Reviewer #3: Wish a very good luck for all authors. The manuscript now is ready for publication after addressing all the comments.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Dec 8;17(12):e0278112. doi: 10.1371/journal.pone.0278112.r004

Author response to Decision Letter 1


3 Oct 2022

Paper title: A Quantitative Analysis of the Impact of Explicit Incorporation of Recency, Seasonality and Model Specialization into Fine-Grained Tourism Demand Prediction Models

Autores: Amir Khatibi, Ana Paula Couto da Silva, Jussara Almeida, Marcos A. Gonçalves

Dear Plos One Editor in Chief and Reviewers

We are sending a second revision version of our manuscript under submission to PLOS ONE Journal as well as the responses to the reviewers’ comments.

We would like to sincerely thank all reviewers for their thoughtful and detailed comments that helped improve our paper. We did our best to address all their comments. The original reviewers' comments are in bold, each followed by our responses in italic.

We also performed all the requested editorial changes in our revised manuscript.

Best regards,

Amir Khatibi,

Ana Paula Couto da Silva, Jussara Almeida, Marcos A. Gonçalves

REVIEWER #1

Some final cosmetic comments:

COMMENT #1:

The results of your comparative study should be discussed in-depth and with more insightful comments on the behavior of your algorithm on various case studies. Discussing results should not mean reading out the tables and figures once again.

RESPONSE: We thank the reviewer for the feedback on our paper.

As requested, we expanded the Discussion Section with more insightful comments on our experimental results. We also avoided re-reading previous tables and figures and focused on providing more qualitative comments based on our case studies, as suggested.

COMMENT #2:

Avoid lumping references as in [x, y] and all others. Instead summarize the main contribution of each referenced paper in a separate sentence. For scientific and research papers, it is not necessary to give several references that say exactly the same. Anyway, that would be strange, since then what is the innovative scientific contribution of referenced papers? For each thesis state only one reference.

RESPONSE: We checked for all lumping references in the Related Work Section and broke them to explain the contribution of each separately.

COMMENT #3:

Avoid using first person.

RESPONSE: We reduced considerably the use of the first person in the revised manuscript, although we did not remove all uses, since in some points of the text it would sound unnatural or artificial.

COMMENT #4:

Avoid using abbreviations and acronyms in title, abstract, headings and highlights.

RESPONSE: We resolved the cases using acronyms in sub-sections in Section 2.4.

COMMENT #5:

Avoid having heading after heading with nothing in between, either merge your headings or provide a small paragraph in between.

RESPONSE: We reviewed the paper to guarantee avoiding the issues emphasized by the reviewer. In addition, in the Appendices Section, we made some modifications to avoid heading after dividing the Appendices Section into part A and B with a small paragraph explaining each of these sub-sections.

COMMENT #6:

The first time you use an acronym in the text, please write the full name and the acronym in parenthesis. Do not use acronyms in the title, abstract, chapter headings and highlights.

RESPONSE: We revised the paper and took care of three incidents of this type. Now every acronym correctly appears in parentheses after the full name.

COMMENT #7:

The results should be further elaborated to show how they could be used for the real applications.

RESPONSE: We expanded the discussion in Section 4.2, emphasizing limitations of our study and discussing issues related to the practical applications of our results, analyses and findings.

REVIEWER #2

There were no comments from Reviewer #2.

REVIEWER #3

Wish a very good luck for all authors. The manuscript now is ready for publication after addressing all the comments.

RESPONSE: We thank the reviewer #3 for kindful feedback and good wishes.

Attachment

Submitted filename: Response_Letter_2nd_revision_PloSOne.pdf

Decision Letter 2

Ali Safaa Sadiq

10 Nov 2022

A Quantitative Analysis of the Impact of Explicit Incorporation of Recency, Seasonality and Model Specialization into Fine-Grained Tourism Demand Prediction Models

PONE-D-22-04884R2

Dear Dr. Khatibi,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Ali Safaa Sadiq

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

The authors have addressed all the given comments by reviewers, hence I am happy to recommend their paper for the possible publication.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: (No Response)

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: (No Response)

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: (No Response)

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: all comments have been addressed. all comments have been addressed. all comments have been addressed. all comments have been addressed.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

Acceptance letter

Ali Safaa Sadiq

17 Nov 2022

PONE-D-22-04884R2

A Quantitative Analysis of the Impact of Explicit Incorporation of Recency, Seasonality and Model Specialization into Fine-Grained Tourism Demand Prediction Models

Dear Dr. Khatibi:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Ali Safaa Sadiq

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Response Letter- PloS One.pdf

    Attachment

    Submitted filename: Response_Letter_2nd_revision_PloSOne.pdf

    Data Availability Statement

    All the data exploited in our paper is already published and publicly available in Mendeley Repository: https://data.mendeley.com/datasets/t7bfhtzhxg/1.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES