Abstract
Predicting spatial behaviors of an individual (e.g., frequent visits to specific locations) is important to improve our understanding of the complexity of human mobility patterns, and to capture anomalous behaviors in an individual’s spatial movements, which can be particularly useful in situations such as those induced by the COVID-19 pandemic. We propose a system called Deep Spatio-Temporal Predictor (DST-Predict), that can predict the future visit frequency of an individual based on one’s past mobility behaviour patterns using GPS trace data collected from mobile phones. Predicting such spatial behavior is challenging, primarily because individuals’ patterns of location visits for each individual consists of both systematic and random components, which vary across the spatial and temporal scales of analysis. To address these issues, we propose a novel multi-view sequence-to-sequence model that uses Convolutional Long-short term memory (ConvLSTM) where the past history of frequent visit patterns features is used to predict individuals’ future visit patterns in a multi-step manner. Using the GPS survey data obtained from 1,464 participants in western New York, US, we demonstrated that the proposed system is capable of predicting individuals’ frequency of visits to common places in an urban setting, with high accuracy.
INDEX TERMS: Human mobility, deep learning, predictive learning
I. INTRODUCTION
Understanding human mobility patterns is important for solving problems related to public health [1], [2], emergency event detection [3], urban planning [4], and transportation engineering [5]. Literature [6] shows that sequencing locations that individuals visited frequently is an effective means of capturing daily human mobility patterns, as individuals’ frequently visited locations are the basis of essential travel activities such as going from home to workplace in the morning and coming back to home from work in the evening. Despite individual differences, previous studies [7], [8] have shown that large scale human mobility patterns are highly regular and thus predictable due to circadian patterns and routine daily activities, such as one’s journey to work or home.
Prediction of individual’s mobility over time, i.e., on an hourly, daily or weekly basis, enables us to better understand the general behavioral patterns of individuals and has been used in various practical applications, such as crowd flows prediction [9] and location-based advertising [10]. In the context of COVID-19 pandemic, individual level mobility patterns, especially in dense urban settings, is crucial for understanding and controlling the spread of the disease, especially in dense urban settings [11]. Predicting recurrent visits to a finite set of locations over time requires understanding of both spatial and temporal aspect of human movements. Previous studies [6] have demonstrated that mobility patterns can be captured by an exploration and preferential return model with a displacement distribution, in which individuals return to a limited number of places over time and their trips to places outside a regularly traveling region are rare. However, most previous studies are based on trajectory data extracted from mobile phone data logs, referred to as call detailed records (CDR), and focus only on large-scale mobility patterns. Prediction of individuals’ visit counts at frequently visited locations across multiple spatial and temporal resolutions have not yet been investigated.
GPS-enabled mobile phone data in which phone location is determined by special queries with pre-determined sampling intervals (“active mobile phone data” hereafter) have been increasingly used in human mobility studies [12]. A unique advantage of the active mobile phone data to other data modalities, such as CDRs or geo-tagged Twitter posts that have been frequently used in human mobility studies [13], is that active mobile phone data provide precise spatial location, compared to the closest antenna location for CDRs or the limited information present in geo-tagged tweets [14]. With a rapid increase in the availability of data arriving from heterogeneous sensors such as camera [15], loop detectors [16], [17], standalone GPS devices [18], [19], etc., an opportunity has been offered for making use of deep learning approaches for developing novel and effective models that leverages the use of huge amount of data and then turn them into useful information to society in general. Several models [20], [21] have been applied to different applications/use-case scenarios to address a specific problem and the type of data available. Some applications that used deep learning approaches include traffic flow prediction [22]–[24], traffic incident detection [25]–[27] and crime incident prediction [28], [29]. A few applications to the domain of human mobility pattern mining also have employed deep learning approaches to estimate migratory flows and human trajectory data mining [30]–[32].
In the present paper, we propose a system, DST-Predict that employs a novel multi-step deep learning architecture to: predict an individual’s visit frequency at a finite set of locations using the individual’s past active mobile phone data and other relevant information. We evaluate the model architecture at different spatial scales (i.e. resolutions) and demonstrated its capability to forecast short-term visit patterns. Lastly, we show how the proposed model captures both spatial and temporal dependencies along with individual-specific characteristics, such as age, gender, employment status.
II. RELATED WORK
We investigate the problem of predicting individuals’ frequent visit, which can be considered as a special instance of count prediction problem explored in related contexts. The application of the predicted counts can be found elsewhere, including crowd count prediction in videos [33], taxi demand prediction [34], forecasting flow of crowds in a city [35] and tweet count prediction [36] within the specific geographic region. The other useful applications related to count prediction includes the counting in microscopic images [37], vehicle counting in images related to traffic congestion [38] and counting animals in the wild [39]. However, to the best of our knowledge, using GPS trace data to predict future visit frequencies has not been directly explored in the literature.
With the rapid increase in the use of deep learning approaches, substantial changes occurred in the field of spatio-temporal data mining recently. Conventional application of deep learning algorithms can be found in the area of natural language processing [40], [41] and computer vision [42], [43], although these algorithms also have been extensively used in the modeling and analysis of human mobility [33]–[35], [44] in more recent years. Convolution neural networks (CNN) and Recurrent neural networks (RNN) have been used extensively in capturing spatial and temporal movements in human trajectory data mining studies [35], [45]. For example, the study [46] presented the use of both CNN and RNN to capture the spatiotemporal movements. Similarly, the study [47] provided a unique convolution Long-short term network (ConvLSTM) for precipitation nowcasting on radar echo dataset while capturing both the spatial and temporal correlation effectively. Some studies have used multiple machine learning techniques to count prediction problems in different settings, including using deep learning approaches to forecast crime incidents across different spatio-temporal scales. For example, [48] used deep learning in forecasting crime in a fine-grain city partition while [29] used ST-ResNet [35] to forecast crime distributions over the Los Angeles area. Deep learning approaches have also been used in understanding traffic flow and forecasting traffic accidents. For instance, Yuan et al., [49] used ConvLSTM on heterogeneous urban data for forecasting traffic accidents while Liu et al., [50] used ConvLSTM along with Bidirectional LSTM in predicting short-term traffic flow on the urban daily traffic data. Crowd counting is another problem in which several deep learning approaches have been employed in the past. For example, Zhang et al., [51] used deep convolution neural network to solve the cross-scene crowd problem while use of Bidirectional convLSTM for crowd counting in videos is presented in Xiong et al., [33]. However, the neural network architectures employed by these related solutions cannot handle the unique challenges associated with the problem of individuals’ visit frequency prediction from GPS trace data. Most previous studies forecasted only at uni-timestep ahead that provided a limited outlook on the ability of the accuracy of the models. To overcome this limitation, some studies [24], [52] provided sequence-to-sequence based learning approach for problems related to traffic prediction. Our proposed solution handles these challenges using customized architectural and procedural modifications to perform the forecasting many timesteps ahead into the future.
III. PROBLEM FORMULATION
For each individual, the raw GPS data is available as a series of chronologically ordered GPS locations (latitude and longitude), denoted as where the index in the superscript, i, denotes the ith individual. We transform this data into a gridded representation, by first grouping the locations by a individual-specific temporal window, e.g., hourly, daily, weekly, etc. For each window, e.g., a day, we construct an M × N matrix , where M and N are the number of rows and columns, respectively, of a uniform spatial grid of a particular scale, applied on the target spatial area. t denotes the index of the temporal window. Each entry of is equal to the number of times the ith individual’s “visits” the corresponding grid cell, during the tth window. We will refer to the matrix as the visit count matrix for the ith individual for the tth time window.
Figure 1 illustrates this transformation for a randomly selected participant for the target area, as discussed in the subsequent sections. Note that, we use a daily window as a temporal unit of analysis (i.e., predictions from DST-Predict will be obtained daily), though the same methodology is applicable for any window length, depending on the target application.
FIGURE 1.

Transforming GPS trace data (left) to a gridded representation (right). Each grid cell is 2km × 2km for this example.
In summary, the visit frequency prediction problem can be defined as follows: Given the historical visit count matrices until time t, denoted as , predict the future visit count matrix , where f > t and d is the forecasting time steps.
The core engine of DST-Predict is a recurrent neural network based forecasting model, that can model the sequential and temporal dependencies in the data and used them for future predictions. A key aspect of the solution is that we treat the visit frequency matrix, as an image with (M × N) pixels. This allows us to utilize a convolutional architecture [42], that is the state-of-art approach to model the spatial relationships among the image pixels. Given that images in this context are spatially sparse, as illustrated in Figure 2 of the distribution of the unique grid cells visited by each individual in the target urban area data set. The study area was represented by 3200 grid cells, but each participant visited only 15 grid cells on average. On the other hand, we found a strong spatial correlation in the spatial patterns of visited places, as a significant proportion of grids visited by the individuals were adjacent as shown in Figure 3. We handle this challenge by utilizing a loss function that can account for such sparsity in the data.
FIGURE 2.

Distribution of unique grid cells visited by each individual for the target data set. Each grid cell is 2km × 2km and there are 3200 unique cells.
FIGURE 3.

Number of visits for each grid cell (1 km × 1 km) in the target spatial area (left). The urban counties are shown as shaded regions. The target geographical area is shown in the map on the right. 50% of the grid cells were visited at least once during the study period.
IV. METHODS
In this section, we provide a brief overview of the individual components of the proposed model, and present the proposed deep learning based model architecture, DST-Predict in detail.
A. CONVOLUTION LONG SHORT-TERM MEMORY NETWORKS (ConvLSTM)
As a widely used recurrent neural network, we use LSTM network to solve sequence modeling problems while modeling temporal dependencies in sequence data. To accommodate both the temporal and spatial dependencies present in the data, Shi et al. [47] proposed the Convolutional LSTM (ConvLSTM) that is similar to fully connected LSTM (FC-LSTM) but uses convolution operator in the state-to-state and input-to-state transitions. The mathematical equations for the computations inside a ConvLSTM cell are as shown below:
| (1) |
| (2) |
| (3) |
| (4) |
| (5) |
where * denotes the convolution operation and ◦ denotes the Hadamard (elementwise) product. Here, it, ft and ot are the outputs of the input, forget and the output gate respectively. ct is the cell output at time step t while ht is the hidden state of the cell at time step t. σ(·) is the logistic sigmoid function. Wxi, Wxf, Whi, Whf, Wxc, Wxo, Who, Wco corresponds to the weight matrices. The usual meaning of each weight parameter matrix is indicated by the subscripts written alongside the symbol (W). For example, Whi is the weight matrix that maps the hidden to input gate. bi, bf, bc, bo are the bias parameter matrices associated with input gate, forget gate, cell and output gate respectively. Also note that the input, Xt is “flattened” into a (M × N)-length vector, denoted by xt. We have also dropped the superscript (·)(i) from the notation, when referring to the data for individuals as a whole.
B. PROPOSED MODEL ARCHITECTURE
To account for the historical visit counts at different locations in different time instants, we need to use a mechanism that handles both the spatial and temporal aspect of the data. We use ConvLSTM [47] as the basic unit to address this issue effectively. An individual’s future visit to a specific location likely be affected by both the recent and far-distant history of visit patterns. In order to effectively capture any complex spatio-temporal patterns within the visit counts of each individual, we use two weeks of historical visit count observations for the first component, i.e. p1 = 14 and one week history of historical visit count observations for the second component, i.e. q1 = 7. For a better representation of both spatial and temporal dependencies, we also propose the use of multi-component sequence-to-sequence architecture DST-Predict as presented in Figure 4. The architecture consists of the following two components:
- Component 1 uses past p1 days of visit count data in the matrix format:
- Component 2 uses past q1 days of visit count data in the matrix format:
FIGURE 4.

Proposed model architecture DST-Predict.
1). COMPONENT 1
For a prediction of visit counts in a sequence, we take the approach of encoder-decoder architecture in Component 1, where the input sequence is processed and encoded into a latent vector of fixed length using one or many neural network layers. We expect that this latent vector provides a summary of the complete input sentence. The latent vector is then passed to the decoder phase where the decoder gets use this vector to start producing the output sequence using one or many neural network layers.
The input for this component first goes to the encoder ConvLSTM block shown in Fig. 5 that consists of three ConvLSTM layers in which the first two layers are followed by a Batch normalization (BN) layer, a non-linear LeakyRelu activation layer and a dropout layer. Batch normalization helps in reducing the internal covariance shift while speeding up the training process whereas LeakyReLU was employed to avoid the “dying ReLU” problems [53], [54] in training of deep neural networks. This dying ReLU problem arises when no gradient flow backwards so the neurons becomes inactive and thus only output 0 for any input. To avoid this issue, we use LeakyReLU activation layers instead of the other activation layers such as tanh, ReLU etc. Dropout [55] prevents overfitting issues as it provides the regularization in neural networks. The third ConvLSTM layer is just followed by a BN layer after which the output of the encoder is the encoded state vector that is passed to the decoder. In order to enhance the representational and learning power of the model in performing high level feature extraction from the inputs, we include a “shortcut” connection [56] that takes the output after the first ConvLSTM layer and adds it to the input for the final ConvLSTM layer. We created these connections to provide stability in training with stacking of more layers without leading to degradation of performance which may be caused due to vanishing/exploding gradients [57], [58]. The architecture components of the decoder is pretty similar to the encoder except for two differences. Firstly, there is an extra final ConvLSTM layer, and secondly, there is a shortcut connection that adds the output of the second ConvLSTM layer to the input of the final ConvLSTM layer. It is important to notice here that we transfer the last cell state c (also called long-term memory) and the last hidden state h (also called short-term memory) from each of the ConvLSTM layers in encoder ConvLSTM block to the all the ConvLSTM layers except the last ConvLSTM layer in the decoder block as shown in Fig. 5.
FIGURE 5.

Component 1 of the proposed model.
2). COMPONENT 2
For Component 2, we employed modifications in the overall architecture in comparison to the first component. Even though the encoder-decoder architecture that we used in first component provides relatively satisfying results, it can potentially suffer from the problem of encoding a good summary of very long sequences because the encoder-decoder architecture can be restricted to a fixed length of latent vector. To overcome this limitation, we use attention mechanism [59] that enables to account for each of the position of the input sequence while predicting output at each timestep. This makes use of the contribution or influence each data at each position in correspondence with each output.
The working principle of attention mechanism is following: first, we consider if we have Tx number of inputs in the sequence; then, the annotations or hidden state outputs are denoted by . In the simple encoder-decoder model, only the last state () of the encoder is used as the context vector and is then passed to the decoder, however, in attention mechanism [59], we compute the context vector ci for each target output yi. Each of the context vector ci is generated using a weighted sum of annotations as:
| (6) |
Here, the weight αij of each annotation hj is computed by a softmax function given by the following equation:
| (7) |
where
| (8) |
is an alignment model that is responsible for scoring how well the inputs around position j and the output at position i match. It is important to note that the score here depends on the hidden state si−1, which precedes the output yi and the j-th annotation hj of the input sequence.
In terms of architecture in encoder-decoder LSTM block for this component as shown in Fig. 6, the last ConvLSTM layer is followed by a Batch Normalization, Leaky ReLU activation and a dropout layer before feeding to the attention layer. Similarly, on the decoder side, right at the beginning we have a ConvLSTM layer followed by all these layers. A rational behind using these extra layers is to better capture high-level spatial features temporally which is best used by the attention layer that improves the representation of the past week’s input temporal sequence to generate the relevant output sequence for the following week.
FIGURE 6.

Component 2 of the proposed model.
Lastly, we include a final fusion layer that combines the sequence prediction coming from the two components to predict the final output sequence. We compute this output sequence by fusing the sequence outputs of different components of the model with associated learnable component weighted parameters as below:
Here, , are the predicted sequence output coming out of the two components of the model while W1, W2 are the trainable weight parameters that indicates the degree of influence that each of the component has on the final sequence prediction.
V. DATA AND EXPERIMENTAL SET-UP
A. DATA
The data used in the experiments was collected from larger project. A total of 1,464 participants who were Apple iPhone users were recruited from 1 December 2016 to 31 May 2017. The study area encompasses Buffalo-Niagara region within Erie and Niagara counties of western New York, US. During the study, participants’ locations were collected using their own mobile phone and an application developed by our research team. The data has been carefully collected keeping the under the consideration of the privacy of each study participant. Primarily, the data set comprises of the following information:
Demographics: It comprises of the participants’ personal information such as gender, age group, home and work address, employment status. In this study we only use the employment status as an individual-specific feature. In this data set, approximately 17% of the individuals have a non-working status.
Global Positioning system (GPS) data: It consists of the movement locations of participants collected at about 35 minute intervals using the application installed on their mobile phones. The data was collected for a period of 32 weeks in the years of 2016–2017.
B. EXPERIMENTAL SET-UP
In the data, there were missing data that needed to be handled before proceeding to the training phase. This included missing data for days in a sequence for a person. We imputed the values of the missing data with the mean value across each corresponding day of the week. The observed visit counts at each location was scaled to the range [0, 1]. While evaluating with the ground truth values, the prediction values are re-scaled back to the normal range. The experiments were conducted on a computing cluster available through Centre for Computational Research (CCR) in University at Buffalo. The nodes equipped with NVIDIA Tesla V100 GPUs with 16GB memory. We used Keras library [60] with Tensorflow library [61] as the backend.
1). MODEL TRAINING
Each of 1,464 participants has GPS data records over a maximum period of 221 days (approx. 32 weeks), although some participants had less than 221 days. On average, participants’ GPS data were available on 179 days with a minimum of 53 days. Since only 17% of the 1464 participants had non-working status, we selected data for 485 out of 1464 participants across 32 weeks in a way so that the total participants indicated a well-balanced distribution of working and non-working status. We alternatively select participants based on this criteria i.e. out of 485 participants, every alternate participant has a non-working status. We train our model using the 80% of data for each of selected individuals. The model was validated with the remaining 20% data for each selected individuals.
2). CHOOSING HYPERPARAMETERS
In the Encoder ConvLSTM block for the component 1, the first two Convolution LSTM (ConvLSTM) layers has 40 filters while the layer has 1 filter. The first ConvLSTM in the decoder block has 1 filter, the next two ConvLSTM layers has 40 filters while the final ConvLSTM layer has 1 filter. For the component 2, all the ConvLSTM layers has 40 filters on both the encoder and decoder with the final ConvLSTM layer in the decoder having 1 filter. Each of the filter is of size 3 × 3 for extracting the relevant spatial features from both the input and output from the previous timesteps. Between each convLSTM layer we have employed batch normalization layer which is followed by Leaky ReLU and dropout layers. The dropout layer is set with the rate of 0.25. We train our model using the training data with batch size of 16 and 300 epochs. We used Adam [62] optimizer with learning rate of 0.001. The optimizer is set with β1 = 0.9, β2 = 0.999, ϵ = 1e − 07 and clip value = 1.0. We also used model checkpoint that only saves the best weights while training.
C. EVALUATION METRICS
To evaluate the predictive power of the proposed model to correctly identify the visit locations for a given individual, we need evaluation metrics that can measure the following two aspects:
Recall - What fraction of actual visits were correctly predicted by the model?
Precision - What fraction of the predicted visits corresponded to the actual visits made by the individual?
Mathematically, the two quantities can be calculated as follows. Consider a (M × N) test image matrix, X, at a given spatial scale, and let be the corresponding prediction matrix, obtained from the model. Note that we have dropped the time subscript, t, for clarity. The recall and precision are defined as:
| (9) |
| (10) |
Note that, for both recall and precision, the numerator is the same and counts the overlap between the true and predicted visit counts for each grid cell. In the paper study, we report the average recall and precision over all daily visit counts matrices in the test data set.
An issue with the recall and precision metrics, as defined in (9) and (10), is that they are dependent on a spatial scale (i.e. resolution) at which the matrices are created. Clearly, the task of predicting visit counts at a coarser resolution is easier than predicting visit counts at a finer resolution, and the expected recall and precision values at a coarser resolution are higher than at finer resolution. Consequently, the results obtained at different scales are incomparable. This is a clear shortcoming in the present context, since we are interested in understanding the performance of the proposed model as a function of the spatial scale. To address this issue, we propose scale-invariant versions of the above defined recall and precision metrics.
We first calculate the recall and precision of a naive predictor, which simply distributes the total visits in X uniformly across all the grid cells. The output of the naive predictor, denoted as , is calculated as:
| (11) |
The base recall and precision for this naive predictor are defined as:
| (12) |
| (13) |
One can verify that the values for the base_recall and base_precision metrics likely increase as the spatial scale becomes coarser, because the probability of placing a randomly assigned visit to a correct grid cell by the naive predictor is , which increases as the scale becomes coarser, i.e., M and N become smaller. We use the performance of the naive predictor to “normalize” the recall and precision of the proposed model as follows:
| (14) |
| (15) |
Both the normalized recall and precision values are reported when comparing the performance of the proposed model across different scales.
1). LOSS FUNCTION
Loss function, for training the model is composed of Mean Square error (MSE) and square of the Mean Absolute Percentage Error (MAPE) and Structured dissimilarity (DSSIM). For a single training vector, Y (a “flattened version” of the input I × J matrix, where N = I × J) and the corresponding prediction, , the loss is defined as:
| (16) |
Here, θ are all the parameters that needs to be learned in the network. For the training on the given data, we choose λ1 = 10, λ2 = λ3 = 1 as the hyperparameters of our loss function. They were found to be good for the given data, however, one can further experiment with the values in order to get get an improved performance of the model on different related problem.
VI. RESULTS
We summarize the overall performance evaluation of the proposed system and discuss the effect of different spatial grid sizes on the performance as well as forecasting horizon. Capturing the visit counts of participants during weekends might be difficult for the model as compare to weekdays since number of weekends would be less as compare to weekdays. Due to this, we are motivated to present and discuss the evaluation of the model’s performance during weekdays/weekends.
We will then discuss the effectiveness of the model’s performance on the type of regions in the study area, i.e. rural region versus urban region. The purpose here is to see the consistency of the model’s performance with respect to the region type. Since, most of the population of participants tends to move around more in the urban than the rural areas and because of the presence of unevenness of the covered area between urban and rural regions, it is nice to check the consistency of the model’s performance in context to the region type.
Lastly, we present the comparative results of the proposed model with state-of-the-art and competitive baseline approaches. Here, we use the normalized recall and precision metrics, as defined in Section V-C, henceforth referred to as recall and precision, as evaluation metrics.
A. IMPACT OF SPATIAL GRID SIZES AND FORECASTING HORIZON DURING WEEKDAYS AND WEEKENDS
In this section, we present the quantitative evaluation of results for the model on different spatial grid sizes – 2 × 2, 3 × 3, 4 × 4 and 5 × 5 km and forecasting horizons. We tested our proposed model on 20% hold-out validation data for each of the 485 participants. We also evaluate the forecasting results with respect to day of the week, i.e. weekdays and weekends. See Table 1 for a tabulated summary of results for different forecasting time horizon and spatial grid size. This also includes the performance of the model during weekdays/weekends with respect to forecasting horizon and spatial grid size.
TABLE 1.
Evaluation of prediction for different forecasting horizons (indicated as f) on different grid sizes. Each value represents the mean with standard deviation.
| Grid size | Metric | f = 1 | f = 2 | f = 3 | f = 4 | f = 5 | f = 6 | f = 7 | |
|---|---|---|---|---|---|---|---|---|---|
| 2×2 | Overall | norm_Precision | 0.20 ± 0.08 | 0.30 ± 0.09 | 0.35 ± 0.10 | 0.35 ± 0.09 | 0.31 ± 0.09 | 0.32 ± 0.10 | 0.34 ± 0.11 |
| norm_Recall | 0.21 ± 0.10 | 0.25 ± 0.10 | 0.27 ± 0.11 | 0.25 ± 0.10 | 0.26 ± 0.10 | 0.22 ± 0.09 | 0.22 ± 0.10 | ||
| Weekdays | norm_Precision | 0.21 ± 0.08 | 0.31 ± 0.10 | 0.37 ± 0.10 | 0.36 ± 0.10 | 0.32 ± 0.09 | 0.33 ± 0.11 | 0.36 ± 0.11 | |
| norm_Recall | 0.21 ± 0.11 | 0.26 ± 0.10 | 0.27 ± 0.11 | 0.26 ± 0.11 | 0.26 ± 0.10 | 0.23 ± 0.10 | 0.22 ± 0.11 | ||
| Weekends | norm_Precision | 0.17 ± 0.10 | 0.26 ± 0.10 | 0.32 ± 0.12 | 0.31 ± 0.13 | 0.27 ± 0.11 | 0.29 ± 0.12 | 0.30 ± 0.12 | |
| norm_Recall | 0.20 ± 0.12 | 0.23 ± 0.11 | 0.25 ± 0.13 | 0.23 ± 0.12 | 0.24 ± 0.12 | 0.20 ± 0.11 | 0.20 ± 0.12 | ||
| 3×3 | Overall | norm_Precision | 0.53 ± 0.18 | 0.57 ± 0.17 | 0.58 ± 0.19 | 0.57 ± 0.20 | 0.56 ± 0.21 | 0.55 ± 0.21 | 0.53 ± 0.21 |
| norm_Recall | 0.22 ± 0.13 | 0.20 ± 0.10 | 0.17 ± 0.13 | 0.16 ± 0.13 | 0.17 ± 0.14 | 0.18 ± 0.15 | 0.19 ± 0.15 | ||
| Weekdays | norm_Precision | 0.56 ± 0.19 | 0.59 ± 0.18 | 0.60 ± 0.20 | 0.59 ± 0.21 | 0.57 ± 0.22 | 0.57 ± 0.21 | 0.57 ± 0.21 | |
| norm_Recall | 0.22 ± 0.14 | 0.20 ± 0.14 | 0.18 ± 0.14 | 0.16 ± 0.14 | 0.17 ± 0.14 | 0.19 ± 0.15 | 0.19 ± 0.16 | ||
| Weekends | norm_Precision | 0.48 ± 0.21 | 0.51 ± 0.21 | 0.52 ± 0.23 | 0.52 ± 0.25 | 0.50 ± 0.25 | 0.49 ± 0.24 | 0.47 ± 0.25 | |
| norm_Recall | 0.20 ± 0.14 | 0.18 ± 0.14 | 0.16 ± 0.14 | 0.15 ± 0.15 | 0.15 ± 0.14 | 0.17 ± 0.15 | 0.17 ± 0.14 | ||
| 4×4 | Overall | norm_Precision | 0.31 ± 0.14 | 0.42 ± 0.13 | 0.42 ± 0.11 | 0.45 ± 0.10 | 0.47 ± 0.11 | 0.49 ± 0.13 | 0.49 ± 0.12 |
| norm_Recall | 0.26 ± 0.13 | 0.28 ± 0.12 | 0.35 ± 0.13 | 0.35 ± 0.12 | 0.32 ± 0.12 | 0.28 ± 0.12 | 0.30 ± 0.11 | ||
| Weekdays | norm_Precision | 0.33 ± 0.14 | 0.43 ± 0.13 | 0.44 ± 0.11 | 0.47 ± 0.11 | 0.49 ± 0.12 | 0.50 ± 0.14 | 0.51 ± 0.13 | |
| norm_Recall | 0.27 ± 0.13 | 0.29 ± 0.13 | 0.36 ± 0.13 | 0.36 ± 0.14 | 0.33 ± 0.12 | 0.29 ± 0.13 | 0.31 ± 0.12 | ||
| Weekends | norm_Precision | 0.28 ± 0.15 | 0.37 ± 0.15 | 0.38 ± 0.13 | 0.41 ± 0.14 | 0.43 ± 0.14 | 0.45 ± 0.16 | 0.44 ± 0.15 | |
| norm_Recall | 0.24 ± 0.15 | 0.27 ± 0.14 | 0.33 ± 0.15 | 0.32 ± 0.14 | 0.30 ± 0.14 | 0.26 ± 0.13 | 0.28 ± 0.13 | ||
| 5×5 | Overall | norm_Precision | 0.35 ± 0.15 | 0.45 ± 0.14 | 0.54 ± 0.12 | 0.57 ± 0.13 | 0.59 ± 0.12 | 0.58 ± 0.13 | 0.56 ± 0.15 |
| norm_Recall | 0.25 ± 0.13 | 0.31 ± 0.13 | 0.30 ± 0.12 | 0.30 ± 0.12 | 0.29 ± 0.12 | 0.26 ± 0.12 | 0.22 ± 0.11 | ||
| Weekdays | norm_Precision | 0.37 ± 0.15 | 0.47 ± 0.15 | 0.56 ± 0.12 | 0.58 ± 0.13 | 0.61 ± 0.12 | 0.59 ± 0.14 | 0.58 ± 0.16 | |
| norm_Recall | 0.26 ± 0.14 | 0.31 ± 0.14 | 0.30 ± 0.12 | 0.31 ± 0.13 | 0.29 ± 0.13 | 0.26 ± 0.13 | 0.23 ± 0.12 | ||
| Weekends | norm_Precision | 0.32 ± 0.17 | 0.40 ± 0.16 | 0.49 ± 0.16 | 0.51 ± 0.16 | 0.54 ± 0.16 | 0.54 ± 0.18 | 0.52 ± 0.19 | |
| norm_Recall | 0.23 ± 0.15 | 0.29 ± 0.16 | 0.28 ± 0.14 | 0.28 ± 0.14 | 0.27 ± 0.14 | 0.25 ± 0.14 | 0.21 ± 0.13 |
A graphical comparison of the model performance at different forecast horizons is shown in Figure 8.
FIGURE 8.

Evaluation of the model prediction for different forecast horizons (left: Normalized Recall, right: Normalized Precision, higher values indicates better performance for both metrics). Results are shown separately for different grid sizes.
As shown in Figure 8, the recall performance of the model is stable as the forecast horizon increases from 1 day to 7 days. The recall performance is best for a 4 × 4 grid, and is worse for coarser as well as finer scale. However, for the precision evaluation metric, the model performance improves as the scale becomes coarser, and is best for the 5 × 5 grid. Moreover, for the finer scale grids (2 × 2 and 4 × 4), the performance improves with the increase in forecasting horizon. Interestingly, precision remains stable while recall generally remains low for 3 × 3 spatial grid size.
We can also noticed an increase in performance as the forecasting time ahead increases. For 2 × 2, there is a clear increase in performance in terms of all the evaluation measures with an increase in forecasting time ahead. For all other spatial grid sizes – 3 × 3, 4 × 4 and 5 × 5, we can see that there is an increase in performance until third forecasting timestep, after which a slight decrease in performance can be clearly seen. This can be an expected result as generally the predictive power of a model decreases as the forecasting horizon is increased.
A comparison of the model performance for seven steps ahead prediction between the weekdays and the weekends across different spatial grids is shown in Figure 7. It can be clearly seen that across different spatial grid sizes, the model performs better in forecasting the visit counts during weekdays then compare to weekends. Moreover, the precision and recall increases as the grid size (spatial scale) goes from finer to coarser with an exception of recall for 3 × 3. It can be seen that recall for this spatial grid size is more than any other spatial grid sizes.
FIGURE 7.

Evaluation of the model prediction for different grid sizes for f = 7 (left: Normalized Recall, right: Normalized Precision, higher values indicates better performance for both metrics). Results are shown separately for Weekdays and Weekends.
B. PERFORMANCE EVALUATION FOR URBAN AND RURAL AREAS
In this section, we evaluate the performance of our model for urban and rural areas. Figure 9 shows the performance evaluation of the proposed model in urban and rural areas for a forecasting horizon of 7 days, with respect to different spatial grid sizes. We found a similar trend in both rural and urban areas, although the prediction performance was significantly improved in urban areas. This differences in the model performance in rural versus urban areas might be attributed to the fact that only a small number of observations was available for rural areas (see Figure 3), which also suggests that the movement of individuals in urban areas is more predictive than in rural areas.
FIGURE 9.

Comparison of the model performance on urban and rural areas, for different grid sizes for f = 7 (left: Normalized Recall, right: Normalized Precision, higher values indicates better performance for both metrics).
C. COMPARISON WITH OTHER COMPETITIVE AND BASELINE APPROACHES
In Table 2, the performance of our proposed model was compared with that of other approaches for 5 × 5 spatial grid size at f = 7. We use the data for 10 participants for training and testing. Here for each participant, 80% is used as training data and the remaining 20% is used as validation data. The methods used for comparisons are:
ARIMA – Autoregressive integrated moving average (ARIMA), also known as Box-Jenkins model is a popular model used for time-series forecasting. It uses the historical time series for predicting future values in the series.
STResNet [35] – State-of-the-art approach that makes use of convolutional layers and residual networks for spatio-temporal prediction.
Res-ConvLSTM [36] – STResNet based variant that makes use of ConvLSTM layers for spatio-temporal predictions.
DST-Predict-Ext – Here, we make use of incorporating external meta data such as weekday/weekend and employment status into our model. This was used to check whether there is any improvement in our prediction results if we fuse the external features into our model right after the fusion layer in a sequential manner.
DST-Predict-without-Attention – Here, we switch off the attention mechanism to check the performance of the model.
TABLE 2.
Performance evaluation of our proposed model (in bold) in comparison with other approaches. Each value represents mean ± stddev. This comparison is done using the GPS data for 10 users for 5 × 5 grid size at f = 7.
The results clearly indicates that our proposed model outperforms the state-of-the-art and competitive baseline approaches in terms of normalized recall and normalized precision.
VII. CONCLUSION
Our specific contributions are following. First, we propose a system, DST-Predict which uses a sequence to sequence deep learning approach to predict the visit frequency of an individual, based on the historical GPS location data.
Second, we propose a customized loss function that takes variable weighted Mean Squared error (MSE), Mean absolute percentage error (MAPE) and Structured Similarity (SSIM). Third, we propose a scale-invariant evaluation metrics that effectively compares the performance of the model with respect to different spatial grid sizes. Lastly, experimental results on real GPS traces obtained for over 485 individuals over a period of 32 weeks for a Western New York area in United States indicates that the proposed system effectively forecast the visit counts for future forecasting horizons.
One motivation for this system was to test how accurately can we predict an individual’s future mobility behavior, based on past data, and our experimental results show that the model can indeed produce highly accurate predictions at different spatial scales. This task is challenging and the proposed deep learning architecture, handles the various modeling challenges, through specific customization, including the use of a residual block and a specialized loss function. Given the need for accurate mobility predictions for a variety of important applications, including understanding impact of mobility on spread of infectious diseases, understanding the privacy implications of mobile tracking, etc., the DST-Predict system can provide a vital predictive capability. One of the shortcomings of DST-Predict is the lack of geographic awareness in the model training. Each visit frequency matrix is treated as an image, and loses any geographic information, such as presence of water bodies and other hazards, when making the predictions. In future, we plan to develop customized loss functions that can explicitly incorporate such constraints into the model.
ACKNOWLEDGMENT
Computing facilities were provided by the University at Buffalo’s Center for Computational Research.
This work was supported in part by the National Science Foundation, Office of Advanced Cyberinfrastructure (OAC) under Grant 1910539, and in part by the National Institutes of Health under Grant R01GM108731.
This work involved human subjects or animals in its research. The authors confirm that all human/animal subject research procedures and protocols are exempt from review board approval.
Biography

SYED MOHAMMED ARSHAD ZAIDI received the B.Tech. degree from the Institute of Engineering and Technology, Lucknow, India, and the M.Sc. degree in computer science from the University of St Andrews, St Andrews, U.K. He is currently pursuing the Ph.D. degree under supervision of Dr. Varun Chandola with the Department of Computer Science and Engineering, University at Buffalo. His research interests include machine learning and data mining. The applications of his research works include temporal modeling, spatio-temporal modeling, and uncertainty quantification.

VARUN CHANDOLA received the Ph.D. degree in computer science and engineering from the University of Minnesota. He is currently an Associate Professor with the Center for Computational and Data-Enabled Science and Engineering (CDSE), Computer Science Department, University at Buffalo (UB). Before joining UB, he was a Scientist with the Oak Ridge National Laboratory, Computational Sciences and Engineering Division. His research interests include application of data mining and machine learning to problems involving big and complex data, focusing on anomaly detection from big and complex data.

EUN-HYE YOO received the B.A. and M.A. degrees in geography from Seoul National University, Seoul, South Korea, and the Ph.D. degree in geography from the University of California at Santa Barbara. She is currently an Associate Professor with the Department of Geography, University at Buffalo. Her research interests include exploring spatial and temporal scale issues and quantifying spatial uncertainties in various fields, including population density, human spatial behavior, air pollution, human mobility, and environmental impact on human health.
REFERENCES
- [1].Alberdi A, Weakley A, Schmitter-Edgecombe M, Cook DJ, Aztiria A, Basarab A, and Barrenechea M, “Smart home-based prediction of multidomain symptoms related to Alzheimer’s disease,” IEEE J. Biomed. Health Inform, vol. 22, no. 6, pp. 1720–1731, Nov. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Barros JM, Duggan J, and Rebholz-Schuhmann D, “Disease mentions in airport and hospital geolocations expose dominance of news events for disease concerns,” J. Biomed. Semantics, vol. 9, no. 1, p. 18, Dec. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Gray K, Smolyak D, Badirli S, and Mohler G, “Coupled IGMM-GANs for deep multimodal anomaly detection in human mobility data,” 2018, arXiv:1809.02728. [Google Scholar]
- [4].Xia F, Wang J, Kong X, Wang Z, Li J, and Liu C, “Exploring human mobility patterns in urban scenarios: A trajectory data perspective,” IEEE Commun. Mag, vol. 56, no. 3, pp. 142–149, Mar. 2018. [Google Scholar]
- [5].Huang Z, Ling X, Wang P, Zhang F, Mao Y, and Lin T, “Modeling real-time human mobility based on mobile phone and transportation data fusion,” Transp. Res. C, Emerg. Technol, vol. 96, pp. 251–269, Nov. 2018. [Google Scholar]
- [6].Song C, Koren T, Wang P, and Barabási A-L, “Modelling the scaling properties of human mobility,” Nature Phys, vol. 6, no. 10, p. 818, 2010. [Google Scholar]
- [7].Cuttone A, Lehmann S, and González MC, “Understanding predictability and exploration in human mobility,” EPJ Data Sci, vol. 7, no. 1, p. 2, Dec. 2018. [Google Scholar]
- [8].Barbosa H, Barthelemy M, Ghoshal G, James CR, Lenormand M, Louail T, Menezes R, Ramasco JJ, Simini F, and Tomasini M, “Human mobility: Models and applications,” Phys. Rep, vol. 734, pp. 1–74, Mar. 2018. [Google Scholar]
- [9].Jiang R, Song X, Huang D, Song X, Xia T, Cai Z, Wang Z, Kim K-S, and Shibasaki R, “DeepUrbanEvent: A system for predicting citywide crowd dynamics at big events,” in Proc. 25th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Jul. 2019, pp. 2114–2122. [Google Scholar]
- [10].Palos-Sanchez P, Saura JR, Reyes-Menendez A, and Esquivel IV, “Users acceptance of location-based marketing apps in tourism sector: An exploratory analysis,” J. Spatial Org. Dyn, vol. 6, no. 3, pp. 258–270, 2018. [Google Scholar]
- [11].Kraemer MUG, Yang C-H, Gutierrez B, Wu C-H, Klein B, Pigott DM, Plessis L, Faria NR, Li R, Hanage WP, Brownstein JS, and Layan M, “The effect of human mobility and control measures on the COVID-19 epidemic in China,” Science, vol. 368, no. 6490, pp. 493–497, May 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Kazagli E, Chen J, and Bierlaire M, “Individual mobility analysis using smartphone data,” in Intelligent Transportation and Planning: Breakthroughs in Research and Practice. Hershey, PA, USA: IGI Global, 2018, pp. 332–354. [Google Scholar]
- [13].Williams NE, Thomas TA, Dunbar M, Eagle N, and Dobra A, “Measures of human mobility using mobile phone records enhanced with GIS data,” PLoS ONE, vol. 10, no. 7, Jul. 2015, Art. no. e0133630, doi: 10.1371/journal.pone.0133630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Sloan L and Morgan J, “Who tweets with their location? Understanding the relationship between demographic characteristics and the use of geoservices and geotagging on Twitter,” PLoS ONE, vol. 10, no. 11, Nov. 2015, Art. no. e0142209, doi: 10.1371/journal.pone.0142209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Hadavi S, Rai HB, Verlinde S, Huang H, Macharis C, and Guns T, “Analyzing passenger and freight vehicle movements from automatic-number plate recognition camera data,” Eur. Transp. Res. Rev, vol. 12, no. 1, pp. 1–17, Dec. 2020. [Google Scholar]
- [16].Ariannezhad A and Wu Y-J, “Large-scale loop detector troubleshooting using clustering and association rule mining,” J. Transp. Eng., A, Syst, vol. 146, no. 7, Jul. 2020, Art. no. 04020064. [Google Scholar]
- [17].Rizvi SMA, Ahmed A, and Shen Y, “Real-time incident detection and capacity estimation using loop detector data,” J. Adv. Transp, vol. 2020, pp. 1–14, Oct. 2020. [Google Scholar]
- [18].Yu Q, Zhang H, Li W, Song X, Yang D, and Shibasaki R, “Mobile phone GPS data in urban customized bus: Dynamic line design and emission reduction potentials analysis,” J. Cleaner Prod, vol. 272, Nov. 2020, Art. no. 122471. [Google Scholar]
- [19].Allahbakhshi H, Conrow L, Naimi B, and Weibel R, “Using accelerometer and GPS data for real-life physical activity type detection,” Sensors, vol. 20, no. 3, p. 588, Jan. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Atluri G, Karpatne A, and Kumar V, “Spatio-temporal data mining: A survey of problems and methods,” ACM Comput. Surv, vol. 51, no. 4, pp. 1–41, Sep. 2018. [Google Scholar]
- [21].Wang S, Cao J, and Yu P, “Deep learning for spatio-temporal data mining: A survey,” IEEE Trans. Knowl. Data Eng, early access, Sep. 22, 2020, doi: 10.1109/TKDE.2020.3025580. [DOI] [Google Scholar]
- [22].Du S, Li T, Gong X, and Horng S-J, “A hybrid method for traffic flow forecasting using multimodal deep learning,” 2018, arXiv:1803.02099. [Google Scholar]
- [23].Huang W, Song G, Hong H, and Xie K, “Deep architecture for traffic flow prediction: Deep belief networks with multitask learning,” IEEE Trans. Intell. Transp. Syst, vol. 15, no. 5, pp. 2191–2201, Oct. 2014. [Google Scholar]
- [24].Liao B, Zhang J, Wu C, McIlwraith D, Chen T, Yang S, Guo Y, and Wu F, “Deep sequence learning with auxiliary information for traffic prediction,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2018, pp. 537–546. [Google Scholar]
- [25].Ren H, Song Y, Wang J, Hu Y, and Lei J, “A deep learning approach to the citywide traffic accident risk prediction,” in Proc. 21st Int. Conf. Intell. Transp. Syst. (ITSC), 2018, pp. 3346–3351. [Google Scholar]
- [26].Zhang Z, He Q, Gao J, and Ni M, “A deep learning approach for detecting traffic accidents from social media data,” Transp. Res. C, Emerg. Technol, vol. 86, pp. 580–596, Jan. 2018. [Google Scholar]
- [27].Zhu L, Guo F, Krishnan R, and Polak JW, “A deep learning approach for traffic incident detection in urban networks,” in Proc. 21st Int. Conf. Intell. Transp. Syst. (ITSC), 2018, pp. 1011–1016. [Google Scholar]
- [28].Duan L, Hu T, Cheng E, Zhu J, and Gao C, “Deep convolutional neural networks for spatiotemporal crime prediction,” in Proc. Int. Conf. Inf. Knowl. Eng. (IKE), 2017, pp. 61–67. [Google Scholar]
- [29].Wang B, Zhang D, Zhang D, Jeffery Brantingham P, and Bertozzi AL, “Deep learning for real time crime forecasting,” 2017, arXiv:1707.03340. [Google Scholar]
- [30].Ouyang X, Zhang C, Zhou P, Jiang H, and Gong S, “DeepSpace: An online deep learning framework for mobile big data to understand human mobility patterns,” 2016, arXiv:1610.07009. [Google Scholar]
- [31].Yao D, Zhang C, Zhu Z, Hu Q, Wang Z, Huang J, and Bi J, “Learning deep representation for trajectory clustering,” Expert Syst, vol. 35, no. 2, Apr. 2018, Art. no. e12252. [Google Scholar]
- [32].Jiang R, Song X, Fan Z, Xia T, Chen Q, Chen Q, and Shibasaki R, “Deep ROI-based modeling for urban human mobility prediction,” Interact., Mobile, Wearable Ubiquitous Technol, vol. 2, no. 1, pp. 1–29, 2018. [Google Scholar]
- [33].Xiong F, Shi X, and Yeung D-Y, “Spatiotemporal modeling for crowd counting in videos,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 5151–5159. [Google Scholar]
- [34].Yao H, Wu F, Ke J, Tang X, Jia Y, Lu S, Gong P, Ye J, and Li Z, “Deep multi-view spatial-temporal network for taxi demand prediction,” in Proc. AAAI Conf. Artif. Intell, 2018, pp. 2588–2595. [Google Scholar]
- [35].Zhang J, Zheng Y, and Qi D, “Deep spatio-temporal residual networks for citywide crowd flows prediction,” in Proc. 21st AAAI Conf. Artif. Intell., 2017, pp. 1655–1661. [Google Scholar]
- [36].Wei H, Zhou H, Sankaranarayanan J, Sengupta S, and Samet H, “Residual convolutional LSTM for tweet count prediction,” in Proc. Web Conf, 2018, pp. 1309–1316. [Google Scholar]
- [37].Walach E and Wolf L, “Learning to count with CNN boosting,” in Proc. Eur. Conf. Comput. Vis Cham, Switzerland: Springer, 2016, pp. 660–676. [Google Scholar]
- [38].Guerrero-Gómez-Olmedo R, Torre-Jiménez B, López-Sastre R, Maldonado-Bascón S, and Onoro-Rubio D, “Extremely overlapping vehicle counting,” in Proc. Iberian Conf. Pattern Recognit. Image Anal Cham, Switzerland: Springer, 2015, pp. 423–431. [Google Scholar]
- [39].Arteta C, Lempitsky V, and Zisserman A, “Counting in the wild,” in Proc. Eur. Conf. Comput. Vis Cham, Switzerland: Springer, 2016, pp. 483–498. [Google Scholar]
- [40].Young T, Hazarika D, Poria S, and Cambria E, “Recent trends in deep learning based natural language processing,” IEEE Comput. Intell. Mag, vol. 13, no. 3, pp. 55–75, Aug. 2018. [Google Scholar]
- [41].LeCun Y, Bengio Y, and Hinton GE, “Deep learning,” Nature, vol. 521, pp. 436–444, Dec. 2015. [DOI] [PubMed] [Google Scholar]
- [42].Voulodimos A, Doulamis N, Doulamis A, and Protopapadakis E, “Deep learning for computer vision: A brief review,” Comput. Intell. Neurosci, vol. 2018, Feb. 2018, Art. no. 7068349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Ioannidou A, Chatzilari E, Nikolopoulos S, and Kompatsiaris I, “Deep learning advances in computer vision with 3D data: A survey,” ACM Comput. Surv, vol. 50, no. 2, p. 20, 2017. [Google Scholar]
- [44].Yuan Q, Zhang W, Zhang C, Geng X, Cong G, and Han J, “PRED: Periodic region detection for mobility modeling of social media users,” in Proc. 10th ACM Int. Conf. Web Search Data Mining, 2017, pp. 263–272. [Google Scholar]
- [45].Clark R, Wang S, Markham A, Trigoni N, and Wen H, “VidLoc: A deep spatio-temporal model for 6-DoF video-clip relocalization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6856–6864. [Google Scholar]
- [46].Li C, Wang P, Wang S, Hou Y, and Li W, “Skeleton-based action recognition using LSTM and CNN,” in Proc. IEEE Int. Conf. Multimedia Expo. Workshops (ICMEW), Jul. 2017, pp. 585–590. [Google Scholar]
- [47].Xingjian S, Chen Z, Wang H, Yeung D-Y, Wong W-K, and Woo W-C, “Convolutional LSTM network: A machine learning approach for precipitation nowcasting,” in Proc. Adv. Neural Inf. Process. Syst, 2015, pp. 802–810. [Google Scholar]
- [48].Stec A and Klabjan D, “Forecasting crime with deep learning,” 2018, arXiv:1806.01486. [Google Scholar]
- [49].Yuan Z, Zhou X, and Yang T, “Hetero-ConvLSTM: A deep learning approach to traffic accident prediction on heterogeneous spatio-temporal data,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Jul. 2018, pp. 984–992. [Google Scholar]
- [50].Liu Y, Zheng H, Feng X, and Chen Z, “Short-term traffic flow prediction with Conv-LSTM,” in Proc. 9th Int. Conf. Wireless Commun. Signal Process. (WCSP), Oct. 2017, pp. 1–6. [Google Scholar]
- [51].Zhang C, Li H, Wang X, and Yang X, “Cross-scene crowd counting via deep convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 833–841. [Google Scholar]
- [52].Liao B, Zhang J, Cai M, Tang S, Gao Y, Wu C, Yang S, Zhu W, Guo Y, and Wu F, “Dest-resnet: A deep spatiotemporal residual network for hotspot traffic speed prediction,” in Proc. 26th ACM Int. Conf. Multimedia, 2018, pp. 1883–1891. [Google Scholar]
- [53].Agarap AF, “Deep learning using rectified linear units (ReLU),” 2018, arXiv:1803.08375. [Google Scholar]
- [54].Lau MM and Lim KH, “Investigation of activation functions in deep belief network,” in Proc. 2nd Int. Conf. Control Robot. Eng. (ICCRE), Apr. 2017, pp. 201–206. [Google Scholar]
- [55].Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, and Salakhutdinov RR, “Improving neural networks by preventing co-adaptation of feature detectors,” 2012, arXiv:1207.0580. [Google Scholar]
- [56].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Dec. 2016, pp. 770–778. [Google Scholar]
- [57].Bengio Y, Simard P, and Frasconi P, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural Netw, vol. 5, no. 2, pp. 157–166, Mar. 1994. [DOI] [PubMed] [Google Scholar]
- [58].Glorot X and Bengio Y, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. 13th Int. Conf. Artif. Intell. Statist., 2010, pp. 249–256. [Google Scholar]
- [59].Bahdanau D, Cho K, and Bengio Y, “Neural machine translation by jointly learning to align and translate,” 2014, arXiv:1409.0473. [Google Scholar]
- [60].Chollet F et al. , “Keras,” 2015. [Online]. Available: https://keras.io
- [61].Abadi M. (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. [Online]. Available: http://tensorflow.org/
- [62].Kingma DP and Ba J, “Adam: A method for stochastic optimization,” 2014, arXiv:1412.6980. [Google Scholar]
