PGSFormer: traffic flow prediction based on joint optimization of progressive graph convolutional networks with subseries transformer

Linlong Chen

doi:10.1038/s41598-026-35643-x

. 2026 Feb 4;16:7200. doi: 10.1038/s41598-026-35643-x

PGSFormer: traffic flow prediction based on joint optimization of progressive graph convolutional networks with subseries transformer

Linlong Chen ^1,^✉

PMCID: PMC12923748 PMID: 41639147

Abstract

Traffic flow prediction is challenging due to its complex spatio-temporal correlations and the graph-structured nature of traffic networks. Adaptive graph construction methods have gained attention for their superiority over static graph models. However, most existing methods only adjust the graph structure during training, failing to reflect real-time dynamics in the testing phase. This limitation is particularly significant in traffic flow prediction, where data are often affected by abrupt changes and irregularities in time series. To address this, a traffic flow prediction model named Progressive Graph Convolutional Networks with Subseries Transformer (PGSFormer) is proposed, which jointly optimizes a progressive graph convolutional network and a subsequence Transformer. PGSFormer constructs a progressive adjacency matrix by learning trend similarity between nodes, integrates it with dilated causal convolution and gated recurrent units to extract temporal features, and uses parameterized cosine similarity to dynamically update edge weights in real time. Additionally, the Transformer is enhanced with a mask reconstruction task to generate context-aware subsequence representations and extracts long-term trends using stacked 1D convolutional layers. Experiments on two real-world datasets show that PGSFormer significantly outperforms existing baselines in prediction accuracy.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-026-35643-x.

Keywords: Spatial-Temporal traffic forecasting, Progressive graph convolutional networks, Adaptive graph convolution, Mask subseries strategy

Subject terms: Engineering, Mathematics and computing

Introduction

The rapid growth of motorized transportation has substantially improved human mobility, accelerated logistics, and shortened interregional travel times. However, this advancement has simultaneously intensified urban congestion, extended commuting durations, and resulted in significant economic and environmental costs. With the evolution of sensing and communication technologies, massive volumes of real-time traffic data can now be collected, laying the foundation for data-driven intelligent transportation systems (ITS) and spurring the advancement of traffic flow forecasting techniques¹. Traffic flow prediction aims to anticipate future traffic conditions based on historical observations, enabling proactive strategies to alleviate congestion and optimize resource allocation². Accurate forecasting allows transportation authorities to dynamically regulate signals, schedule routes, and deploy control strategies that mitigate traffic jams. Meanwhile, commuters benefit from intelligent route planning and adaptive travel decisions, thereby improving public transport utilization and reducing overall waiting time. Hence, traffic flow prediction has emerged as a vital component of smart city infrastructure with broad practical and societal implications³.

In the past decades, a wide range of modeling approaches have been explored. Early traffic flow prediction models, such as historical average and ARIMA⁴, relied on simple statistical methods. These approaches assumed linearity and stationarity in traffic data, which does not hold true for real-world traffic flow. ARIMA, for example, can only capture linear temporal dependencies, making it ill-suited for traffic flow, which exhibits nonlinear and time-varying behavior. Moreover, these models do not consider the spatial dependencies between road segments, which are crucial for accurately forecasting traffic conditions in urban areas. With the rapid progress of deep learning, graph convolutional networks (GCNs)⁵ have been widely adopted to model non-Euclidean traffic networks through neighborhood aggregation, enabling effective extraction of latent spatial features for traffic flow prediction. Early spatiotemporal models, such as STGCN⁶, coupled GCN-based spatial modeling with convolutional operations along the temporal dimension to capture dependencies among road segments and their temporal variations. Nevertheless, the limited receptive field of temporal convolutions restricted their ability to represent long-range temporal dependencies. To address this limitation, DCRNN⁷ introduced recurrent neural networks into graph-based frameworks, thereby strengthening long-term temporal modeling capacity. Despite this improvement, DCRNN remains insufficient in characterizing spatiotemporal heterogeneity, as it lacks explicit mechanisms to cope with dynamic and region-specific traffic evolution. More recently, methods like Traffic Transformer⁸ and STAEformer⁹ incorporated Transformers to model spatiotemporal dependencies, leveraging self-attention mechanisms to capture global dependencies and improve the handling of long-range correlations. However, these approaches still rely on static graph structures, which do not adapt to the changing relationships between road segments over time.

With the advent of deep learning, convolutional neural networks (CNNs) were introduced to extract spatial features from grid-structured data, achieving noticeable accuracy improvements. Nevertheless, CNN-based approaches fail to effectively handle the non-Euclidean topology of transportation networks. To fully capture the complex interplay between spatial and temporal dependencies, researchers have reformulated traffic flow forecasting as an extrapolation problem on discrete-time dynamic graphs (DTDGs)¹⁰, where traffic nodes and edges evolve over time to represent dynamic correlations. Given the outstanding representation power of graph neural networks (GNNs), numerous studies have integrated GNNs with sequential models to build DTDG-based forecasting frameworks¹¹. Combining GNN with sequence models is an efficient approach to developing Discrete-Time Dynamic Graph (DTDG) models. GNN is adopt at capturing complex spatial information related to node connectivity, while sequence models focus on temporal dynamics over time¹². Since traffic data often exhibit significant daily and weekly cyclical patterns, an accurate and reliable traffic flow prediction model must effectively handle these long sequences of temporal input data. However, because traffic data often exhibit multi-scale periodicities (daily, weekly, seasonal), existing architectures tend to capture only short-term variations while failing to preserve long-range dependencies. This temporal limitation leads to accumulated prediction errors and weak adaptability to nonstationary traffic conditions. The Transformer model¹³, introduced by Google in 2017, is the sequence model of choice in recent years as a sequence model in DTDG traffic flow prediction due to its superior ability to handle long sequence complexity¹⁴. While Transformers have improved performance in traffic forecasting, two major challenges persist:

(1) Insufficient modeling of spatial dependencies. Most prior works depend on graph convolutional networks (GCNs) or their variants to capture local connectivity patterns. Yet, these methods are inherently restricted by fixed adjacency matrices, making it difficult to assign precise importance weights to dynamically changing neighboring nodes¹⁵. Although graph attention networks (GATs) partially address this by introducing learnable weights, their reliance on static road topology prevents them from representing dynamic and indirect spatial interactions. In realistic urban environments, traffic congestion or incidents at one node can propagate rapidly through multiple nonadjacent nodes, requiring adaptive and efficient mechanisms to model dynamic spatial diffusion⁵. The reason is that the feature update of a central node usually only considers its directly connected neighboring nodes, as shown in Fig. 1(a), whereas in real scenarios, congestion at one node often affects many surrounding nodes, as shown in Fig. 1(b). Although GAT can eventually propagate congestion information to all nodes, this propagation method is obviously not efficient and practical enough in traffic environments that require real-time updates and high-precision predictions. Therefore, the current model still needs to be improved in accurately capturing and expressing the complex spatial dependencies in the transportation network.

Fig. 1 — Spatial dependence. (a) Only directly connected neighboring nodes exert influence on the center node. The thickness of the connecting lines indicates the weights assigned to the connections between nodes. (b) Traffic congestion and accidents affect all nodes and the interaction is dynamic spatial dependence.

(2) Inadequate treatment of spatial heterogeneity and node-specific temporal behaviors. Existing spatio-temporal graph neural networks (STGNNs) often utilize predefined or adaptive graphs that remain static during inference, failing to account for heterogeneous temporal evolution patterns among nodes. Predefined graphs, based on geographic distances or fixed connectivity¹¹, capture only coarse relationships, whereas adaptive graphs parameterize node embeddings but remain static after training¹⁶. Dynamic graph construction methods have recently emerged to address this issue by learning data-driven topologies that evolve with input features¹⁷. In these studies, the graph structures are adapted to changes in the input spatio-temporal data, thus enhancing their ability to capture spatial correlations evolving over time. Although existing data-driven dynamic graph construction methods have shown success, these studies fail to adequately consider the different spatio-temporal patterns exhibited by each node due to spatial heterogeneity. Traffic patterns at each node are not only influenced by real-time traffic flows in other regions, but also by spatial heterogeneity, such as specific traffic patterns due to regional or roadway differences.

To address the above challenges, this study focuses on developing a unified spatio-temporal modeling framework (PGSFormer) capable of capturing both long-range temporal dependencies and dynamically evolving spatial interactions in traffic networks. Instead of relying on static or training-stage adaptive graph structures, the framework emphasizes progressive structural evolution driven by online traffic dynamics, together with subsequence-aware temporal representation learning. By jointly considering dynamic spatial dependency propagation and long-horizon temporal pattern extraction, the proposed framework aims to improve prediction robustness under nonstationary traffic conditions and heterogeneous node behaviors, while supporting multi-node forecasting within a single unified model.

The contributions of this paper are as follows:

A traffic flow prediction model is proposed that fuses progressive graph convolution with a subsequence Transformer. It captures time-varying spatial correlations through progressive graph convolution and preprocesses the Transformer via masked reconstruction to generate contextual subsequence representations. Stacked 1D convolutional layers then extract long-term trend features from these representations.
A progressive graph construction strategy is employed to update spatial relationships online during both training and inference, enabling the graph structure to adapt continuously to evolving traffic patterns.
A progressive dynamic graph convolution mechanism is developed by combining the progressive graph with diffusion-based convolution and dilated causal temporal modeling, providing a practical extension to Graph WaveNet-style architectures.
Extensive experiments on two real datasets validate that the proposed model exhibits the best prediction performance compared to the baseline method.

Related work

Traffic flow prediction

Traffic flow prediction has long been a fundamental topic in ITS. Early research predominantly employed traditional time-series models such as ARIMA⁴, which rely on linear assumptions and stationary data distributions. These classical methods, while mathematically interpretable, often fail to capture the nonlinear and spatially correlated nature of real-world traffic data. With the rise of machine learning, approaches such as support vector regression (SVR)¹⁸ and random forest regression were developed to address some of these limitations. However, most traditional machine learning models require manual feature engineering, making them sensitive to data scale and insufficiently robust to dynamic traffic changes¹⁹.

With the rapid development of deep learning, Wu et al.²⁰ used CNN to mine the spatial features of traffic flow for grid-based traffic flow prediction. STGCN⁶ and DCRNN⁷ further utilized graph neural networks to extract the spatial information in the task of traffic flow prediction based on a non-Euclidean structure, but they were limited by the local sensory field of the pre-creating graphs limitation. Subsequently, the models STFGNN²¹ and STGODE²² were introduced, utilizing the Dynamic Time Warping (DTW) algorithm to capture long-term spatial features in traffic flow by analyzing time series correlations. However, the inefficient speed of DTW computation over the entire time series makes the performance of the prediction model limited. Xu et al.²³ proposed to learn a better graph network through a data-driven approach, and Fang et al.²⁴ utilized a self-attention mechanism instead of GCN to capture global spatial dependencies. In addition, STA-LSTM²⁵ and ASTGNN²⁶ further utilize attention on the time series dimension to form spatio-temporal attention for traffic prediction. In addition, many knowledge-driven techniques can be combined with the model, e.g., KST-GCN²⁷ and MPGCN²⁸ utilize knowledge graph and mobility knowledge, respectively, to improve the prediction performance.

GNNs for Spatial-Temporal traffic forecasting

Graph neural networks (GNNs) serve as the core framework for extracting spatial dependencies within non-Euclidean traffic networks. In these architectures, adjacency matrices are typically employed to encode structural connectivity between nodes. However, conventional GNNs based on fixed adjacency matrices fail to adapt to temporal changes in topology. To enhance representational flexibility, researchers have extended GNNs in multiple directions. For example, Chen et al.²⁹ incorporated geodesic distances into adjacency definitions to preserve spatial proximity, whereas the DDP-GCN³⁰ model enriched structural information by integrating factors such as distance, heading, and joint angle into graph construction. The MW-TGC framework³¹ further demonstrated that distinct structural cues should be selectively emphasized under different traffic scenarios. However, these approaches rely on static information such as distance and joint angles between node pairs and speed limits on road segments, whereas nodes may share similar characteristics rather than being physically close or connected in the transportation network. Several research works have shown that learning the adaptive graph during the training phase can further improve the performance of the model³². To overcome these static constraints, adaptive graph learning has become a dominant research trend. Graph WaveNet³³ constructed an adaptive adjacency matrix by multiplying self-learning node embeddings, and DMSTGCN²⁸ showed that constructing an adaptive graph for each time slot of the day can further improve the performance. Bai et al.³² proposed a model that generates node-specific parameters as well as an adaptive graph based on the data. However, these methods have defined the graph prior to the validation and testing phases. Some studies detail updating the connectivity information during the testing phase, such as Z-GCNETs²⁹ which uses zigzag persistence images to capture the underlying topology of dynamic traffic data that persists over time, and ASTGNN²⁶ which uses the attention output to extract spatial correlation matrices to update the graph weights during the testing period. Nevertheless, these approaches often fail to fully adapt to the diverse spatio-temporal characteristics exhibited by heterogeneous traffic nodes, limiting their robustness in nonstationary conditions³⁴.

Since trends in spatio-temporal data may face changes in daily trends and other contingencies during the testing period, a method is needed to adapt to the online input data during the training and testing phases³⁵. Therefore, this paper proposes a progressive graph convolutional network that updates the edge connections of the traffic graph as well as the weights of the edges in such a way as to better capture the changes in spatio-temporal dynamics in the traffic flow data in order to improve the accuracy and robustness of the predictions.

Unlike conventional masking strategies that operate directly on raw spatial or temporal dimensions, the proposed framework introduces a masked subsequence reconstruction mechanism that partitions long temporal sequences into fixed-length subseries and selectively masks a large portion of them. This design compels the Transformer to recover contextual temporal patterns and long-range dependencies more effectively than element-wise or step-wise masking used in earlier studies. In addition, the model employs a progressive graph construction strategy that dynamically updates adjacency relationships using parameterized cosine similarity, allowing the graph to evolve in response to changing traffic patterns. This mechanism differs fundamentally from prior work that relies on static graphs or fixed adaptive graphs that remain unchanged during inference. To demonstrate the advantages of these designs, extensive comparisons with representative baselines such as STGCN and ASTGCN are provided. The results consistently show that the proposed model achieves superior accuracy and stability across multiple datasets, with particularly strong improvements in long-horizon prediction tasks.

In addition to the aforementioned studies, several recent works have further advanced spatiotemporal traffic forecasting through dynamic graph learning, multi-scale temporal modeling, and structural disentanglement. With the rapid growth of urban-scale traffic data, recent research has explored the efficiency and scalability of Transformer-based architectures, emphasizing optimized spatial data management and attention computation to handle high-dimensional and long-sequence forecasting tasks more effectively³⁶. Meanwhile, the emergence of spatiotemporal foundation models has brought a paradigm shift toward unified forecasting frameworks that integrate representation learning, data preprocessing, and task adaptation into a cohesive pipeline, offering greater generalization and transferability across domains³⁷. SSGCRTN³⁸ enhances spatial specificity by integrating graph convolutional, recurrent, and Transformer components to jointly capture localized and global dependencies, while MSTDFGRN³⁹ introduces a multi-view dynamic fusion mechanism to accommodate heterogeneous spatial relationships and varying traffic patterns. From a structural perspective, PSTCGCN⁴⁰ incorporates causal priors through principal spatiotemporal graph construction to better characterize the interactions between long-term trends and short-term events, whereas TIIDGCN⁴¹ employs temporal-identity interaction to stabilize periodic variations across time-evolving graphs. In the temporal domain, SDSINet⁴² focuses on dual-scale spatiotemporal interaction for fine-grained feature representation, and MTEGCRN⁴³ improves temporal adaptability by integrating multi-scale temporal cues within a recurrent graph framework. Orthogonally, GDGCRN⁴⁴ and DMFGCRN⁴⁵ adopt decoupled architectures that disentangle spatial and temporal dependencies, thereby improving robustness and interpretability. Despite their effectiveness, these approaches still face limitations in modeling ultra-long dependencies, maintaining computational efficiency, and achieving unified dynamic structural reasoning.

Table 1 presents a comprehensive comparison of representative spatiotemporal traffic forecasting models. In this table, the Structure column specifies the spatiotemporal coupling strategy adopted by each method, including sequential coupling (In Series), embedded interaction (In Embedded), and parallel fusion (In Parallel). The Relationship between layers describes how spatial, temporal, and feature representations are connected, where interaction patterns are denoted as Interactive (IT), Shared (S), or Independent (I). In addition, the Relationship type characterizes the form of adjacency or dependency modeling, which can be pre-defined (P), adaptive (A), or dynamic (D). Compared with existing approaches, the proposed DSTGA-Mamba exhibits fully interactive parameter sharing across spatial, temporal, and feature dimensions, while simultaneously incorporating multiple graph structures encompassing pre-defined, adaptive, and dynamic relationships. This integrated design reflects a more systematic and innovative modeling paradigm, highlighting clear advantages over prior methods.

Table 1.

Comparative analysis of Spatiotemporal traffic forecasting models.

Methods	Structure	Spatial	Temporal	Feature	Relationship type
STGCN⁶	In Series	S	I	I	P
DCRNN⁷	In Embedded	S	I	I	P
Graph WaveNet³³	In Series	S	I	I	A
ASTGCN⁴⁶	In Series	S & I	I	I	A & D
STFGNN²¹	In Parallel	S	I	I	–
STGODE²²	In Series	S	I	I	D
DSTAGNN⁴⁷	In Series	S	I	I	D
SSGCRTN³⁸	In Embedded	S & I	IT	I	A & D
MSTDFGRN³⁹	In Parallel	S & I	IT & S	I	A & D
PSTCGCN⁴⁰	In Series	S	I	I	P & A
TIIDGCN⁴¹	In Series	S & I	IT & S & I	I	A & D
SDSINet⁴²	In Parallel	S	IT	I	A
MTEGCRN⁴³	In Parallel	S & I	IT & S	I	A & D
GDGCRN⁴⁴	In Series	S	I	I	A & D
DMFGCRN⁴⁵	In Parallel	IT & S	IT & S & I	IT & S & I	A & D
PGSFormer (Proposed)	In Series	IT & S & I	IT & S & I	IT & S & I	P & A & D

Open in a new tab

Methodology

Notations and problem definitions

1) Transportation Network: which can be constructed from spatial data collected by an internal sensor network or by identifying stations and road segments. It is defined as an undirected graph Inline graphic , where denotes the set of nodes, each representing an observation point in the transportation network, such as a sensor or a road segment, and denotes the set of edges, which represent the connections between nodes. is the adjacency matrix of the graph, which is used to quantify the strength of connections between nodes.

2) Traffic Signal: Traffic Signal Inline graphic represents the data collected by all observation points in the traffic network at time step , where denotes the number of nodes ( sensors), and denotes the initial number of characteristic channels (e.g., demand, flow, or speed).

The goal of traffic flow prediction is to predict a future traffic signal sequence Inline graphic based on a period of historical traffic signal sequence . Here, represents the length of the historical time series and represents the length of the predicted time series. The task can be formalized as a function that maps the historical sequence to the predicted sequence, i.e.:

To clearly illustrate the dimensional consistency across different components of the proposed framework, Table 2 summarizes the input and output dimensions of each module in PGSFormer. All components follow a unified tensor representation based on the number of nodes Inline graphic , historical length , prediction horizon , and feature dimension . This design ensures seamless information flow between spatial graph modeling and temporal representation learning modules.

Table 2.

Input and output dimensions of different components in PGSFormer.

Component	Output	Description
Historical traffic signal	-	Raw traffic observations from all nodes
Progressive Graph Construction		Time-varying progressive adjacency matrix
Progressive Graph Convolution (PGC)		Dynamic spatial feature aggregation
Dilated Causal Convolution		Temporal dependency modeling
Subseries Partition		Non-overlapping temporal subseries
Subseries Temporal Representation Learner (STRL)		Subseries-level temporal representations
Masked Reconstruction Head		Reconstructed temporal sequence
Prediction Head		Final traffic flow prediction

Open in a new tab

The PGSFormer framework proposed in this paper is illustrated in Fig. 2, which is mainly composed of Progressive Graph Convolutional Network (PGCN) and Subsequence Transformer. Among them, the core idea of PGCN lies in its unique mechanism of graph construction and convolution operation, which first involves the construction of progressive graphs to simulate the evolution of traffic networks by dynamically updating the adjacency matrices. Subsequently, graph convolution operations are performed using these progressive adjacency matrices to capture the dynamic interactions among nodes. In addition, PGCN deepens the analysis of historical sequences by integrating the graph convolution module with the extended causal convolution to ensure the temporal coherence of the prediction results. Eventually, the subsequence Transformer is combined with these modules to form the PGSFormer framework, which not only effectively handles complex spatio-temporal dependencies, but also significantly improves the accuracy of traffic flow prediction.

Progressive graph convolution module (PGC)

The progressive graph construction method proposed in this paper takes into account the dynamics of node signal similarity over time. For example, traffic patterns near schools are similar to office areas during the morning rush hour and very different during the afternoon rush hour. While modeling such correlations based on static features (e.g., POI categories) is an intuitive approach, it may not be the most efficient given the scale and complexity. Therefore, this paper learns rich semantic information from online traffic data by measuring similarities between nodes, rather than relying only on simple spatial adjacencies, which captures the dynamic interactions between nodes more accurately and provides deeper insights for traffic prediction.

Figure 3 illustrates the concept of the progressive graph. Given four nodes Inline graphic , each row in the matrix represents the signal of a node’s single feature observed over the last 5-time steps. Two matrices for time steps and are provided, where at , strong similarity between nodes and can be observed, while and are more similar at time . It’s also noteworthy that connections between nodes may emerge or disappear over time.

Parametritis cosine similarity is first utilized to define the trend similarity between node signals, and propose a progressive graph (p-graph) that can gradually adapt to traffic changes based on the trend similarity. Define the p-graph, which dynamically updates node relationships as the correlation between nodes evolves. P-graph Inline graphic is a collection of graphs where . The progressive adjacency matrix of time is denoted as , which includes pairwise weights derived from the similarity of nodes’ signals. This paper aims to assign higher weights between nodes with similar signals, rather than relying solely on their spatial proximity. To achieve this, the cosine similarity of signals is employed to measure the nodes in the graph. Here, it is assumed that each node signal Inline graphic has one input feature. The cosine similarity between two nodes and is defined as:

where Inline graphic is the unit vector and is the min-max normalized signal of node at time . In analyzing the node signals, this paper uses normalization to measure the similarity of the trends rather than relying solely on the absolute values of the signals. This is because even if the signal values of two nodes differ, they may exhibit similar trends. For example, two nodes produce signal vectors^{22,22,22,32,48} and [50,60,50,70,50], which have very similar trends and ranges despite their different values. In this case, after normalization, the cosine similarity of these two signals reaches 1, indicating that they are identical in trend.

In order to better accommodate the noisy patterns and randomness present in real-world datasets, the cosine similarity computation is flexibly tuned by introducing a learnable parameter, Inline graphic , to capture the complexity in the data. The elements of the progressive adjacency matrix at are defined using :

The softmax function is used to normalize the progressive adjacency matrix, the ReLU activation eliminates the negative connectivity, and Inline graphic coding is applied to linearly transform the signals to obtain the final similarity values.

The core idea of a graph convolution module is to aggregate information from neighboring nodes while extracting spatial features of a target node, which is basically in the form of multiplying the graph signal with learnable parameters by a neighbor matrix processed by a defined method. One of the most commonly used forms of graph convolution modules in traffic prediction is diffusion convolution, where the traffic flow on a traffic network is considered as a diffusion process. Using the transition matrix Inline graphic , the diffusion convolution on a directed graph with a -step diffusion process with filter can be defined as:

where Inline graphic is the graph convolution operation with filter , and and are learnable parameters. and are used to reflect the forward and backward diffusion process.

Using diffusion convolution as the base graph convolution module for progressive graph convolution, this paper adds the product of the progressive adjacency matrix, the graph signal matrix and the additional weight parameters to the diffusion convolution.

Dilated causal Convolution

The main component of traditional causal convolution is the convolutional layer, so it does not require recursive computation like recursive loop units, thus making the model more concise, and this operation can be performed as indicated in Fig. 4(a). However, causal convolution has the limitation of needing many layers to enhance the size of the receptive field, in contrast to dilated causal convolution which can overcome the limitation of causal convolution (Fig. 4(b)). Dilated causal convolution network obtains a larger receptive field by stacking convolutional layers. In addition, dilated causal convolution slides the input in a specific step size and uses non-recursive parallel computation to process long time sequences to improve the learning speed to alleviate the gradient vanishing problem.

Fig. 4 — Causal convolution and dilated causal convolution.

Using an expansion causal convolution with kernel size 2 and expansion factor Inline graphic , the inputs are selected at every step and the standard 1D convolution is applied to the selected inputs. Given a 1D sequence of inputs and filter , the dilation causal convolution operation of with at step is represented as shown in Eq. 2:

where Inline graphic is the expansion factor that controls the jump step size. By stacking the dilated causal convolutional layers with dilation factors in increasing order, the sensory field of the temporal convolutional network layer grows exponentially. By giving the input , the Gated TCN takes the form:

where Inline graphic , , and are the model parameters, is the product of elements, is the activation function of the output, and is the Sigmoid function which determines the ratio of information passed to the next layer.

Subseries Temporal representation learner (STRL)

For time series, the difference between Transformer and temporal models such as RNNs and 1D CNNs is that the inputs for each time step in Transformer are directly connected to each other. Transformer learns to consider the representation of previous temporal features regardless of whether the time step is increased or not. This paper utilizes the Transformer encoder as the STRL. As shown in Fig. 5, the Masked Subseries Transformer consists of two parts: the STRL and the self-supervised task head. The STRL learns the temporal representations of the subsequences, and the self-supervised task head then reconstructs the complete long sequence based on the temporal representations of unmasked subsequences and masked markers.

Fig. 5 — The structure of masked subseries Transformer.

Specifically, first, Inline graphic , the long history sequence is partitioned into non-overlapping subsequences containing time steps, hence the number of subsequences is . In this paper, 75% of the subsequences are randomly blocked, and the set of blocked subsequences is denoted as . The remaining data, called , is used as input to STRL:

where Inline graphic denotes the representation of , the output of STRL.

The self-supervised task head consists of a Transformer layer and a linear output layer that reconstructs the complete long sequence of Inline graphic and the learnable masking marker as follows:

The goal of the pre-training phase is to keep the error between the reconstructed Inline graphic and the true value of the mask as small as possible, so only the mask subsequence is considered when computing the loss:

where Inline graphic denotes the learnable parameters of the entire Transformer.

Experiment

Datasets

To validate the prediction performance of the PGSFormer model on the public transportation datasets METR-LA, PEMS-BAY, PEMS04 and PEMS08. METR-LA consists of traffic speed statistics recorded by 207 sensors on freeways in Los Angeles County over a four-month period. PEMS-BAY consists of traffic speed information recorded by 325 sensors on traffic roads in the San Francisco Bay Area over a six-month period. Both METR-LA and PEMS-BAY record the detection vector, detection date, and data type. PEMS04 and PEMS08 contain the information collected from PeMS⁴⁶. The details of the experimental dataset are shown in Table 3:

Table 3.

Description of the experimental dataset.

Datasets	# Samples	# Nodes	Frequency	Traffic
METR-LA	34,272	207	5 min	Traffic Flow
PEMS-BAY	52,116	325	5 min	Speed
PeMS04	16,992	307	5 min	Traffic Flow, Speed
PeMS08	17,856	170	5 min	Traffic Flow, Speed

Open in a new tab

The METR-LA dataset covers traffic speed or flow information collected by loop detectors on the Los Angeles County roadway network, which utilizes 207 sensors detailing traffic speed data for a four-month period from March 1, 2012 to June 30, 2012¹⁴.

The PEMS-BAY dataset, on the other hand, is derived from Caltrans Performance Measurement System of Caltrans, focusing on traffic data collection in the Bay Area of California. During the six-month period from January 1, 2017 to May 31, 2017, this dataset accurately recorded traffic speed data through 325 sensors, providing strong support for traffic flow forecasting and pattern analysis¹⁴.

Experimental settings and evaluation metrics

This experiment is based on the Pytorch deep learning framework, and the construction and training of the traffic flow prediction model is completed in the PyCharm development environment. The Adam optimizer is used to train the model, and the dataset is used in the ratio of 7:2:1 as the training, testing and validation sets, respectively, with an epoch of 100 and an initial learning rate of 0.001. For the inflated causal convolution, the inflation factors are 1, 2, 1, 2, 1, 2, 1 and 2, and the convolution kernel size is 2. In order to better analyze the experimental results and evaluate the prediction performance of the model, this paper evaluates the error between the actual traffic flow speed and the prediction results based on the following evaluation indexes:

Mean Absolute Error (MAE):

Root Mean Square Error (RMSE):

Mean Absolute Percentage Error (MAPE):

where Inline graphic and are the real traffic information and the predicted value of the model at the time step, respectively. denotes the number of nodes on the traffic network.

Baselines

In order to evaluate the performance of PGSFormer, this paper compares it with 11 models widely used for traffic flow prediction.

The baseline models are divided into two categories:

1. The first category is the classical time series forecasting models based on statistical analysis, which only consider the time or spatial dimension, mainly including HA, FNN, ARIMA⁴.

HA: The average value of historical and current traffic flow is involved as the prediction value for the next step. In the baseline method, the average of the past 12 time slices in the same period as a week ago is used to predict the current time slice.
FNN: Two hidden layer feedforward neural network using L2 regularization.
ARIMA: A popular model used in time series prediction. The orders of autoregression, difference, and moving average are the three crucial parameters for the ARIMA model. In the baseline method, (p, d, q) is set to (4, 1, 1).

2. The second category is the nearly new forecasting models that also consider spatio-temporal correlation, covering FC-LSTM, G-WN³³, STID⁴⁹, STGCN⁶, GCRNN⁵⁰, ST-MetaNet⁵¹, ASTGCN⁴⁶, Bi-STAT⁵², PGCN³⁵, DSTAGNN⁴⁷, Trafformer⁴⁸, FedAGAT⁵³, LEISN-ED⁵⁴, DCRNN⁷, STFGNN²¹, STGODE²², Z-GCNETs²⁹, DSTAGNN⁴⁷, Traffic Transformer⁸, ASTRformer⁵⁵, PDG2seq⁵⁶, DGMA⁵⁷.

FC-LSTM: Which is a classic RNN to learn time series and make predictions by fully connected neural networks. In the baseline method, the number of hidden layers is set to 1 and the number of hidden units is set to 64.
G-WN: Which is constructed by the GCN and the gated temporal convolution layer (gated TCN). Each layer in this model contains a gated TCN and a spatial GCN. The number of stacked layers in this model is set as 8 with the dilation rate^{1,1,1,1,1,2,2,2,2,2} and the hidden dimension is set as 64.
STID: The temporal and spatio ID information is directly embedded into the model as additional features, and the prediction ability is enhanced in a simple and effective way to avoid the computational burden caused by the complex architecture and achieve efficient and accurate prediction. The hidden dimension is set to 32. The number of MLP layers is set to 3. The learning rate is set to 0.001.
STGCN: Which employs the graph convolutional layers and convolutional sequence layers. The number of spatio-temporal cell is set as 2 and the hidden dimension is set as 64.
GCRNN: It learns roadway interactions and forecasts network-wide traffic states using a traffic graph convolution defined by the physical network topology, while also discussing its relationship to spectral graph convolution. The size of hops in the graph convolution can vary, but we set it as 3, , for the model evaluation and comparison in this experiment.
ST-MetaNet: Which adopts a sequence-to-sequence architecture with identical encoder-decoder modules; each integrates meta graph attention networks (diverse spatial correlations) and meta recurrent neural networks (diverse temporal correlations), with weights derived from geo-graph embeddings and dynamic traffic context, to address complex spatio-temporal traffic correlation challenges.
ASTGCN: Which employs the attention mechanism to capture the spatiotemporal dynamic correlations. Similar to STGCN, there are two spatio-temporal cells in this model and the hidden dimension is set as 64.
Bi-STAT: Bi-STAT adopts an encoder-decoder architecture, where both the encoder and the decoder maintain a spatial-adaptive transformer and a temporal-adaptive transformer structure, it also includes a structurally simple recollection decoder for regularization and auxiliary information provision. Structural parameters (= layers, =heads, =dimension): recollection decoder uses , , on all datasets; encoder/prediction decoder uses , , for PeMSD4/8. Loss weight αfor recollection decoder is 0.001 (PeMSD8) and 0.01 (PeMSD4).
PGCN: PGCN constructs a set of graphs by progressively adapting to online input data during the training and testing phases. The model was implemented with 8 spatial-temporal layers and a hidden dimension of 32. The dilated causal convolutions used a kernel size of 2 with dilation factors sequenced as^{1,1,1,1,2,2,2,2}. Training was conducted using the Adam optimizer with an initial learning rate of 0.001.
DSTAGNN: The dynamic spatio-temporal attention mechanism is introduced to adaptively learn the importance of spatial and temporal dimensions. It has a high degree of adaptability to changes in network topology and time patterns, and improves the robustness of the model to dynamic traffic environments. The number of terms of the Chebyshev polynomial . The window size of the pooling layer is 2. The number of attention heads in the temporal attention module is 3. All graph convolutional layers and time convolution layers use 32 convolution kernels.
Trafformer: A deep learning architecture that employs self-attention as its core mechanism, composed of an encoder-decoder structure. The model configuration uses an encoder with 2 layers and a decoder with 1 layer. Each attention layer has a hidden dimension of 32, and all learnable parameters are initialized from a normal distribution. Training employs the Adam optimizer with an initial learning rate of 0.002. After testing dropout rates of 0, 0.05, and 0.2, the optimal rate was set to 0.
FedAGAT: Which consists of AGAT and federated learning components: AGAT captures intricate spatial-temporal dependencies in traffic flow, while federated learning enables decentralized learning (local data processing, only model updates shared with server) to enhance privacy.
LEISN-ED: Which consists of: (1) Long-term Dependency Module (stores multi-step historical hidden states to transmit long-term features); (2) two graph convolution-based spatial branches (extract explicit spatial features via topology-based adjacency matrix, implicit ones via trend similarity-based adjacency matrix); features are fused to generate next state. Derived encoder framework LENSI-ED addresses long-term temporal dependence and local/non-local spatial dependence issues.
DCRNN: A diffusion convolutional recurrent neural network that captures spatial dependencies and temporal dynamics through recursive forecasting. Both encoder and decoder contain two recurrent layers. In each recurrent layer, there are 64 units, the initial learning rate is 1e⁻², The maximum steps of random walks, i.e., K, is set to 3.
STFGNN: Fuses data-driven temporal graphs with spatial graphs using Dynamic Time Warping. The model architecture comprises 3 STFGNLs, each containing 8 independent STFGNMs and a gated convolution module with a dilation rate of 3, corresponding to the spatial-temporal fusion graph size (). For simplification, all elements within the three graph types are binarized to 0 or 1. Each convolutional layer employs 64 filters. The model is trained using the Adam optimizer with a learning rate of 0.001.
STGODE: Uses tensor-based ordinary differential equations for continuous spatiotemporal dynamics. The hidden dimensions of TCN blocks are set to 64, 32, 64, and 3 STGODE blocks are contained in each layer.
Z-GCNETs: Zigzag persistence provides a systematic and mathematically rigorous framework to track the most important topological features of the observed data that tend to manifest themselves over time. For token networks, Z-GCNETs contains 2 layers, where each layer has 16 hidden units. For PeMSD4 and PeMSD8, Z-GCNETs contains 2 layers, with each layer has 64 hidden units.
DSTAGNN: Which adopts a data-driven dynamic spatial-temporal aware graph (replacing static graphs) to capture intrinsic dynamic spatial structure from historical data, an improved multi-head attention mechanism for dynamic spatial relevance among nodes, and multi-scale gated convolution to acquire wide-range dynamic temporal dependency from multi-receptive field features. Experimental hyperparameters: Chebyshev polynomial terms (spatial attention heads) ; pooling window ; temporal attention heads = 3, spatial-temporal attention = 32; 32 convolution kernels for all graph/time convolution layers, 4 stacked ST blocks; Huber loss (threshold = 1), Adam optimizer (epochs = 100, learning rate = 0.0001, batch size = 32).
Traffic Transformer: Which adopts multi-head attention and stacking layers to learn dynamic and hierarchical sequential data features, and integrates a global encoder and a global-local decoder to extract and fuse spatial patterns globally and locally, and is configured with 3 temporal embedding layers, 64 hidden feature channels after temporal embedding, 6 blocks each for the global encoder and global-local decoder, 8 attention heads in these blocks, and a feedforward network of dimension 256 with a 10% dropout rate in the traffic Transformer.
ASTRformer: ASTRformer dynamically integrates spatial and temporal information through an adaptive relation learning mechanism and learnable fusion networks to generate enriched input representations. The embedding dimension is 24, and the window size is 80. The number of layers is 3 for both spatial and temporal transformers, with 4heads. Adam is chosen as the optimizer with the learning rate decaying from 0.001, and the batch size is 16.
PDG2seq: Exploits periodicity with periodic feature selection and trend transitions. The model configuration uses a random seed of 10, a batch size of 64 for 200 training epochs, with an initial learning rate of 0.003.
DGMA: The DGMA model captures detailed spatio-temporal patterns, while the CATP model uses meta-learning to transfer knowledge from data-rich areas to data-scarce regions, enhancing prediction accuracy in low-data scenarios. The model is configured with 110 training epochs, an input and prediction sequence length of 12, a batch size of 32, 3 attention heads across 4 attention sublayers, and an initial learning rate of 1e-4.

Comparison results

Table 4 shows in detail the results of the PGSFormer model proposed in this paper compared with other state-of-the-art models for 15-minute, 30-minute, and 60-minute traffic flow prediction on the METR-LA and PEMS-BAY datasets. From the results, it can be seen that the traditional statistical methods (e.g., HA, ARIMA) perform poorly when dealing with nonlinear traffic flow data, and it is obvious that since traffic flow prediction is not a simple problem of linear analysis of time series but an intricate spatio-temporal prediction task, the performance of the traditional statistical models on such problems is not satisfactory.

Table 4.

Comparison of experimental results of different models on two datasets.

Data	Models	15 min			30 min			60 min
Data	Models	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
METR-LA	HA	4.16	7.80	13.00%	4.16	7.80	13.00%	4.16	7.80	13.00%
	FNN	3.99	7.94	9.90%	4.23	8.17	12.90%	4.49	8.69	14.00%
	ARIMA	3.99	8.21	9.60%	5.15	10.45	12.70%	6.90	13.23	17.40%
	FC-LSTM	3.44	6.30	9.60%	3.77	7.23	10.90%	4.37	8.69	13.20%
	G-WN	2.98	5.90	7.92%	3.59	7.29	10.26%	4.43	8.97	13.64%
	STGCN	2.88	5.74	7.62%	3.47	7.24	9.57%	4.59	9.40	12.70%
	GCRNN	3.55	7.32	9.30%	4.21	8.54	11.20%	5.78	8.65	13.40%
	ST-MetaNet	2.69	5.17	6.91%	3.10	6.28	8.57%	3.59	7.52	10.63%
	ASTGCN	4.86	9.27	9.21%	5.43	10.61	10.13%	6.51	12.52	11.64%
	STID	2.80	5.53	7.70%	3.18	6.60	9.40%	3.45	7.54	10.95%
	Bi-STAT	2.83	5.60	7.54%	3.19	6.55	9.03%	3.57	7.41	10.51%
	PGCN	2.70	5.16	6.98%	3.08	6.22	8.38%	3.54	7.36	9.94%
	DSTAGNN	3.76	9.52	8.65%	4.78	11.96	10.54%	6.12	14.93	13.03%
	Trafformer	2.79	5.36	7.28%	3.15	6.37	8.87%	3.66	7.43	10.05%
	FedAGAT	2.70	5.28	7.02%	3.06	6.27	8.35%	3.45	7.28	9.88%
	LEISN-ED	2.77	5.29	7.18%	3.13	6.33	8.48%	3.52	7.40	9.97%
	PGSFormer	2.60	5.11	6.90%	3.08	6.23	8.38%	3.51	7.33	10.04%
PEMS-BAY	HA	2.88	5.59	6.80%	2.88	5.59	6.80%	2.88	5.59	6.80%
	FNN	2.20	4.42	5.19%	2.30	4.63	5.43%	2.46	4.98	5.89%
	ARIMA	1.62	3.30	3.50%	2.33	4.76	5.40%	3.38	6.50	8.30%
	FC-LSTM	2.05	4.19	4.80%	2.20	4.55	5.20%	2.37	4.96	5.70%
	G-WN	1.39	3.01	2.89%	1.83	4.21	4.11%	2.35	5.43	5.78%
	STGCN	1.36	2.96	2.90%	1.81	4.27	4.17%	2.49	5.69	5.79%
	GCRNN	1.77	3.53	3.70%	1.78	4.33	4.70%	2.51	5.49	5.50%
	ST-MetaNet	1.36	2.90	2.82%	1.76	4.02	4.00%	2.20	5.06	5.45%
	ASTGCN	1.52	3.13	3.22%	2.01	4.27	4.48%	2.61	5.42	6.00%
	STID	1.30	2.81	2.73%	1.62	3.72	3.68%	1.89	4.40	4.47%
	Bi-STAT	1.36	2.88	2.81%	1.68	3.78	3.70%	1.99	4.51	4.61%
	PGCN	1.30	2.73	2.72%	1.62	3.67	3.63%	1.92	4.45	4.45%
	DSTAGNN	1.40	2.97	3.02%	1.72	3.86	3.97%	2.14	4.82	5.09%
	Trafformer	1.35	2.91	2.88%	1.66	3.76	3.82%	2.06	4.58	4.67%
	FedAGAT	1.30	2.79	2.73%	1.62	3.72	3.69%	1.86	4.39	4.44%
	LEISN-ED	1.36	2.81	2.86%	1.68	3.74	3.81%	1.98	4.52	4.67%
	PGSFormer	1.21	2.67	2.66%	1.55	3.68	3.64%	1.86	4.38	4.42%

Open in a new tab

In shorter time prediction, Graph Wavenet (G-WN) performs superiorly in 15-minute and 30-minute prediction, despite the underperformance of FC-LSTM model. This is due to the fact that G-WN embeds GCN into temporal convolutional networks (TCNs), thus enhancing the extraction of temporal features through TCNs. ST-MetaNet generates a graph attention mechanism through meta-graph information and combines it with RNN weight optimization model to improve the accuracy without significantly increasing the parameters.

Among the methods in the field of deep learning, the PGSFormer model proposed in this paper shows excellent performance. Compared to methods that use predefined adjacency matrices and do not take into account the hidden properties of road networks, such as STGCN and ASTGCN, the PGSFormer model has significant advantages. For some models, approaches that ignore the dynamic spatial dependence of each time step in the time dimension despite the use of adaptive adjacency matrices, such as G-WN, the PGSFormer model achieves significant improvements with the introduction of dynamic information.

In addition, the PGSFormer model outperforms models such as ST-MetaNet, STID, and Bi-STAT in all evaluation metrics. In particular, the PGSFormer model performs well in capturing the dynamic spatio-temporal correlations at each time period and achieves excellent results in the 15, 30 and 60-minute predictions, indicating that the model is able to efficiently explore the unseen dynamic associations among road network nodes and thus capture the hidden spatial correlations. The METR-LA and PEMS-BAY datasets have significant differences in terms of the node number and time span are significantly different, but the PGSFormer model maintains a high prediction accuracy on all datasets. This fully demonstrates that the PGSFormer model proposed in this paper is capable of performing the prediction task for different road networks, showing strong generalization ability.

Table 5 present the prediction results of PGSFormer against 11 baselines on PeMS04 and PeMS08, evaluated by MAE, RMSE, and MAPE. The best results are in bold. Overall, PGSFormer achieves superior performance across all datasets and metrics.

Table 5.

Performance comparison on PeMS04 and PeMS08 datasets.

Model	PeMS04			PeMS08
Model	MAE	RMSE	MAPE (%)	MAE	RMSE	MAPE (%)
HA	38.03	59.24	27.88	34.86	59.24	27.88
ARIMA	33.73	48.80	24.18	31.09	44.32	22.73
FC-LSTM	26.77	40.65	18.23	23.09	35.17	14.99
STGCN	21.16	34.89	13.83	17.50	27.09	11.29
DCRNN	21.22	33.44	14.17	16.82	26.36	10.92
ASTGCN	22.93	35.33	16.56	18.25	28.06	11.64
STFGNN	19.83	31.88	13.02	16.64	26.22	10.60
STGODE	20.84	32.82	13.77	16.81	25.97	10.62
Z-GCNETs	19.50	31.61	12.78	15.76	25.11	10.01
DSTAGNN	19.30	31.46	12.70	15.67	24.77	9.94
Traffic Transformer	21.15	32.12	14.01	17.52	24.01	10.45
ASTRformer	18.25	30.09	11.98	14.64	23.82	9.29
PDG2seq	19.74	31.89	13.50	13.83	23.46	9.43
DGMA	19.32	31.42	12.87	15.77	25.05	9.97
PGSFormer	18.55	28.62	12.17	14.26	23.09	9.12

Open in a new tab

In order to demonstrate the performance of the PGSFormer model more intuitively, this paper visualizes and compares its prediction results with those of HA, ARIMA, GCRNN, ST-MetaNet, STID, and Bi-STAT models on the METR-LA dataset (see Fig. 6). It is clear from the figure that the prediction performance of PGSFormer is significantly better than the other models, especially in capturing the dynamic spatio-temporal characteristics of the traffic flow. As the prediction duration increases, although the prediction errors of all models increase, the error growth of PGSFormer is smaller. When the prediction time exceeds 15 min, the error of PGSFormer is significantly lower than that of the other compared models, indicating that the model has more stability and superiority in long-term prediction. This further demonstrates the excellent performance of PGSFormer in traffic flow prediction tasks and its efficiency and accuracy in real-time traffic prediction.

Fig. 6 — Visualization of MAE, RMSE and MAPE results on METR-LA dataset.

In further analysis, Fig. 7 shows the prediction results of the PGSFormer model on the PEMS-BAY dataset at the 15th minute (Horizon 3) and the 60th minute (Horizon 12) compared with the true values. The results show that PGSFormer is able to accurately capture the fluctuating trends of traffic flow data and exhibits high prediction accuracy. Although the complex dynamic spatio-temporal features make the prediction more difficult as the prediction time is extended, PGSFormer is still able to effectively track the real trend of traffic flow in the 60-minute long-time prediction. This fully proves the accuracy and robustness of PGSFormer in long time series traffic flow prediction. Whether it is a short-time prediction or a long-time prediction, PGSFormer can accurately capture the trend of traffic flow, predict the occurrence of traffic congestion, and identify the start and end time of the peak period of traffic flow. This not only demonstrates PGSFormer’s excellent performance in traffic flow prediction tasks, but also highlights its efficiency and accuracy in real-time traffic prediction.

As shown in Figs. 8 and 9, the training and validation loss curves of PGSFormer on both the METR-LA and PEMS-BAY datasets exhibit a smooth and consistent downward trend, indicating stable optimization behavior during training. On the METR-LA dataset, the training and validation losses decrease steadily and remain closely aligned throughout the entire training process, without obvious divergence, which suggests that the model achieves good generalization without severe overfitting. A similar convergence pattern can be observed on the PEMS-BAY dataset, where both loss curves gradually stabilize after sufficient epochs and show no abrupt oscillations. These results demonstrate that the proposed model maintains stable learning dynamics across different traffic networks and that the reported performance improvements are supported by a well-behaved and reliable training process.

Fig. 8 — Training and validation loss curves of PGSFormer on the METR-LA dataset.

Fig. 9 — Training and validation loss curves of PGSFormer on the PEMS-BAY dataset.

To further validate the statistical reliability of the performance improvements, we conducted statistical testing and average ranking analysis across different models. As shown in Fig. 10, the MAE distributions of PGSFormer and the best baseline exhibit clear separation, indicating that the proposed model consistently achieves lower prediction errors with statistical significance. Moreover, the average ranking results in Fig. 11 show that PGSFormer achieves the best overall rank among all compared methods, demonstrating its superior and stable performance across different datasets and evaluation metrics.

Fig. 10 — Statistical Testing Visualization.

Fig. 11 — Average Ranking Visualization.

As shown in Fig. 12, the RMSE violin plots on the METR-LA and PEMS-BAY datasets demonstrate that PGSFormer consistently achieves the lowest median error and the most compact distribution among all compared models, indicating both superior prediction accuracy and stronger stability. In contrast, traditional graph-based models such as STGCN and DCRNN exhibit noticeably wider distributions, reflecting higher variance and less robust predictive behavior across different samples. Moreover, the error distributions of ASTGCN and MTGNN remain more dispersed, especially on the METR-LA dataset, suggesting their sensitivity to complex spatiotemporal fluctuations. Overall, the concentrated error distribution and lower central tendency of PGSFormer on both datasets confirm that the proposed model not only achieves higher average accuracy, but also delivers more reliable and stable predictions under diverse traffic conditions.

Ablation study

Table 6 reports the ablation results on the METR-LA, PEMS-BAY, PeMS04, and PeMS08 datasets across three prediction horizons. The results show that the full PGSFormer consistently achieves the best performance across all datasets. Removing the progressive graph convolution or the subseries temporal representation learner leads to the most significant degradation, especially at longer horizons, confirming the essential role of dynamic spatial modeling and subsequence-aware temporal representation. Removing the progressive graph convolution (NPGC) or the subseries temporal representation learner (NSTRL) leads to notable performance degradation, indicating that both dynamic spatial modeling and subsequence-aware temporal representation are essential. Compared with these module-level ablations, the mechanism-level variants (NPCS and NDiffusion) show more moderate but consistent performance drops, suggesting that parameterized cosine similarity and diffusion-based convolution play complementary roles in refining spatial dependency modeling. Notably, the performance gaps become more pronounced at the 60-minute horizon, highlighting that the proposed components are particularly critical for long-term traffic flow prediction under nonstationary conditions.

Table 6.

Ablation study.

Data	Models	15 min			30 min			60 min
Data	Models	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
METR-LA	NPGC	2.74	5.19	6.96%	3.18	6.30	8.45%	3.60	7.41	10.13%
	NSTRL	2.72	5.17	6.99%	3.15	6.29	8.44%	3.59	7.39	10.12%
	NPCS	2.68	5.18	7.05%	3.15	6.31	8.56%	3.60	7.42	10.18%
	NDiffusion	2.71	5.23	7.12%	3.18	6.37	8.63%	3.66	7.50	10.31%
	PGSFormer	2.60	5.11	6.90%	3.08	6.23	8.38%	3.51	7.33	10.04%
PEMS-BAY	NPGC	1.28	2.72	2.73%	1.61	3.75	3.72%	1.96	4.48	4.53%
	NSTRL	1.25	2.71	2.70%	1.59	3.72	3.69%	1.93	4.43	4.49%
	NPCS	1.26	2.73	2.72%	1.59	3.75	3.71%	1.93	4.46	4.50%
	NDiffusion	1.29	2.76	2.78%	1.62	3.79	3.76%	1.97	4.51	4.57%
	PGSFormer	1.21	2.67	2.66%	1.55	3.68	3.64%	1.86	4.38	4.42%
PeMS04	NPGC	18.92	29.47	12.68%	19.94	30.86	13.21%	21.63	33.91	14.20%
	NSTRL	18.81	29.21	12.54%	19.72	30.55	13.07%	21.41	33.52	14.07%
	NPCS	18.68	29.04	12.38%	19.55	30.32	12.92%	21.23	33.18	13.91%
	NDiffusion	18.74	29.18	12.46%	19.61	30.49	12.98%	21.36	33.41	13.98%
	PGSFormer	17.47	28.32	12.01%	18.40	29.74	12.66%	19.98	31.89	13.76%
PeMS08	NPGC	14.02	22.51	9.18%	14.83	23.95	9.69%	16.21	26.42	10.38%
	NSTRL	13.94	22.37	9.12%	14.70	23.71	9.58%	16.03	26.11	10.25%
	NPCS	13.81	22.19	8.98%	14.55	23.49	9.47%	15.87	25.88	10.14%
	NDiffusion	13.88	22.28	9.04%	14.62	23.60	9.53%	15.95	26.01	10.19%
	PGSFormer	13.21	21.30	8.69%	14.01	23.00	9.22%	15.18	25.15	9.98%

Open in a new tab

From the quantitative results in Table 6, PGSFormer consistently achieves the best performance across all horizons on both datasets. On METR-LA, removing the progressive graph convolution (NPGC) increases MAE from 2.60 to 2.74 at 15 min and from 3.51 to 3.60 at 60 min, with RMSE rising from 7.33 to 7.41, indicating clear degradation in long-horizon prediction. The NSTRL variant shows a similar trend, with MAE increasing to 3.59 and MAPE to 10.12% at 60 min, confirming the importance of subsequence-aware temporal modeling. For mechanism-level ablations, replacing parameterized cosine similarity (NPCS) raises MAE by approximately 0.08–0.09 across horizons, while removing diffusion convolution (NDiffusion) causes even larger degradation, particularly at 60 min where RMSE increases from 7.33 to 7.50. A consistent pattern is observed on PEMS-BAY, where PGSFormer reduces MAE by 0.07–0.12 compared with ablated variants at 60 min, and achieves the lowest RMSE (4.38) and MAPE (4.42%). Overall, the numerical differences widen as the prediction horizon increases, quantitatively demonstrating that progressive spatial modeling and diffusion-based propagation are especially critical for long-term traffic forecasting.

In summary, the asymptotic map convolution module, progressive graph evolution, Subseries Temporal Representation Learner and other modules in the PGSFormer model all improve the prediction performance of the model on spatio-temporal data from different perspectives, and it is indispensable to have one without the other.

As shown in Figs. 13, 14 and 15, PGSFormer consistently achieves the best performance across all metrics and prediction horizons on PEMS-BAY datasets. Replacing the parameterized cosine similarity or removing the diffusion-based progressive graph convolution leads to consistent error increases, indicating that these mechanisms are essential for accurately modeling dynamic spatial dependencies. The performance degradation becomes more pronounced at longer horizons, highlighting their importance for long-term forecasting under nonstationary traffic conditions. Overall, the results confirm that the proposed mechanisms complement each other and jointly contribute to the robustness and accuracy of PGSFormer.

Fig. 13 — Visualization of ablation results for MAE on the PEMS-BAY dataset.

Fig. 14 — Visualization of ablation results for RMSE on the PEMS-BAY dataset.

Fig. 15 — Visualization of ablation results for MAPE on the PEMS-BAY dataset.

Conclusion

The PGSFormer model is proposed, a novel framework for traffic flow prediction that integrates the Progressive Graph Convolutional Network (PGCN) and the Subsequence Transformer (STRL) to address the complex spatio-temporal dependence in traffic data. The PGSFormer model captures dynamic spatial correlations using the PGCN module and extracts long-term temporal features, experimental results on two real datasets show that PGSFormer significantly outperforms other baseline models in terms of prediction performance. The comparative analysis shows that PGSFormer can not only provide more accurate prediction, but also perceive the traffic flow trend more quickly. The effectiveness of the components in PGSFormer is verified through ablation experiments, visualization and analysis, which further proves its excellent performance and efficiency.

This study provides a new approach for progressive graph structure learning, which is particularly applicable to traffic flow prediction with spatio-temporal data. Although PGSFormer achieves superior performance under normal traffic conditions, its robustness under extreme scenarios such as severe weather, traffic accidents, and large-scale public events has not yet been fully investigated. Under such abnormal conditions, sudden non-stationarity and distribution shifts may challenge the stability of progressive graph construction and temporal representation learning. In addition, the current model relies solely on historical traffic data and does not explicitly incorporate external factors such as weather and incident information. Future work will therefore focus on enhancing the robustness of PGSFormer under abnormal traffic conditions through event-aware graph adaptation, multimodal data fusion, and lightweight model optimization to support real-time large-scale deployment.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1^{(10.4KB, docx)}

Author contributions

The authors confirm contribution to the paper as follows: study All content was independently completed by Linlong Chen. All authors reviewed the results and approved the final version of the manuscript.

Funding

No funding was obtained.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request. https://github.com/Haku-zx/PGSFormer.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Liu, Y., James, J. Q., Kang, J., Niyato, D. & Zhang, S. Privacy-preserving traffic flow prediction: A federated learning approach. IEEE Internet Things J.7 (8), 7751–7763. 10.1109/JIOT.2020.2974820 (2020). [Google Scholar]
2.Peng, H. et al. Dynamic graph convolutional network for long-term traffic flow prediction with reinforcement learning. Inf. Sci.578, 401–416. 10.1016/j.ins.2021.06.053 (2021). [Google Scholar]
3.Djenouri, Y., Belhadi, A., Srivastava, G. & Lin, J. C. W. Hybrid graph convolution neural network and branch-and-bound optimization for traffic flow forecasting.Future Generation Comput. Syst.139: 100–108. 10.1016/j.future.2022.09.032. (2023).
4.Williams, B. M. & Hoel, L. A. Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process: Theoretical basis and empirical results. Journal of Transportation Engineering129(6), 664–672. 10.1061/(ASCE)0733-947X (2003).
5.Zhang, Q., Li, C., Su, F. & Li, Y. Spatiotemporal residual graph attention network for traffic flow forecasting. IEEE Internet Things J.10 (13), 11518–11532. 10.1109/JIOT.2023.3248874 (2023). [Google Scholar]
6.Yu, B., Yin, H. & Zhu, Z. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting.Proceedings of the 27th International Joint Conference on Artificial Intelligence. 3634–3640. 10.24963/ijcai.2018/505 (2018).
7.Li, Y., Yu, R., Shahabi, C. & Liu, Y. Diffusion convolutional recurrent neural network: data-driven traffic forecasting. InternationalConference on Learning Representations (ICLR 2018). Vancouver, Canada. 10.48550/arXiv.1707.01926 (2018).
8.Cai, L., Janowicz., K., Mai., G., Yan., B. & Zhu, R. Traffic transformer: capturing the continuity and periodicity of time series for traffic forecasting. Trans. GIS. 24 (3), 736–755. 10.1111/tgis.12644 (2020). [Google Scholar]
9.Liu, H. et al. STAEformer: Spatio-Temporal adaptive embedding makes vanilla transformer SOTA for traffic forecasting. Proc. 32nd ACM Int. Conf. Inform. Knowl. Manage. (CIKM). 4125-412910.48550/arXiv.2308.10425 (2023).
10.Kazemi, S. M. et al. Representation learning for dynamic graphs: A survey. J. Mach. Learn. Res.21 (70), 1–73 (2020).34305477 [Google Scholar]
11.Zhao, L. et al. T-GCN: A Temporal graph convolutional network for traffic prediction. IEEE Trans. Intell. Transp. Syst.21 (9), 3848–3858. 10.1109/TITS.2019 (2019). [Google Scholar]
12.Peng, H. et al. Spatial Temporal incidence dynamic graph neural networks for traffic flow forecasting. Inf. Sci.521, 277–290. 10.1016/j.ins.2020.02.006 (2020). [Google Scholar]
13.Vaswani, A. et al. Attention is all you need. Adv. Neural. Inf. Process. Syst.30, 1–11 (2017). [Google Scholar]
14.Cai, L., Janowicz, K., Mai, G., Yan, B. & Zhu, R. Traffic transformer: capturing the continuity and periodicity of time series for traffic forecasting. Trans. GIS. 24 (3), 736–755. 10.1111/tgis.12607 (2020). [Google Scholar]
15.Zheng, C. et al. Spatio-temporal joint graph convolutional networks for traffic forecasting. IEEE Trans. Knowl. Data Eng.36 (1), 372–385. 10.1109/TKDE.2022 (2023). [Google Scholar]
16.Han, L., Du, B., Sun, L., Fu, Y. & Lv, Y. and H. Dynamic and multi-faceted spatio-temporal deep learning for traffic speed forecasting. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 547–555 10.1145/3447548.3467118 (2021).
17.Kong, J., Fan, X., Zuo, M., Deveci, M., Zhong, K. & X., and ADCT-Net: adaptive traffic forecasting neural network via dual-graphic cross-fused transformer. Inform. Fusion. 103, 102122. 10.1016/j.inffus.2023.102122 (2024). [Google Scholar]
18.Wu, C. H., Ho, J. M. & Lee, D. T. Travel-time prediction with support vector regression. IEEE Trans. Intell. Transp. Syst.5 (4), 276–281. 10.1109/TITS.2004.837813 (2004). [Google Scholar]
19.Liu, A. & Zhang. Y Spatial–Temporal dynamic graph convolutional network with interactive learning for traffic forecasting. IEEE Trans. Intell. Transp. Syst.25 (7), 7645–7660. 10.1109/TITS.2024.3362145 (2024). [Google Scholar]
20.Wu, Y., Tan, H., Qin, L., Ran, B. & Jiang, Z. A hybrid deep learning based traffic flow prediction method and its Understanding. Transp. Res. Part. C: Emerg. Technol.90, 166–180. 10.1016/j.trc.2018.03.001 (2018). [Google Scholar]
21.Li, M. & Zhu, Z. Spatial-temporal fusion graph neural networks for traffic flow forecasting. Proceedings of the AAAI Conference on Artificial Intelligence35(5), 4189–4196 10.1609/aaai.v35i5.16550 (2021).
22.Fang, Z., Long, Q. & Song, G. and K. Spatial-temporal graph ODE networks for traffic flow forecasting. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining: 364–373. (2021). 10.1145/3447548.3467141
23.Xu, Y. et al. Generic dynamic graph convolutional network for traffic flow forecasting. Inform. Fusion. 100, 101946. https://doi.org/10.1016/j. inffus.2023.101946 (2023). [Google Scholar]
24.Fang, Y., Zhao, F., Qin, Y., Luo, H. & Wang, C. Learning all dynamics: traffic forecasting via locality-aware spatio-temporal joint transformer. IEEE Trans. Intell. Transp. Syst.23 (12), 23433–23446. 10.1109/TITS.2022 (2022). [Google Scholar]
25.Lin, L., Li, W., Bi, H. & Qin, L. Vehicle trajectory prediction using LSTMs with spatial-temporal attention mechanisms. IEEE Intell. Transp. Syst. Mag.14 (2), 197–208. 10.1109/MITS.2021.3058034 (2021). [Google Scholar]
26.Guo, S., Lin, Y., Wan, H., Cong, G. & L., and Learning dynamics and heterogeneity of spatial-temporal graph data for traffic forecasting. IEEE Trans. Knowl. Data Eng.34 (11), 5415–5428. 10.1109/TKDE.2021.3081562 (2021). [Google Scholar]
27.Zhu, J. et al. KST-GCN: A knowledge-driven spatial-temporal graph convolutional network for traffic forecasting. IEEE Trans. Intell. Transp. Syst.23 (9), 15055–15065. 10.1109/TITS.2021.3137177 (2022). [Google Scholar]
28.Kong, X., Wang, K., Hou, M., Karmakar, F., Li, J. & G., and Exploring human mobility for multi-pattern passenger prediction: A graph learning framework. IEEE Trans. Intell. Transp. Syst.23 (9), 16148–16160. 10.1109/TITS.2021 (2022). [Google Scholar]
29.Chen, Y., Segovia, I. & Gel, Y. R. Z-GCNETs: Time zigzags at graph convolutional networks for time series forecasting. Proceedings of the 38th International Conference on Machine Learning 139: 1684–1694. (2021).
30.Lee, K. & Rhee, W. DDP-GCN: Multi-graph convolutional network for Spatiotemporal traffic forecasting. Transp. Res. Part. C: Emerg. Technol.134, 103466. 10.1016/j.trc.2021.103466 (2022). [Google Scholar]
31.Shin, Y. & Yoon, Y. Incorporating dynamicity of transportation network with multi-weight traffic graph convolutional network for traffic forecasting. IEEE Trans. Intell. Transp. Syst.23 (3), 2082–2092. 10.1109/TITS.2021 (2022). [Google Scholar]
32.Bai, L., Yao, L., Li, C., Wang, X. & Wang, C. Adaptive graph convolutional recurrent network for traffic forecasting. Adv. Neural. Inf. Process. Syst.33, 17804–17815 (2020). [Google Scholar]
33.Wu, Z., Pan, S., Long, G., Jiang, J. & Zhang, C. Graph WaveNet for deep spatial-temporal graph modeling. Proceedings of the 28th International Joint Conference on Artificial Intelligence: 1907–1913. (2019). 10.24963/ijcai.2019/264
34.Wang, W. D. P. K. Spatial–Temporal graph attention gated recurrent transformer network for traffic flow forecasting. IEEE Internet Things J.11 (8), 14267–14281. 10.1109/JIOT.2023.3340182 (2024). [Google Scholar]
35.Shin, Y. & Yoon. Y PGCN: progressive graph convolutional networks for Spatial-Temporal traffic forecasting. IEEE Trans. Intell. Transp. Syst.25 (7), 7633–7644. 10.1109/TITS.2024.3349565 (2024). [Google Scholar]
36.Fang, Y. et al. Efficient large-scale traffic forecasting with transformers: A Spatial data management perspective. KDD ‘25: Proc. 31st ACM SIGKDD Conf. Knowl. Discovery Data Min.307-31710.1145/3690624.3709177 (2024).
37.Fang, Y. et al. Unraveling spatio-temporal foundation models via the pipeline lens: A comprehensive review. Inf. Fusion. 115, 102346. 10.48550/arXiv.2506.01364 (2025). [Google Scholar]
38.Yang., S., Wu., Q., Wang., Y. & Lin, T. SSGCRTN: A space-specific graph convolutional recurrent transformer network for traffic prediction. Appl. Intell.54 (22), 11978–11994. 10.1007/s10489-024-05815-1 (2024). [Google Scholar]
39.Yang., S., Huang., Z., Wu., Q. & Zhuo, Z. MSTDFGRN: A multi-view spatio-temporal dynamic fusion graph recurrent network for traffic flow prediction. Comput. Electr. Eng.123, 110046. 10.1016/j.compeleceng.2024.110046 (2025). [Google Scholar]
40.Yang, S., Wu, Q., Li, Z. & Wang, K. PSTCGCN: principal spatio-temporal causal graph convolutional network for traffic flow prediction. Neural Comput. Appl.1-14, 10.1007/s00521-024-10769-6 (2024).
41.Yang., S., Wu., Q., Li., M. & Sun, Y. Temporal identity interaction dynamic graph convolutional network for traffic forecasting. IEEE Internet Things J.12 (11), 15057–15072. 10.1109/JIOT.2025.3503328 (2025). [Google Scholar]
42.Yang., S. & Wu, Q. SDSINet: A Spatiotemporal dual-scale interaction network for traffic prediction. Appl. Soft Comput. 112892. 10.1016/j.asoc.2025.112892 (2025). [Google Scholar]
43.Yang., S. & Wu, Q. MTEGCRN: Multi-scale Temporal enhanced graph convolutional recurrent network for traffic prediction. Neurocomputing13106410.1016/j.neucom.2025.131064 (2025).
44.Yang., S., Wu., Q., Huang., Z. & Zhuo, Z. General Decoupled Graph Convolutional Recurrent Network for Traffic Prediction. IEEE Sensors J. (2025).
45.Yang., S., Wu., Q. & Li, M. Decoupled Multi-Spatio-Temporal Fusion Graph Convolutional Recurrent Network for Traffic Prediction. Eng. Appl. Artif. Intell.163: 112956. 10.1016/j.engappai.2025.112956. (2025).
46.Guo, S., Lin, Y., Feng, N., Song, C. & Wan, H. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. Proceedings of the AAAI Conference on Artificial Intelligence 33(01): 922–929. (2019). 10.1609/aaai.v33i01. 3301922.
47.Lan, S., Huang, M. Y., Wang, W., Yang, W. & Li, H. P. DSTAGNN: Dynamic spatial-temporal aware graph neural network for traffic flow forecasting. in: International Conference on Machine Learning, PMLR, 11906–11917. (2022).
48.Li, J. D. S. J. W. R. & Huang, Y. Y., Yang. Y.-B. Trafformer: unify time and space in traffic prediction. in: Proceedings of the AAAI Conference on Artificial Intelligence, 37(7): 8114–8122. (2023).
49.Shao, Z., Zhang, Z., Wang, F., Wei, W. & Xu, Y. Spatial-temporal identity: A simple yet effective baseline for multivariate time series forecasting. Proceedings of the 31st ACM International Conference on Information & Knowledge Management: 4454–4458. (2022). 10.1145/3511808.3557410
50.Cui, Z., Henrickson, K., Ke, R. & Wang, Y. Traffic graph convolutional recurrent neural network: A deep learning framework for network-scale traffic learning and forecasting. IEEE Trans. Intell. Transp. Syst.21 (11), 4883–4894. 10.1109/TITS.2019.2950416 (2020). [Google Scholar]
51.Pan, Z. et al. Spatio-temporal meta learning for urban traffic prediction. IEEE Trans. Knowl. Data Eng.34 (3), 1462–1476. 10.1109/TKDE.2020.2989138 (2020). [Google Scholar]
52.Chen, C., Liu, Y., Chen, L. & Zhang, C. Bidirectional spatial-temporal adaptive transformer for urban traffic flow forecasting. IEEE Trans. Neural Networks Learn. Syst.34 (10), 6913–6925. 10.1109/TNNLS.2022.3156673 (2022). [DOI] [PubMed] [Google Scholar]
53.Al-Huthaif, R., Li, T., Al-Huda, Z. & Li, C. FedAGAT: Real-time traffic flow prediction based on federated community and adaptive graph attention network. Inf. Sci.667 (2024), 120482. 10.1016/j.ins.2024.120482 (2024). [Google Scholar]
54.Lai, Q. & Chen, P. LEISN: A long explicit-implicit spatio-temporal network for traffic flow forecasting. Expert Syst. Appl.245 (2024), 123139. 10.1016/j.eswa.2024.123139 (2024). [Google Scholar]
55.Wang., R., Xi., L., Ye., J., Zhang., F. & Xu, X. Y. L. Adaptive Spatio-Temporal relation based transformer for traffic flow prediction. IEEE Trans. Veh. Technol.74 (2), 2220–2230. 10.1109/TVT.2024.3390997 (2025). [Google Scholar]
56.Fan., J., Weng., W., Chen., Q., Wu., H. & Wu, J. Pdg2seq: periodic dynamic graph to sequence model for traffic flow prediction. Neural Netw.183, 106941. 10.1016/j.neunet.2024.106941 (2025). [DOI] [PubMed] [Google Scholar]
57.Wu, B. et al. DT-CTFP: 6 g-enabled digital twin collaborative traffic flow prediction. IEEE Trans. Intell. Transp. Syst.26 (10), 18129–18144. 10.1109/TITS.2025.3582356 (2025). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Djenouri, Y., Belhadi, A., Srivastava, G. & Lin, J. C. W. Hybrid graph convolution neural network and branch-and-bound optimization for traffic flow forecasting.Future Generation Comput. Syst.139: 100–108. 10.1016/j.future.2022.09.032. (2023).
Yang., S., Wu., Q. & Li, M. Decoupled Multi-Spatio-Temporal Fusion Graph Convolutional Recurrent Network for Traffic Prediction. Eng. Appl. Artif. Intell.163: 112956. 10.1016/j.engappai.2025.112956. (2025).

Supplementary Materials

Supplementary Material 1^{(10.4KB, docx)}

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request. https://github.com/Haku-zx/PGSFormer.

[CR1] 1. Liu, Y., James, J. Q., Kang, J., Niyato, D. & Zhang, S. Privacy-preserving traffic flow prediction: A federated learning approach. IEEE Internet Things J.7 (8), 7751–7763. 10.1109/JIOT.2020.2974820 (2020). [Google Scholar]

[CR2] 2.Peng, H. et al. Dynamic graph convolutional network for long-term traffic flow prediction with reinforcement learning. Inf. Sci.578, 401–416. 10.1016/j.ins.2021.06.053 (2021). [Google Scholar]

[CR3] 3.Djenouri, Y., Belhadi, A., Srivastava, G. & Lin, J. C. W. Hybrid graph convolution neural network and branch-and-bound optimization for traffic flow forecasting.Future Generation Comput. Syst.139: 100–108. 10.1016/j.future.2022.09.032. (2023).

[CR4] 4.Williams, B. M. & Hoel, L. A. Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process: Theoretical basis and empirical results. Journal of Transportation Engineering129(6), 664–672. 10.1061/(ASCE)0733-947X (2003).

[CR5] 5.Zhang, Q., Li, C., Su, F. & Li, Y. Spatiotemporal residual graph attention network for traffic flow forecasting. IEEE Internet Things J.10 (13), 11518–11532. 10.1109/JIOT.2023.3248874 (2023). [Google Scholar]

[CR6] 6.Yu, B., Yin, H. & Zhu, Z. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting.Proceedings of the 27th International Joint Conference on Artificial Intelligence. 3634–3640. 10.24963/ijcai.2018/505 (2018).

[CR7] 7.Li, Y., Yu, R., Shahabi, C. & Liu, Y. Diffusion convolutional recurrent neural network: data-driven traffic forecasting. InternationalConference on Learning Representations (ICLR 2018). Vancouver, Canada. 10.48550/arXiv.1707.01926 (2018).

[CR8] 8.Cai, L., Janowicz., K., Mai., G., Yan., B. & Zhu, R. Traffic transformer: capturing the continuity and periodicity of time series for traffic forecasting. Trans. GIS. 24 (3), 736–755. 10.1111/tgis.12644 (2020). [Google Scholar]

[CR9] 9.Liu, H. et al. STAEformer: Spatio-Temporal adaptive embedding makes vanilla transformer SOTA for traffic forecasting. Proc. 32nd ACM Int. Conf. Inform. Knowl. Manage. (CIKM). 4125-412910.48550/arXiv.2308.10425 (2023).

[CR10] 10.Kazemi, S. M. et al. Representation learning for dynamic graphs: A survey. J. Mach. Learn. Res.21 (70), 1–73 (2020).34305477 [Google Scholar]

[CR11] 11.Zhao, L. et al. T-GCN: A Temporal graph convolutional network for traffic prediction. IEEE Trans. Intell. Transp. Syst.21 (9), 3848–3858. 10.1109/TITS.2019 (2019). [Google Scholar]

[CR12] 12.Peng, H. et al. Spatial Temporal incidence dynamic graph neural networks for traffic flow forecasting. Inf. Sci.521, 277–290. 10.1016/j.ins.2020.02.006 (2020). [Google Scholar]

[CR13] 13.Vaswani, A. et al. Attention is all you need. Adv. Neural. Inf. Process. Syst.30, 1–11 (2017). [Google Scholar]

[CR14] 14.Cai, L., Janowicz, K., Mai, G., Yan, B. & Zhu, R. Traffic transformer: capturing the continuity and periodicity of time series for traffic forecasting. Trans. GIS. 24 (3), 736–755. 10.1111/tgis.12607 (2020). [Google Scholar]

[CR15] 15.Zheng, C. et al. Spatio-temporal joint graph convolutional networks for traffic forecasting. IEEE Trans. Knowl. Data Eng.36 (1), 372–385. 10.1109/TKDE.2022 (2023). [Google Scholar]

[CR16] 16.Han, L., Du, B., Sun, L., Fu, Y. & Lv, Y. and H. Dynamic and multi-faceted spatio-temporal deep learning for traffic speed forecasting. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 547–555 10.1145/3447548.3467118 (2021).

[CR17] 17.Kong, J., Fan, X., Zuo, M., Deveci, M., Zhong, K. & X., and ADCT-Net: adaptive traffic forecasting neural network via dual-graphic cross-fused transformer. Inform. Fusion. 103, 102122. 10.1016/j.inffus.2023.102122 (2024). [Google Scholar]

[CR18] 18.Wu, C. H., Ho, J. M. & Lee, D. T. Travel-time prediction with support vector regression. IEEE Trans. Intell. Transp. Syst.5 (4), 276–281. 10.1109/TITS.2004.837813 (2004). [Google Scholar]

[CR19] 19.Liu, A. & Zhang. Y Spatial–Temporal dynamic graph convolutional network with interactive learning for traffic forecasting. IEEE Trans. Intell. Transp. Syst.25 (7), 7645–7660. 10.1109/TITS.2024.3362145 (2024). [Google Scholar]

[CR20] 20.Wu, Y., Tan, H., Qin, L., Ran, B. & Jiang, Z. A hybrid deep learning based traffic flow prediction method and its Understanding. Transp. Res. Part. C: Emerg. Technol.90, 166–180. 10.1016/j.trc.2018.03.001 (2018). [Google Scholar]

[CR21] 21.Li, M. & Zhu, Z. Spatial-temporal fusion graph neural networks for traffic flow forecasting. Proceedings of the AAAI Conference on Artificial Intelligence35(5), 4189–4196 10.1609/aaai.v35i5.16550 (2021).

[CR22] 22.Fang, Z., Long, Q. & Song, G. and K. Spatial-temporal graph ODE networks for traffic flow forecasting. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining: 364–373. (2021). 10.1145/3447548.3467141

[CR23] 23.Xu, Y. et al. Generic dynamic graph convolutional network for traffic flow forecasting. Inform. Fusion. 100, 101946. https://doi.org/10.1016/j. inffus.2023.101946 (2023). [Google Scholar]

[CR24] 24.Fang, Y., Zhao, F., Qin, Y., Luo, H. & Wang, C. Learning all dynamics: traffic forecasting via locality-aware spatio-temporal joint transformer. IEEE Trans. Intell. Transp. Syst.23 (12), 23433–23446. 10.1109/TITS.2022 (2022). [Google Scholar]

[CR25] 25.Lin, L., Li, W., Bi, H. & Qin, L. Vehicle trajectory prediction using LSTMs with spatial-temporal attention mechanisms. IEEE Intell. Transp. Syst. Mag.14 (2), 197–208. 10.1109/MITS.2021.3058034 (2021). [Google Scholar]

[CR26] 26.Guo, S., Lin, Y., Wan, H., Cong, G. & L., and Learning dynamics and heterogeneity of spatial-temporal graph data for traffic forecasting. IEEE Trans. Knowl. Data Eng.34 (11), 5415–5428. 10.1109/TKDE.2021.3081562 (2021). [Google Scholar]

[CR27] 27.Zhu, J. et al. KST-GCN: A knowledge-driven spatial-temporal graph convolutional network for traffic forecasting. IEEE Trans. Intell. Transp. Syst.23 (9), 15055–15065. 10.1109/TITS.2021.3137177 (2022). [Google Scholar]

[CR28] 28.Kong, X., Wang, K., Hou, M., Karmakar, F., Li, J. & G., and Exploring human mobility for multi-pattern passenger prediction: A graph learning framework. IEEE Trans. Intell. Transp. Syst.23 (9), 16148–16160. 10.1109/TITS.2021 (2022). [Google Scholar]

[CR29] 29.Chen, Y., Segovia, I. & Gel, Y. R. Z-GCNETs: Time zigzags at graph convolutional networks for time series forecasting. Proceedings of the 38th International Conference on Machine Learning 139: 1684–1694. (2021).

[CR30] 30.Lee, K. & Rhee, W. DDP-GCN: Multi-graph convolutional network for Spatiotemporal traffic forecasting. Transp. Res. Part. C: Emerg. Technol.134, 103466. 10.1016/j.trc.2021.103466 (2022). [Google Scholar]

[CR31] 31.Shin, Y. & Yoon, Y. Incorporating dynamicity of transportation network with multi-weight traffic graph convolutional network for traffic forecasting. IEEE Trans. Intell. Transp. Syst.23 (3), 2082–2092. 10.1109/TITS.2021 (2022). [Google Scholar]

[CR32] 32.Bai, L., Yao, L., Li, C., Wang, X. & Wang, C. Adaptive graph convolutional recurrent network for traffic forecasting. Adv. Neural. Inf. Process. Syst.33, 17804–17815 (2020). [Google Scholar]

[CR33] 33.Wu, Z., Pan, S., Long, G., Jiang, J. & Zhang, C. Graph WaveNet for deep spatial-temporal graph modeling. Proceedings of the 28th International Joint Conference on Artificial Intelligence: 1907–1913. (2019). 10.24963/ijcai.2019/264

[CR34] 34.Wang, W. D. P. K. Spatial–Temporal graph attention gated recurrent transformer network for traffic flow forecasting. IEEE Internet Things J.11 (8), 14267–14281. 10.1109/JIOT.2023.3340182 (2024). [Google Scholar]

[CR35] 35.Shin, Y. & Yoon. Y PGCN: progressive graph convolutional networks for Spatial-Temporal traffic forecasting. IEEE Trans. Intell. Transp. Syst.25 (7), 7633–7644. 10.1109/TITS.2024.3349565 (2024). [Google Scholar]

[CR36] 36.Fang, Y. et al. Efficient large-scale traffic forecasting with transformers: A Spatial data management perspective. KDD ‘25: Proc. 31st ACM SIGKDD Conf. Knowl. Discovery Data Min.307-31710.1145/3690624.3709177 (2024).

[CR37] 37.Fang, Y. et al. Unraveling spatio-temporal foundation models via the pipeline lens: A comprehensive review. Inf. Fusion. 115, 102346. 10.48550/arXiv.2506.01364 (2025). [Google Scholar]

[CR38] 38.Yang., S., Wu., Q., Wang., Y. & Lin, T. SSGCRTN: A space-specific graph convolutional recurrent transformer network for traffic prediction. Appl. Intell.54 (22), 11978–11994. 10.1007/s10489-024-05815-1 (2024). [Google Scholar]

[CR39] 39.Yang., S., Huang., Z., Wu., Q. & Zhuo, Z. MSTDFGRN: A multi-view spatio-temporal dynamic fusion graph recurrent network for traffic flow prediction. Comput. Electr. Eng.123, 110046. 10.1016/j.compeleceng.2024.110046 (2025). [Google Scholar]

[CR40] 40.Yang, S., Wu, Q., Li, Z. & Wang, K. PSTCGCN: principal spatio-temporal causal graph convolutional network for traffic flow prediction. Neural Comput. Appl.1-14, 10.1007/s00521-024-10769-6 (2024).

[CR41] 41.Yang., S., Wu., Q., Li., M. & Sun, Y. Temporal identity interaction dynamic graph convolutional network for traffic forecasting. IEEE Internet Things J.12 (11), 15057–15072. 10.1109/JIOT.2025.3503328 (2025). [Google Scholar]

[CR42] 42.Yang., S. & Wu, Q. SDSINet: A Spatiotemporal dual-scale interaction network for traffic prediction. Appl. Soft Comput. 112892. 10.1016/j.asoc.2025.112892 (2025). [Google Scholar]

[CR43] 43.Yang., S. & Wu, Q. MTEGCRN: Multi-scale Temporal enhanced graph convolutional recurrent network for traffic prediction. Neurocomputing13106410.1016/j.neucom.2025.131064 (2025).

[CR44] 44.Yang., S., Wu., Q., Huang., Z. & Zhuo, Z. General Decoupled Graph Convolutional Recurrent Network for Traffic Prediction. IEEE Sensors J. (2025).

[CR45] 45.Yang., S., Wu., Q. & Li, M. Decoupled Multi-Spatio-Temporal Fusion Graph Convolutional Recurrent Network for Traffic Prediction. Eng. Appl. Artif. Intell.163: 112956. 10.1016/j.engappai.2025.112956. (2025).

[CR46] 46.Guo, S., Lin, Y., Feng, N., Song, C. & Wan, H. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. Proceedings of the AAAI Conference on Artificial Intelligence 33(01): 922–929. (2019). 10.1609/aaai.v33i01. 3301922.

[CR47] 47.Lan, S., Huang, M. Y., Wang, W., Yang, W. & Li, H. P. DSTAGNN: Dynamic spatial-temporal aware graph neural network for traffic flow forecasting. in: International Conference on Machine Learning, PMLR, 11906–11917. (2022).

[CR48] 48.Li, J. D. S. J. W. R. & Huang, Y. Y., Yang. Y.-B. Trafformer: unify time and space in traffic prediction. in: Proceedings of the AAAI Conference on Artificial Intelligence, 37(7): 8114–8122. (2023).

[CR49] 49.Shao, Z., Zhang, Z., Wang, F., Wei, W. & Xu, Y. Spatial-temporal identity: A simple yet effective baseline for multivariate time series forecasting. Proceedings of the 31st ACM International Conference on Information & Knowledge Management: 4454–4458. (2022). 10.1145/3511808.3557410

[CR50] 50.Cui, Z., Henrickson, K., Ke, R. & Wang, Y. Traffic graph convolutional recurrent neural network: A deep learning framework for network-scale traffic learning and forecasting. IEEE Trans. Intell. Transp. Syst.21 (11), 4883–4894. 10.1109/TITS.2019.2950416 (2020). [Google Scholar]

[CR51] 51.Pan, Z. et al. Spatio-temporal meta learning for urban traffic prediction. IEEE Trans. Knowl. Data Eng.34 (3), 1462–1476. 10.1109/TKDE.2020.2989138 (2020). [Google Scholar]

[CR52] 52.Chen, C., Liu, Y., Chen, L. & Zhang, C. Bidirectional spatial-temporal adaptive transformer for urban traffic flow forecasting. IEEE Trans. Neural Networks Learn. Syst.34 (10), 6913–6925. 10.1109/TNNLS.2022.3156673 (2022). [DOI] [PubMed] [Google Scholar]

[CR53] 53.Al-Huthaif, R., Li, T., Al-Huda, Z. & Li, C. FedAGAT: Real-time traffic flow prediction based on federated community and adaptive graph attention network. Inf. Sci.667 (2024), 120482. 10.1016/j.ins.2024.120482 (2024). [Google Scholar]

[CR54] 54.Lai, Q. & Chen, P. LEISN: A long explicit-implicit spatio-temporal network for traffic flow forecasting. Expert Syst. Appl.245 (2024), 123139. 10.1016/j.eswa.2024.123139 (2024). [Google Scholar]

[CR55] 55.Wang., R., Xi., L., Ye., J., Zhang., F. & Xu, X. Y. L. Adaptive Spatio-Temporal relation based transformer for traffic flow prediction. IEEE Trans. Veh. Technol.74 (2), 2220–2230. 10.1109/TVT.2024.3390997 (2025). [Google Scholar]

[CR56] 56.Fan., J., Weng., W., Chen., Q., Wu., H. & Wu, J. Pdg2seq: periodic dynamic graph to sequence model for traffic flow prediction. Neural Netw.183, 106941. 10.1016/j.neunet.2024.106941 (2025). [DOI] [PubMed] [Google Scholar]

[CR57] 57.Wu, B. et al. DT-CTFP: 6 g-enabled digital twin collaborative traffic flow prediction. IEEE Trans. Intell. Transp. Syst.26 (10), 18129–18144. 10.1109/TITS.2025.3582356 (2025). [Google Scholar]

PERMALINK

PGSFormer: traffic flow prediction based on joint optimization of progressive graph convolutional networks with subseries transformer

Linlong Chen

Abstract

Supplementary Information

Introduction

Fig. 1.

Related work

Traffic flow prediction

GNNs for Spatial-Temporal traffic forecasting

Table 1.

Methodology

Notations and problem definitions

Table 2.

Fig. 2.

Progressive graph convolution module (PGC)

Fig. 3.

Dilated causal Convolution

Fig. 4.

Subseries Temporal representation learner (STRL)

Fig. 5.

Experiment

Datasets

Table 3.

Experimental settings and evaluation metrics

Baselines

Comparison results

Table 4.

Table 5.

Fig. 6.

Fig. 7.

Fig. 8.

Fig. 9.

Fig. 10.

Fig. 11.

Fig. 12.

Ablation study

Table 6.

Fig. 13.

Fig. 14.

Fig. 15.

Conclusion

Supplementary Information

Author contributions

Funding

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Citations

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases