Abstract
Time series classification finds widespread applications in civil, industrial, and military fields, while the classification performance of time series models has been improving with the recent development of deep learning. However, the issues of feature extraction effectiveness, model complexity, and model design uncertainty constrain the further development of time series classification. To address the above issues, we propose a Lightweight Spatio-Temporal Decoupling Transformer framework based on Automated Machine Learning technique (AutoLDT). The framework proposes a novel lightweight Transformer with fuzzy position encoding, TS-separable linear self-attention mechanism, and convolutional feedforward network, which mine the temporal and spatial features, as well as the local and global relationship of time series. Fuzzy positional encoding integrates fuzzy ideas to enhance the generalization performance of model information mining. TS-separable linear self-attention mechanism and convolutional feedforward network achieve feature extraction in a lightweight way by decoupling temporal and spatial features of time series. Notably, we adopt the Covariance Matrix Adaptation Evolution Strategy and global adaptive pruning technique to realize automated network structure design, which further improves the model training efficiency and automation, and avoids the uncertainty problem of network design. Finally, we validate the effectiveness of the proposed framework on publicly available UCR and UEA time series datasets. The experimental results show that the proposed framework not only improves the model classification performance in a lightweight way but also dramatically improves the model training efficiency.
Keywords: Time series classification, TS-separable linear self-attention mechanism, Automated machine learning, Fuzzy position encoding, Covariance matrix adaptation evolution strategy
Subject terms: Computational science, Computer science, Information technology, Electrical and electronic engineering
Introduction
Time series is the most common and essential form of data in the real world, and it has extensive applications in finance, healthcare, energy, transportation, the military, etc1. Time series are data sequences arranged in chronological order, which record the state or behavior of a system or phenomenon over time and contain rich information and potential value. Particularly, time series classification is an essential issue in time series analysis, which classifies the time series into different categories or labels to understand better and predict the properties and behaviors. Time series classification tasks have been widely used in many fields, such as target recognition2,3, disease diagnosis4, intrusion detection5, anomaly detection6, etc. Accurately classifying time series data can support decision-making, risk management, and resource optimization.
As the communication and big data fields rapidly develop, time series classification faces new challenges and problems. First, time series are characterized by high dimensionality, complexity, and nonlinearity, and traditional classification methods can no longer perform the current classification task. Secondly, time dependence, periodic patterns, and feature variable correlations in time series significantly impact the classification results, but how to effectively capture and utilize these patterns remains a complex problem. Recently, the application of deep learning in time series classification has become more extensive. Models such as Recurrent Neural Network (RNN)7, Long Short-Term Memory Network (LSTM)8–10, and Convolutional Neural Network11,12 have shown excellent performance in processing time series data. In particular, Transformer13 has rapidly risen in popularity, and the self-attention mechanism can efficiently capture the long-term dependencies in time series, thus improving the classification accuracy. Following the continuous development of deep learning techniques, we foresee that more deep learning methods will be applied to time series classification. Motivated by the existing work, we research on the problem of feature extraction and network architecture design for time series, and propose a lightweight Spatio-Temporal Decoupling Transformer framework based on automated machine learning technology. The main contributions of this paper are as follows:
To enhance the characterization of time series and reduce the model complexity, we propose a Lightweight Spatio-Temporal Decoupling Transformer, which can effectively extract the spatio-temporal features of the time series and reduce the computational complexity by using the spatio-temporal feature separation strategy.
To improve the efficiency of model design and classification accuracy, we propose an automated machine learning framework for time series classification, which utilizes the covariance matrix adaptation evolution strategy (CMA-ES) algorithm and global adaptive pruning technique to achieve automated iterative optimization, avoiding the uncertainty and inefficiency of manual network design.
We evaluate the effectiveness of the proposed framework on the time series standard datasets UCR and UEA, and the experimental results show that the proposed framework achieves superior classification performance with low complexity compared to other remarkable methods, and the ablation experiments validate the advantages of the model structure.
Related work
Recently, researchers have proposed many methods and models for time series classification, which can be roughly categorized into feature-based methods, model-based methods, and deep learning-based methods. Among them, feature-based methods perform classification by extracting statistical, time-frequency, or shape features of time series data. Arul et al.14 propose a time series representation method based on the local similarity of the shapes of the subsequences and combine the shape-based representation with standard machine learning algorithms to make the algorithms more interpretable. Zuo et al.15 propose an SVP-T algorithm that takes time series subsequences as inputs and improves the self-attention mechanism by utilizing variable position encoding to enhance the attention weights of overlapping shapes, then validate the effectiveness of the proposed self-attention mechanism through experiments. However, feature-based methods are strongly sensitive to the selection of features, and there is still room for improvement in terms of classification accuracy and generalization performance.
Furthermore, model-based methods utilize the generation mechanism or pattern of time series data to build classification models, such as the Hidden Markov Model (HMM), Dynamic Time Warping (DTW), etc. Lahreche et al.16 propose a Local Extreme Value Dynamic Time Warping (LE-DTW) algorithm for the time series similarity metrics, which first utilizes local extremes to downsize the given time series, physically separating the minimum and maximum points and then adjusts the DTW metric to evaluate the similarity scores between the generated representations, providing more competitive results. Feremans et al.17 propose a pattern embedding-based time series classification method that can efficiently enumerate variable-length sequence patterns with intervals, construct embeddings based on the sequence patterns, and learn a linear model to use the discovered pattern compositions for classification interpretability. Model-based approaches are more adapted to the task of classifying large amounts of data, but the model feature extraction capability is still limited by manual model design, which requires researchers to have rich domain knowledge and experience in feature engineering.
In contrast, deep learning-based methods automatically leverage neural networks to learn feature representations and classification decision boundaries for time series data. Convolutional Neural Network (CNN) can extract useful features from time series data automatically through convolutional and pooling layers, Wang et al.18 propose a classification method named T-CNN based on Gram matrixes, which converts the time series into time domain images to preserve the temporal information and then utilizes an improved convolutional neural network and introduces a triple network to compute the similarity between events and different categories of events as well as optimize the squared loss function of the CNN. Chen et al.19 propose a Multi-scale Attention Convolutional Neural Network (MACNN), which first utilizes multi-scale convolution to generate feature maps at different scales to capture the time axis of the different scales and then enhance the useful feature maps by automatically learning the importance of the feature maps through the attention mechanism. Dempster et al.20 propose a MiniRocket method, which achieves remarkable time series classification accuracy by transforming the input time series using a random convolutional kernel and utilizing the transformed features to train a linear classifier with much lower computational overhead than most existing methods. CNNs are limited by the size of the perceptual field of the convolutional kernel, which makes them ineffective in extracting global features for long-range time series, although they can extract detail-dependent information. Xiao et al.21 propose a temporal feature network (RTFN) to further enhance the temporal feature extraction capability by utilizing a temporal network and attention network based on long short-term memory (LSTMaN), which can mine local features and correlations between features from the data, respectively. In addition, to take full advantage of the temporal and spatial information in the time series, Zhao et al.22 propose a multi-modal network LSTM-MFCN composed of multi-scale FCN and LSTM is proposed. The gate-based LSTM network is used to extract the temporal dependence of the time series, while the FCN with multi-scale filters can perceive different ranges of spatial features from the time series curves. Both temporal and spatial features are utilized to achieve better accuracy.
Notably, Transformer has achieved remarkable success in natural language processing23,24 and computer vision25–27. Transformer has emerged in the field of time series in recent years and gradually attracted more attention. Transformer can effectively extract the global information of the time series by using the self-attention mechanism to obtain higher distinguishable features, Wu et al.28 propose a Transformer network with a combination of Transformer and convolutional structures, which combines the inductive bias of the convolutional neural network and extracts the early features through the convolutional layer and proposes a convolutional Encoder that is both squeezed and stimulated (BC-Encoder), which can enhance the extraction of global information from time series. Chen et al.29 propose a dual-attention-based multivariate time series classification network (DA-Net) for mining local-global features. DA-Net consists of a squeezed excitation windowed attention (SEWA) layer and an intra-windowed sparse self-attention (SSAW) layer. The SEWA layer prioritizes key windows by explicitly establishing window dependencies to capture local window information, and the SSAW layer retains rich activation scores with less computation to expand the range of windows capturing global long-range dependencies. Zhao et al.30 propose a flexible multi-head linear attention (FMLA) architecture that enhances local perception through layer-by-layer interaction with deformable convolutional blocks and online knowledge distillation and helps to reduce the effect of noise in the time series and reduces redundancy in FMLA by probabilistically selecting and masking the position of each given sequence. Foumani et al.31 propose a method that combines temporal Absolute Position encoding (tAPE), relative position encoding (eRPE), and convolutional encoding in a multivariate time series classification model named ConvTrans to improve the location and data embedding performance of time series data, which can further improve the classification accuracy. Yao et al.32 propose a multivariate time series analysis method named Contextual Dependency Vision Transformer (CD-ViT), which generates multi-grained semantic information based on the spectrogram and explores mutual dependencies between multi-variable and multi-temporal representations, constructing parallel structures to extract information at various grains based on periods.
Although some methods and models have been developed to make significant progress in time series classification tasks, there are still some issues and challenges that need to be further investigated and solved: (1) Capturing and utilizing temporal and variable structural information, long-term correlation, and local dependence in time series data to achieve more efficient feature extraction; (2) Reducing the computational complexity of the network and improving the level of model lightness; (3) Avoiding the uncertainty and inefficiency of manual network design, and construct an automated network structure design framework. Through deeply analyzing the characteristics of time series data and the challenges of classification tasks, we propose a lightweight Spatio-Temporal Decoupling Transformer framework based on automated machine learning technology, which effectively enhances the spatio-temporal feature characterization ability of the model and reduces the complexity of the model, particularly improving the automated design and optimization of the model by using the CMA-ES algorithm and the adaptive global pruning algorithm.
Proposed architecture
In this section, we first describe the model structure of the Lightweight Spatio-Temporal Decoupling Transformer, then analyze the principles and details of the critical modules and the network structure search framework, focusing on the iterative optimization process of the network structure based on automated machine learning. A Time Series is a set of sequences of observations arranged in chronological order, which records the observations of a variable or a set of variables at different points in time. The time points contain rich temporal relevance and rich structural information between the feature variables at each time point. However, most existing methods mainly consider the temporal information of the time series and ignore the spatial structure information among the variables33, which makes them still have some limitations in feature extraction. Liu et al.34 emphasize that by modeling time series variable features embedded as Tokens, the relationship between variable features can be used to effectively characterize the temporal information, which provides better performance than using only temporal relationships. Wang et al.2 argue that the temporal relationship is as important as the correlation of variable features, and extract both temporal and spatial features in parallel by building a temporal Transformer and a spatial Transformer architecture, which extract temporal and spatial features in parallel, greatly improve the model performance despite the significant complexity increase. Meanwhile, to overcome the over-parameterization problem and reduce the model complexity, we decouple the extraction of temporal and structural information and use the depthwise convolution to independently extract the temporal information of the time series and the structural information between variable features, which effectively reduces the computational complexity of the modeling of temporal and structural information. The model structure of the Lightweight Spatio-Temporal Decoupling Transformer is shown in Fig. 1.
Fig. 1.
Schematic diagram of lightweight spatio-temporal decoupling transformer structure. Depthwise convolution (DW Conv) is a separate convolution operation for each channel, and pointwise convolution (PW Conv) performs a 1 × 1 kernel to perform feature extraction on a single time point.
As shown in the upper of Fig. 1, the Lightweight Spatio-Temporal Decoupling Transformer consists of fuzzy position encoding, Spatio-temporal separable Linear Self-attention mechanism (TS-separable Linear Self-Attention), convolutional feedforward network (Conv FFN), and Softmax classifier. In detail, Fuzzy position encoding increases the stochasticity of the input by introducing Gaussian regular terms in the position encoding to improve the generalization performance. The TS-separable linear self-attention mechanism can decouple the temporal and spatial features of the time series and extract the nonlinear diversified local temporal information separately, then the local temporal and spatial information is aggregated for global feature enhancement by the linear self-attention mechanism. Furthermore, the Conv FFN extracts temporal and spatial information through one-dimensional convolution and Depthwise convolution. Finally, the categories of the samples are outputted by the Softmax classifier.
Fuzzy position encoding
Since the Transformer model discards the Recurrent Neural Network and Convolution Neural Network as the basic models for sequence learning and instead uses the self-attention mechanism exclusively, this results in a model that cannot directly capture the relative positional relationships between words in a sequence. Position encoding provides an essential module for Transformer to introduce temporal position information, which encodes and characterizes temporal information by embedding position vectors. Traditional position encoding is generally deterministic, i.e., the encoding is fixed for each position in the sequence. However, real-world time series are affected by equipment and other factors with interference noise, resulting in the inability to accurately reflect the true position information of the sequence using only traditional position encoding. Literature35 introduces the idea of fuzzy in labels to enhance the generalization performance. Inspired by this work, we propose a fuzzy position encoding method that combines the fuzzy concept with the Gaussian noise regular term to improve the position encoding and enhance the generalization and robustness of the model.
We assume that the time series is
,
, where N denotes the length of the sequence and d denotes the vector embedding dimension. The Gaussian noise vector
is generated for the positions i of the time series, where
are sampled independently in a Gaussian distribution, and the strength of the regularization is controlled by adjusting the parameter
. The position encoding
of position i is
![]() |
1 |
where
and
denote the time series encoding vector and the encoding vector of the position, respectively. The explicit presentation of fuzzy position encoding increases the stochasticity of the model, facilitates the control of the intensity of the noise and the sensitivity of the model to the noise, and can effectively improve the generalization ability. More complex noise patterns or correlation structures can also be considered better for modeling the uncertainty or noise patterns in real data.
TS-separable linear self-attention
Time series consist of a series of time points, which contain rich temporal correlations among the time points, in addition to rich structural information among the features of each time point. Motivated by previous works on the temporal and spatial structural features of time series2,32,33, we argue that the temporal information between time points and the structural information between variables should be simultaneously utilized. Deviating from the work of2, we modify the attention mechanism of Transformer by first decoupling the extraction of temporal and spatial structural features of the embedded time series using Depthwise convolution, then performing spatio-temporal feature fusion, and finally performing global perception of the fused features using a linear attention mechanism. Thus, we can utilize models with lower complexity to construct deeper network structures and thus extract deep global features.
As shown in the lower left of Fig. 1, we propose a TS-separable linear self-attention mechanism consisting of the depth-separable convolution and a linear attention mechanism. The proposed method decouples the time series data from the temporal and structural perspectives and independently extracts the temporal and variable structural information using Depthwise convolution. Then, the temporal and structural information are fused using Pointwise convolution. Finally, the fusion features are globally modeled by linear self-attention to enhance the critical information of the fusion features adaptively. Firstly, Depthwise convolution is utilized to extract the temporal and variable structural information of the time series. The local information is processed as
![]() |
2 |
where
and
denote local temporal and local structural features, respectively, and
denotes Depthwise convolution. To maintain feature consistency and obtain a richer feature representation, local temporal and local structural features are fused using Pointwise convolution. The computation is performed as
![]() |
3 |
where
denotes the temporal and structural fusion features and
denotes the Pointwise convolution. The concept of separable convolution can be utilized to obtain diversified local fusion features, which contain rich temporal and structural information.
Subsequently, the fusion features are mapped to obtain the queries
, keys
, and values
, respectively, using the embedding in linear self-attention mechanism, where
,
, and
are the parameter matrixes.
To enhance the global modeling capability of long time series and reduce the quadratic complexity of the traditional self-attention mechanism, we use the law of union of matrix multiplication to replace the original calculation method, so that
and
are combined first, which reduces the computational complexity to
, traditional self-attention mechanism is
![]() |
4 |
and the linear attention value vector operation yields an output attention feature as
![]() |
5 |
where
denotes the feature mapping acting on
and
, which uses a Gaussian kernel function. To extract richer feature patterns, the single-head is extended to a multi-head attention mechanism, computed as
![]() |
6 |
where
denotes the multi-head self-attention output feature, h denotes the number of self-attention heads, and
denotes the parameter matrix.
From Eq. (4), the computational cost of the traditional self-attention mechanism is
, for the memory requirement as well, since the complete attention matrix must be stored to compute the gradient for the queries, keys, and values. In contrast, the linear attention from Eq. (5) has a complexity of
, since we can compute
and
once, which is then available for reuse for each query. We use Depthwise convolution for extracting temporal and spatial information and Pointwise convolution for fusion, they both have a complexity of
. Therefore, the complexity of TS-Separable Linear Self-Attention is lower compared to the traditional self-attention mechanism.
The TS-separable linear self-attention mechanism combines the advantages of depth-separable convolution and linear self-attention mechanism, which decouples the extraction of diversified nonlinear features in the time series by depth-separable convolution and realizes the aggregation of local temporal and variable structural features and then utilizes the linear self-attention mechanism for global perception and enhancement of the fused features.
Convolutional Feedforward Neural Network
TS-Separable Linear Self-Attention can decouple the extraction and fusion of temporal and structural information of time series, and realize global information perception by using the linear self-attention mechanism, which effectively utilizes temporal and structural information as well as local and global information. To realize lightweight feature extraction, we adopt the Convolutional Feedforward Neural Network to improve the traditional FFN, which extracts the correlation information of the fused features from different dimensions, as well as reduces the complexity of feature extraction. As shown in the lower right of Fig. 1, ConvFFN mainly consists of one-dimension convolutional neural network, depthwise convolution, GELU activation function, and batch normalization layer. To extract temporal and variable structure information of attention fusion features, one-dimension convolutional neural network and depthwise convolution are used to construct a three-layer network for extracting temporal and variable structure features. The first layer of one-dimension convolution in ConvFFN can extract the temporal features, and the computational process is as follows
![]() |
7 |
where
denotes the batch normalization layer,
denotes the nonlinear activation function, and
denotes the one-dimension convolutional layer. Then, the structural features among the variables are extracted by depthwise convolution after dimensional transformation, which is computed as
![]() |
8 |
where
is the Depthwise convolution,
denotes the batch normalization layer, and
denotes the nonlinear activation function. Finally, the fused features of the first and second layers are re-featured by one-dimension convolution, which is computed as
![]() |
9 |
When the Transformer encoder stacks N layers, the output features are
, and finally using the SoftMax classifier, the output is
![]() |
10 |
where
is the output of the Nth layer Transformer encoder,
denotes the exponential function e, and
denotes the total number of categories. The deep abstract features of time series data are mined by stacked feature encoders to improve the classification performance.
Optimization procedure of AutoML method
To promote the efficiency of model design and training, we propose a network structure optimization method based on the CMA-ES algorithm to achieve automated network structure design and optimization of the model structure, avoiding the uncertainty and inefficiency problems of manual design. The automated network training and testing architecture is built using the CMA-ES algorithm. The CMA-ES algorithm uses a Gaussian probability distribution to model the current optimal position in the search space and adaptively adjusts the search strategy based on the historical information of the current search point. The search process follows the maximum entropy principle and utilizes the normal distribution of multivariate variables to generate new search points. The CMA-ES algorithm is based on an evolutionary path through step size control to achieve fast convergence, while the covariance matrix adaptive algorithm increases the likelihood of successful step size and can improve the performance according to the order of magnitude of the problem size. Notable, the CMA-ES algorithm mainly involves the updating of the mean vector, covariance matrix, and step size, with the mean value calculated based on the weighted average of selected individuals in the current population. The weights are determined based on the fitness ranking of the individuals, and the mean update is calculated as
![]() |
11 |
where
is the previous generation mean,
is the learning rate for mean update,
is the weight of the i-th individual, and
is the i-th individual in the current population (i.e., the model structural parameter configuration). The update of the covariance matrix
considers the old covariance matrix, the evolutionary path, and the difference between the selected individuals in the current population and the mean value, which is calculated as
![]() |
12 |
where
is the covariance matrix of the current generation,
and
is the learning rate of the covariance matrix update, and
is the evolutionary path for accumulating successful optimization directions in successive steps. The update of the step size
aims to adjust the scope of exploration, and the step size is updated as
![]() |
13 |
where
is the step size of the current generation,
and
are the parameters of the step update,
is the evolutionary path of the step, and
is the expected Euclidean length of the standard normal distribution vector for normalization. The pseudo-code of the CMA-ES algorithm is shown in Algorithm 1.
The architecture of the neural network structure optimization algorithm based on the CMA-ES algorithm is shown in Fig. 2.
Fig. 2.
Schematic of the network structure optimization method.
As shown in Fig. 2, the network structure optimization method based on CMA-ES algorithm contains two critical stages of model training and testing. The training stage includes the training set, search space, CMA-ES optimization algorithm, model building, model training set and evaluation, training condition judgment, and parameter transfer. The inference stage includes the test set, model reconstruction, and model inference parts. Taking the designed lightweight Transformer as the base model and utilizing the CMA-ES-based algorithm for network parameter design, the specific training and inference steps are as follows:
- Training stage
- Define the optimization problem
- Define the model structural parameters as the search space of CMA-ES.
- Define the model accuracy as the objective function of CMA-ES, which is used to evaluate the model performance.
- Parameter initialization
- Initialize CMA-ES parameters: population size
, parent size
, initial step size
, covariance matrix
, learning rate c, and maximum number of iterations
, and use default values for other parameters. - Initialize the structural parameters
of the Transformer model. - Define the monitoring metrics parameters
of the pruning algorithm.
- Train and evaluate
- Build the Transformer network using the initial structural parameter search space.
- Train the model and monitor whether the training reaches the early stop condition
through the pruning strategy, if the accuracy has not been improved for several consecutive times
, the current round of model training will be completed early; otherwise, the training will be continued to the maximum number of iterations
. - Evaluate the performance of candidate solutions on the test set and judge and save the current optimal model.
- Iteration condition determination
- Determine whether the maximum number of iterations is reached, if it is reached, then output the model structure parameters to the testing stage; if not, then go to step 2), according to the CMA-ES algorithm to update the model structure parameters, iterative training.
- Model reconstruction
- Rebulid the Transformer network based on the model structure parameters transmitted during the training stage.
- Model inference
- Evaluate the classification performance of the model using a test set.
The network structure optimization method based on the CMA-ES algorithm optimizes the structural parameters of Transformer without relying on gradient information, effectively improves the model performance through global search, and realizes the automation and intelligence of the model structural parameter search, which not only reduces the burden of manual tuning, but also improves the reliability and consistency of the optimization results.
Experiments and results
In this section, we thoroughly present the datasets that were employed throughout the experimental process, along with the evaluation metrics utilized to assess the performance of our experiments. Furthermore, we develop an in-depth analysis of the results obtained from the comparison experiments, ablation studies, and model complexity assessments.
Datasets and experiment setup
Datasets. We rigorously validate our proposed model using both the univariate time series classification dataset and the multivariate time series classification dataset: UCR Archive 201836 and UEA Archive 201837. Details of our datasets are presented in Tables 1 and 2.
Table 1.
Details of UCR dataset.
| Datasets | Type | Train | Test | Length | Classes |
|---|---|---|---|---|---|
| ACSF1 | Device | 100 | 100 | 1460 | 10 |
| BME | Simulated | 30 | 150 | 128 | 3 |
| ECG5000 | ECG | 500 | 4500 | 140 | 5 |
| LargeKitchenAppliances | Device | 375 | 375 | 720 | 3 |
| Plane | Sensor | 105 | 105 | 144 | 7 |
| SyntheticControl | Simulated | 300 | 300 | 60 | 6 |
| SemgHandGenderCh2 | Spectrum | 300 | 600 | 1500 | 2 |
| UWaveGestureLibraryAll | Motion | 896 | 3582 | 945 | 8 |
| Wafer | Sensor | 1000 | 6164 | 315 | 2 |
| Wine | Spectro | 57 | 54 | 234 | 2 |
Table 2.
Details of UEA dataset.
| Datasets | Type | Train | Test | Dimensions | Length | Classes |
|---|---|---|---|---|---|---|
| EthanolConcentration | Spectra | 261 | 263 | 9 | 1751 | 4 |
| Handwriting | Motion | 150 | 850 | 3 | 152 | 26 |
| SelfRegulationSCP1 | EEG | 268 | 293 | 6 | 896 | 2 |
| UWaveGestureLibrary | Human Activity Recognition | 120 | 320 | 3 | 315 | 8 |
Baseline. We selected several methods that have demonstrated strong performance in time series classification tasks over recent years to serve as our baselines for comparison: Transformer13, informer38, pyraformer39, crossformer40, TimesNet41and iTransformer34, MiniRocket20.
Evaluation. Accuracy has been a widely adopted metric for evaluating the performance of models in time series classification tasks. It provides a high-level overview of a model’s classification abilities by quantifying the overall correctness of its predictions41.
Implementation Details. All experiments are consistently executed on the same device, equipped with Intel® Xeon® Silver 4214R CPU, NVIDIA A100 80GB GPU, and 128GB of RAM. The experimental setup is meticulously configured using Python 3.9, Pytorch 1.12.1, and CUDA 12.1 to ensure a seamless and reproducible research environment.
Comparison results and discussion
As shown in Table 3, Our proposed AutoLDT is superior to other baseline models in the time series classification task, achieving an average accuracy of 81.21%. We conduct comprehensive experiments on 14 different datasets, and the results show that AutoLDT achieves optimal performance on 5 datasets, namely BME, LargeKitchenAppliances, SyntheticControl, Wine, and UWaveGestureLibrary, and achieves an improvement of more than 0.63% compared to other 7 remarkable methods. The second place was taken by iTransformer, which achieves optimal performance on 4 datasets with an average accuracy of 80.58%, further demonstrating the effectiveness of extracting spatial structural information. TimesNet, despite achieving optimal performance on 1 dataset, achieves an average accuracy of 80.41%, demonstrating that intra-periodic and inter-periodic data contain valid time-series information. The average accuracy of the other five methods is lower than 80%, and the classification performance is lower than that of the proposed method. The proposed method fully utilizes the temporal and spatial information of the time series and improves the average accuracy by 1.76% compared to the traditional Transformer that uses temporal information, and more than 0.63% compared to the iTransformer that uses spatial information, which demonstrates the effectiveness of AutoLDT in using temporal and spatial fusion information. In future work, research is needed for lightweight spatio-temporal feature fusion methods to further extract more efficient fusion features.
Table 3.
Performance of different models on UCR and UEA datasets. We report classification accuracy (%) as the primary metric to evaluate the performance of models.
| Datasets / Models | Transformer 2017 |
Informer 2021 |
MiniRocket 2021 |
Pyraformer 2022 |
Crossformer 2023 |
TimesNet 2023 |
iTransformer 2024 |
AutoLDT (Ours) |
|---|---|---|---|---|---|---|---|---|
| ACSF1 | 78 | 78 | 82.8 | 80 | 69 | 80 | 81 | 80 |
| BME | 97.33 | 97.33 | 99 | 98.67 | 98 | 98 | 98 | 99.33 |
| ECG5000 | 94.04 | 93.87 | 93.7 | 94.09 | 93.73 | 94.17 | 94.38 | 94.33 |
| LargeKitchenAppliances | 48.8 | 43.73 | 46.72 | 44.53 | 46.4 | 47.78 | 50.8 | 51.2 |
| Plane | 98.1 | 97.14 | 99.1 | 99.05 | 99.05 | 98.2 | 98.46 | 98.1 |
|
Synthetic Control |
92.33 | 94.33 | 92.4 | 96 | 91.33 | 96 | 94.33 | 96 |
| SemgHandGenderCh2 | 88.17 | 86.5 | 87.3 | 87.33 | 87 | 87.47 | 88.03 | 87 |
| UWaveGestureLibraryAll | 95.14 | 94.78 | 93.57 | 95.42 | 88.5 | 94.41 | 94.33 | 94.86 |
| Wafer | 99.28 | 99.5 | 99.1 | 99.51 | 99.21 | 99.51 | 99.59 | 99.51 |
| Wine | 79.63 | 81.48 | 80.74 | 85.19 | 85.19 | 85.33 | 83.33 | 88.89 |
|
Ethanol Concentration |
32.7 | 31.6 | 38.20 | 30.8 | 38 | 35.7 | 41.22 | 37.40 |
| Handwriting | 32 | 32.8 | 30.5 | 29.4 | 28.8 | 32.1 | 28.56 | 30.86 |
|
SelfRegula tionSCP1 |
91.2 | 90.1 | 91.44 | 88.1 | 92.1 | 91.8 | 91.1 | 91.75 |
| UWaveGestureLibrary | 85.6 | 85.6 | 77.29 | 85.9 | 85.3 | 85.3 | 84.95 | 87.77 |
| Average Accuracy | 79.45 | 79.05 | 79.42 | 79.57 | 78.69 | 80.41 | 80.58 | 81.21 |
Ablation studies
In this section, we aim to validate the effectiveness of the various modules within our proposed model: PE, Attention, and FFN. We conducted experiments on the Wafer dataset, replacing each module individually while keeping the other components unchanged. The results obtained from these experiments are presented in Table 4.
Table 4.
Results of the ablation experiments.
| Model | Accuracy (%) | Parameters | |||
|---|---|---|---|---|---|
| Components | Fuzzy PE | TS-SLA | Conv FFN | 99.51 | 0.67 M |
| Replace | PE | TS-SLA | Conv FFN | 99.27 | 0.67 M |
| Fuzzy PE | Multi-head attention | Conv FFN | 99.48 | 3.48 M | |
| Fuzzy PE | TS-SLA | FFN | 99.25 | 1.65 M | |
| PE | Multi-head attention | FFN | 99.28 | 4.97 M | |
Fuzzy Position Encoding. We conducted an experiment in which the original fuzzy position coding in the AutoLDT model was substituted with standard position coding. Unexpectedly, this modification resulted in a reduction of the model’s accuracy, reaching 99.27%. In addition, there were no significant changes in model parameter counts. As a result of our analysis, we conclude that fuzzy position encoding enhances the model’s ability to extract deeper information from the input sequences by incorporating noise, which facilitates the extraction of more abstract and highly distinguishable features by avoiding the noise contained in the original data, thus enhancing the generalization ability of AutoLDT. The observed accuracy degradation suggests that removing this fuzzy coding may have hindered the model’s ability to capture such fine-grained sequence information.
TS-Separable Linear Self-Attention. We conducted an experimental study wherein the original TS-Separable Linear Self-Attention was substituted with a multiple attention mechanism. The findings revealed that this substitution led to a marginal decrease in the model’s accuracy, specifically to 99.48%. Nevertheless, the changes resulted in a significant increase in the number of parameters of the model by 2.81 M. Upon analyzing the experimental outcomes, we deduce that TS-Separable Linear Self-Attention significantly enhances the model’s capability to extract spatio-temporal features from time-series data. This, in turn, aids the model in effectively capturing global feature information, thereby elevating its overall performance. Moreover, the results underscore the notable advantage of TS-Separable Linear Self-Attention in terms of model lightweighting, as it maintains high performance with a lower parameter count.
Convolutional Feedforward Neural Network. In the experimental section, we substituted the Convolutional Feedforward Neural Network with a conventional feedforward neural network. The subsequent empirical results revealed that the accuracy of the substituted model diminished to 99.25%. Furthermore, we noticed an elevation in the model’s parameter count, amounting to 1.65 M. Replacing the primary fully-connected layer with a lightweight convolutional layer can reduce the number of parameters, and the convolutional layer requires less storage of parameters. The Convolutional Feedforward Neural Network contains three convolutional layers to extract the globally enhanced features from different dimensions, which has a deeper network structure compared to the traditional perceptron network, and is conducive to extracting deeper abstract features. This outcome underscores the notable efficacy of Convolutional Feedforward Neural Networks in extracting profound spatio-temporal features. Additionally, the lightweight architecture inherent in Convolutional Feedforward Neural Network facilitates the model’s operational efficiency, thereby demonstrating its potential in optimizing model performance and resource utilization.
Besides, when we replace the PE, Attention, and FFN, the accuracy drops to 99.28%, and the number of parameters is increased by a factor of 7 to 4.97 M, which drops to further demonstrates the lightness and effectiveness of AutoLDT.
Complexity analysis
As dataset lengths can vary and subsequently impact model parameters and complexity, we undertook a comprehensive analysis by standardizing all models under the same dataset and hyper-parameter settings. This approach allowed us to compare the number of model parameters and complexity across the board, leading to the results presented in Table 5.
Table 5.
Complexity analysis of models on the Wafer dataset.
| Model | Parameters | MACs | Training time (s) |
|---|---|---|---|
| AutoLDT(Ours) | 0.67 M | 1.58 M | 21 |
| Transformer | 4.97 M | 4.97 M | 105 |
| MiniRocket | 0.59 M | 1.74 M | 32 |
| Informer | 4.72 M | 4.74 M | 108.7 |
| Pyraformer | 5.17 M | 14.64 M | 154.4 |
| Crossformer | 13.92 M | 1.6 G | 238 |
| TimesNet | 1.6G | 1.57G | 287.5 |
| iTransformer | 3.31 M | 479.09 M | 167.1 |
We rigorously assessed all models within a standardized environment, evaluating key factors such as model parameter count, computational complexity, and training duration. The results of this evaluation demonstrate the superior performance of AutoLDT in terms of both parameter efficiency and computational simplicity. Compared to the efficient performance of AutoLDT, Timesnet model exhibits higher parameter counts and computational requirements, with approximately 1.6G parameters and 1.57G in computational complexity. This increase primarily stems from its approach of decomposing complex temporal variations into multiple intra- and inter-periodic components. While this method enhances the model’s ability to accurately capture and interpret multiple periodic patterns within time series data, it also contributes to increased model complexity, thereby elevating both the number of parameters and computational demands. Meanwhile, Crossformer model underscores the significance of cross-dimensional dependencies in multivariate time series analysis. To adeptly capture and harness these intricate relationships, Crossformer employs dimensional segmentation embedding, a two-stage attention mechanism, and a hierarchical encoder-decoder framework. However, it is worth noting that these advancements, while improving model performance, also lead to a substantial increase in computational complexity, particularly evident when dealing with cross-dimensional data. In our experiments, although the crossformer exhibits a relatively low parameter count of 13.92 M, its computational complexity stands at 1.6G, highlighting the trade-off between performance and computational requirements. Specific to AutoLDT, its parameter count stands at a mere 0.67 M, with a computational complexity of 1.58 M. This remarkable efficiency distinguishes AutoLDT from its contemporaries and underscores its validity and computational feasibility. Notably, although MiniRocket has a lower number of parameters compared to AutoLDT, the computation and training time are still slightly worse than AutoLDT, Since AutoLDT uses more lightweight Depthwise convolution and Pointwise convolution, the computational efficiency is higher than traditional convolution.
Furthermore, AutoLDT excels in training efficiency, requiring only 21 s to complete the entire training process, the experimental results reveals notable disparities in their parameter counts, computational complexities, and training times when compared to AutoLDT. These findings underscore AutoLDT distinct advantages in terms of compactness and efficiency. In future work, we will explore other efficient time series models, such as CapsNets42 and Mamba43, to further improve the processing efficiency.
Conclusion
In this paper, we propose a lightweight Transformer classification framework based on Automated Machine learning technique named AutoLDT, which effectively enhances the spatio-temporal feature characterization ability and reduces the complexity of the model. Besides, we adopt the covariance matrix adaptation evolution strategy and global adaptive pruning technique to realize efficient automated network structure design, which further improves the model training efficiency and automation level. The comparison experiments demonstrate the proposed method achieves superior classification performance at a more desirable lightweight level. The proposed method achieves an average accuracy of 81.21% on 14 datasets, which reduces the parameter count by a factor of 7 compared to the traditional Transformer.
However, linear attention discards Softmax operation leads to loss of information diversity, and to minimize the information loss, we introduce additional parallel convolution operations in time and space, so the proposed method still has limitations in spatio-temporal feature fusion. In future work, more efficient feature fusion methods and model pruning methods need to be further explored to improve the accuracy and inference efficiency of time series classification with lower information loss.
Acknowledgements
This work was supported by the National Natural Science Foundation of China [Grant Numbers 61876189, 61703426, 61273275]; the Young Talent Fund of University Association for Science and Technology in Shaanxi, China [Grant Number 20190108]; and the Innovation Talent Supporting Project of Shaanxi, China [Grant Number 2020KJXX-065].
Author contributions
Peng Wang: Conceptualization, methodology, resources, writing—original draft. Ke Wang: Software, validation, writing—review and editing. Yafei Song: Funding acquisition, project administration, supervision. Xiaodan Wang: Methodology, funding acquisition, project administration, supervision.
Data availability
The datasets and code used during the current study are available from the corresponding author on reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Foumani, N. M. et al. Deep learning for time series classification and extrinsic regression: A current survey. ACM Comput. Surveys. 56 (9), 1–54 (2024). [Google Scholar]
- 2.Wang, X. et al. High-resolution range profile sequence recognition based on transformer with temporal–spatial fusion and label smoothing. Adv. Intell. Syst.5 (11), 2300286 (2023). [Google Scholar]
- 3.Wang, X. et al. Recognition of high-resolution range profile sequence based on TCN with sequence length-adaptive algorithm and elastic net regularization. Expert Syst. Appl., 123417. (2024).
- 4.Zhou H-Y, Yu, Y. et al. A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics. Nat. Biomedical Eng.7 (6), 743–755 (2023). [DOI] [PubMed] [Google Scholar]
- 5.Saheed, Y. K., Abiodun, A. I., Misra, S. et al. A machine learning-based intrusion detection for detecting internet of things network attacks. Alexandria Eng. J.61 (12), 9395–9409 (2022). [Google Scholar]
- 6.Saba, T. et al. Anomaly-based intrusion detection system for IoT networks through deep learning model. Comput. Electr. Eng.99, 107810 (2022). [Google Scholar]
- 7.Hüsken, M. & Stagge, P. Recurrent neural networks for time series classification. Neurocomputing50, 223–235 (2003). [Google Scholar]
- 8.Karim, F. & Majumdar, S. Insights into LSTM fully convolutional networks for time series classification. IEEE Access.7, 67718–67725 (2019). [Google Scholar]
- 9.Karim, F. et al. Multivariate LSTM-FCNs for time series classification. Neural Netw.116, 237–245 (2019). [DOI] [PubMed] [Google Scholar]
- 10.Yu, Y. et al. LSTM-based intrusion detection system for VANETs: a time series classification approach to false message detection. IEEE Trans. Intell. Transp. Syst.23 (12), 23906–23918 (2022). [Google Scholar]
- 11.Fauvel, K. & Lin, T. Xcm: an explainable convolutional neural network for multivariate time series classification. Mathematics9 (23), 3137 (2021). [Google Scholar]
- 12.Hssayni, E. H., Joudar, N. E., Ettaouil, M. A deep learning framework for time series classification using normal cloud representation and convolutional neural network optimization. Comput. Intell.38 (6), 2056–2074 (2022). [Google Scholar]
- 13.Vaswani, A. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), 2017, NY, USA, 6000–6010.
- 14.Arul, M. Applications of shapelet transform to time series classification of earthquake, wind and wave data. Eng. Struct.228, 111564 (2021). [Google Scholar]
- 15.ZUO, R. et al. SVP-T: A shape-level variable-position transformer for multivariate time series classification. In Proceedings of the AAAI Conference on Artificial Intelligence, 37(9): 11497–11505. (2023).
- 16.Lahreche, A. & Boucheham, B. A fast and accurate similarity measure for long time series classification based on local extrema and dynamic time warping. Expert Syst. Appl.168, 114374 (2021). [Google Scholar]
- 17.Feremans, L. & Cule, B. PETSC: Pattern-based embedding for time series classification. Data Min. Knowl. Disc.36 (3), 1015–1061 (2022). [Google Scholar]
- 18.Wang, J. et al. A T-CNN time series classification method based on Gram matrix. Sci. Rep.12 (1), 15731 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Chen, W. Multi-scale attention convolutional neural network for time series classification. Neural Netw.136, 126–140 (2021). [DOI] [PubMed] [Google Scholar]
- 20.Dempster, A., Schmidt D F & Webb G I. MiniRocket a very fast (almost) deterministic transform for time series classification. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. Virtual Event, Singapore; Association for Computing Machinery, pp. 248–257. (2021).
- 21.Xiao, Z. & Xu, X. RTFN: A robust temporal feature network for time series classification. Inf. Sci.571, 65–86 (2021). [Google Scholar]
- 22.Zhao, L., Mo, C., Ma, J. & Chen, Z. LSTM-MFCN: A time series classifier based on multi-scale spatial–temporal features. Comput. Commun.182, 52–59 (2022). [Google Scholar]
- 23.Geneva, N., & Zabaras, N. Transformers for modeling physical systems. Neural Netw.146, 272–289 (2022). [DOI] [PubMed] [Google Scholar]
- 24.Nassiri, K. & Akhloufi, M. Transformer models used for text-based question answering systems. Appl. Intell.53 (9), 10602–10635 (2023). [Google Scholar]
- 25.LI, G. et al. TransGait: Multimodal-based gait recognition with set transformer. Appl. Intell.53 (2), 1535–1547 (2023). [Google Scholar]
- 26.Su, W. et al. Hybrid token transformer for deep face recognition. Pattern Recogn.139, 109443 (2023). [Google Scholar]
- 27.Zhu, S. et al. R2former: Unified retrieval and reranking transformer for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023).
- 28.Wu, Y. et al. An aggregated convolutional transformer based on slices and channels for multivariate time series classification. IEEE Trans. Emerg. Top. Comput. Intell.7 (3), 768–779 (2023). [Google Scholar]
- 29.Chen, R. et al. DA-Net: Dual-attention network for multivariate time series classification. Inf. Sci.610, 472–487 (2022). [Google Scholar]
- 30.Zhao, B. et al. Rethinking attention mechanism in time series classification. Inf. Sci.627, 97–114 (2023). [Google Scholar]
- 31.Foumani, N. M. et al. Improving position encoding of transformers for multivariate time series classification. Data Min. Knowl. Disc.38 (1), 22–48 (2024). [Google Scholar]
- 32.Yao, J. et al. Contextual dependency vision transformer for spectrogram-based multivariate time series analysis. Neurocomputing572, 127215 (2024). [Google Scholar]
- 33.Middlehurst, M. & Schäfer, P. Bake off redux: A review and experimental evaluation of recent time series classification algorithms. Data Min. Knowl. Disc.38, 1958–2031 (2024). [Google Scholar]
- 34.Liu, Y. et al. Itransformer: Inverted transformers are effective for time series forecasting. (2023). arXiv preprint arXiv:231006625.
- 35.Müller, R. & Kornblith, S. Hinton G. When does label smoothing help?. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, NY, USA, Article 422, 4694–4703. (2019).
- 36.Dau, H. et al. The UCR time series archive. IEEE/CAA J. Autom. Sin.. 6 (6), 1293–1305 (2019). [Google Scholar]
- 37.Bagnall, A., Dau, H. A., Lines, J., et al. The UEA multivariate time series classification archive. (2018). arXiv preprint arXiv:1811.00075.
- 38.Zhou, H., & Peng J. Informer: Beyond efficient transformer for long sequence time-series forecasting. In AAAI Conference on Artificial Intelligence, 35(12): 11106–11115. (2020).
- 39. Liu, S. Yu, H., Cong, L. et al. Low-complexity pyramidal attention for long-range time series modelingand forecasting. In International Conference on Learning Representations (ICLR), pp. 1–20. (2022).
- 40.Zhang, Y. & Yan, J. Crossformer transformer utilizing cross-dimension dependency for multivariate time series forecasting. In International Conference on Learning Representations (ICLR), pp. 1–21. (2023).
- 41.Wu, H. et al. Temporal 2D-variation modeling for general time series analysis. In International Conference on Learning Representations (ICLR), 1–23. (2023).
- 42.Liu, Y., Cheng, D., Zhang, D. & Xu, S. Han J. Capsule networks with residual pose routing . IEEE Trans. Neural Networks Learn. Syst., 1–14. (2024). [DOI] [PubMed]
- 43.Zhang, D. et al. Mamba capsule routing towards part-whole relational camouflaged object detection. arXiv preprint arXiv:2410.03987 (2024).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets and code used during the current study are available from the corresponding author on reasonable request.
















