Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Jan 28;16:6436. doi: 10.1038/s41598-026-36456-8

Semantic-aware fault diagnosis of heavy-duty railway maintenance machinery and its potential in multisensor fusion systems

Yuying Zhang 1,2, Chunlei Gao 1,2,, Runzhe Wang 3, Huize Liang 4, Guohua He 1,2, Chuan Liu 1,2, Shangkun Liu 1,2, Zhe He 1,2, Jiaqing Zhang 1,2
PMCID: PMC12910104  PMID: 41606053

Abstract

To address the semantic gap in physical sensor data for fault diagnosis of heavy-duty railway maintenance machinery and the underuse of semantic information in maintenance logs, this study proposes a model that treats fault-related text as a virtual semantic sensor. The goal is to explore a semantic-aware approach to fault diagnosis and its role in multisensor fusion. A classification model combining a BERT pretrained model with a convolutional neural network (BERT-CNN) was built. To improve the focus on key semantic units and strengthen links between textual features and sensor modalities, a dual self-attention (DSA) mechanism was added, forming the BERT-DSA-CNN model. It extracts structured semantic feature vectors from unstructured logs, which serve as outputs of the virtual semantic sensor. Experiments show that (1) incorporating DSA significantly increases performance, with BERT-DSA-CNN and Word2vec-DSA-CNN outperforming baselines (BERT-CNN and Word2vec-CNN) in terms of accuracy, precision, recall, and F1-score; (2) BERT’s contextual embeddings clearly surpass Word2vec, as BERT-DSA-CNN consistently outperforms Word2vec-DSA-CNN; (3) CNN effectively captures local features of short fault texts, as BERT-CNN outperforms BERT-BiLSTM on most metrics; and (4) deep semantic feature learning substantially outperforms traditional machine learning, confirming the superiority of deep semantic feature learning. This study validates that the proposed semantic-aware model can efficiently transform fault texts into semantic features for identification. More importantly, the structured semantic features extracted by this model have the potential to be fused with physical sensor data in future work, which could provide a foundation for more accurate, robust, and interpretable intelligent fault diagnosis systems for heavy-duty railway maintenance machinery.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-026-36456-8.

Keywords: Semantic awareness, Physical sensors, Bidirectional encoder representations from transformers (BERT), Convolutional neural network (CNN), Dual self-attention (DSA), Machine learning

Subject terms: Engineering, Mathematics and computing


Heavy-duty railway maintenance machinery is the core equipment for track repair and maintenance operations. It has high-speed and efficient maintenance ability, which is significantly superior to traditional manual operation, and can greatly shorten the track closure time caused by maintenance operations1,2. This kind of machinery can carry out regular inspection and maintenance of railway infrastructure, effectively identify and correct potential safety hazards such as irregular tracks and subgrade settlement3, and play a crucial role in ensuring the safety of railway operation, punctuality rate and service quality.

In actual engineering, monitoring the status of these complex systems and diagnosing faults largely depend on the fusion of data from multiple physical sensors (such as vibration, temperature and pressure)46. This method integrates various signals to build a comprehensive image of the health status of the device7. However, there is still a significant limitation: although the numerical data provided by the sensor can effectively capture the physical state of the machine, it is insufficient in explaining the intrinsic meaning of complex failures8,9. For example, while an abnormal high-frequency vibration can be detected, the sensor cannot translate this signal into specific fault semantics like “bearing wear” or “oil line blockage”. Similarly, recorded temperature gradients cannot describe experiential observations such as “sluggish operation” or “abnormal noise” noted by maintenance staff. This semantic gap ultimately compromises the decision-making accuracy of maintenance personnel and the interpretability of repair outcomes in complex failure scenarios.

Heavy-duty railway maintenance machinery often operates in harsh environments such as high temperature, high humidity and excessive dust. Its system structure is complex and the fault mode is diverse10,11. Maintenance personnel usually record the observed faulty symptoms (such as excessive vibration or insufficient pressure), troubleshooting steps, and final conclusions in written logs during trouble-shooting. These logs essentially constitute a valuable but not fully utilized “semantic sensor” data, which contains rich experience and knowledge. Traditional diagnostic methods that rely only on manual experience are difficult to meet the needs of efficient fault identification. The information of historical fault logs and physical sensor signals is of complementary value. Therefore, the core of solving this problem is to convert this unstructured semantic information into standardised, machine-interpretable signals. This can be deeply integrated with physical sensor data, laying the foundation for the development of a new generation of intelligent fault diagnosis systems.

In this work, we treat fault text maintenance records as Virtual Semantic Sensors (VSS)—a computational module that converts unstructured natural language into standardized, structured semantic feature vectors, thereby emulating the signal transduction of a physical sensor. Its processing pipeline consists of signal perception (text input), feature extraction (via BERT‑DSA‑CNN), and signal output (fixed‑dimensional feature vector). The output is designed to align with physical sensor features in dimensionality and semantics, serving as a ready‑to‑fuse interface for feature‑level multimodal integration within a broader diagnostic system.

In summary, while existing studies have advanced text-based or sensor-based diagnostics, they largely remain separate endeavors: text classification aims to improve accuracy, while multisensor fusion focuses on processing physical signals. This work bridges the gap systematically. Our core contributions are threefold: (1) Conceptually, we frame unstructured fault texts as a “Virtual Semantic Sensor” tasked with outputting standardized, structured semantic feature vectors—not just classification. (2) Methodologically, we design a Dual Self-Attention (DSA) mechanism not only to boost performance but to learn structured feature representations aligned with potential physical-sensor dimensions. (3) Practically, the entire BERT-DSA-CNN model acts as a “semantic signal processor,” outputting features ready for feature-level multimodal fusion. Thus, this study provides a key module and framework for integrating textual and physical sensor data.

Related works

Research on intelligent fault diagnosis of equipment generally follows two main technical routes: (1) analysis on the basis of physical sensor data and (2) mining of maintenance text records. In recent years, both approaches have shown a trend of evolving from single-model methods to fusion architectures, aiming to achieve more comprehensive and accurate diagnostic capabilities.

Fault diagnosis based on multisensor data fusion

A classical and widely studied approach in this field is to improve diagnostic accuracy by integrating heterogeneous physical sensor data, such as vibration, temperature, and pressure signals1214. The core advantage of this approach lies in leveraging the complementarity among different sensor signals to construct a comprehensive perception of equipment health. For example, in the railway domain, Niu15applied probabilistic principal component analysis to extract features from sensor data and used support vector machines (SVMs) and k-nearest neighbor models to realize intelligent diagnosis of turnout faults. Tan et al16., who targeted SDH optical fibre communication networks, proposed an SVR-based fault classification method. They first extracted steady-state features of fault data transmission via a Poisson distribution-based model and then applied SVR for classification, achieving an accuracy of 99%. Dai et al17. proposed a fault diagnosis method combining the Beluga Whale Optimisation (BWO) algorithm with Random Forest (RF) for rolling bearing issues in wind turbine drive systems, effectively optimising the processing of high-dimensional sensor features. In other industrial domains, Kougiatsos et al18. proposed a nonlinear estimation method for monitoring faults in marine fuel oil engines based on distributed sensors, demonstrating the advantages of distributed fusion architectures. Li et al19. employed BERT to process transient operational parameters in nuclear power plants, achieving anomaly detection while reducing false alarm rates, thereby highlighting the potential of natural language models for handling structured system status data. Such research typically relies on signal processing, machine learning, and deep learning methodologies to extract and fuse feature information from raw sensor data. However, these approaches are heavily dependent on sensor deployment density and data quality, and they struggle to capture deeper semantic information of faults, often lacking direct connections to maintenance practices.

Fault diagnosis based on text data mining

In addition to physical sensors, maintenance records constitute another valuable data source. These logs document fault phenomena, causes, and corrective actions in a qualitative but semantically rich manner. Mining such data requires natural language processing (NLP) techniques, which have evolved from traditional machine learning to deep learning and, more recently, to pretrained language models2023. Early studies primarily used bag-of-words and TF-IDF representations combined with shallow classifiers for fault categorization. For example, Xie24 applied the sLDA topic model to semantically cluster ground equipment fault logs in urban rail systems and classified them via a naïve Bayes model. Liu25 used the XGBoost algorithm to classify eight typical fault types in train control onboard equipment. Feng26 proposed a BP-SVM–based classification and prediction method for shearer faults, in which features are extracted via a BP neural network and classified with an SVM. While such methods offer interpretability, they suffer from high-dimensional sparse representations and limited semantic capture. With the development of deep learning, embedding techniques such as Word2vec and GloVe have mitigated the sparsity problem, whereas the CNN and RNN architectures have improved the modelling of local features and contextual dependencies. Zhou27 employed Word2Vec with a CBOW model to generate embed-dings, followed by a CNN for training control onboard fault classification. Yang28 proposed an IBTM-TMW method that combines improved topic models with word embeddings to handle the short, domain-specific fault texts of railway signal equipment. Deng29 applied co-word analysis for text processing and multivariate statistical methods for failure term classification and visualization in the context of aero-engines. More recently, pretrained models such as BERT have greatly advanced performance in short and unstructured text classification. Lin30 developed a BERT-MHA-BiLSTM-CRF (BMBC) model for entity recognition in high-speed railway signal fault texts. Xu31 proposed a BERT-CNN model with focal loss for handling imbalanced datasets of urban rail train control fault logs. Zhang32 demonstrated the superiority of BERT-based representations in railway safety supervision text classification. Zhang33 further combined a BERT-based short text classification model with a knowledge graph for fault localization in railway CIR equipment. Jiao34 applied fuzzy NLP techniques, using BERT to enable intelligent testing of CTC system interface logs. Xia35 also validated BERT’s advantages in short-text analysis.

Towards integrating physical and semantic modalities

Notably, a few recent studies have begun to explore the preliminary integration of numerical sensor data and textual data. Xu36 performed ontology modelling on multisource diesel engine fault data and applied a BERT-BiLSTM-CRF framework to mine expert knowledge, which was subsequently incorporated into a knowledge graph and Bayesian network for fault diagnosis. Li37 adopted a federated edge–cloud approach to train BERT models, which demonstrated strong performance in rail transit fault detection tasks and highlighted the potential of multimodal and distributed architectures. Furthermore, in the broader field of intelligent fault diagnosis, advanced neural network frameworks such as Metric-Encapsulation (ME) and Deep Coupled Metric Learning (DCMLDF) have been proposed to learn joint representations and measure cross-modal similarities, offering sophisticated paradigms for multimodal data fusion. Nevertheless, most existing studies on railway equipment remain fragmented. On the one hand, research on multisensor fusion emphasizes the optimization of physical signal processing but neglects fault semantics. On the other hand, text mining studies focus on extracting fault knowledge from human language but lack systematic integration into sensor-based diagnostic frameworks.

This fragmentation, together with the identified gap in applying these advanced fusion paradigms specifically to the context of fusing unstructured maintenance logs with physical sensor streams for heavy-duty machinery, reveals a significant research gap: very few studies have examined how to deeply fuse the rich semantic information in fault texts with physical sensor signals. Texts, as qualitative data, remain underexplored in terms of how their semantic features can complement and validate quantitative sensor data at the feature level. To address this gap, the present study conceptualizes fault texts as virtual semantic sensors (VSSs) and introduces a BERT-CNN–based framework as its signal processing module, enabling the extraction of standardized, integrable semantic features from text. Our objective is to demonstrate how these semantic features can be mapped to and complemented with physical sensor signals, thereby providing the key technical components and theoretical foundation for developing a multimodal intelligent diagnostic system that integrates both numerical and semantic information.

Equipment structure and fault text mining framework

Structure of the Heavy-Duty railway maintenance machine

Heavy-duty railway maintenance machinery can be divided into four major systems: the central control and operation system, the functional execution system, the status monitoring and diagnostic system, and the auxiliary support system. These systems work in coordination and are highly integrated to jointly complete efficient and precise line operations3840.

Among them, the central control and operation system is mainly responsible for the command, control, and coordination of the entire equipment. Its components include the onboard main control MCU, the driver’s touch console, and the operation control subsystems. The status of this system is usually monitored by internal sensors such as current, voltage, and communication.

The functional execution system is mainly responsible for specifically executing the action instructions issued by the central control system, realizing the physical functions of machine movement and operation. Its components include the power transmission system, travelling system, braking system, and working devices. This is a high-incidence fault area, where many vibration, temperature, pressure, and rotational speed sensors are deployed.

The status monitoring and diagnostic system is responsible for real-time monitoring of the overall operating status of the machine to ensure operational safety and provide early warnings of potential failures. Its components include the data processing unit and the status display screen in the cab.

The auxiliary support system is mainly responsible for providing necessary support and services for other core systems, ensuring that personnel and equipment can continue to work stably in various environments. Its components include battery packs, distribution boxes, voltage regulation units, intervehicle communication antennas, BeiDou modules, onboard air conditioning, wipers, lighting, and so on.

The structure of heavy-duty railway maintenance machinery is shown in Fig. 1.

Fig. 1.

Fig. 1

Structure of heavy-duty railway maintenance machinery.

Fault text data mining

Data collection

In this study, the fault text data of heavy-duty railway maintenance machinery are obtained from equipment fault maintenance logs. The logs consist of fault location, faulty component, fault phenomenon, fault level, fault cause, and fault handling information. In this study, we regard the fault text recorded by maintenance personnel as a type of virtual semantic sensor, which contains abundant state information that is difficult to capture by physical sensors. Some of the log information is shown in Table 1.

Table 1.

Example of fault text data.

Category Component Fault
phenomenon
Fault level Cause Treatment

Power

transmission

Gearbox Abnormal disengagement of gearbox 2 Fracture of ZF final-stage clutch soft shaft/abnormal clearance of connecting rod Replace clutch soft shaft/readjust connecting rod clearance
Gearbox Low gearbox pressure 3 Oil pump failure Replace oil pump

Braking

system

Independent brake

Insufficient

independent brake pressure

1 Pressure regulating valve failure/shuttle valve air leakage/brake valve failure/pipeline leakage/distributor valve failure Overhaul pressure regulating valve/clean shuttle valve/overhaul or replace brake valve/repair leakage points/overhaul distributor valve
Indirect brake Indirect brake cannot maintain pressure 1 Working air reservoir leakage/distributor valve failure/safety valve leakage/shuttle valve air leakage/single-lap pipeline leakage/internal leakage of automatic brake valve Inspect pipeline leakage points/replace distributor valve/overhaul safety valve/overhaul shuttle valve/inspect single-lap pipeline leakage points/inspect lower single-lap valve of automatic brake valve

Running

gear

Shock

absorber

Car body

swaying

3 Low static stiffness coefficient of chevron-type shock absorber Replace

Side

bearing

Abnormal oil volume in side bearing automatic lubricator 3 Aging Replace

The reasons why fault-handling information easily leads to low accuracy in traditional text processing algorithms include the following:

Short texts cannot cover the complete content. The amount of information contained in words is limited but expresses important content. Important local features and contextual information can affect the fault category. Fault descriptions belonging to different fault categories may contain highly similar features.

Semantic ambiguity. Owing to differences in maintenance personnel’s understanding and description habits of faults, phenomena of synonymy or ambiguity exist among multiple words.

Text analysis

Text analysis refers to the computational process of extracting meaningful patterns and valuable insights from unstructured textual sources. This technology has been successfully applied across multiple domains, including intelligent ecommerce and search engine technologies. Within the domain of heavy-duty railway maintenance machinery, fault classification tasks necessitate analysing vast volumes of maintenance logs to identify and extract pertinent failure information—a process falling within the conceptual and technical scope of text data mining. Capitalising on this alignment, this study employs text data mining methodologies to construct fault classification models for such machinery. The proposed framework’s overall architecture is illustrated below.

Fault texts constitute natural language, and we require natural language processing techniques to convert textual information into numerical data. This typically involves three steps: word segmentation, stopword removal, and vectorisation. Word segmentation entails decomposing sentence information into word-level information based on statistical analysis of standard corpora. For instance, the sentence: “After lifting the tamping device, it cannot be locked or unlocked normally” can be segmented as: “tamping device”, “lifting”, “cannot”, “normally”, “lock”, and “unlock”. Stopword removal involves eliminating high-frequency words that do not impact textual analysis, thereby enhancing the model’s generalisation capability. Examples include modal particles and conjunctions. Vectorisation converts unstructured textual information into structured vector data, providing input variables for deep learning models.

A deep learning model is constructed and trained by adjusting the model parameters and objective functions, and its performance is evaluated on a test set.

Research methodology

BERT functions as a pretrained language encoder that employs the masked language model (MLM) as its pretraining objective, combined with a fine-tuning strategy inspired by ULMFiT (Universal Language Model Fine-tuning). It has strong model transferability and can effectively handle polysemy4143. For downstream tasks, BERT only requires simple fine-tuning to adapt to new tasks, and its performance on multiple downstream tasks is significantly better than that of static embedding techniques such as Word2Vec and GloVe4446. Therefore, BERT has been increasingly applied in text processing. CNN, on the other hand, is an effective method for processing fault logs, which are short texts and nontime series data. Compared with BiLSTM, the CNN has a simpler structure, faster training speed, and facilitates timely updates of the case database.

In this study, the fault classification model is constructed as follows: First, the BERT model is used to vectorize the fault text data of heavy-duty railway maintenance machinery. Following vectorization, the resulting feature representations are fed into a convolutional neural network (CNN) for model training. To better capture the intrinsic mapping between fault descriptions and multisensor data, the standard CNN architecture is enhanced through the incorporation of a double self-attention (DSA) mechanism.

Fault text vectorization based on the BERT model

BERT is a character-level language model based on the transformer architecture47. Using the built-in BertTokenizer tool, the text is segmented into the smallest units, called tokens. To enhance the feature representation and generalization capability of sentences, BERT employs masked language model (MLM) and next sentence prediction (NSP) during training. In MLM, a portion of the input tokens is randomly masked, and the model predicts only the masked tokens, introducing a noise mechanism to improve robustness. NSP randomly selects two sentences A and B and determines whether B follows A. Both training steps are self-supervised and do not require labelled data. After pretraining via self-supervised learning, the model can be fine-tuned with only a small amount of data for various downstream tasks.

The input representation of the BERT model integrates three distinct types of embeddings: token, segment, and position embeddings48.

As illustrated in Fig. 2, tokens (e1, …, en)—derived from tokenization and stop-word removal—undergo a feature extraction process. For each token (e.g., e1), its semantic (v11), segment (v12), and positional (v13) feature vectors are summed to form a composite representation (v1). These integrated vectors are subsequently fed into the Transformer’s bidirectional encoder, which outputs the final contextualized word representations (r1…, rn) for downstream deep learning tasks.

Fig. 2.

Fig. 2

Structure of the BERT model.

The original transformer model consists of an encoder and a decoder. In BERT, only the encoder component is used, which includes multihead attention, residual connections with layer normalization, a feedforward layer, and a second residual connection with layer normalization. Compared with models of similar scale (e.g., GPT-2), BERT demonstrates superior performance in Chinese. In this study, the Chinese pretrained BERT model BERT-based Chinese, which is open-sourced by Google, is selected. This model supports both simplified and traditional Chinese, meeting the basic requirements for understanding the Chinese language.

DSA-CNN deep learning model

The convolutional neural network (CNN) is a deep learning model that specializes in extracting local features. Compared with traditional neural networks, CNNs have fewer training parameters and better overfitting suppression49,50. It is suitable for mining the associative features between contextually related fault text information to identify the category of fault events51. The standard convolutional neural network (CNN) architecture comprises four fundamental layers: an input layer, convolutional layers, pooling layers, and a fully connected layer. In textual applications, the process begins with the input layer accepting a matrix of word embeddings. The convolutional layer subsequently performs local feature extraction, the pooling layer conducts dimensionality reduction to preserve essential information while mitigating overfitting, and the fully connected layer serves as the final classifier for category prediction.

In this study, the output of BERT is used as the word embedding layer for the CNN. The concatenated word embedding matrix is passed through convolutional layers to extract feature vectors at multiple levels. Finally, the feature vectors are averaged through the pooling layer to reduce dimensionality, and the resulting vectors are fed into the fully connected layer and a Softmax classifier to generate the final fault category.

In traditional CNN models for text processing, the convolutional kernels cannot clearly distinguish the physical sensor modalities corresponding to text features. As a result, the learned features lack explicit engineering semantics and cannot be directly associated with physical sensor data. To address this, a double self-attention mechanism (DSA)5254 is introduced and defined as a sensor-modality-guided attention layer. Its core idea is to guide the model in extracting multi-dimensional, discriminative semantic features from the text through a set of learnable query vectors55. Conceptually, these features can be mapped to different physical monitoring dimensions, such as vibration or temperature, thereby providing a structurally compatible interface for future feature-level fusion with real physical sensor data. In the current work, these query vectors are parameters learned end-to-end entirely from textual data and are not connected to any real physical sensor signals during training or inference.

Attention mechanisms are widely used in deep learning within the computer science field. The mechanism is underpinned by a key-value paired data structure. Its core operation involves calculating affinity scores between a query vector and all keys, which subsequently determine the weighting coefficients for the corresponding value pairs. A weighted sum is then calculated to produce attention values reflecting the importance of each element5658. The structure of the DSA-CNN model is shown in Fig. 3.

Fig. 3.

Fig. 3

Structure of the DSA-CNN model.

Input Layer: The input data of this layer are the vectorized fault text information matrix MR n × d, where n denotes the total number of fault records and d represents the embedding dimensionality. The matrix M can be represented as shown in Eq. (1).

graphic file with name d33e757.gif 1

In the given formulation, the term ri denotes the d-dimensional embedding vector of the i-th input word. The input layer subsequently serves to channel this vectorized textual information into the DSA-CNN architecture, establishing its initial interface with the subsequent attention mechanism.

Attention Layer: This layer employs a self-attention scoring mechanism to compute a contextualized representation for each word embedding in the fault text. The resulting contextual vector is then concatenated with the original word embedding before being passed to the convolutional layer. The transformed contextual vector corresponding to word embedding ri is formally given by Eq. (2). Let the sequence representation output by BERT be denoted as XRB×L×H, where B, L, and H represent the batch size, sequence length, and hidden layer dimension (with H = 768), respectively. The attention weight mi, j is computed using the learnable query vectors and X.

graphic file with name d33e818.gif 2

In Eq. (2), mi, j​ represents the attention weight coefficient, computed via the Softmax function—a normalized exponential transformation widely adopted in deep learning. The mathematical property of Softmax inherently amplifies the significance of critical elements by assigning them proportionally larger weights in the resulting distribution. This equation integrates contextual information through attention weights to enhance the representation of each token, highlighting its semantic role within the global sequence.

Finally, the transformed contextual vector is concatenated with the original word embedding and used as the new input for the convolutional layer.

graphic file with name d33e843.gif 3

In the equation, I′ represents the input to the convolutional layer, and ⊕ denotes the concatenation operation.

Convolutional Layer: This layer incorporates multiple parallel feature maps to detect diverse patterns within the input representation. Each feature map specializes in identifying distinct local features through convolutional operations. For the attention layer matrix I′ ∈ Rn×d, a convolution matrix WRn×d with h rows and d columns is convolved with submatrices of I′ having the same size. The convolution matrix W moves sequentially from left to right and top to bottom, repeatedly performing the convolution operation. The resulting convolution can be represented as shown in Eq. (4).

graphic file with name d33e889.gif 4

In the equation, Inline graphic represents the convolution calculation, and Inline graphic denotes the submatrix formed by rows i through i + h−1.

The convolution operation yields a feature map C of dimensions (n-h + 1) × 1. This resulting matrix subsequently serves as the input to the pooling layer.

graphic file with name d33e923.gif 5
graphic file with name d33e927.gif 6

In the corresponding equations, ReLU functions as the activation mechanism, whereas bi represents the learnable bias parameter. The convolution operation aims to extract discriminative local feature patterns from the text sequence; the feature map C encodes the activation states of these patterns.

Pooling Layer: This layer is designed to decrease feature dimensionality, compress data and parameters, and enhance generalizability. In the proposed architecture, a secondary self-attention mechanism is incorporated within the pooling layer to assign adaptive weights to the convolutional output matrix. The weighted features undergo transformation and concatenation before being processed via the pooling operation. Specifically, max pooling is applied to produce the final feature vector for the fully connected layer. It is worth noting that the second attention stage here serves a different function from the Eq. (2). The first stage operates at the token level, aiming to enhance the semantic representation of key words. In contrast, the second stage operates at the convolutional feature level, utilizing another set of learnable query vectors (in this study, set to M = 4) to evaluate the discriminative importance of different local features before performing pooling.

graphic file with name d33e950.gif 7
graphic file with name d33e959.gif 8

In the formulation, the term ai,j corresponds to the attention weight coefficient, whereas P designates the input to the fully connected layer. This stage applies a second attention weighting to the convolutional features, screening and reinforcing the most important local features for classification, thereby forming the final feature representation P.

Fully Connected Layer: This layer serves as the final classifier in the architecture, mapping the extracted features to the target output categories. The input vector P passes through the fully connected layer and is classified via the Softmax function. The result can be expressed as shown in Eq. (9).

graphic file with name d33e993.gif 9

In the given equation, p corresponds to the predicted class probability, W0 designates the weight matrix associated with the input vector P to the fully connected layer, and b0 represents the corresponding bias term. This layer maps the learned comprehensive features to a probability distribution over fault categories, completing the classification decision.

Model design and construction

In this study, a coupled analysis model integrating fault text semantic features and physical sensors based on BERT-DSA-CNN is constructed. The model construction approach is illustrated in Fig. 4, and the specific steps are as follows:

Fig. 4.

Fig. 4

Construction flowchart of the BERT-DSA-CNN model.

Construction of the Fault Text Database: A fault text database is established by collecting maintenance log data of heavy-duty railway maintenance machinery and extracting relevant information categories for the model.

Data Sample Augmentation: The original fault data may have very few samples for certain component categories, which can cause the model to be biased toward fault categories with more samples. Therefore, this paper utilizes text augmentation techniques, employing back-translation and random synonym replacement methods to expand the samples of each category to a specified quantity. For example, “oil pump failure” can be back-translated as “pump oil failure,” and “abnormal” can be replaced with the synonym “not normal.” Subsequently, the augmented data is applied exclusively to the samples in the training set.

Text Preprocessing: This step aims to convert text data into a format suitable for the model. Special characters in the fault texts are first removed. Then, the sentences are segmented into individual words via the JiebaTokenizer. Finally, stop words are removed to filter out irrelevant text, retain the core semantic words, and output space-separated word sequences for subsequent BERT encoding.

Dataset Construction: Texts and labels are packaged into batch data readable by the model.

BERT-Based Fault Text Vectorization: BERT is employed to produce dense, con-textualized representations of fault descriptions. These high-dimensional embed-dings capture rich semantic and syntactic features from the text, forming the input representations for the subsequent deep learning model.

DSA-CNN Model Training: The model uses the preprocessed fault text vectors as its input features, with the corresponding fault categories from historical maintenance records serving as output labels for the supervised learning process. Optimization algorithms are applied to adjust model parameters so that the input fault texts are accurately mapped to the corresponding fault categories. In addition, the high-dimensional abstract feature vector outputs before the final classification layer are regarded as the signal outputs of the virtual semantic sensor. These vectors structurally represent the semantic information of faults, providing a potential interface for subsequent feature-level fusion with physical sensor data.

Fault Text Classification Testing: The recognition performance of the DSA-CNN model is tested on the test set.

Model evaluation: A series of standardized classification metrics are utilized to comprehensively evaluate the model’s recognition capability and validate its predictive accuracy.

Experiment analysis

Experimental environment and dataset

The experiments in this study were conducted via Python 3.9 via the PyTorch 1.8.0 framework. First, the BERT-based Chinese pretrained model was used to extract semantic features, and the BERT output was transposed from the original [B, L, H] dimensions to [B, H, L], where B represents the batch size, L is the sequence length, and H = 768 is the hidden layer dimension. Second, a feature extraction module composed of three 1D-CNN layers was constructed, with convolution kernel sizes of 3, 4, and 5, and each layer output 128 channels. Finally, maximum pooling and feature concatenation were applied to obtain a 384-dimensional combined feature vector, and a dropout layer with a probability of 0.7 was introduced for regularization to prevent overfitting, forming the classification prediction model. The number of learnable query vectors in the DSA module is set to K = 8 and M = 4, both initialized using the Xavier uniform distribution.

The dataset used in this study was derived from the 2023–2025 fault maintenance logs of a certain type of heavy-duty railway maintenance machine. The original dataset contains 1,253 fault records. According to the classification of fault systems, seven categories of fault texts (labelled F1F7) were extracted, as shown in Table 2. The preprocessed dataset in this experiment was split into training and test sets at a 7:3 ratio. To avoid model bias caused by imbalanced sample sizes, we employed the back-translation and random synonym replacement methods described in the Model Design and Construction section to ensure that the minimum number of fault texts in each category reached 100 (min_samples = 100). Furthermore, data augmentation was performed only after the train/test split and applied exclusively to the minority-class samples in the training set. The test set remained entirely unaffected by augmentation, thereby eliminating any risk of data leakage.

Table 2.

Fault text classification statistics.

Label Fault system Original sample size
F 1

Power Transmission

System

243
F 2 Running System 167
F 3 Braking System 76
F 4 Stabilizing Device 64
F 5 Sleeper-End Compaction Device 237
F 6 Tamping Device 262
F 7 Switch Lifting Device 204

As the service life of heavy-duty railway maintenance machinery increases, more maintenance records accumulate for similar types of equipment and under different track conditions, leading to the aggregation of sporadic faults. The BERT-DSA-CNN model in this study is capable of analysing large volumes of data. In theory, an increase in data volume can further improve the fault classification performance of the model and help establish a more comprehensive fault case library. In practice, owing to the stringent safety requirements of heavy-duty railway maintenance machinery, system vulnerabilities are continuously corrected during operation, preventing a significant increase in the number of fault text categories. Therefore, fault classification for heavy-duty railway maintenance machinery remains a small-sample, imbalanced classification problem with important practical value.

Experiment design and performance metrics

To ensure fair comparison of classification performance across different models, each experimental group used the same number of training and test samples drawn from identical train/validation/test splits. The following control groups were selected to validate the effectiveness of the BERT-DSA-CNN model: BERT-CNN (without the dual self-attention mechanism), SVM (TF-IDF), BERT-BiLSTM, Word2vec-CNN, and Word2vec-DSA-CNN. All deep learning models were monitored on the validation set with early stopping to ensure comparability, and the hyperparameters (C, gamma) for the SVM model were optimized via grid search on the validation set. The detailed hyperparameter settings and training protocols are summarized in Table 3. Model evaluation was based on four metrics: accuracy (A), precision (P), recall (R), and F1 score, all ranging from 0 to 1, where higher values indicate better classification performance5961.

Table 3.

Core hyperparameter settings and training protocol.

Configuration/Model BERT-DSA-CNN/BERT-CNN BERT-BiLSTM Word2vec-DSA-CNN/Word2vec-CNN SVM (TF-IDF)
Text Representation

BERT

(bert-base-chinese)

BERT

(bert-base-chinese)

Word2vec (CBOW, 300-dim) TF-IDF
Optimizer & Learning Rate AdamW (2e-5) AdamW (2e-5) Adam (1e-3) N/A
Batch Size & Max Epochs 32/20 32/20 32/50 N/A
Early Stopping (Patience) Yes (5) Yes (5) Yes (10) N/A
Dropout Rate 0.7 0.5 0.5 N/A
BERT-BiLSTM Architecture N/A Hidden Size = 256, Layers = 2, Bidirectional N/A N/A
SVM Configuration N/A N/A N/A RBF Kernel; Hyperparameters (C, gamma) optimized via grid search
Parameter Initialization BERT: Pretrained, Others: Xavier BERT: Pretrained, BiLSTM: Xavier Word2vec: Pretrained, Others: Xavier N/A

To evaluate the stability and reliability of the models, all experiments were conducted with five different random seeds (controlling both parameter initialization and data shuffling) for independent training and testing runs. In the final report, each evaluation metric is presented in the form of mean ± standard deviation. The calculation formulas for these metrics are as follows.

graphic file with name d33e1378.gif 10
graphic file with name d33e1382.gif 11
graphic file with name d33e1386.gif 12
graphic file with name d33e1390.gif 13

In the given formulations, the variable n denotes the count of events within each fault category, whereas TP, FN, FP, and TN represent the respective confusion matrix components, as presented in Table 4.

Table 4.

Confusion matrix of classification.

Classification: this event Classification: other events
Actual: This Event TP FN
Actual: Other Event FP TN

Experimental results analysis

The evaluation metrics for each model are presented in Table 5, and Fig. 5 visually compares the average performance of each model across the five independent experimental runs. On the basis of the comparative analysis, the following conclusions can be drawn:

Table 5.

Evaluation metrics of different models.

(a) Comparison of accuracy of each model
Accuracy (%) F 1 F 2 F 3 F 4 F 5 F 6 F 7
BERT-DSA-CNN 97.3 ± 0.4 94.8 ± 0.6 100.0 ± 0.0 100.0 ± 0.0 95.4 ± 0.5 95.7 ± 0.4 96.3 ± 0.3
BERT-CNN 94.2 ± 0.7 93.5 ± 0.9 99.8 ± 0.3 99.9 ± 0.2 91.8 ± 1.0 94.9 ± 0.7 93.8 ± 0.8
BERT-BiLSTM 93.4 ± 0.8 93.3 ± 1.0 98.6 ± 0.5 99.8 ± 0.3 92.1 ± 1.1 94.3 ± 0.9 92.6 ± 1.0
SVM (TF-IDF) 85.3 ± 1.5 90.6 ± 1.3 89.6 ± 1.4 88.3 ± 1.5 91.5 ± 1.2 91.2 ± 1.3 87.9 ± 1.5
Word2vec-CNN 95.1 ± 0.6 91.6 ± 1.1 99.0 ± 0.4 98.2 ± 0.6 92.5 ± 1.0 92.8 ± 0.9 92.3 ± 0.9
Word2vec-DSA-CNN 95.7 ± 0.5 92.8 ± 0.8 99.9 ± 0.2 99.3 ± 0.4 92.7 ± 0.9 93.4 ± 0.8 94.4 ± 0.6
(b) Comparison of precision of each model
Precision (%) F 1 F 2 F 3 F 4 F 5 F 6 F 7
BERT-DSA-CNN 98.2 ± 0.3 96.3 ± 0.5 99.7 ± 0.2 99.5 ± 0.3 95.7 ± 0.6 96.0 ± 0.5 97.8 ± 0.3
BERT-CNN 94.2 ± 0.8 95.4 ± 0.7 99.3 ± 0.4 99.5 ± 0.3 92.1 ± 1.0 92.9 ± 0.9 93.7 ± 0.8
BERT-BiLSTM 91.9 ± 1.1 90.7 ± 1.2 98.9 ± 0.5 99.4 ± 0.4 92.2 ± 1.1 95.8 ± 0.7 95.1 ± 0.8
SVM (TF-IDF) 89.4 ± 1.3 88.7 ± 1.4 98.7 ± 0.6 98.3 ± 0.7 90.1 ± 1.3 89.3 ± 1.4 92.5 ± 1.1
Word2vec-CNN 90.8 ± 1.0 89.1 ± 1.2 99.0 ± 0.5 98.6 ± 0.6 92.8 ± 1.0 91.3 ± 1.1 93.0 ± 0.9
Word2vec-DSA-CNN 94.4 ± 0.7 93.3 ± 0.9 99.1 ± 0.4 99.2 ± 0.4 95.6 ± 0.7 92.1 ± 1.0 93.2 ± 0.9
(c) Comparison of recall of each model
Recall (%) F 1 F 2 F 3 F 4 F 5 F 6 F 7
BERT-DSA-CNN 98.2 ± 0.3 97.9 ± 0.4 98.7 ± 0.5 98.1 ± 0.6 95.6 ± 0.7 97.4 ± 0.5 95.8 ± 0.6
BERT-CNN 93.8 ± 0.9 92.4 ± 1.0 97.3 ± 0.8 96.7 ± 0.9 90.8 ± 1.2 91.3 ± 1.1 94.0 ± 0.9
BERT-BiLSTM 92.5 ± 1.0 91.8 ± 1.1 92.3 ± 1.3 94.1 ± 1.0 88.1 ± 1.4 87.9 ± 1.4 90.3 ± 1.2
SVM (TF-IDF) 88.4 ± 1.4 90.8 ± 1.2 91.0 ± 1.3 91.2 ± 1.3 90.1 ± 1.3 87.9 ± 1.4 88.6 ± 1.4
Word2vec-CNN 90.5 ± 1.1 93.4 ± 0.9 97.7 ± 0.7 95.3 ± 0.9 91.6 ± 1.1 92.8 ± 1.0 91.7 ± 1.0
Word2vec-DSA-CNN 95.7 ± 0.7 93.6 ± 0.9 97.9 ± 0.7 97.1 ± 0.8 91.6 ± 1.1 96.9 ± 0.6 95.1 ± 0.7
(d) Comparison of F1-Score of each model
F 1 -Score (%) F 1 F 2 F 3 F 4 F 5 F 6 F 7
BERT-DSA-CNN 98.7 ± 0.2 97.9 ± 0.3 99.0 ± 0.3 99.2 ± 0.3 97.2 ± 0.4 97.8 ± 0.3 98.4 ± 0.2
BERT-CNN 95.6 ± 0.5 93.8 ± 0.8 98.3 ± 0.4 97.9 ± 0.5 94.6 ± 0.7 94.8 ± 0.6 96.1 ± 0.5
BERT-BiLSTM 94.8 ± 0.6 94.1 ± 0.7 97.3 ± 0.5 98.0 ± 0.4 94.9 ± 0.7 95.4 ± 0.6 95.9 ± 0.5
SVM (TF-IDF) 94.1 ± 0.7 93.8 ± 0.8 95.6 ± 0.6 94.8 ± 0.7 94.7 ± 0.7 95.0 ± 0.7 95.3 ± 0.6
Word2vec-CNN 95.1 ± 0.6 94.9 ± 0.7 97.1 ± 0.5 97.0 ± 0.5 94.9 ± 0.7 97.0 ± 0.5 96.6 ± 0.5
Word2vec-DSA-CNN 95.7 ± 0.5 96.1 ± 0.6 97.7 ± 0.4 98.7 ± 0.3 95.7 ± 0.6 95.2 ± 0.7 97.3 ± 0.4

Fig. 5.

Fig. 5

Visualisation of the mean value of cross-model evaluation metrics.

The dual self-attention (DSA) mechanism demonstrates significant promise in multisensor fusion applications. Experimental validation shows that the BERT-DSA-CNN architecture achieves superior performance compared with the BERT-CNN baseline across all fault categories, with consistent improvements observed in accuracy (A), precision (P), recall (R), and F1 score metrics. Similarly, the Word2vec-DSA-CNN model also demonstrated improvements over the Word2vec-CNN model across these metrics. These results indicate that incorporating the DSA mechanism effectively weights the semantic units of the physical sensor features within the fault texts, significantly enhancing the fault classification performance.

Pretrained language models demonstrate superior capacity for capturing nuanced semantic information. Consequently, the BERT-DSA-CNN architecture achieved consistently better performance than the Word2vec-DSA-CNN variant across all evaluation metrics. This demonstrates that BERT provides higher-quality word embeddings; its dynamic contextual embeddings better capture polysemy and complex grammatical structures in fault descriptions. This richer and more precise semantic representation improves CNN feature extraction and DSA weight allocation, increasing the overall performance ceiling of the model.

The CNN architecture offers unique advantages for fault text classification. For most metrics, the BERT-CNN model outperforms the BERT-BiLSTM model, indicating that for short-text, highly distinctive fault logs of large-scale track maintenance machinery, CNNs are more efficient at capturing local key n-gram features than are BiLSTMs, which are better at long-distance dependencies. CNNs can directly match local language patterns closely related to fault categories, aligning well with the requirements of fault diagnosis tasks.

Compared with deep learning models, the SVM (TF-IDF) performs inferiorly. Traditional machine learning methods exhibit clear disadvantages in text classification tasks. Deep learning models automatically learn deep semantic feature representations, overcoming vocabulary gaps, dimensionality issues, and data sparsity inherent in TF-IDF approaches, resulting in a substantial performance improvement.

Beyond the quantitative metrics, we performed an interpretability analysis to examine how the model’s attention aligns with fault semantics and to identify its limitations. Qualitatively, the DSA mechanism consistently assigned higher attention weights to tokens describing core fault components (e.g., “bearing,” “oil pump”) and key symptom words (e.g., “abnormal noise,” “leakage”) in correctly classified samples. The most frequent misclassifications occurred between the Power Transmission System (F1) and the Running System (F2), primarily because their log descriptions often share generic symptom terms such as “excessive vibration.” Cases that led to errors typically contained vague descriptions (e.g., “sluggish operation”) or mixed causal statements spanning multiple subsystems. These observations confirm that the model’s attention is semantically meaningful, while also highlighting the inherent ambiguity in text-only descriptions—a limitation that motivates the proposed feature-level fusion with physical sensor data in future work.

Conclusion

This study addresses the needs of semantic awareness and multisource information fusion in the fault diagnosis of large-scale track maintenance machinery and constructs and validates a fault text semantic feature extraction framework. First, a BERT-DSA-CNN–based semantic awareness model was proposed. The model obtains deep contextual word embeddings via BERT, effectively captures local key n-gram features in fault descriptions through a CNN, and introduces a double self-attention (DSA) mechanism to weight semantic units associated with physical sensor modalities. The experimental results demonstrate that this model significantly outperforms alter-natives in fault text classification tasks.

Second, systematic experiments reveal performance differences among various models. The results clearly show (1) the effectiveness of the DSA mechanism in enhancing the model’s focus on discriminative features; (2) the significant semantic representation advantage of the pretrained BERT model over static Word2vec embeddings; (3) the efficiency advantage of CNN architectures over BiLSTM when processing short-text, high-feature fault logs; and (4) the performance superiority of deep learning methods over traditional machine learning approaches. These findings provide empirical guidance for technical selection in similar industrial text intelligence applications.

Finally, this study highlights the potential of semantic features in multisensor fusion systems. The structured semantic feature vectors output by the BERT-DSA-CNN model are designed to be complementary and compatible with physical sensor signals at the feature level, laying a technical foundation for the development of more comprehensive and reliable intelligent diagnostic systems in the future. However, the actual fusion with physical sensor data remains a subject for future experimental validation.

Limitations and future work

The primary contribution of this work lies in validating a semantic feature extraction model (BERT-DSA-CNN) for text-based fault classification. The proposed feature-level fusion between the extracted semantic features and physical sensor data, while being a core conceptual motivation, remains conceptual and is not experimentally demonstrated in this study. Performance evaluation is also constrained by the scale and specificity of the single-equipment dataset used.

Based on these limitations, future research will focus on: (1) Experimental validation of multimodal fusion architectures by integrating the structured semantic features with real physical sensor signals to build and test a complete diagnostic system; (2) Developing an online diagnostic system with incremental learning capabilities to process real-time logs and sensor streams; (3) Enhancing the model’s generalizability across different equipment types and operational environments through domain adaptation and larger-scale data collection.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (3.3MB, docx)

Author contributions

Y.Z. and C.G. conceived the research and designed the experiments. R.W. and H.L. conducted the experiments and collected the data. G.H., C.L., and S.L. performed data analysis and visualization. Z.H. and J.Z. provided critical materials and analysis tools. Y.Z. and C.G. wrote the manuscript. All authors reviewed and approved the final manuscript.

Funding statement

This research was funded by the Scientific Research Project of the China Academy of Railway Sciences under Grant 2024YJ294.

Data availability

The data supporting the findings of this study are available from the corresponding author upon reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Zhang, Z. et al. Research on online oil monitoring technology for rail Heavy-duty maintenance machinery. Adv. Mater. High. Speed Rail. 4 (4), 51–55 (2025). [Google Scholar]
  • 2.Zhang, H. Design and application research of the condition monitoring system for the running gear of large track maintenance machinery. Adv. Mater. High. Speed Rail. 14 (20), 25–30 (2024). [Google Scholar]
  • 3.Yang, G., Chen, X., Guan, D. & Xu, Y. Reflection on improving the supply guarantee capacity of large railroad maintenance machinery parts. Rail Purch Logist. 19 (2), 20–23 (2024). [Google Scholar]
  • 4.Qi, J., Lu, X. & Sun, J. Multi-Radar track fusion method based on parallel track fusion model. Electronics14, 3461 (2025). [Google Scholar]
  • 5.Kim, M. & Kim, H. Movable wireless Sensor-Enabled waterway surveillance with enhanced coverage using Multi-Layer perceptron and reinforced learning. Electronics14, 3295 (2025). [Google Scholar]
  • 6.Peng, H. & Cao, X. Research conflict problems of D-S evidence and its application in multi-sensor information fusion technology. In Proceedings of the IEEE International Conference on Information Theory and Information Security, Beijing, China, 16 August 2011, 747–750 (2011)., Beijing, China, 16 August 2011, 747–750 (2011). (2010).
  • 7.Han, L. Abnormal detection of Multi-sensor fusion data and early warning of automatic driving safety. Auto Electr. Parts. 9, 16–18 (2025). [Google Scholar]
  • 8.Fang, J., Xie, K., He, P., Huang, T. & Shi, L. Latent fault detection method for switchgear based on Multi-Sensor fusion and machine learning. Electron. Des. Eng.33 (18), 26–30 (2025). [Google Scholar]
  • 9.Yang, X., Song, C. & Wu, X. Gearbox fault diagnosis method based on Multi-Sensor data fusion and GAN. J. Mech. Strength.47 (6), 37–47 (2025). [Google Scholar]
  • 10.Song, J., Li, Y. & Shi, S. Application analysis and research on JZT laser alignment system for railway large maintenance machinery. China Plant. Eng.S1, 155–156 (2024). [Google Scholar]
  • 11.Shi, G., Li, J. & Wang, H. Research on the daily operational data analysis model for Large-Scale track maintenance machinery in rail transit. Sci. Technol. Innov.23, 68–70 (2023). 73. [Google Scholar]
  • 12.Xiong, D. et al. Fast identification of series Arc faults based on singular spectrum statistical features. Electronics14, 3337 (2025). [Google Scholar]
  • 13.Si, X. et al. Helicopter Rotor Fault Diagnosis Based on IPSO and RVM. In Proceedings of the 2023 Global Reliability and Prognostics and Health Management Conference (PHM-Hangzhou), Hangzhou, China, 12–15 October 1–6. (2023). (2023).
  • 14.Zhang, S. et al. SVD-SDP-CNN Fault Diagnosis Method. In Proceedings of the International Conference on Sensing, Diagnostics, Prognostics, and Control (SDPC), Shijiazhuang, China, 1, 52–56. (2024).
  • 15.Niu, T. Railway turnout fault recognition based on machine learning. Henan Sci. Technol.40 (6), 33–35 (2021). [Google Scholar]
  • 16.Tan, Z. & Xu, L. Fault classification method for SDH optical fiber communication network based on SVR. Technol. IoT AI. 56 (6), 43–46 (2024). [Google Scholar]
  • 17.Dai, T. et al. Bearing feature selection and fault diagnosis based on Beluga Whale optimization algorithm and random forest algorithm. Ind. Control Comput.38 (6), 43–45 (2025). [Google Scholar]
  • 18.Nikos, K., Ruby, N. & Vasso, R. Distributed Model-based Sensor Fault Diagnosis of Marine Fuel Engine. IFAC-PapersOnLine. 55 (6), 347–353. (2022).
  • 19.Li, X., Cheng, K., Huang, T. & Tan, S. Research on false alarm detection algorithm of nuclear power system based on BERT-SAE-iForest combined algorithm. Ann. Nucl. Energy. 170, 108985 (2022). [Google Scholar]
  • 20.Sun, L., Chen, H., Fu, J. & Wang, Q. Engine fault text classification based on Bert model. Equip. Manuf. Technol.51 (8), 282–286 (2023).
  • 21.Cheng, J. et al. Aspect-level Sentiment Classification with HEAT (Hierarchical Attention) Network. In Proceedings of the ACM on Conference on Information and Knowledge Management (CIKM 17), Association for Computing Machinery, New York, NY, USA, 6 November 2017, 97–106. (2017).
  • 22.Devlin, J., Chang, M., Lee, K., Toutanova, K. & BERT Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, Minneapolis, Minnesota, USA, 2–7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 4171–4186. (2019).
  • 23.Cinelli, L. P. et al. Automatic event identification and extraction from daily drilling reports using an expert system and artificial intelligence. J. Pet. Sci. Eng.205, 108939 (2021). [Google Scholar]
  • 24.Xie, M., He, J., Hu, X. & Cao, Y. Fault diagnosis for urban rail transit trackside signaling equipment based on fault logs. J. Beijing Jiaotong Univ.44 (5), 27–35 (2020). [Google Scholar]
  • 25.Liu, J., Xu, K., Cai, B., Guo, Z. & Wang, J. XGBoost-based fault prediction method for on-board train control equipment. J. Beijing Jiaotong Univ.45 (4), 95–106 (2021). [Google Scholar]
  • 26.Feng, X. Research on fault classification and prediction of Shearer based on BP-SVM. J. Mine Autom.51 (S1), 47–49 (2025). [Google Scholar]
  • 27.Zhou, L., Dang, W., Wang, Y. & Zhang, Z. Fault diagnosis for On-board equipment of train control system based on CNN-CSRF hybrid model. J. China Railw Soc.42 (11), 94–101 (2020). [Google Scholar]
  • 28.Yang, N., Zhang, Y., Zuo, J. & Zhao, B. Research on fault text clustering method of signal equipment based on IBTM-TMW. China Railw Sci.45 (06), 194–201 (2024). [Google Scholar]
  • 29.Deng, S., Wang, W., Zhang, Y., Zhang, C. & Tu, S. Case study of aeroengine system failure based on Co-Word analysis. J. East. China Univ. Sci. Technol.47 (5), 635–646 (2021). [Google Scholar]
  • 30.Lin, H. et al. Named entity recognition of fault information of High-speed railway turnout from BMBC model. J. Railw Sci. Eng.40 (5), 529–538 (2023). [Google Scholar]
  • 31.Xu, Q., Zhang, L., Ou, D. & He, Y. Fault classification method for on-board equipment of metro train control system based on BERT-CNN. J. Shenzhen univ. (Sci Eng). 40 (5), 529–538 (2023). [Google Scholar]
  • 32.Zhang, S. BERT and BiLSTM based text classification method for railway safety supervision system. SmartTech Innov.22, 38–42 (2021). [Google Scholar]
  • 33.Zhang, Y., Ye, H., Zhang, L. & Xue, Y. BERT-Based short text classification model and its application in fault diagnosis of CIR equipment. J. Syst. Sci. Math. Sci.44 (1), 115–131 (2024). [Google Scholar]
  • 34.Jiao, Y., Li, R. & Wang, J. Intelligent testing method for railway CTC interface data based on fuzzy natural Language processing. Chin. J. Intell. Sci. Technol.6 (2), 201–209 (2024). [Google Scholar]
  • 35.Xia, L. et al. Short text automatic scoring system based on BERT-BiLSTM model. J. Shenzhen Univ. (Sci Eng). 39 (3), 349–354 (2022). [Google Scholar]
  • 36.Xu, J. et al. Construction and application of knowledge graph in diesel engine fault field. Comput. Syst. Appl.31 (7), 66–76 (2022). [Google Scholar]
  • 37.Li, Z., Lin, S. & Zhang, Q. Edge cloud computing approach for intelligent fault detection in rail transit. Comput. Sci.51 (9), 331–337 (2024). [Google Scholar]
  • 38.Xu, J. et al. Research on state evaluation of large railroad maintenance machinery based on Fuzzy-AHP. Railw Transp. Econ.45 (11), 168–174 (2023). [Google Scholar]
  • 39.Yuan, H., Zhao, P. & Zhu, L. Design and implementation of heavy-duty maintenance machinery assignment management system. China Instrum.43 (4), 66–69 (2023).
  • 40.Li, H. & Ma, L. Common mechanical faults of Large-Scale road maintenance machinery and their handling measures. Technol. Mark.28 (02), 181–182 (2021). [Google Scholar]
  • 41.Wang, L. et al. Image captioning model based on Multi-Step Cross-Attention Cross-Modal alignment and external commonsense knowledge augmentation. Electronics14, 3325 (2025). [Google Scholar]
  • 42.Patra, C., Giri, D., Maitra, T. & Kundu, B. A. Comparative Study on Detecting Phishing URLs Leveraging Pre-trained BERT Variants. In Proceedings of the 2024 6th International Conference on Computational Intelligence and Networks (CINE), Bhubaneswar, India, 19–21 December 1–6. (2024).
  • 43.Olaniyan, J., Verkijika, S. F. & Obagbuwa, I. C. NLP-Based restoration of damaged student essay archives for educational preservation and fair reassessment. Electronics14, 3189 (2025). [Google Scholar]
  • 44.Chen, K., Wang, Z. & Zhou, X. Text classification based on improved BERT-sPTT model with space reduced attention. Comput. Eng. Appl.62 (14), 1–18 (2025).
  • 45.Liu, Y., Huang, H., Gao, J. & Gai, S. A. Study of Chinese Text Classification Based on a New Type of BERT Pre-training. In Proceedings of the 2023 5th International Conference on Natural Language Processing (ICNLP), Guangzhou, China, 303–307. (2023).
  • 46.Jiang, K. et al. A Student Social Network Text Sentiment Classification Model Based on Ensemble Learning and BERT Architecture. In Proceedings of the 2024 2nd International Conference on Sensors, Electronics and Computer Engineering (ICSECE), Jinzhou, China, 24–26 359–362. (2023).
  • 47.Tian, B. et al. Adaptive Understanding framework and key technology of power grid fault disposal information. Electr. Power. 57 (7), 188–195 (2024). [Google Scholar]
  • 48.Rogers, A., Kovaleva, O. & Rumshisky, A. A primer in bertology: what we know about how BERT works. Trans. Assoc. Comput. Linguist. 8, 842–866 (2020). [Google Scholar]
  • 49.Kido, S., Hirano, Y. & Hashimoto, N. Detection and Classification of Lung Abnormalities by Use of Convolutional Neural Network (CNN) and Regions with CNN Features (R-CNN). In Proceedings of the 2018 International Workshop on Advanced Image Technology (IWAIT), Chiang Mai, Thailand, 7–9 January 1–4. (2018).
  • 50.Peng, T. & Wen, F. Photovoltaic Panel Fault Detection Based on Improved Mask R-CNN. In Proceedings of the 2023 International Conference on Control, Electronics and Computer Technology (ICCECT), Jilin, China, 28–30 April 1187–1191. (2023).
  • 51.Wang, Y. & Zhu, Y. Deep residual shrinkage network recognition method for transformer partial discharge. Electronics14, 3181 (2025). [Google Scholar]
  • 52.Zhou, X., Wang, K., Zhou, X., Zhang, Q. & Xue, G. DSA-YOLOv8 algorithm for traffic sign detection. Comput. Eng. Des.46 (08), 2320–2327 (2025). [Google Scholar]
  • 53.Chaudhari, P., Xiao, Y., Li, T. & Translution A hybrid Transformer–Convolutional architecture with adaptive gating for occupancy detection in smart buildings. Electronics14, 3323 (2025). [Google Scholar]
  • 54.Xiong, K. Research on Text Classification Method Based on Deep Learning and Attention Mechanism. Master’s Thesis, Jiangxi Normal University, Jiangxi, China. (2020).
  • 55.Luo, Q. et al. Self-Attention and Transformers: Driving the Evolution of Large Language Models. In IEEE 2023 6th International Conference on Electronic Information and Communication Technology (ICEICT), Qingdao, China, 401–405. (2023). (2023).
  • 56.Zheng, W., Yao, Y., Dai, B., Chang, Y. & Sun, X. Economic load dispatch of Coal-fired power plant based on data mining technology. Therm. Power Gener. 50 (7), 78–83 (2021). [Google Scholar]
  • 57.Yang, X., Cao, Z., Zhang, M., Hu, Z. & Li, L. Energy disaggregation of commercial buildings based on Attention-LSTM. Smart Power. 48 (9), 89–95 (2020). [Google Scholar]
  • 58.Zhang, H., Tian, H., Wang, L., Xu, B. & Duan, Z. A power system transient stability assessment method based on Seq2Seq technology. Power Syst. Clean. Energy. 37 (4), 23–31 (2021). [Google Scholar]
  • 59.Wang, W., Yasenjiang, J., Xiao, Y., Lv, L. & Lan, Z. Variable speed fault model of bearings based on improved BiTCN and BiGRU. J Mech. Electr. Eng.42 (12), 2343–2353 (2025).
  • 60.Wu, Y., Guo, H., Yan, X., Meng, F. & Luo, L. A fault classification method based on knowledge graph and TransGNN model. Mod. Mach. Tool. Autom. Manuf. Tech.8, 74–79 (2025). 86. [Google Scholar]
  • 61.Li, X. et al. Fault diagnosis method based on improved auxiliary classifier generative adversarial network combined with model migration strategy. Chin. Hydraul Pneum. 49 (8), 21–34 (2025). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1 (3.3MB, docx)

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES