Abstract
The current power grid business handles massive data operations where data retrieval frequently encounters redundancy issues. Conventional decision tree-based methods struggle to achieve accurate data acquisition when facing redundant interference. To address this challenge, this study proposes a multi-level redundant data retrieval method using an improved decision tree algorithm for grid resource business center platforms. The methodology first establishes a multi-level data decision tree using grid resource business middle-platform data, then applies a decision tree pruning algorithm based on Akaike information criterion. The ant colony algorithm optimizes the pruning parameters of the decision tree model, and after obtaining optimal pruning parameters, processes the grid resource business middle-platform data decision tree to generate an improved version. Subsequently, the multi-level redundant data retrieval method based on the improved decision tree implements fast retrieval of hierarchical redundant data in grid resource business through designed repetitive data processing flows and multi-level redundant data discrimination mechanisms. The experimental results demonstrate that the improved decision tree algorithm improves multi-level redundant data retrieval accuracy by 14%. The optimized decision tree model for middle-platform data achieves more comprehensive representation of grid resource service data hierarchies and enables effective retrieval of multi-level redundant data including both image and text categories from the middle-platform data. The maximum F1-score reaches 0.99 with retrieval time of only 4.5 s, which is 1.5 s below the predefined threshold, confirming excellent practical performance.
Keywords: Improved decision tree, Grid resource, Business middle-platform, Multi-level redundant data, Fast retrieval
Subject terms: Engineering, Mathematics and computing
Introduction
Research background
The middle platform of power grid resource business has evolved into an indispensable component of modern power systems. This platform not only manages massive power data but also handles complex business logic and data processing workflows. Among these data operations, redundant data represents a prevalent challenge1 that not only consumes storage capacity but may also degrade processing efficiency and potentially compromise grid security and stability. Traditional redundant data storage strategies typically employ single-level redundancy, simply replicating identical data copies across different storage nodes. In contrast, multi-level redundant data implements a more sophisticated hierarchical redundancy architecture. This approach stores data copies not only across different physical locations but also utilizes diverse storage media within the same location, and further applies varying storage formats or encoding schemes on identical media. The multi-level redundancy strategy enhances data reliability and recovery capacity through its hierarchical structure, addressing more complex and diverse failure scenarios. Consequently, achieving efficient retrieval of multi-level redundant data in grid resource services has emerged as both a research hotspot and technical challenge. As the central component of power systems, the grid resource business center processes enormous datasets with complex structures. Throughout data generation, transmission, storage, and processing workflows, redundant data inevitably accumulates due to various factors including equipment failures, human errors, and system upgrades2. These redundancies manifest in multiple forms such as duplicate records, invalid entries, and obsolete data, which not only complicate data management but may also threaten normal power system operations. Efficient retrieval represents the critical solution for redundant data management, enabling timely identification and subsequent cleanup or optimization measures. However, the substantial scale and structural complexity of grid resource data often render conventional retrieval methods inadequate for meeting speed and accuracy requirements3. Therefore, developing an efficient multi-level redundant data retrieval method specifically designed for grid resource service platforms holds significant theoretical importance and practical value.
Related research work
Research on efficient redundant data retrieval in power grid resource services has achieved notable progress. Raji et al.4 develop a cloud database security forensic data transmission algorithm that analyzes MySQL binary log structures and employs the KMP algorithm for keyword matching to retrieve user-required information. However, this approach presents implementation challenges due to complex parameter tuning requirements and susceptibility to local optima. While the KMP algorithm offers linear time complexity for string matching, its performance may degrade when processing massive datasets, particularly with extremely large binary logs. Spea5 introduces a social network search algorithm for cogeneration economic scheduling that utilizes concept similarity and attribute similarity as semantic detection metrics, accomplishing network data retrieval through semantic feature extraction and similarity computation. Nevertheless, this method overlooks multi-level redundant data characteristics, leading to incomplete feature extraction and compromised redundant data search accuracy. Seo6 proposes a weighted quantum search algorithm that examines how superposition state initialization amplitude variations affect iteration outcomes, establishing necessary weight coefficient conditions and constructing quantum superposition states containing target weight information for quantum information retrieval following feature extraction. Despite considering superposition state initialization amplitude influences and incorporating weight coefficients, the algorithm’s weight coefficient determination for multi-level redundant data retrieval may lack precision, adversely affecting redundant data acquisition accuracy. Olaide et al.7 present a nature-inspired heuristic optimization search algorithm employing probability distribution domain classification to model user interests, subsequently providing similarity calculations and interest model updates to satisfy information retrieval needs. However, when handling multi-level redundant data with numerous features and hierarchies, this method struggles to precisely identify and locate optimal data copies, diminishing retrieval accuracy. Lo et al.8 devise a context-embedded entity relationship retrieval method that computes gray-level co-occurrence matrices to extract texture features (e.g., energy) for initial retrieval, then applies Canny operators for edge detection and projection methods for vertical/horizontal edge profile calculations to refine results based on shape characteristics. Although combining texture and shape features, this approach’s feature extraction remains vulnerable to noise interference from gray-level co-occurrence matrices and Canny operators, limiting retrieval accuracy. Furthermore, the method demonstrates poor adaptability to image complexity, yielding suboptimal retrieval performance for intricate images.
In multi-level redundant data environments, retrieval methods must dynamically select and transition between various redundancy tiers, potentially introducing complexity and latency to the retrieval process. Furthermore, conventional classifiers often lack optimization for multi-level redundant data characteristics, compromising their ability to precisely identify and locate optimal data replicas, which consequently diminishes retrieval accuracy and operational efficiency. These technical challenges necessitate the development of specialized classifiers that explicitly account for the distinctive attributes of multi-level redundant data architectures. Such tailored classifier designs can better accommodate the inherent complexity and diversity of hierarchical redundancy structures, ultimately enhancing both retrieval performance and system reliability.
However, the decision tree algorithm, employing tree-based structures for decision-making processes, represents a fundamental machine learning methodology. This approach provides an intuitive and interpretable machine learning framework, with an architecture resembling flowcharts where each node constitutes a decision point and each branch denotes a potential outcome. This transparent structure offers significant advantages for multi-level redundant data retrieval in grid resource business centers, enabling administrators and users to readily comprehend both the retrieval processes and outcomes. Importantly, decision trees demonstrate exceptional capability in processing data with multiple features and hierarchical levels, making them particularly suitable for multi-level redundant data retrieval scenarios. Within such environments, decision trees can efficiently navigate and locate target data by systematically processing different data replica levels through distinct decision nodes. Furthermore, decision tree methods exhibit remarkable flexibility and adaptability, allowing customization according to varying data characteristics and retrieval requirements. This adaptability proves especially valuable in grid resource business centers where data attributes and retrieval needs may evolve over time and operational contexts, as decision trees can dynamically adjust to maintain effective retrieval strategies. The decision tree construction methodology comprises three critical phases: feature selection, tree generation, and pruning. Feature selection determines the optimal classification attributes for each node, employing established metrics including information gain, gain ratio, and Gini index. During the tree generation phase, the algorithm recursively identifies optimal feature segmentation points until meeting predefined termination criteria (e.g., complete sample homogeneity, exhaustive feature utilization, or sample quantity falling below specified thresholds).
To address the limitations of inadequate retrieval accuracy and suboptimal performance in existing multi-level redundant data retrieval approaches, this paper presents an enhanced decision tree-based fast retrieval method for multi-level redundant data in power grid resource services. The proposed methodology accomplishes efficient retrieval of hierarchical redundant information in grid resource operations, significantly enhancing data processing efficiency while minimizing storage requirements, thereby providing robust support for power system security and stability. The technical framework involves constructing a multi-level data decision tree and applying an Akaike information criterion-based pruning algorithm to effectively reduce model complexity while improving classification performance. The ant colony algorithm is subsequently employed to optimize decision tree pruning parameters, incorporating three key enhancements: (1) introduction of dynamic selection factors to refine ant state transitions and ensure algorithmic convergence, (2) implementation of an optimal worst-case pheromone update strategy with reward-punishment mechanisms to improve pheromone concentration updates, and (3) comprehensive optimization of decision tree pruning parameters to enhance model generalization capability and retrieval efficiency. These algorithmic improvements collectively enable intelligent retrieval of multi-level redundant data. The implemented solution features a specialized retrieval workflow for processing hierarchical redundant data in grid resource business platforms, incorporating discriminative mechanisms for multi-level redundancy identification, ultimately achieving high-performance retrieval of complex redundant data structures in power grid operations.
Rapid retrieval of multi-level redundant data in the middle office of the grid resources business
To achieve efficient retrieval of multi-level redundant data in grid resource business centers, operational data is first acquired from the platform to construct a multi-level data decision tree. While traditional decision tree pruning typically employs criteria like information gain and Gini impurity, this study proposes an Akaike information criterion-based pruning algorithm9 to effectively reduce decision tree complexity while enhancing classification performance. Specifically addressing the characteristics of multi-level redundant data, the methodology incorporates an ant colony algorithm to optimize pruning parameter selection. The ant colony algorithm implementation includes three key enhancements: (1) introduction of dynamic selection factors to improve ant state transitions and ensure convergence, (2) development of a reward-punishment mechanism based on optimal-worst pheromone update strategies to enhance pheromone concentration updates, and (3) resolution of the limitation caused by fixed pheromone concentration fluctuation coefficients during pruning parameter solution searches in grid resource service data decision tree models. These algorithmic improvements collectively enhance the decision tree’s generalization capability and retrieval efficiency. The optimized decision tree subsequently enables intelligent retrieval of multi-level redundant data through systematic generation of corresponding classification structures.
Multilevel data decision tree generation for the grid resource operations middle office
Obtain data from the grid resource business center for a certain period of time and use the data as a basis for generating a multilevel data decision tree for the grid resource business center, with the following detailed process: Operational data acquired from the grid resource business center serves as the foundation for constructing a multi-level data decision tree through the following technical process:
Let X and Y respectively represent the input and output variables of the decision tree for multi-level data in the grid resource business middle platform, where Y denotes continuous operational multi-level data variables. Given the training dataset:
| 1 |
The regression tree structure corresponds to a partition of the input space for multi-level data in the grid resource business center, with assigned output values for each partitioned region. The implementation employs a heuristic partitioning method10, selecting the j-th variable
and its threshold value s as the splitting variable and cut point, thereby defining two distinct regions:
| 2 |
| 3 |
where,
indicates the multi-stage data.
The input space is partitioned into L distinct modules
, with each unit
containing a fixed output value
for the grid resource business middle platform’s multi-level data. Consequently, the regression tree model is formally expressed as:
![]() |
4 |
Given a predetermined partition of the multi-level data input space in the grid resource business center, the loss function can be quantified using the squared error term
, where the optimal output value for each partitioned cell is derived through squared error minimization. In accordance with the least squares principle11, the optimal value
of
for the unit
corresponds to the arithmetic mean of all output values
associated with input multi-level data points
within
formally expressed as:
| 5 |
The optimization objective involves determining the optimal splitting variable j and cut point s through the following minimization formulation:
| 6 |
For a specified splitting variable j, the optimal cut point s is derived. Based on this, the input space partitions into two distinct regions whose optimal output values are computed as follows:
| 7 |
| 8 |
The algorithm systematically examines all input variables of the grid resource business middle platform’s multi-level data to identify the optimal splitting variable j and its corresponding cut point s, forming the optimal pair
. This pair partitions the input space into two distinct regions12, with the partitioning process iteratively applied to each resulting region until meeting termination conditions. The formal algorithmic procedure is structured as follows:
Input: Training dataset D comprising multi-level data from the grid resource business middle platform;
Output: Optimized decision tree model
.
Within the input space encompassing all training data from the grid resource business middle platform’s multi-level dataset, a binary decision tree is constructed through recursive bipartitioning of each region into two sub-regions, with subsequent determination of optimal output values for each resulting partition.
(1) The optimal splitting variable j and cut point s are determined by solving the minimization problem:
. For each candidate variable j, the optimal cut point s is identified through exhaustive search, selecting the pair
that minimizes the objective function.
(2) Using the optimal pair
, partition the input space into regions
and
, and compute the corresponding optimal output value
:
![]() |
9 |
where
,
.
(3) Recursively apply steps (1) and (2) to each subregion until meeting the stopping criteria.
(4) The final decision tree partitions the multi-level data input space of the grid resource business middle platform into L regions
, completing the model construction.
Decision tree pruning design based on the Akaike information criterion
The decision trees predict target attributes through hierarchical splitting rules, while logistic regression models evaluate performance by jointly optimizing classification accuracy and model complexity via the Akaike Information Criterion (AIC)13. By transforming decision trees into equivalent logistic regression models—where splitting rules become explanatory variables and decision attributes serve as response variables—the AIC can similarly assess decision tree quality. The optimal decision tree thus corresponds to the logistic regression configuration with minimal AIC. The pruning process involves: full tree growth followed by bottom-up pruning, where nodes are eliminated only if pruning reduces their AIC. Iteration continues until no further pruning improves AIC, yielding the AIC-optimal tree. Generally, superior decision trees exhibit higher classification accuracy with fewer nodes (lower complexity) at equivalent accuracy levels. This AIC-based pruning algorithm enhances decision tree performance by systematically balancing accuracy and complexity, often improving classification rates while maintaining interpretability.
Based on the preceding analysis, the Akaike Information Criterion (AIC)-based decision tree pruning algorithm implements the following procedure:
Input: Training dataset D;
Output: The decision tree algorithm after pruning is as follows:
Step 1: Using the data of training set D, generate a fully-grown decision tree as the initial model. To control overfitting and enhance generalization, a complexity parameter
is applied. The tree is then decomposed into subtrees, with each subtree’s splitting rules transformed into equivalent logistic regression representations.
Step 2: For each terminal leaf node: extract all splitting rules from the root node; evaluate these rules on test set s to generate explanatory variables
, where p denotes the variable count.
Step 3: For each subtree: designate
as explanatory variables; set decision attribute y as the response variable; construct the logistic regression model with expression:
| 10 |
Estimate parameters
via maximum likelihood; Compute model’s AIC value.
Step 4: Eliminate terminal nodes sequentially from bottom-up; Recalculate AIC for each modified logistic model; Retain configuration with minimal AIC; Preserve corresponding explanatory variables as post-pruning rules.
Step 5: Repeat Steps 2–4 for all subtrees until no further AIC improvement.
Step 6: Aggregate all optimally pruned subtrees to produce final decision tree.
Improved decision tree algorithm based on ant colony algorithm
The conventional decision tree algorithm utilizes a multinomial tree structure for classification tasks, which exhibits inherent inefficiencies in computational performance and space utilization. To address these limitations, this study enhances the decision tree through integration with the ant colony optimization algorithm, specifically targeting the pruning process optimization. This hybrid approach achieves dual objectives: maintaining precise classification accuracy for training samples while systematically reducing overfitting risks, ultimately yielding a more robust decision tree model. The technical implementation framework proceeds through the following sequence:
The ant colony algorithm treats the pruning parameters of the grid resource business center’s data decision tree model as feasible solutions within a parameter space, where each mapping relationship represents a distinct feasible solution. In this framework, every possible combination of pruning parameters serves as a node traversed by artificial ants during their search for the optimal solution. During exploration, ants deposit pheromones14 along their paths, with the pheromone concentration at each node guiding subsequent ants toward optimal pruning parameters. This biologically inspired process ultimately identifies the most efficient foraging path (i.e., the optimal pruning configuration), enabling derivation of the ideal decision tree model for grid resource business operations. Key algorithmic parameters include: α (pheromone concentration coefficient) and β (heuristic factor weight). The ants’ movement direction is determined by the state transition probability15, calculated as follows:
![]() |
11 |
In the above equation,
indicates the set of accessible feasible solution nodes for the k-th ant’s next movement in the pruning parameter space of the grid resource business middle platform’s decision tree model; s, i and j represent individual feasible solution nodes within this set;
quantifies the pheromone concentration on the path between nodes i and j for the pruning parameter of the grid resource service data decision tree model at time t;
represents the heuristic desirability factor for the path between nodes i and j in the pruning parameter space of the grid resource service data decision tree model.
During the ants’ iterative search for feasible pruning parameter solutions in the grid resource business center’s decision tree model16, pheromone concentrations along traversed paths are dynamically updated according to the following evolutionary rule:
| 12 |
In the above equation,
denotes a pheromone update; ρ is the pheromone volatilization coefficient;
is the residual pheromone concentration variable for the ants during the iteration.
Following pheromone concentration updates (Eq. 12) for feasible pruning parameter nodes in the grid resource business middle platform’s decision tree model, ants identify optimal parameter solutions via Eq. 11. This process reveals two limitations: individual ants exhibit constrained global search capability for feasible solutions in the power-centralized control system’s fault diagnosis model, and convergence rates remain suboptimal. To address these issues, we introduce dynamic selection factor
to enhance the ant state transition mechanism, yielding the improved node selection formula:
| 13 |
In the above equation,
denotes the improved ant state transfer formula; f denotes a uniformly distributed random variable within [0,1]; j denotes the next feasible pruning parameter node selected by the ant based on state transition probabilities for the mobile grid resource service middle platform’s decision tree model;
defines the set of accessible pruning parameter nodes for the grid resource business middle-office data decision tree model. The selection mechanism operates as follows: when
, ants select the node maximizing the product of pheromone concentration and heuristic value among feasible pruning parameter solutions as their next position17; when
, node selection follows the roulette wheel method specified in Eq. (11).
During the above operation, to avoid the fast convergence of the algorithm, the value expression formula of the dynamic selection factor
is as follows:
| 14 |
where N represents the iteration count threshold in the ant colony algorithm.
During the ants’ search for feasible pruning parameter nodes in the grid resource business middle platform’s decision tree model, the pheromone evaporation coefficient typically remains fixed. This constraint reduces the diversity of obtainable pruning parameter solutions. To address this limitation, we enhance the pheromone update mechanism18 by implementing: (1) an optimal-worst pheromone update strategy with reward-punishment mechanisms, and (2) path classification into three categories: optimal path (
T), worst path (
), and average path (
). The local pheromone concentrations along each node’s path in the feasible solution space are then updated according to the k-th ant’s path length using the following formulation:
![]() |
15 |
In the above equation,
specifies the maximum iteration count for the ant colony; Qis the initial pheromone intensity;
quantifies the total path length traversed by the k-th ant;
characterize the repulsive field gain effects for nodes within the feasible solution space of the grid resource business middle-office’s decision tree model pruning parameters.
Building upon the updated pheromone concentrations specified in Eq. (15), the ant colony algorithm systematically explores the feasible solution space of pruning parameters within the grid resource business middle platform’s decision tree model19. This exploration targets the identification of optimal input-output mapping relationships across all network layers, where the ultimately determined mapping relationship constitutes the final pruning parameter set. These optimized parameters are then applied to execute the pruning process for the grid resource business middle platform’s decision tree model, yielding the globally optimal decision tree configuration for grid resource management operations.
The data schema is established by applying the optimized decision tree model derived from the grid resource business center, which serves as the foundation for subsequent redundant data retrieval operations. The data model determination process executes through the following sequential steps:
Step 1: Select data classification attributes according to the inherent characteristics of the dataset and generate a comprehensive collection of test attributes.
Step 2: Construct the initial decision tree framework by utilizing the test attribute set obtained in Step 1 as nodal elements within the decision tree architecture.
Step 3: Perform computational analysis of data gain metrics for each test attribute associated with the current decision tree node.
Step 4: Implement selection criteria to identify the test attribute exhibiting maximum data gain.
Step 5: Establishe three termination conditions: (a) singular categorized attribute value, (b) singular test attribute value, or (c) zero test attribute data gain. The decision tree node evaluation proceeds as: if any condition is satisfied, the node is popped and designated as the current node with return to Step 3; if unsatisfied, progression to Step 6 occurs; node library depletion triggers advancement to Step 7.
Step 6: Generate the sub-decision tree for the current node, assign the rightmost sub-node as the new current node, archive residual nodes in the node library, and reinitiate Step 3.
Step 7: Conclude the algorithmic process.
Upon completion of this rigorous procedure, the finalized decision tree structure emerges with test attributes constituting nodal points and branch pathways representing the value domains of corresponding nodal attributes. This methodical approach yields both a determinate data schema and its corresponding decision tree implementation, thereby enabling sophisticated intelligent retrieval capabilities for multi-level redundant data systems as subsequently elaborated.
Multi-level redundant data retrieval implementation based on improved decision tree
Building upon the established data model and corresponding decision tree architecture, this study implements a multi-level redundant data retrieval framework to enable intelligent identification and processing of redundant data. During data acquisition operations, multi-level redundant data emerge due to intrinsic equipment characteristics and environmental influences. These redundant manifestations primarily consist of two distinct categories20: duplicate data representing identical information records, and similar data conveying equivalent semantic content through non-identical representations. While similar data may retain potential utility in certain contexts21, the current research scope specifically defines redundant data as strictly duplicate records, as formally represented in the multi-level redundant data retrieval framework illustrated in Fig. 1.
Fig. 1.

Multi level redundant data retrieval process based on improved decision tree.
The multi-level redundant data retrieval process utilizing the enhanced decision tree model initiates by establishing the complete data attribute set. The algorithm sequentially selects individual grid resource data entries from this set and performs duplicate verification. For non-duplicate instances, the system reverts to the attribute set for subsequent data selection. When encountering potential multi-level redundant data, the process: flags the target data, computes its cryptographic hash value, and evaluates whether this value falls below the predetermined threshold. Should the hash value exceed the threshold, the system reinitiates duplicate verification for the current data entry; otherwise, it outputs the identified multi-level redundant data instance and terminates the retrieval cycle. This operational workflow is technically implemented through the following procedure:
The data model obtained in the previous section is denoted as
; where
represents the data set of the n-th property of the decision tree. Redundant data is mainly judged by the hash value, and the data hash value is calculated as:
![]() |
16 |
Where
indicates the hash value of the i-th data;
indicates the i -th data; σ denotes miscellaneous factors; χ denotes the computational auxiliary parameters; ○ is the same-order operator.
The multi-level redundant data22 is retrieved according to the hash value, mainly based on the distance between classes between hash values, representing the DBI index. The formula for calculating the DBI index is:
| 17 |
where
represents the distance between data i and data j;
and
denote the hash value of data i and data j; j denotes the index paramete;
denotes the computational factor.
The multi-level redundant data discrimination rule is:
| 18 |
Among them,
denotes the discrimination threshold for multilevel redundant data.
Experimental analysis
Experimental data description
The experimental study focuses on a power utility enterprise established in 1999 as a wholly-owned subsidiary of State Grid Corporation of China, specializing in regional power grid construction and operation as its core business. With service coverage spanning its entire regional jurisdiction, the enterprise provides electricity to over 43 million consumers. Its operational infrastructure comprises 14 municipal-level power supply companies, all interconnected through a unified grid resource business platform.
The grid resource business platform integrates core resources including grid equipment and topological data from 14 municipal power supply companies under the power enterprise’s jurisdiction. These resources span multiple operational domains such as equipment management, marketing, and power dispatching. The platform provides technical support for diverse business operations while enabling standardized data protocols, homogeneous maintenance procedures, and unified data construction with shared access. However, during platform operation, several challenges emerge: (1) business management software, tools, and sub-platforms are deployed at different time periods, (2) initial platform development lacks coordinated informatization planning, resulting in significant duplication of business management functions, and (3) operational data for identical business processes are frequently distributed across multiple software systems, creating substantial redundant data. Since such multi-level redundant data may contain errors, inconsistencies, or information duplication, they potentially introduce bias and inaccuracies in data analysis and decision-support systems within the grid resource business center. These data quality issues may ultimately compromise decision-making precision and operational reliability. This study implements and validates the proposed multi-level redundant data retrieval method within this practical grid resource business environment.
The data is obtained from the grid resource business center of the experimental object over a certain period of time, with a data collection time span of one month. A total of 100,000 records are collected, and each record represents the business status at a certain point in the month. Each record contains 20 attributes, and the specific names of each attribute are as follows: The Timestamp is the timestamp of the record, indicating the specific time of data collection. ResourceID is a unique resource identifier and is used to distinguish different computing resources. ResourceType refers to the type of resource, such as CPU, GPU, and storage. ResourceStatus is the status of a resource, such as idle, busy, or under maintenance. Load is the load condition of resources and is usually expressed as a percentage. Availability refers to the availability of resources, which indicates whether they are online and available for allocation. Latency is the resource response time, which represents the time delay from the request to the response. Throughput is the resource throughput and represents the number of tasks processed per unit of time. Bandwidth is the network bandwidth and indicates the data transmission rate between resources. EnergyConsumption refers to the energy consumption of resources during operation. JobID is a unique identifier for tasks and is used to distinguish between different computing tasks. JobType refers to the type of task, such as data analysis and simulation calculations. JobStatus is the status of a task, such as waiting, running, completed, or failing. JobPriority is the priority of a task and indicates its importance and urgency. JobDuration is the duration of a task and represents the length of time it takes for the task to start and end. JobCompletionRate is the completion rate of a task, indicating the progress of the task completion. UserID is a unique identifier for users and is used to distinguish between different users. UserGroup is the group to which the user belongs and represents the user’s permissions and roles. DataSize is the amount of data processed by the task, representing the size of the data involved in the task. The Error Rate is the error rate of a task, representing the proportion of errors that occur during the execution of the task. The data is used as the basis to establish the decision tree model of the grid resource business center data, as shown in Fig. 2.
Fig. 2.

Decision Tree Model for Platform Data in Power Grid Resource Business.
In Fig. 2, green represents decision nodes, while yellow and blue represent two different state nodes. Analysis of Fig. 2 reveals that the proposed method successfully constructs a decision tree model for grid resource management middle-office data. The model clearly displays the hierarchical structure of this data, demonstrating well-defined levels and comprehensive coverage of all grid resource operation types. These results not only validate the effectiveness of the proposed methodology but also establish a solid foundation for subsequent hierarchical redundant data retrieval in grid resource management. The decision tree model enables more efficient management and utilization of middle-office data, improves data processing accuracy and efficiency, and provides robust support for grid resource operations development. Consequently, the method presented in this study demonstrates broad application prospects and significant practical value.
Experimental parameter settings
To ensure the reliability of the experimental test results, the experimental parameters are set as shown in Table 1.
Table 1.
Parameter Settings.
| Parameter Category | Specific parameters | Numerical value |
|---|---|---|
| Basic parameters | Maximum Tree Depth | 8 |
| Minimum number of leaf node samples | 5 | |
| Trim parameters | Complexity cost coefficient | 0.01 |
| Retain branch confidence threshold | 0.85 | |
| Ant colony algorithm parameters | Pheromone volatility coefficient | 0.3 |
| Ant number | 20 | |
| Maximum Number Of Iterations | 100 |
Experimental procedure
The purpose of this experiment is to verify the effectiveness of the proposed method in multi-level redundant data retrieval on the grid resource business platform, and to evaluate the improvement of data processing performance by the improved decision tree model. The experimental process mainly consists of the following steps:
(1) Data collection and preprocessing.
Collect business data from the grid Resource Business Center within one month, with a total of 100,000 records obtained. Each record contains 20 attributes, such as timestamp, resource ID, resource type, resource status, etc.
Clean and preprocess the collected data, including removing duplicate records, handling missing values, standardizing data formats, etc., to ensure data quality and consistency.
(2) Construction of decision tree model.
Based on preprocessed data, the decision tree algorithm is used to construct a decision tree model for grid resource business data. In the process of model construction, appropriate features are selected as nodes of the decision tree, and branches are formed based on feature values to form a hierarchical structure.
The constructed decision tree model, as shown in Fig. 2, clearly illustrates the hierarchical structure of grid resource business data, providing a foundation for subsequent data analysis and processing.
(3) Decision tree pruning and parameter optimization.
A decision tree pruning method based on the Akaike Information Criterion (AIC) is adopted to prune the initial decision tree in order to remove redundant and duplicated branches and improve the classification performance of the model.
An improved ant colony algorithm is used to optimize the pruning parameters of the decision tree model, including maximum tree depth, minimum leaf node sample size, and complexity cost coefficient. By iteratively optimizing the ant colony algorithm, the optimal parameter combination is found to simplify the complexity of the decision tree model while maintaining classification accuracy.
(4) Multi level redundant data retrieval experiment:
Retrieve multi-level redundant data from the grid resource business center using a pruned decision tree model. During the retrieval process, data records that meet the criteria are filtered layer by layer based on the branching rules of the decision tree.
Compare the accuracy of decision tree pruning based on the Akaike Information Criterion, traditional decision tree pruning based on information gain, and multi-level redundant data retrieval without pruning.
(5) Model stability and performance evaluation.
Evaluate the stability of the decision tree model through cross-validation (CV). As data complexity increases, calculate the CV standard deviation values of decision tree models with different complexity coefficients and compare them with preset thresholds.
Evaluate the performance of the proposed method and the comparative method in multi-level redundant data retrieval using the F-measure metric.
(6) Actual application effect verification:
The Normalized Discounted Cumulative Gain (NDCG) is used as a performance metric to measure the effectiveness of hierarchical redundant data retrieval in grid resource business. Using text and image data as experimental objects, test the NDCG values of the proposed method under different data volumes.
Through the above experimental process, this article verifies the effectiveness of the proposed method in multi-level redundant data retrieval of grid resource business platforms, and evaluates the improvement of data processing performance by the improved decision tree model.
Analysis of experimental results
To verify the effectiveness of the proposed method, an ablation experiment analysis is conducted on the decision tree pruning design based on the Akaike Information Criterion (AIC) in the proposed method. The accuracy of decision tree pruning based on the AIC, traditional decision tree pruning based on information gain, and multi-level redundant data retrieval without pruning are compared, as shown in Table 2 below.
Table 2.
Accuracy results of multi level redundant data Retrieval.
| Method | Multi level redundant data retrieval accuracy/% | Expected retrieval accuracy/% |
|---|---|---|
| Decision tree pruning based on Akaike information criterion proposed | 98 | 96 |
| Traditional decision tree pruning based on information gain | 93 | |
| No trim | 80 |
The experimental results presented in Table 2 demonstrate three key findings: (1) traditional decision tree pruning using information gain and unpruned multi-level redundant data retrieval both underperform accuracy expectations; (2) information gain-based pruning achieves higher accuracy than unpruned retrieval; and (3) the proposed Akaike Information Criterion (AIC)-based pruning method surpasses both alternatives in retrieval accuracy. This performance advantage stems from fundamental methodological differences: information gain prioritizes features with greater value diversity through entropy reduction, which proves problematic in multi-level redundant data environments where uneven data distribution causes certain redundancy levels to dominate access patterns. Such bias may lead traditional decision trees to select features that appear significant in training but lack practical discriminative power for retrieval tasks. By contrast, the AIC-based approach synergistically combines decision trees’ rule-based segmentation with logistic regression’s probabilistic evaluation, simultaneously optimizing classification accuracy and model complexity. This dual optimization enables better adaptation to data distribution skewness while selecting truly discriminative features without unnecessary complexity inflation - a capability where information gain methods fundamentally struggle. Consequently, the AIC-optimized pruning design significantly enhances decision tree classification performance, delivering the observed accuracy improvements in multi-level redundant data retrieval applications.
The method proposed in this paper uses the ant colony algorithm to improve the pruning of the decision tree model of the grid resource business center mapped data. To verify the improvement effect of the proposed method on the decision tree algorithm, we take the decision tree model of the grid resource business center stage data established in Fig. 2 as the experimental object, using the proposed method for its pruning process, with pruning results shown in Fig. 3.
Fig. 3.

Pruning Effect of Decision Tree Model for Platform Data in Power Grid Resource Business.
From the analysis of Fig. 3, it can be seen that through the application of the ant colony algorithm, this study has successfully determined the optimal parameters for pruning the data decision tree model in grid resource operations. The pruned decision tree model demonstrates significant reduction in both branch quantity and hierarchical levels, effectively eliminating redundant and duplicate branches. This optimization not only simplifies the model’s complexity but also enhances its classification accuracy. When applied to establish the middle-office data model for grid resource operations, the improved decision tree more clearly reveals the hierarchical structure of middle-office data, particularly the hierarchical redundant data. This advancement facilitates better understanding of intrinsic patterns and characteristics within grid resource data while providing robust support for subsequent data retrieval and management. Overall, by optimizing the decision tree model, the proposed method enables more precise and efficient analysis and management of middle-office data in grid resource operations, significantly contributing to enhanced operational management capabilities and efficiency.
To further demonstrate the effectiveness of the proposed method in optimizing the pruning parameters of decision tree models using the improved ant colony algorithm, we examine the depth of the tree before and after optimization, the minimum number of samples for leaf nodes, and the minimum number of samples for splitting nodes to evaluate the accuracy of multi-level redundant data retrieval in decision trees. The parameter groups for the depth of the tree before and after optimization, minimum sample size of leaf nodes, and minimum sample size of split nodes are set as 1 and 2, respectively. The specific parameter values and statistical retrieval accuracy results are shown in Table 3.
Table 3.
Specific results of parameters before and after Optimization.
| Group | Parameter | Numerical value | Decision number retrieval accuracy/% |
|---|---|---|---|
| Before optimization (Group 1) | The depth of the tree | 10 | 84 |
| Minimum sample size for leaf nodes | 5 | ||
| Minimum sample size for splitting nodes | 10 | ||
| After optimization (Group 2) | The depth of the tree | 7 | 98 |
| Minimum sample size for leaf nodes | 15 | ||
| Minimum sample size for splitting nodes | 20 |
The results in Table 3 demonstrate that after optimizing the decision tree model’s pruning parameters using the proposed improved ant colony algorithm, the decision-tree-based retrieval accuracy reaches 98%, representing a 14% improvement over the pre-optimization performance. These findings indicate that employing the enhanced ant colony algorithm to optimize decision tree pruning parameters can significantly boost decision tree performance and deliver excellent results.
To further verify the method’s ability to establish an improved decision tree model for mid-range grid resource business data, we test the CV cross-validation standard deviation of the grid resource business mid-range data decision tree model established by the method in this study, with the threshold value of the CV cross-validation standard deviation is set to 0.1. The test results are shown in Table 4.
Table 4.
Cross validation results of decision tree model CV for power grid resource business platform data.
| Complexity coefficient | CV cross validation standard deviation |
|---|---|
| 0.010 | 0.0398 |
| 0.015 | 0.0399 |
| 0.020 | 0.0401 |
| 0.025 | 0.0402 |
| 0.030 | 0.0405 |
| 0.035 | 0.0408 |
| 0.040 | 0.0412 |
| 0.045 | 0.0419 |
| 0.050 | 0.0421 |
| 0.055 | 0.0428 |
Analysis of Table 4 reveals that as data complexity increases in the grid resource business operations, the standard deviation of cross-validation (CV) scores for the decision tree model constructed by the proposed method exhibits a gradual upward trend. Crucially, this elevation remains within a minimal range, demonstrating the model’s ability to maintain exceptional stability even when processing highly complex datasets. Most significantly, the model’s standard deviation values consistently remain substantially below predefined thresholds, even when handling grid resource platform data with maximum complexity coefficients. These results substantiate three key advantages of the methodology: (1) robust performance in managing complex and dynamic grid resource business data, (2) simultaneous optimization of classification accuracy and model stability, and (3) proven capability for hierarchical redundant data retrieval within grid resource service platforms. The method thereby provides a reliable technical foundation for enhanced data management and analytical operations in power grid resource services.
Using cached operational data generated by the grid resource business center during a specified timeframe as experimental subject, the proposed hierarchical redundant data retrieval method is applied to analyze the cached dataset. The experimental results successfully identify two instances of redundant cached data (thumb_cache_48) within the NotifyIcon folder, demonstrating the method’s capability to accurately detect and retrieve hierarchically structured redundant data through targeted folder analysis. This outcome validates the method’s operational effectiveness for cached data redundancy processing in grid resource services, confirming both its high-precision identification performance and robust hierarchical redundancy detection capacity. The methodology provides substantial technical support for grid resource business data management systems, effectively enhancing data processing efficiency while ensuring operational accuracy through reliable redundant data identification.
Collect operational images of regional power grid resources, equipment, etc. as image based grid resource service center data, and use them as experimental objects to retrieve hierarchical redundant data from the central image based data of the grid resource service using the method proposed in this paper. The retrieval results are shown in Fig. 4.
Fig. 4.
Image Class Layered Redundant Data Retrieval Results.
Analysis of Fig. 4 reveals duplicate image instances (2–3 identical copies) within the grid resource service center’s image data repository. The proposed methodology demonstrates exceptional efficacy in addressing this challenge by: accurately identifying and tagging hierarchical redundant image copies, and guaranteeing 100% retrieval precision. These results validate the method’s robust applicability for image-class hierarchical redundant data retrieval in power grid resource business platforms. The solution exhibits dual advantages of superior algorithmic accuracy and operational practicality, delivering critical support for grid resource data management through enhanced data processing efficiency, improved data operation accuracy, and facilitated optimization of grid resource business operations.
In multi-level redundant data retrieval, precision and recall are two key performance metrics. Precision measures the proportion of truly relevant data among retrieved results, while recall measures the proportion of all relevant data successfully retrieved. The F-measure balances these two metrics through harmonic averaging, ensuring unbiased evaluation of retrieval methods. To validate the retrieval performance of the proposed method, comparative tests were conducted using the F-measure metric, with methods from references4–8 serving as benchmarks. The F-measure values are presented in Fig. 5.
Fig. 5.
Comparison of F1 Results.
Figure 5 demonstrates that the proposed method consistently achieves higher F1-scores than those reported in references4–8. The proposed method attains a maximum F1-score of 0.99 with stable performance, whereas the highest F1-scores from references4–8 reach only 0.75, 0.82, 0.78, 0.75, and 0.63 respectively. While these comparative methods occasionally achieve satisfactory F1-scores, they exhibit significant performance fluctuations when processing multi-level redundant data. In contrast, the proposed method maintains stable performance and superior F1-scores, enabling more accurate retrieval of multi-level redundant data.
To further validate the practical application effectiveness of the proposed method, we employ normalized Discounted Cumulative Gain (NDCG) as the performance metric to evaluate hierarchical redundant data retrieval for power grid resource operations. Using both textual and image data as experimental subjects, we assess the method’s retrieval capability for hierarchically redundant data in power grid resource management. With the NDCG threshold set at 0.85, the test results are presented in Fig. 6.
Fig. 6.

NDCG curve for hierarchical redundant data retrieval in power grid resource business.
As shown in Fig. 6, as the image database grows, the NDCG values of the proposed method gradually decrease when retrieving hierarchically redundant data across different types of grid resource services. When processing 2000 middle-office data entries in grid resource services, the NDCG value for retrieving image-type hierarchical redundant data drops to its lowest point at approximately 0.88. At this stage, the NDCG value for text-type hierarchical redundant data retrieval remains higher than that for image-type retrieval. Notably, the NDCG values for platform-level hierarchical redundant data retrieval across different grid resource service types consistently exceed the predefined threshold. These results demonstrate that the proposed method maintains high retrieval accuracy for hierarchical redundant data in power grid resource services regardless of data type, with consistently strong NDCG values indicating robust retrieval capability and excellent practical applicability.
To evaluate the applicability of the proposed method for retrieving multi-level redundant data in power grid resource services, we measure the retrieval time across different volumes of grid resource data. To ensure the system can respond promptly and deliver required data in most real-time operational scenarios while meeting business real-time requirements, we set the retrieval time threshold at 6 s. The test results are presented in Fig. 7.
Fig. 7.
Retrieve applicability test results.
Figure 7 demonstrates that the proposed method’s retrieval time varies across different search volumes. When processing 5,000, 10,000, 15,000, 20,000, 25,000, and 30,000 search entries, the method achieves retrieval times of 1.0 s, 1.2 s, 1.8 s, 2.2 s, 3.1 s, and 4.5 s respectively. All these values remain below the predefined threshold, with the minimum time under 1.5 s. In comparison, the fastest retrieval times from references4–8 are 2.0 s, 2.8 s, 2.7 s, 2.9 s, and 3.0 s respectively. The results clearly show that while search volume increases, the proposed method maintains consistently lower average retrieval times than comparative approaches.
Conclusion
The research and application of the rapid retrieval method for multi-level redundant data in power grid resource operations, based on an improved decision tree algorithm, significantly enhances both the efficiency of grid resource data management and the accuracy and reliability of data processing. By introducing an improved decision tree algorithm, the performance problems faced by traditional retrieval methods in handling large-scale and multi-level redundant data, such as low accuracy and long retrieval time, have been solved. In practice, the proposed method demonstrates several significant advantages. The improved decision tree algorithm effectively enhances multi-level redundant data retrieval accuracy by 14%, achieving a peak F1-score of 0.99. With a retrieval time of just 4.5 s, which is 1.5 s below the predefined threshold, the method demonstrates excellent application performance. Its efficient retrieval capability enables rapid response to various operational requirements in grid resource management, providing robust support for secure and stable grid operations. Furthermore, by minimizing redundant data interference, the method significantly improves data quality, establishing a solid foundation for intelligent grid management. As grid systems continue expanding and operational requirements grow increasingly complex, demands for efficient retrieval methods will intensify. While the current method may face computational complexity challenges with large-scale datasets, future work will explore integrating additional machine learning techniques to further refine the decision tree algorithm and enhance overall performance. Continuous optimization of the grid resource management center’s data retrieval mechanisms will ensure compatibility with future grid development needs. In conclusion, the improved decision tree-based rapid retrieval method for multi-level redundant data in grid resource operations represents an important research achievement that actively contributes to advancing intelligent grid management capabilities.
Author contributions
Wei Sun: Writing – Original Draft Preparation, Writing – Review and Editing, Investigation, Conceptualization, Supervision, Project administration, Formal AnalysisHui Liu: Data Curation, Writing – Original Draft Preparation, Visualization, Writing – Review and Editing, Software, ResourcesYu Wang: Conceptualization, Writing – Review and Editing, Writing – Original Draft Preparation, Data Curation, Formal AnalysisWeihao Shi: Writing – Review and Editing, Writing – Original Draft Preparation, Investigation, VisualizationXiao Wang: Methodology, Formal Analysis, Writing – Review and Editing, Writing – Original Draft PreparationZhiwei Zou: Writing – Review and Editing, Writing – Original Draft Preparation, Data Curation.
Data availability
Data is provided within the manuscript.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Hassani, R., Mahseredjian, J., Tshibungu, T. & Karaagac, U. Evaluation of time-domain and phasor-domain methods for power system transients. Electr. Power Syst. Res.212 (Nov.), 1–8. 10.1016/j.epsr.2022.108335 (2022). [Google Scholar]
- 2.Qamar, S., Azeem, A., Alam, T. & Ahmad, I. A crow search algorithm integrated with dynamic awareness probability for cellular network cost management. J. Supercomputing. 78 (17), 19046–19069. 10.1007/s11227-022-04623-z (2022). [Google Scholar]
- 3.Hasanien, H. M. et al. Precise modeling of Pem fuel cell using a novel enhanced transient search optimization algorithm. Energy247(2), 123530.1-123530.14 (2022).
- 4.Raji, L. & Ramya, S. T. Secure forensic data transmission system in cloud database using fuzzy based butterfly optimization and modified Ecc. Trans. Emerg. Telecommunications Technol.33 (9), 4558–4576. 10.1002/ett.4558 (2022). [Google Scholar]
- 5.Spea, S. R. Social network search algorithm for combined heat and power economic dispatch. Electric Power Systems Research, 221(Aug.), 1.1–1.24. (2023). 10.1016/j.epsr.2023.109400
- 6.Seo, Y., Kang, Y. & Heo, J. Quantum search algorithm for weighted solutions. IEEE Access.10 (10-), 16209–16224. 10.1109/ACCESS.2022.3149351 (2022). [Google Scholar]
- 7.Olaide, O., Ezugwu, E. S., Mohamed, T. I. A. & Abualigah, L. Ebola optimization search algorithm: a new nature-inspired metaheuristic optimization algorithm. IEEE Access, 10(2022), 1–38. (2022). 10.1109/ACCESS.2022.3147821
- 8.Lo, P. C. & Lim, E. P. Contextual path retrieval: a contextual entity relation embedding-based approach. ACM Trans. Inform. Syst.41 (1). 10.1145/3502720 (2023).
- 9.Olaniran, O. R., Olaniran, S. F. A. & Novel Variable Selection procedure for binary logistic regression using Akaike information criteria testing: an example in breast Cancer prediction: methodological study (Validity study)[J]. Turkiye Klinikleri J. Biostatistics. 15 (2). 10.5336/biostatic.2023-97597 (2023).
- 10.Abualigah, L., Elaziz, M. A., Sumari, P., Geem, Z. W. & Gandomi, A. H. Reptile search algorithm (rsa): a nature-inspired meta-heuristic optimizer. Expert Syst. Appl.191(Apr.)10.1016/j.eswa.2021.116158 (2022).
- 11.Damiani, C., Rodina, Y. & Decherchi, S. .A hybrid federated kernel regularized least squares algorithm[J].Knowledge-Based systems, 2024, 305(Dec.3): 1.1–1.16. 10.1016/j.knosys.2024.112600
- 12.Cintuglu, M. H. & Ishchenko, D. Real-time asynchronous information processing in distributed power systems control. IEEE Trans. Smart Grid. 13 (1), 773–782. 10.1109/TSG.2021.3113174 (2022). [Google Scholar]
- 13.Snaiki, R. & Parida, S. S. A data-driven physics-informed stochastic framework for hurricane-induced risk Estimation of transmission tower-line systems under a changing climate. Eng. Struct.280 (Apr.1), 11–113. 10.1016/j.engstruct.2023.115673 (2023). [Google Scholar]
- 14.Chatzimparmpas, A., Martins, R. M. & Kerren, A. Visruler: visual analytics for extracting decision rules from bagged and boosted decision trees. Inform. Visualization. 22 (2), 115–139. 10.48550/arXiv.2112.00334 (2023). [Google Scholar]
- 15.Shao, Y., Deng, X. & Feng, L. Path Planning of Ant Colony Algorithm Based on Decision Tree in the Context of COVID-19[J] (Wireless Communications & Mobile Computing, 2023, 2023(1): 8984451.). 10.1155/2023/8984451
- 16.Ramakrishnan, J. et al. A decision tree-based modeling approach for evaluating the green performance of airport buildings. Environ. Impact Assess. Rev.100 (May), 1070701–10707017. 10.1016/j.eiar.2023.107070 (2023). [Google Scholar]
- 17.Mahawar, K. et al. Employing artificial bee and ant colony optimization in machine learning techniques as a cognitive neuroscience tool[J].Sci. Rep., 15(1). DOI:10.1038/s41598-025-94642-6. (2025). [DOI] [PMC free article] [PubMed]
- 18.Yin, C. et al. An optimized resource scheduling algorithm based on GA and ACO algorithm in fog computing[J]. J. Supercomputing. 80 (3). 10.1007/s11227-023-05571-y (2024).
- 19.Hang, P. et al. Research on Global Path Planning of Intelligent Vehicles Based on Improved Ant Colony algorithm[J]. Journal of Physics: Conference Series (2023) 2674(1). 10.1088/1742-6596/2674/1/012027
- 20.Fazzinga, B., Flesca, S., Furfaro, F. & Pontieri, L. Process mining Meets argumentation: explainable interpretations of low-level event logs via abstract argumentation. Inform. Syst.107 (Jul.), 1019871–10198724. 10.1016/J.IS.2022.101987 (2022). [Google Scholar]
- 21.Lemus Cardenas, L., Leon, A., Mezher, A. M. & J. P., & Gratree: a gradient boosting decision tree based multimetric routing protocol for vehicular ad hoc networks. Ad Hoc Netw.137(Dec.), 106–117. 10.1016/j.adhoc.2022.102995 (2022). [Google Scholar]
- 22.Jin, S. M., Wang, Y. H. & Li, Y. A study on the application of multilevel secure hash function in blockchain[J].Proceedings of SPIE, 12605(000):6. (2023). 10.1117/12.2673384
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- Mahawar, K. et al. Employing artificial bee and ant colony optimization in machine learning techniques as a cognitive neuroscience tool[J].Sci. Rep., 15(1). DOI:10.1038/s41598-025-94642-6. (2025). [DOI] [PMC free article] [PubMed]
Data Availability Statement
Data is provided within the manuscript.








