Abstract
Vehicular Edge Cloud Computing (VECC) has emerged as a promising paradigm to support delay-sensitive and computation-intensive applications in Intelligent Transportation Systems (ITS). However, dynamic traffic patterns, fluctuating network conditions, and uncertain resource availability often result in high task latency and service failures. To address these challenges, this paper proposes a bi-level Deep Q-Network (DQN)-based mobility-aware framework for fault-tolerant task offloading in VECC environments. Unlike existing approaches that offload tasks solely to the receiving Roadside Unit (RSU), the proposed framework introduces a level-1 DQN agent that performs high-level scheduling by selecting the most suitable RSU for task execution based on its workload, network latency, and failure rate. In parallel, level-2 DQN agents at each RSU handle low-level decisions, including task allocation and failure-recovery strategy selection, choosing among First Result, Recovery Block, or Retry mechanisms. To eliminate centralized dependency, the level-1 DQN is replicated across RSUs at the edge layer, ensuring high accessibility and resilience for distributed scheduling. Extensive simulations conducted using an integrated SimPy/SUMO environment demonstrate that, under heavy and imbalanced traffic, the proposed bi-level DQN improves the total reward by 7.7% to 37.8% and reduces the task failure rate by 29% to 63% relative to bi-level PPO, Greedy, and No-Forwarding baselines, based on averages over the final 40 training episodes.
Keywords: Vehicular edge cloud computing, Task offloading, Deep reinforcement learning, Deep Q-Network, Fault tolerance, Mobility-aware scheduling
Subject terms: Engineering, Mathematics and computing
Introduction
The rapid advancement of Intelligent Transportation Systems (ITS) and the growing demand for real-time vehicular services have considerably increased the computational workload on connected vehicles. Modern vehicles now handle a variety of computation-intensive and latency-sensitive tasks, including perception, target detection, navigation, and Augmented Reality (AR)-based driving assistance1. Similar requirements for real-time processing of complex, multi-sensor data also arise in domains such as sports biomechanics, where AI-driven wearable technologies and motion analysis are used to monitor and interpret athlete movement. In such systems, continuous sensor streaming and real-time inference impose latency and reliability constraints comparable to those in vehicular perception tasks2. In addition, the large-scale deployment of self-driving and AI-enabled vehicles is expected to generate massive data streams that require real-time analysis and decision-making3. Vehicular Edge Cloud Computing (VECC) has therefore emerged as a hybrid computing paradigm that merges the low-latency benefits of Vehicular Edge Computing (VEC) with the extensive computational capacity of Vehicular Cloud Computing (VCC). It provides a unified platform for delay-sensitive and computation-intensive vehicular applications4. Within VECC, Roadside Units (RSUs) act as distributed edge nodes that dynamically interact with centralized cloud servers to enable seamless task offloading, efficient resource allocation, and uninterrupted service delivery. However, the high mobility of vehicles, frequent handovers between RSUs, fluctuating communication quality, and constrained edge resources present significant challenges to maintaining reliable task execution and consistent Quality of Service (QoS) in such dynamic environments.
Mobility is a key feature of vehicular networks that directly influences the reliability and efficiency of task offloading. Rapid fluctuations in vehicle speed or trajectory can cause unstable connections, disrupted task execution, and unpredictable latency3. To address these issues, several mobility-aware approaches have been proposed, such as trajectory prediction–based scheduling5, deadline-aware offloading6, mobility-assisted resource optimization7, and delay-minimization models for task migration and routing4,8–10. Nevertheless, most of these studies overlook the impact of transient or permanent failures that can occur during task execution or transmission, largely due to the expanded state-space and modeling complexity introduced by failure-aware system dynamics. In real VECC environments, failures may stem from unreliable wireless links, hardware faults, or temporary overloads on edge nodes. When combined with the effects of vehicle mobility, such failures can significantly degrade task completion rates and QoS performance. Therefore, designing a mobility- and fault-aware offloading mechanism is essential to ensure resilient and low-latency vehicular services in practical VECC systems.
Traditional optimization-based methods, including heuristic and meta-heuristic algorithms such as Genetic Algorithms (GA), Particle Swarm Optimization (PSO), and combinatorial Multi-Armed Bandit (CMAB) models4,9,10, have been widely used to address task offloading challenges. While these approaches can provide near-optimal solutions under static or predictable conditions, they often lack the adaptability needed for real-time decision-making in fast-changing vehicular scenarios. In contrast, Deep Reinforcement Learning (DRL)–based methods, particularly those employing Deep Q-Network (DQN) and their variants such as Rainbow DQN6, have recently shown remarkable adaptability and scalability in dynamic vehicular edge–cloud environments. Other DRL architectures, such as actor–critic models like Deep Deterministic Policy Gradient (DDPG)11, have demonstrated strong capabilities in managing fault-tolerant and delay-sensitive task offloading. Unlike conventional optimization techniques, DRL agents can learn optimal policies through direct interaction with the environment, eliminating the need for explicit system modeling or predefined mobility assumptions.
Recent analytical studies on machine learning techniques and emerging artificial intelligence trends highlight the capability of learning-based models to process large-scale data and adapt to uncertain environments12,13. These capabilities make deep learning and reinforcement learning particularly suitable for complex decision-making problems such as resource management and task offloading in dynamic vehicular edge–cloud environments.
Among learning-based approaches, reinforcement learning has emerged as an effective technique for adaptive decision-making in dynamic edge computing environments characterized by mobility, network fluctuations, and varying workloads. Several studies have demonstrated that deep reinforcement learning can learn effective task-offloading policies in vehicular edge computing systems under such dynamic conditions14–16.
From a broader artificial intelligence perspective, recent studies identify deep reinforcement learning as a core paradigm for decision-making in complex and dynamic environments, where explicit system modeling and static optimization become impractical17,18. In parallel, advances in multi-agent reinforcement learning emphasize decentralized and coordinated learning architectures as key enablers for scalability and robustness in large-scale systems19,20. Empirical successes in complex multi-agent environments further demonstrate the practical effectiveness of such learning-based approaches21. These trends are particularly relevant to vehicular edge–cloud computing, which inherently involves mobility, uncertainty, and distributed decision-making.
Compared with actor–critic models such as DDPG, which are more appropriate for continuous action spaces22, DQN-based algorithms, are particularly suited to combinatorial offloading and resource allocation problems. In such problems, decisions like RSU selection, server assignment, and failure recovery can be represented as discrete actions.
Despite these advancements, several key research gaps remain. Most mobility-aware frameworks focus on either delay or energy optimization, without jointly addressing fault tolerance and recovery in multi-RSU VECC environments1,3–5,10,23. Similarly, most fault-tolerant models that employ redundancy or replication mechanisms are designed for static or low-mobility conditions, overlooking dynamic vehicle trajectories and real-time deadline constraints24–26. Although some recent efforts, such as27, have incorporated mobility awareness into fault-tolerant scheduling, they still fall short in optimizing recovery processes under highly dynamic multi-RSU VECC settings. Furthermore, the decision-making process in most existing DRL-based offloading frameworks remains centralized, which limits scalability and responsiveness under heavy traffic loads28.
To overcome these challenges, this paper introduces a mobility-aware and fault-tolerant hierarchical DQN framework for task offloading in VECC environments. The proposed architecture employs a bi-level decision-making structure, in which a level-1 DQN agent selects the optimal RSU for each incoming vehicular task based on network-wide mobility, load, and reliability conditions. In turn, a level-2 DQN agent deployed at each RSU determines the Task Execution Plan (TEP), which includes selecting primary and backup servers and defining the appropriate recovery strategy. This hierarchical design not only improves scalability but also reduces change propagation due to the increased decoupling between decision-making layers.
By jointly considering vehicle mobility, network dynamics, and node reliability, the proposed approach ensures resilient and continuous task execution under uncertain vehicular conditions. The major contributions of this work are summarized as follows:
A novel VECC architecture is proposed that integrates mobility awareness and fault-tolerant mechanisms through a hierarchical learning design. The system dynamically adapts to vehicular movement, link variations, and node failures, ensuring reliable task execution under changing network conditions.
A bi-level DQN structure is developed, where the level-1 agent handles RSU selection using network-wide insights, while level-2 agents at RSUs manage task allocation and recovery. The best recovery strategy is selected by the model based on task characteristics, RSU conditions, and failure probabilities, achieving a balanced trade-off between latency and reliability.
An analytical system model is formulated that captures both computational and communication delays in multi-RSU environments, incorporating mobility-induced latency and link-failure probabilities.
A comprehensive Python-based simulation framework integrating SUMO29 and SimPy30 is developed to evaluate the proposed method under various mobility and traffic conditions.
To support reproducibility, the implementation details and experimental resources are publicly available (see Data Availability Statement). The remainder of this paper is organized as follows. Section 2 provides the background on failure recovery patterns and the DQN-based DRL model. Section 3 reviews related work on task offloading and reliability management across edge–cloud and vehicular environments. Section 4 presents the proposed Mobility-Aware Fault-Tolerant VECC architecture, describing its core components and operational workflow. Section 5 formulates the system model and defines the optimization problem. Section 6 details the bi-level DQN-based offloading model. Section 7 presents the experimental setup, evaluation metrics, and performance results, and Sect. 8 concludes the paper.
Background
This section provides an overview of the fundamental mechanisms underpinning the proposed fault-tolerant task offloading model. It first introduces essential failure recovery patterns employed in resilient vehicular systems, followed by a detailed presentation of the DRL formulation and its realization using the DQN algorithm. The integration of these two concepts allows the proposed framework to dynamically adapt to uncertain vehicular environments while maintaining robustness against execution failures.
Failure recovery patterns
Failure recovery mechanisms play a critical role in ensuring the reliability and resilience of task execution in distributed and fault-prone systems. Various well-established recovery patterns have been proposed in the literature, such as ignore, notify, skip, voting, rollback, retry, recovery block, and first result31. Each pattern defines a distinct strategy for handling failures depending on system requirements and failure characteristics. This study focuses on three representative patterns for offloaded tasks in VECC environments as explained below:
Retry (RT) Pattern: The Retry pattern represents a sequential recovery mechanism in which a failed task is re-executed on the same server that encountered the failure. This approach assumes that the error was transient and that reattempting the task under the same conditions may lead to successful completion.
Recovery Block (RB) Pattern: The Recovery Block pattern also operates sequentially but increases reliability by launching a backup version of the failed task on a different server. By diversifying the execution environment, this method mitigates risks associated with permanent or node-specific failures. It is particularly effective in vehicular systems where connectivity fluctuations or hardware instability are common.
First Result (FR) Pattern: The First Result pattern adopts a parallel recovery strategy in which both primary and backup replicas of a task are executed concurrently on different servers. The first successfully completed result is then accepted while the redundant task is terminated. This approach significantly reduces response time and increases reliability at the cost of additional computational resources.
The primary and backup tasks mentioned in these recovery models represent the main and fallback execution instances of an offloaded job. The selection among RT, RB, and FR patterns enables flexible adaptation to varying network dynamics, workload intensities, and reliability constraints.
Deep reinforcement learning model
The decision-making process for task offloading and failure recovery can be formulated as a Markov Decision Process (MDP) approximation, defined by the tuple
, where
denotes the state space,
is the action space,
represents the state transition probability,
is the immediate reward, and
is the discount factor. The agent’s objective is to derive an optimal policy
that maximizes the expected cumulative discounted reward32:
![]() |
1 |
The action-value function
measures the expected return from performing action
in state
and following policy
thereafter. The optimal action-value function satisfies the Bellman equation32:
![]() |
2 |
In DQN,
is approximated by a neural network parameterized by
. The parameters are optimized by minimizing the temporal difference (TD) loss33:
![]() |
3 |
where
denotes the experience replay buffer and
is the target value computed as33
![]() |
4 |
In Eq. (4), the target value
is computed using a separate target network with parameters θ′. This helps stabilize learning by preventing rapid oscillations that would occur if the network were chasing its own constantly changing estimates. While the original DQN used periodic hard updates for θ′, in this work a soft update is applied in the implementation to enhance stability22:
![]() |
5 |
where
controls the synchronization rate between the two networks. The DQN agent stores past transitions
in a replay buffer and samples mini-batches uniformly to update the network, which helps to break temporal correlations and stabilize training33. For action selection, the agent primarily follows the ε-greedy strategy, where it chooses a random action with probability ε or the action with the highest Q-value otherwise33. Additionally, a SoftMax (Boltzmann) policy can optionally be applied to enhance exploration diversity32. In this policy, the probability of selecting an action is proportional to the exponential of its Q-value:
![]() |
6 |
where
is a temperature parameter controlling exploration intensity.
Related works
VECC has recently gained significant attention as a hybrid computing paradigm that seamlessly integrates VEC and VCC. By leveraging both edge and cloud resources, VECC enables efficient task offloading and cooperative computation to support latency-sensitive and computation-intensive vehicular applications. In this architecture, RSUs function as distributed edge nodes that interact dynamically with centralized cloud servers to optimize resource utilization and service continuity. Unlike general Mobile Edge-Cloud Computing (MECC) or Mobile Cloud Computing (MCC) frameworks, VECC specifically addresses the challenges of highly dynamic vehicular environments, characterized by rapid mobility, intermittent connectivity, and frequent handovers among RSUs. These factors complicate task migration, resource allocation, and deadline guarantees, demanding intelligent and adaptive decision-making mechanisms. Consequently, numerous studies have investigated task offloading and resource management across diverse vehicular computing paradigms such as VCC, VEC, Vehicular Fog-Cloud Computing (VFCC), MCC, and VECC, with particular emphasis on mobility awareness, deadline and QoS assurance, and fault-tolerant execution in highly dynamic vehicular environments.
Mobility-aware task offloading strategies incorporate vehicle movement patterns and predicted trajectories into decision-making to improve task completion rates and reduce latency. Chen et al.34 proposed the MAVTO framework, which integrates trajectory prediction with multiple offloading modes, including direct, predictive, and hybrid, within a UAV-assisted VEC environment. Jahandar et al.35 developed a mobility-aware offloading strategy for multi-access edge computing that explicitly considers handover costs and connection variability caused by vehicle movements. Men et al.7 introduced a parallel scheduling mechanism over multiple time slots to adapt task distribution to dynamic vehicular conditions, while Shen et al.8 utilized a mean-field reinforcement learning framework that enables decentralized agents to implicitly adapt to mobility dynamics. Zhiwei et al.36 presented an energy-efficient mobility-aware offloading method for large-scale workflow applications in MEC. They formulated the offloading process as an optimization problem aimed at minimizing the utility cost, which balances energy consumption against total execution time. A recent study by Zeng et al.5 introduced a trajectory prediction–based offloading scheme combining Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) models to predict vehicle movement between base stations, enabling large-volume tasks to be efficiently offloaded to edge servers despite short communication durations. Similarly, Ling et al.1 proposed a mobility prediction–based vehicular offloading framework that employs Estimated Time of Arrival (ETA) services and dynamic programming to allocate resources and balance QoS and response fairness among cooperative RSUs, validated through realistic Veins-based simulations.
Deadline- and QoS-constrained offloading approaches have been investigated to guarantee timely execution of tasks while optimizing system performance. Oza et al.23 exploited traffic light data to estimate vehicle dwell times and opportunistically offload tasks to RSUs, achieving higher completion rates without violating deadlines. Farimani et al.6 employed the Rainbow DQN deep reinforcement learning approach to optimize the trade-off between latency reduction and resource utilization under varying vehicle speeds. Da Costa et al.3 proposed MARINA, which integrates mobility prediction with deadline-aware scheduling in vehicular cloud computing. Earlier work by Jiang et al.37 and Sun et al.9 explored replication-based strategies to reduce deadline violation probability in mobility-uncertain environments. Wu et al.38 and Zhou et al.10 focused on delay-sensitive task allocation within vehicular fog networks to minimize latency through optimized task placement. Materwala et al.4 proposed a QoS-SLA-aware adaptive genetic algorithm (QoS-SLA-AGA) for multi-request offloading in heterogeneous edge-cloud systems. Their model explicitly incorporates Service-Level Agreement (SLA) constraints, including latency, processing time, CPU, and memory requirements, through an adaptive penalty function, ensuring that overlapping multi-request execution adheres to QoS targets even under dynamic vehicle speed and workload variations.
Fault-tolerant task offloading strategies aim to preserve system reliability under dynamic conditions prone to failures and disconnections. Wang et al.27 integrated mobility prediction with a fault-tolerant decision-making mechanism based on Markov Chain and Genetic Algorithm principles to mitigate potential handover failures. Umer et al.26 introduced FP-TOSM, a priority-based fault-tolerant scheduling model designed for vehicular and IoT contexts, which prioritizes task execution based on reliability and urgency, although it does not explicitly consider mobility dynamics. Syed et al.24 combined QoS awareness with fault tolerance within software-defined vehicular fog-cloud networks to achieve resilience through dynamic reallocation of failed tasks. Farimani et al.6 and Chen et al.34 also considered reliability factors in their models through stability-aware scheduling and adaptive multi-mode offloading, but they did not include explicit fault recovery mechanisms. Similarly, Umer et al.25 proposed a multi-objective Analytic Hierarchy Process (AHP)-based offloading and scheduling framework for IoT logistics that jointly optimizes latency, energy consumption, and reliability, incorporating a fault management unit for task reassignment upon failures. Nonetheless, their model assumes a static network topology without accounting for vehicular mobility or RSU dynamics. Our previous work11 presented a deep reinforcement learning based strategy selection framework that enhanced fault tolerance and reliability for delay-sensitive applications; however, it did not explicitly address coordinated task management across multiple RSUs or integrate mobility and deadline constraints simultaneously.
A comparative summary of the key offloading approaches, including their mobility-awareness, QoS features, fault tolerance, and recovery models, is provided in Table 1.
Table 1.
Comparative summary of key offloading approaches.
| Ref. | Architecture | Algorithm/Method | Mobility-Aware | QoS Features | Fault Tolerance | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Deadline | Latency | Energy/Cost | Priority-aware | Reliability | Failure Model | Recovery Model | Strategy Suggestion | ||||
| 23 | VEC | Heuristic | Yes | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ |
| 6 | VEC | Rainbow DQN | Yes | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ |
| 34 | UAV-VEC | Hybrid Heuristic | Yes | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✗ |
| 7 | VEC | Optimization | Yes | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 8 | VEC | RL (Mean-field) | Yes | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 26 | VECC | AHP+ Priority-based Scheduling | No | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 24 | SDN-VFCC | Heuristic | No | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ | ✓ | ✓ |
| 27 | MCC | Markov Chain-GA | Yes | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ | ✗ | ✓ |
| 3 | VECC | Heuristic | Yes | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 35 | MEC | Heuristic | Yes | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 36 | MEC | Heuristic-GA | Yes | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 25 | VFCC | AHP | No | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 11 | MECC | DRL-DDPG | No | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ |
| 38 | VFC | SMDP1 | No | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 10 | VFC | Optimization | Yes | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 37 | VCC | MDP/Replication | Yes | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 9 | VEC | CMAB | Yes | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 4 | VECC | GA | Yes | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 5 | MEC | LSTM, CNN, DDPG | Yes | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
| 1 | MEC | DP2 | Yes | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Proposed Method | VECC | DRL-Bi-level DQN | Yes | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ |
Despite notable progress in mobility-aware, deadline-aware, and fault-tolerant task offloading, several limitations remain. As summarized in Table 1, most existing approaches address these factors in isolation rather than within an integrated framework. In contrast, the proposed framework is the only solution that jointly considers the key dimensions of fault-tolerant task offloading (failure modeling, recovery modeling, and strategy suggestion) together with vehicle mobility. Moreover, although many studies incorporate the RSU layer in task offloading, the specific problem of RSU selection under mobility-aware fault recovery has received limited attention, despite its importance for large-scale deployments.
To address these gaps, the proposed framework introduces a bi-level decision structure. At the first level, an RSU is selected for task execution; at the second level, a server and an appropriate fault-recovery strategy are chosen within the selected RSU. This hierarchical design enables robust and efficient task offloading. While Hierarchical Deep Reinforcement Learning (HDRL) has been explored in prior works to improve learning performance in complex optimization settings and to enhance scalability in large-scale VECC environments through modular and decoupled architectures39–41, these studies do not address the joint problem of mobility-aware RSU selection and adaptive fault-recovery strategy learning within a unified VECC framework. To further clarify the contribution of this study, it should be emphasized that the proposed method does not merely apply deep reinforcement learning to vehicular task offloading. Rather, it formulates mobility-aware RSU selection and fault-recovery strategy selection as a coordinated bi-level decision process. In contrast to flat DRL formulations that jointly optimize RSU choice, server allocation, and recovery configuration within a single enlarged action space, the hierarchical decomposition adopted in this work separates global inter-RSU scheduling from local execution planning. This separation reduces action-space growth and enhances scalability, particularly in large-scale VECC environments characterized by frequent mobility and dynamic failure conditions.
More broadly, recent artificial intelligence research trends increasingly focus on hierarchical and multi-agent reinforcement learning as effective solutions for scalable and robust decision-making in large-scale and dynamic systems17–20. Recent surveys and theoretical analyses highlight that decentralized learning architectures are essential when centralized control becomes infeasible due to system scale and non-stationarity, while empirical studies further confirm their effectiveness in practice21. These trends provide additional motivation for adopting hierarchical learning structures in VECC environments characterized by mobility and frequent network dynamics.
Although multi-agent reinforcement learning has been widely adopted for distributed decision-making in large-scale systems, the proposed framework follows a different design principle. In typical MARL systems, multiple heterogeneous agents interact within a shared environment and may exhibit competitive or cooperative behaviors, often leading to non-stationary learning dynamics. In contrast, the proposed framework adopts a replicated reinforcement learning architecture in which identical level-1 DQN agents are deployed across RSUs. These agents represent replicated instances of a shared policy rather than independent competing agents. This design maintains consistent policies across RSUs while avoiding the coordination complexity commonly associated with MARL approaches, thereby improving learning stability in highly dynamic vehicular environments.
Unlike greedy baseline models such as1, which perform adequately under normal conditions with balanced and uniform traffic, they fail to maintain high task completion rates when deadlines become tight or traffic spikes occur. Similarly, methods in4 and5, which rely on fixed RSU selection, suffer from increased task failures under high load because they cannot adaptively forward tasks to less-loaded RSUs. In contrast, the proposed DQN-based approach dynamically learns from the evolving system state, enabling intelligent RSU selection, improved load balancing, higher resource utilization, and reduced task failures under stringent deadlines and heavy traffic conditions. Unlike some previous studies, our proposed method does not incorporate energy consumption or offloading cost. Excluding energy consumption does not affect real-world applicability provided that vehicles, edge-level RSUs, and cloud nodes are not energy constrained. Similarly, omitting offloading cost is justifiable because, in many real-world deployments, VECC infrastructure expenses are integrated into road network maintenance expenses and effectively covered through road tolls. Nonetheless, extending the proposed method to explicitly include these factors is feasible and represents a meaningful direction for future work.
Proposed fault-tolerant VECC architecture with mobility awareness
The proposed Mobility-Aware Fault-Tolerant VECC architecture, illustrated in Fig. 1, builds upon our previous single-RSU design11 and extends it to a dynamic multi-RSU vehicular environment by integrating hierarchical decision-making and vehicle mobility. The architecture consists of three layers: the vehicle layer, the edge layer including multiple RSUs and edge servers, and the cloud layer providing elastic computation and storage resources. Vehicles in the vehicular layer generate tasks with varying computational demands depending on the application type. Lightweight tasks, such as sensor preprocessing and diagnostics, are typically executed locally, whereas latency-sensitive or computation-intensive tasks, such as perception, object detection, and deep learning-based decision-making, are offloaded to the edge layer. Each RSU in the edge infrastructure is equipped with dedicated functional modules to manage this offloading process efficiently. Specifically, three key functional units are integrated within each RSU to enable distributed intelligence and fault-tolerant operation: the Recommender Unit (RU), Execution Unit (EU), and Forwarding Unit (FU). Together, these units coordinate task assignment, execution, and result dissemination, ensuring efficient resource utilization, mobility-aware decision-making, and resilience to failures in dynamic vehicular environments. The RU determines where a task should be executed across RSUs, the EU decides how it should be executed within the selected RSU, and the FU manages how tasks and results are forwarded between vehicles and RSUs. Upon task arrival, an RU replica at the source RSU evaluates the system-wide state, including RSU load, processing capacity, failure rate, and inter-RSU latency, along with the vehicle’s travel path and task characteristics, to identify the optimal RSU for execution, thereby minimizing the expected end-to-end delay. Periodic synchronization among RU replicas ensures that all units share scheduling experiences, gradually enhancing their collective decision-making capability. After selecting the target RSU, the FU handles the transmission of the task to that RSU for execution. Once execution is complete, the FU disseminates the result along the vehicle’s trajectory, enabling the vehicle to retrieve it from any RSU encountered along its route. The vehicle specifies this trajectory at the time of task offloading. In practice, this trajectory information can be obtained from onboard GPS and navigation systems commonly available in modern vehicles. At the selected RSU, the EU generates an optimal TEP that specifies the primary and backup servers, as well as the fault-recovery strategy, such as Retry, Recovery Block, or First Result. Lightweight or latency-critical tasks are typically processed on local edge servers, whereas computation-intensive tasks may be further offloaded to the cloud. The Task Plan Tracker (TPT) monitors execution progress, detects potential failures, and triggers recovery actions in accordance with the policies defined in the TEP. In a TEP, both primary and backup servers are located within the same RSU. Though assigning backup tasks to an RSU different from the one hosting the primary task can enhance fault tolerance, this approach requires the Fault-Aware Execution Planner (FEP) component to allocate global rather than local resources. Global resource allocation adversely affects both scalability and change management in large-scale RSU networks:
Fig. 1.
Proposed fault-tolerant VECC architecture with mobility awareness.
Scalability degradation: Global resource allocation substantially increases the complexity of the FEP component and degrades its performance.
Poor change management: A FEP component with global visibility across all RSUs becomes highly sensitive to resource-capacity changes at any individual RSU. Consequently, modifying the number of nodes at a single RSU would necessitate redesigning the FEP components at all RSUs. This is an impractical requirement for real-world deployments.
Given these drawbacks, a bi-level model is preferred, in which the level-1 agent is responsible for RSU selection while the level-2 agent manages local resource allocation. This hierarchical design not only improves scalability but also reduces change propagation due to the increased decoupling between decision-making layers.
System model and problem formulation
In this section, we present the system model and problem formulation for task offloading in a fault-tolerant VECC architecture with mobility awareness, where vehicles offload tasks to RSUs. To ensure clarity, all symbols and notations used in this section are summarized in Table 2.
Table 2.
Notations.
| Notation | Definition |
|---|---|
|
The set of time slots |
|
The set of vehicular tasks |
|
The set of vehicles |
|
The set of RSUs |
|
The set of candidate RSUs along the route of the vehicle for delivering the result of task
|
|
The offloading RSU where task is initially offloaded by the vehicle |
|
The selected execution RSU where task is processed after being offloaded from the vehicle |
|
The selected delivery RSU from the set of candidate RSUs, where task ’s result is delivered to the vehicle |
|
The set of edge servers that RSU can use for task assignment |
|
The set of cloud layer servers |
|
The input data size of task
|
|
The output data size of task
|
|
The computation demand of task in time slot
|
|
The execution timeout of task in time slot
|
|
The processing frequency of computing server in time slot
|
|
The failure rate of server in time slot
|
|
The failure probability of task on server
|
|
The decision variable that determines if the primary task is assigned to server , as the task, in time slot in RSU
|
|
The decision variable that determines if the backup task is assigned to server , as the task in time slot in RSU
|
|
The decision variable that determines whether the recovery of task is based on the sequential model (RB or RT) or the parallel model (FR) in time slot
|
|
The latency penalty incurred when task fails |
|
The communication delay incurred when transmitting data size from source to destination in time slot
|
|
Forwarding delay from the offloading RSU to the execution RSU for task at time slot , determined by the offloading decision. |
|
Forwarding delay from the execution RSU to the delivery RSU for task at time slot , based on the selected execution RSU. |
|
The communication bandwidth between source and destination in time slot
|
|
The average completion time of task assigned to server in time slot
|
|
The average queuing time of task assigned to server n in time slot
|
|
The service time of task assigned to server in time slot
|
|
The average total latency of task in time slot
|
|
The weighted total latency of task in a normal execution scenario in time slot
|
|
The total latency of the primary task execution in time slot
|
|
The total latency of the backup task execution in time slot
|
|
The weighted latency of task in a recoverable failure scenario in time slot
|
|
The weighted latency of task in an unrecoverable failure scenario in time slot
|
|
The probability that task terminates successfully in time slot
|
|
The probability that task recovers from a primary failure in time slot
|
|
The probability that task recovers from a backup failure in time slot
|
|
The probability that task fails in time slot
|
|
The transmission delay for sending data of size between RSUs and at time slot
|
|
The propagation delay for the signal to travel between RSUs and at time slot
|
|
The queuing delay at RSU
|
|
The retransmission delays due to packet loss for sending data of size between RSUs and at time slot
|
|
The communication bandwidth between RSUs and at time slot , represents the effective transmission rate |
|
The distance between RSUs and at time slot
|
|
The propagation speed of the signal, typically the speed of light or the specific communication medium used |
|
Arrival rate of forwarded tasks to RSU (tasks/s) |
|
Service rate (forwarding capability) of RSU (tasks/s) |
|
The current load on destination RSU at time slot typically the number of tasks in the queue at the RSU |
|
The maximum possible load in the system, usually defined as the total number of tasks generated per episode |
|
The expected number of retransmissions between RSUs and , derived from the geometric distribution |
|
The failure probability of the link between RSUs and at time slot , representing the likelihood of a transmission failure |
|
A sensitivity factor that adjusts the failure rate based on the load in the network |
|
The base failure rate of the communication link between RSUs and under ideal conditions. |
|
Set of vehicle arrival times at candidate delivery RSUs for task . |
When a vehicle
offloads a task
, the process involves three different RSUs with distinct roles. The task is initially sent to an offloading RSU, denoted by
. From there, it is transmitted to another RSU selected for execution, referred to as
. After the task has been executed at
, the resulting output is forwarded to a third RSU along the vehicle’s trajectory, denoted by
, which is responsible for finally delivering the results back to the vehicle. Therefore, the complete path of task processing and result delivery includes interactions between these three RSUs:
,
, and
, each serves a distinct purpose in the offloading pipeline.
The total latency experienced by the vehicle before receiving the task results includes several components. These comprise the communication delay from the vehicle to
, the transmission delay from
to
, the execution time on the computing server (either edge or cloud) associated with
, the transmission delay from
to the RSUs along the vehicle’s route, and finally, the delay from the selected RSU to the vehicle for result delivery.
It is important to note that the result of task
is not sent to a single RSU for delivery, but rather to a set of RSUs located along the vehicle’s expected trajectory. We denote this set as
, where each
is an RSU positioned along the route of the vehicle. Among these RSUs, the one that the vehicle reaches first and has already received the output of task
from
will be responsible for delivering the result to the vehicle. Accordingly, the communication delay from
to
must account for the transmission of task
’s output to all RSUs in
and the minimum delivery time among them is used based on the vehicle’s mobility. This dynamic selection influences the overall latency and must be incorporated into the system model. The corresponding formulation of the average total latency is provided as shown in Formula (7). The hierarchical decision mechanism introduced in Sect. 4 directly determines the formation of these latency components. In particular, the selection of the execution RSU
defines the inter-RSU transmission delay from
to
, which contributes to the communication term in the total latency. The server assignment and recovery configuration within
affect the computation-related latency components, including both normal execution and failure-dependent scenarios. After task completion, the transmission of the result to the RSUs in
and the dynamic selection of the final delivery RSU
further influence the overall communication delay. Consequently, the end-to-end latency defined in Formula (7) inherently captures the hierarchical RSU selection, execution configuration, and recovery decisions of the proposed framework.
![]() |
7 |
In this formula,
and
represent the communication delays for transferring the task input from the vehicle to the offloading RSU
, and for returning the output from the selected delivery RSU
back to the vehicle. The terms
denotes the end-to-end delay from the offloading RSU
to the RSU selected for task execution, while
refers to the end-to-end delay from the execution RSU to the delivery RSU
. The specific execution RSU is determined based on the offloading decisions using the binary decision variable, and its identity is implicitly reflected in the delay terms. In contrast, the selection of the delivery RSU
is based on evaluating both the time at which the vehicle reaches each candidate RSU in
and the time at which the task result becomes available there. The RSU that satisfies these timing conditions with the minimum delivery latency is chosen, as described later in this section. The components
indicate the weighted execution latencies of the task
in the case of normal execution, recoverable failure, and complete failure scenarios, respectively.
The assignments of the primary and backup tasks to the edge/cloud servers are controlled by the decision variables
and
, where
represents the RSU to which the task is assigned. These variables are set to 1 to specify that task
is assigned as the
task in the queue at server
of RSU
during time slot
. Therefore,
and
are expanded with the RSU index
, which ensures the identification of the specific RSU for primary and backup task assignments. Additionally, the variables
and
are used to determine the assignments of the primary and backup tasks to server
of RSU
, regardless of their position in the queue:
![]() |
8 |
![]() |
9 |
To facilitate a clearer analysis of latency components, the total delay model is divided into two categories: (1) computation-related delays, which are influenced by server assignment, execution time, and fault-tolerance mechanisms, and (2) communication-related delays, which account for data transmission between the vehicle and RSUs, as well as inter-RSU communication. The following two subsections describe the modeling and computation of these latency components in detail.
Computation delay modelling
This subsection focuses on modeling the delay components associated with task execution, including the impact of failure-tolerant mechanisms. These delays arise from the processing time on computing servers, queuing overheads, and the possible need for backup execution in response to failures. The server assignment decisions and the selected fault-recovery strategy influence the latency. The total latency of task
in the normal execution scenario (when no failure occurs) equals the total latency of the primary task execution denoted by
, when the selected recovery pattern for task
is RB or RT. Otherwise, when the selected recovery pattern is FR, it equals the minimum of the primary and the backup execution latencies, as presented in formula (10):
![]() |
10 |
where,
is a binary decision variable that determines the selected recovery pattern of task
. If
, the parallel FR recovery pattern is activated. Otherwise, if
, a sequential recovery mechanism is applied, corresponding to either RT or RB. The former occurs when both the primary and backup tasks are assigned to the same computing server, whereas the latter applies when they are allocated to different servers. In the DRL formulation, this variable is implicitly determined via discrete action selection by the level-2 DQN agent.
The total latency of the primary/backup task execution, denoted by
/
, involves the RSU to the computing server communication delay, denoted by
, the average task completion time on the allocated computing server, denoted by
, and finally, the RSU to the server communication delay to deliver the output results:
![]() |
11 |
![]() |
12 |
The average completion time of task
on server
, denoted by
, is computed as the sum of average queuing time and the service time on server n:
![]() |
13 |
![]() |
14 |
where
is the failure probability of task
executing on server
in time slot
, and
is the service time of task
on server
in time slot
.
is computed in formula (15) where
and
denote the computation demand of task
in time slot
and the processing frequency of computing server
in time slot
respectively.
![]() |
15 |
The probabilities that task
terminates with no failures, recovers from the primary task failure, recovers from the backup task failure and terminates in a failed state are computed as presented in formulas (16)-(19) respectively:
![]() |
16 |
![]() |
17 |
![]() |
18 |
![]() |
19 |
The weighted total latency of task
in a recoverable failure scenario is computed in formula (20). In the sequential recovery patterns RT and RB, where
, the total latency of task
comprises the RSU to the computing server communication delay, denoted by
, the task execution timeout
and the latency of the backup execution denoted by
. In the FR recovery model, where
, the total latency of task
in a recoverable failure scenario equals either the primary task execution latency or the backup task execution latency as presented in formula (20).
![]() |
20 |
Similarly, the weighted total latency of task
in an unrecoverable failure scenario is computed in formula (21).
![]() |
21 |
where
represents the latency penalty incurred when task
fails. Moreover, the failure probability of task
when executing on server
is computed based on the exponential distribution:
![]() |
22 |
where
denotes the failure rate of server
in time slot
.
The total execution latency of each task is tightly coupled with the server-side computational capabilities and the selected recovery model. However, to accurately evaluate the end-to-end delay from the vehicle’s perspective, communication-related delays must also be considered. These include data transmission between the vehicle and RSUs, as well as inter-RSU exchanges, which are addressed in the next subsection.
Communication delay modeling
The communication delay in task offloading arises from multiple data transmissions between the vehicle and RSUs, as well as between RSUs themselves. These transmissions are classified into two categories: (1) vehicle-related communications, such as vehicle-to-RSU and RSU-to-vehicle, and (2) communications between RSUs. We define the communication delay,
, for transmitting data of size
from the source node
to the destination node
during the time slot
.
![]() |
23 |
In the first case, when the vehicle is either the source or the destination, the delay is calculated using the effective communication bandwidth
and for the second case we define the end-to-end delay model for RSU-to-RSU communication. Each component in this model is calculated as follows:
The transmission delay is denoted as
and represents the time required to send the task data from RSU
to RSU
across the communication link. It depends on the data size and the available bandwidth of the link, and is calculated as:
![]() |
24 |
where
denotes the size of the task data and
represents the effective transmission bandwidth between RSUs
and
during time slot
.
The propagation delay is denoted as
and represents the time taken for a signal to travel between RSUs
and
. It depends on the physical distance and the propagation speed of the signal, as expressed by:
![]() |
25 |
where
is the distance between RSUs
and
, and
denotes the signal propagation speed (e.g., the speed of light or the effective speed in the communication medium).
The queuing delay
captures the waiting time of inter-RSU forwarded tasks at the destination RSU
due to congestion. Since the exchanged messages between RSUs correspond to offloaded tasks in our setting, we model this component using an M/M/1-inspired queueing model to capture the non-linear growth of delay near saturation, despite the potentially bursty nature of vehicular task arrivals.
![]() |
26 |
Here,
denotes the arrival rate of forwarded tasks to RSU
during time slot
, estimated from the number of incoming inter-RSU task transmissions within the slot, and
denotes the corresponding service rate determined by the RSU forwarding capability under the simulation settings. For numerical stability, we enforce:
, where epsilon is a small positive constant.
The retransmission delay is denoted as
and accounts for the additional delay caused by packet losses that require retransmission. It depends on the link failure probability and the expected number of retransmissions, modeled as:
![]() |
27 |
where
= (
) denotes the expected number of retransmissions based on the geometric distribution, and
is the failure probability of the communication link between RSUs
and
at time
. This probability is modeled as
, where
denotes the baseline (idle-state) failure rate of the link, and
indicates the sensitivity of the failure rate to the RSU load.
Here,
represents the instantaneous load at the destination RSU
at time
, and
is the maximum possible load in the system.
To incorporate decision-making into the communication delay formulation, we define the delays for task
using the decision variable
, which indicates whether task
is executed at server
of RSU
. Forwarding delay from offloading RSU
to execution RSU
:
![]() |
28 |
Transmission delay from execution RSU
to delivery RSU
:
![]() |
29 |
To complete the communication delay modeling for result delivery, we must also account for the vehicle’s dynamic interaction with multiple delivery RSUs. After the execution of the task
, its output is transmitted from the execution RSU
to all candidate delivery RSUs in the set
, which are located along the vehicle’s expected route. Since the vehicle continues moving after offloading the task, it sequentially encounters these RSUs and checks whether the task result has already arrived at each one.
For each candidate RSU
, two timing values are considered: (1) The total latency
, which includes all relevant computation and communication delays until the task result becomes available at RSU
, as defined in Eq. (7). (2) The vehicle’s arrival time at RSU
, which we denote by
, is assumed to be known as part of a predefined set
, where each
corresponds to the estimated arrival time of the vehicle at RSU
, derived from the vehicle’s speed and distance to that RSU. The predefined arrival times are assumed only to simplify the problem formulation and are not required by the proposed model-free DRL solution.
The delivery RSU
is dynamically selected from among the RSUs in
that satisfy the condition
, meaning the task result is already available before the vehicle arrives. From this filtered set, the RSU with the minimum vehicle arrival time is selected to minimize delivery latency. Formally:
![]() |
30 |
This selection mechanism guarantees that the task result is downloaded from the first RSU where it is already available when the vehicle arrives, thereby minimizing result delivery latency under dynamic mobility conditions.
Optimization objective and constraints
Based on the models and formulations introduced in Sect. 5.1 and 5.2, this subsection presents the overall optimization objective along with the system constraints. The goal is to minimize the average total latency of all tasks during the offloading time slots. The optimization problem is formulated as follows:
![]() |
31 |
subject to the following constraints:
![]() |
32 |
![]() |
33 |
![]() |
34 |
![]() |
35 |
![]() |
36 |
![]() |
37 |
![]() |
38 |
![]() |
39 |
where constraint (28) denotes that the
,
and
are binary variables, constraints (29) and (30) ensure that each primary/backup task must be assigned to exactly one computing server, constraint (31) ensures that a primary task and its backup are not assigned to the same server in the FR recovery pattern, constraints (32) and (33) guarantee that not two tasks in the same server have the same orders, constraint (34) ensures that a primary task executes before its backup, and constraint (35) ensures that both the primary and backup assignments of each task are located within the same RSU.
Proposed method
Once a new task is offloaded from a vehicle at time interval
, the level-1 DQN agent observes the level-1 state
and outputs an action
corresponding to the selected RSU (step 1). The selected RSU then observes its level-2 state
and queries its level-2 DQN agent to generate a level-2 action
, which defines the TEP. TEP includes the primary and backup nodes as well as the recommended recovery pattern (step 2). The TEP determined by
is enacted by the TPT component, which deploys the primary and backup copies of the task according to the suggested plan and tracks their execution (step 3). Upon task execution, the TPT collects the task outcome and updates the level-2 state to
; correspondingly, the level-1 state
is also updated to reflect the new overall system status (step 4). The level-2 reward
and level-1 reward
are computed based on the execution-level and end-to-end task outcomes, respectively (5). Both level-1 and level-2 DQN agents update their replay buffers using their respective tuples (
,
,
,
) and (
,
,
,
), and train their networks off-policy (step 6). This process repeats for each newly arriving task, allowing the bi-level DQN model to continuously optimize RSU selection, server assignment, and recovery pattern selection.
Fig. 2.
The proposed hierarchical DQN method for fault-tolerant vehicular task offloading.
State
The state information is defined separately for the level-1 and level-2 DQN agents. (All features are normalized to
).
The level-1 state at time
, denoted as
, is defined as
.
The
comprises both per-RSU features, represented as
, and a flattened vector of pairwise end-to-end delays,
, computed for the current task. Each RSU state is defined as
, capturing the average failure rate of the
RSU’s edge servers, the sum of processing frequencies across its servers, the cumulative computational load, the total number of processed tasks, and an in-path flag indicating whether the
RSU lies along the vehicle’s trajectory. Beyond these per-RSU features,
provides a system-level perspective of network latency. The task feature vector is defined as
=
, representing the computational demand of the task, its input size, and the current speed of the vehicle generating the task at time
.
The level-2 state at time
, denoted as
, is defined as
.
The
= ⟨
,
, …,
⟩ comprises the states of all edge and cloud nodes, where
is the total number of nodes. Each node state is defined as
=
, capturing the CPU frequency of the
node, the observed failure rates of primary and backup tasks on the
node, and the cumulative computational load of tasks assigned to the
node at time
. The task feature vector is defined as
=
, representing the computational demand and input size of the task.
Action
The action spaces of the level-1 and level-2 DQN agents are defined as follows:
The level-1 action at time
, denoted as
, corresponds to the index of the RSU selected to execute the incoming task. The level-1 DQN agent generates this action using an ϵ-greedy policy, where SoftMax sampling is optionally applied as described in (6). The total number of possible actions equals
, where
is the number of RSUs available for task assignment.The level-2 action at time
, denoted as
, defines the TEP at the selected RSU. Each TEP is represented as a triple
, where the first two entries indicate the primary and backup servers, and
specifies the recovery pattern for task
, which can be RT, RB, or FR. A complete set of valid TEP triples is pre-generated and stored in a TEP Table. Invalid combinations, such as identical nodes for RB or FR patterns, are excluded. The level-2 DQN agent selects one index from this table, and the corresponding execution plan is directly applied. Assuming
edge/cloud nodes are available for task placement, the number of valid TEPs for RT, RB, and FR patterns are
,
and
, respectively, resulting in a total of
valid execution plans in the TEP Table. The level-2 DQN agent selects the TEP corresponding to the chosen Q-value from its output, ensuring that each task is assigned to a valid combination of servers and recovery pattern.
Reward
This section explains the reward structures for the level-1 and level-2 DQN agents. The difference in their reward formulations stems from the varying delay durations at the two levels. At the level-2 stage, short delays are much more common, so the level-2 reward function is designed to be highly sensitive to small variations in short delays, while treating longer delays with nearly uniform penalties. In contrast, at the level-1 stage, where longer delays are more likely, the reward function is designed to remain sensitive to changes in these longer delays as well.
The level-2 reward is defined based on the local execution delay
, measured from the start of the primary task execution to the earliest successful completion or failure at the RSU. At time
, the level-2 reward
is computed as.
![]() |
40 |
where,
scales rewards for locally successful executions, producing positive values that decrease logarithmically with increasing
, thereby encouraging faster execution. To avoid excessively large rewards in extremely short-delay cases, the maximum reward is capped at
. Similarly,
scales penalties for failures, and
and
bound these penalties to maintain stable training. This formulation incentivizes the level-2 agent to minimize execution latency and efficiently utilize both primary and backup servers, improving fault tolerance and responsiveness within each RSU.
The level-1 reward is defined with respect to the end-to-end delay Δ, measured from task submission at the source RSU to result delivery to the vehicle. At time
, the level-1 reward
is computed as.
![]() |
41 |
where
and
denote the maximum and minimum rewards for successful task completion,
controls the exponential decay of reward with increasing end-to-end delay,
scales penalties for failed or timed-out tasks, and
and
constrain failure penalties to prevent extreme values. This design encourages the level-1 agent to maximize successful deliveries while minimizing end-to-end delays, balancing timeliness and reliability across the vehicular network.
The reward design balances latency reduction, failure penalties, and efficient resource utilization in the VECC system. Latency is captured through delay-dependent reward decay, while task failures are penalized through bounded negative rewards. Resource utilization is implicitly encouraged through the inclusion of RSU workload and server capacity in the agent state representation. Bounded reward ranges are also used to maintain stable learning behavior.
Level-1 agent synchronization
The proposed synchronization algorithm for the level-1 DQN agent operates as follows: At each interaction step t, the agent selects an action
in state
, receives a reward
, and transitions to the next state
. The resulting experience tuple (
,
,
,
) is stored in the agent’s local replay buffer.
To ensure consistency among replicated level-1 DQN agents deployed across RSUs, a distributed experience-sharing mechanism is employed in which newly generated replay tuples are propagated through the RSU network and made available to other agents. The associated communication overhead is minimal due to lightweight replay tuples and high inter-RSU bandwidth. Under this design, replay buffers across RSUs gradually converge toward a shared pool of experiences generated throughout the system. Consequently, all agents learn from an effectively identical pool of global experiences and update their networks consistently.
Performance evaluation
The objective of the performance evaluation is to address the following research questions:
RQ1: How does shortening task deadlines influence the task failure rate in the proposed method compared with the baseline methods?
RQ2: To what extent can the level-2 DQN agents effectively learn fault-tolerant task assignment to the nodes?
RQ3: How effective is the level-1 DQN agent in minimizing total latency and task failures under different traffic conditions?
RQ4: How does the high network delays affect the learning performance of replicated level-1 DQN agents?
RQ5: To what extent is the proposed method robust under conditions of route topology prediction uncertainty?
RQ6: How does the proposed method scale under increasing system size, in terms of both the number of RSUs and the number of computing nodes per RSU?
Simulation environment setup
The simulation environment was developed using Python 3.10.18 managed by Anaconda, with SimPy as the core framework for discrete event simulation3. To emulate vehicular mobility, SUMO was integrated via TraCI47, allowing for real-time vehicle movement simulation and dynamic updates to the task offloading system based on the vehicles’ positions and traffic flow. The PyTorch framework was used to implement the DQN models. In the simulation environment, several Python libraries were used to facilitate various tasks. NumPy and SciPy supported numerical and statistical computations, while Pandas handled dataset management and analysis. Matplotlib was employed for generating visual outputs of simulation results. For external data handling, Openpyxl was applied to manage Excel files, and the Subprocess and OS libraries were used for process management and system-level operations, enabling seamless integration with the SUMO simulation.
The SimPy environment was initialized to manage the task queue, task assignment, and execution processes. At each step, tasks were generated, sorted by their arrival times, and then processed based on the current state. SUMO provided real-time vehicle mobility data, which was continuously synchronized with the offloading framework to align vehicular dynamics with task execution. The overall procedure for the simulation is outlined in Listing 1, where each task is handled iteratively across episodes. The level-1 DQN agent selects the appropriate RSU, and the corresponding level-2 DQN agent chooses the execution nodes. Each task is then processed within the SimPy environment, while SUMO operates alongside it to dynamically update vehicle positions.
Algorithm 1.
SimPy_Simulation.
Listing 1. The hierarchical simulation algorithm integrating SUMO with SimPy.
Each simulation episode begins by initializing the SimPy environment (line 1) and defining the shared Cloud servers (line 2). Then, a level-1 DQN agent is instantiated for selecting the most appropriate RSU based on the current system state and incoming task features (line 3). Additionally, for each RSU, a set of local Edge servers is created and associated with a dedicated level-2 DQN agent and a replica of the level-1 DQN agent for RSU access (lines 4–8). At the start of each episode (line 10), the environment is reset and a pre-generated task queue is loaded, sorted by task arrival times to simulate real-world scheduling (line 11). The main loop iterates through tasks until the maximum number is reached (line 12). For each task arrival (line 13), the level-1state is obtained (line 14). The level-1 DQN agent then selects the most suitable RSU (line 15). Given this decision, the level-2 state for the selected RSU is retrieved (line 16), and the corresponding level-2 DQN agent generates a primary-backup offloading decision, considering the recovery pattern and local infrastructure conditions (line 17).
A Task object is created, embedding the task’s profile, selected RSU, and the recommended execution plan (line 18). Then, the task is executed in the SimPy simulation (line 19), which models both primary and backup execution paths according to the selected recovery pattern. In addition to the task offloading process, a separate process for vehicle movement simulation using Sumo is initiated within the SimPy environment. This process runs concurrently with the task offloading process, and both processes are managed by SimPy to simulate the dynamic interaction between task execution and vehicle movement (line 20). After task execution, the environment is queried to obtain two sets of feedbacks: (1) The level-1 reward and next level-1 state, which measure the quality of RSU selection (line 21) and (2) The level-2 reward and next level-2 state, which reflect task success, delays, and FT performance (line 22). These feedback signals are used to update the experience buffers and train both the level-1 and level-2 DQN agents (lines 23–26), enabling the system to learn optimal coordination strategies over time. The loop proceeds to the next task until the episode ends. To better illustrate the simulation process, Fig. 3 presents a flowchart focusing on task execution and vehicle mobility simulation (steps 19 and 20 in Listing 1). This figure highlights how SimPy manages these concurrent processes, ensuring coordination between task processing and dynamic vehicle movements. The “
” step in Fig. 3 is expanded in Fig. 4, where the simulation steps of a fault-tolerant task execution at a selected RSU is illustrated.
Fig. 3.
SimPy steps for task execution and vehicle mobility simulation (steps 19 and 20 in Listing 1).
Fig. 4.
SimPy steps for fault-tolerant task execution at a selected RSU (Process task offloading step in Fig. 3).
The simulation parameters are summarized in Table 5. Task deadlines were defined as a function of their submission time, computational demand, and input size to ensure realistic timing constraints in the vehicular edge environment. Specifically, the deadline at time
is given by
![]() |
42 |
Table 5.
A. Reward sensitivity analysis.
| Agent | Parameter | P1 | P2 | Selected | P3 | P4 |
|---|---|---|---|---|---|---|
| level-1 | decay scale
|
30.0 | 65.0 | 100.0 | 135.0 | 170.0 |
| level-1 | penalty weight (
|
6.0 | 8.0 | 10.0 | 20.0 | 30.0 |
| level-2 | penalty weight (
|
8.0 | 14.0 | 20.0 | 35.0 | 50.0 |
| level-2 | success weight (
|
0.3 | 0.6 | 1.0 | 1.5 | 2.0 |
where
,
and
denote the submission time at the source RSU, the computational demand and the input size of task t respectively. This formulation ties deadlines to intrinsic task characteristics, preventing overly relaxed or infeasible constraints. The infrastructure setup including RSUs, their servers, and network characteristics are listed in Table 4.
Table 4.
Infrastructure setup.
| Component | Parameter | Value/Range |
|---|---|---|
| Cloud | number of cloud servers | 2 |
| processing frequency (MIPS) | Uniform (30, 60) | |
| Cloud-to-RSU delay (s) | Uniform (1.25, 12.5) | |
| failure rate | Uniform (0.00821, 0.00827) | |
| Edge | processing frequency (MIPS) | Uniform (10, 15) |
| Edge-to-RSU delay (s) | 0.001 | |
| failure rate | Uniform (0.02689, 0.08254) | |
| Network | number of RSUs | 8 |
| edge servers per RSU | Uniform (6, 7) | |
| RSU’s coverage radius (m) | Uniform (850, 1000) | |
| propagation speed (m/s) | 2 × 10⁸ | |
| RSU to-RSU bandwidth (Mb/s) | Uniform (200, 500) | |
| RSU-to-Cloud bandwidth (Mb/s) | 80 | |
| link failure rate | Uniform (0.1, 0.5) |
Table 3.
Simulation parameters.
| Component | Parameter | Value/Range |
|---|---|---|
| Task Profile | number of vehicles | 3 |
| tasks per vehicle | 600 | |
| total number of tasks | 1800 | |
| task arrival rate | Uniform (0.8, 1) tasks/s | |
| task arrival distribution | Poisson | |
| task input size (Mb) | Uniform (100, 1000) | |
| task result size (Mb) | 10 | |
| task computation demand (MIPS) | Normal (50,16) | |
| DQN Component (shared) | number of episodes | 500 |
| activation function | ReLU | |
| optimizer | Adam | |
| batch size | 256 | |
| soft update rate (τ) | 0.005 | |
| replay buffer sampling | Uniform random | |
| target network update | Soft update with τ | |
| device | CPU | |
| Level-1 DQN | hidden layers | [128, 64] |
| learning rate | 0.0003 | |
| discount factor (γ) | 0.85 | |
| replay buffer capacity | 500,000 | |
| epsilon (start/end/decay) | 1.0/0.01/400 | |
| exploration temperature (T) | 1.5 | |
| action selection | ε-greedy + optional SoftMax | |
|
reward parameters: max success ( min success decay scale (λ) penalty weight (
penalty bounds (
|
25.0 5.0 100.0 10.0 [− 150, − 3] |
|
| Level-2 DQN | hidden layers | [128, 64] |
| learning rate | 0.0005 | |
| discount factor (γ) | 0.90 | |
| replay buffer capacity | 200,000 | |
| epsilon (start/end/decay) | 1.0/0.01/300 | |
| action selection | ε-greedy | |
|
reward parameters: success weight ( max success ( penalty weight ( penalty bounds ( |
1.0 30.0 20.0 [− 90, − 3] |
The Edge-to-RSU delay is set to a negligible non-zero value (0.001 s), reflecting the co-located deployment of edge servers within RSUs and their high-speed internal connectivity.
All learning-based baselines considered in this study are trained under identical experimental configurations, including the same traffic patterns, workload settings, and number of episodes, to ensure a fair evaluation protocol. Each experiment was repeated across multiple independent runs under default random initialization of the underlying libraries. The reported results correspond to the average performance across these runs.
Hyper-parameter sensitivity analysis
The DQN hyper-parameters reported in Table 5 were obtained through an iterative grid-based tuning process. Initial candidate ranges were adopted from commonly used settings in related DQN-based studies and our prior DRL offloading implementations11, and were subsequently fine-tuned by inspecting learning curves and convergence behavior. The final configuration in Table 5 corresponds to the setting that consistently provided stable learning behavior. For example, the soft update rate (τ) was evaluated across multiple values, where τ = 0.005 yielded the most consistent convergence compared to higher update rates. Regarding the reward design, the weighting factors in Table 5 were selected to balance deadline failure and node failure rates at level-1 and level-2. In addition, the penalty bounds (
were introduced to limit extreme penalty values and reduce the impact of outlier updates, thereby improving training stability. To analyze the sensitivity of the reward function to the weighting parameters at level-1 and level-2 agents, an experiment was conducted. For each parameter, we changed its value in the neighborhood of the selected value while all other parameter values were kept fixed. The effect of these changes on the task failure rate are listed in Table 5-A and also illustrated in Fig. 5.
Fig. 5.
Average task failure rate under reward-parameter variations.
Figure 5 shows that the performance of the agents is sensitive to the variations in the reward parameter values. In all cases, the selected values yield the lowest failure rate, while deviations toward lower or higher values degrade agents’ performance.
Evaluation metrics and baseline methods
To evaluate the performance of the proposed method, two key metrics are considered:
Average Reward: This metric reflects the learning performance of the level-1 and level-2 DQN agents over time. A sliding window of 40 episodes is applied, and the rewards are averaged across the episodes within the window. The reward functions are separately defined for the level-1 (RSU selection) and level-2 (fault-tolerant task execution) stages, capturing their distinct optimization objectives.
Average Number of Task Failures: This metric measures system reliability by calculating the average number of failed tasks over the most recent 40 episodes. It complements the reward metric by providing a direct perspective on the system’s fault-tolerance strategy.
The above metrics are employed to compare the proposed bi-level DQN approach with the following baseline methods. All baseline algorithms were evaluated under identical simulation settings, and hyperparameters of the learning-based methods were tuned through iterative experimentation to ensure a fair comparison. Since the proposed framework follows a replicated reinforcement learning architecture in which identical agents share a common policy across RSUs, baseline methods were selected among approaches suitable for single-policy decision models operating in discrete action spaces. In addition, a single-level (flat) DQN baseline was evaluated to assess the benefit of hierarchical decision decomposition.
1. Bi-level Proximal Policy Optimization (bi-level PPO): A bi-level PPO framework is implemented using the same hierarchical decision structure as the proposed method. The level-1 PPO agent performs RSU selection, while the level-2 PPO agents handle server selection and fault-recovery decisions within each RSU. PPO is an on-policy, policy-gradient algorithm48, whereas the proposed approach relies on DQN, an off-policy, value-based method. This baseline enables a direct comparison between two fundamentally different DRL paradigms under an identical hierarchical architecture.
2. Greedy RSU Selection (Greedy): This method selects the target RSU based on a deterministic greedy algorithm1, prioritizing the RSU with the lowest workload and minimum communication delay.
3. No Task Forwarding (No-Forwarding): In this baseline method4,5 the task is always executed locally by the receiving RSU. There is no task forwarding mechanism in this method. Once the task execution is completed locally, the results will be forwarded to the vehicle.
4. Flat Deep Q-Network (Flat DQN): A flat (non-hierarchical) DQN model is considered as a baseline, in which RSU selection, server allocation, and fault-recovery decisions are jointly learned by a single agent without hierarchical decomposition. This baseline is used to assess the architectural impact and scalability of the proposed bi-level design, particularly in larger-scale settings. For a fair comparison, the Flat DQN optimizes the same end-to-end reward objective and receives the combined state information required by both level-1 and level-2 agents, resulting in a single enlarged decision space. Thus, any performance difference primarily reflects the effect of hierarchical decomposition rather than differences in objective formulation or information availability.
The isolated impact of disabling or restricting recovery strategies has been examined in our prior work11; in the present study, recovery mechanisms are kept consistent across learning baselines to isolate architectural effects.
Simulation results
Three experiments were conducted to answer to the research questions. Each experiment was repeated 10 times and the average results were reported.
First experiment: impact of deadline shortening on task failures (RQ1)
The first experiment investigates the impact of task deadline shortening on system reliability, measured in terms of the average number of task failures. The initial deadline for each task is determined using the formula (38), and subsequently scaled by reduction factors ranging from 0.9 to 0.5 in order to model increasingly stringent timing constraints. For each reduction factor, the average number of failed tasks is computed at episode 500, where the learning-based methods have converged to a stable behavior. The proposed bi-level DQN method is compared against three baseline approaches: bi-level PPO, Greedy, and No-Forwarding. Figure 6 illustrates the average number of task failures under different deadline reduction factors, while Table 6 reports the corresponding numerical values.
Fig. 6.
Average task failures vs. deadline reduction factors.
Table 6.
Average number of failed tasks under varying deadline reduction factors (episode 500).
| Method | 1.00 | 0.90 | 0.80 | 0.70 | 0.60 | 0.50 |
|---|---|---|---|---|---|---|
| bi-level DQN | 28.15 | 680.5 | 1313.225 | 1514.175 | 1625.475 | 1694.6 |
| bi-level PPO | 37.05 | 766.525 | 1340.85 | 1542.125 | 1651.75 | 1716.75 |
| Greedy | 219.85 | 885.175 | 1361.2 | 1541.85 | 1641.65 | 1705.775 |
| No-Forwarding | 492.55 | 1228.15 | 1567.025 | 1637.675 | 1702.425 | 1735.825 |
The results indicate that the proposed bi-level DQN method consistently achieves the lowest task failure rate across all evaluated deadline settings. When no deadline reduction is applied, a substantial performance gap already exists between the proposed method and the baseline approaches. Under moderate deadline reductions (0.9 and 0.8), the proposed method maintains clear superiority, demonstrating robust performance under tighter timing constraints. The bi-level PPO baseline exhibits competitive performance under moderate reductions; however, as the deadlines become more stringent (0.6 and 0.5), its behavior gradually approaches that of the Greedy strategy. Under these extremely tight conditions, the performance gap among all methods narrows, reflecting inherent system limitations imposed by severe timing constraints. Nevertheless, the proposed bi-level DQN method consistently preserves its advantage and outperforms all baseline approaches across the entire range of evaluated scenarios.
Second experiment: evaluation of level-2 DQN learning performance (RQ2)
This experiment was conducted to measure the ability of level-2 DQN agents to learn fault-tolerant task assignments over learning episodes. Table 7 summarizes the average number of tasks processed by each RSU. The results highlight differences in load distribution: RSU_3 experienced the heaviest load (≈ 285 tasks on average), while RSU_5 handled the least (≈ 143 tasks).
Table 7.
Average number of tasks processed by each RSU.
| RSU | RSU_0 | RSU_1 | RSU_2 | RSU_3 | RSU_4 | RSU_5 | RSU_6 | RSU_7 |
|---|---|---|---|---|---|---|---|---|
| Average number of tasks | 264.49 | 225.76 | 195.75 | 284.52 | 252.84 | 143.25 | 184.79 | 248.62 |
Figure 7 illustrates the average reward trends for the eight RSUs. In most cases, the level-2 DQN agent demonstrates a steady improvement, converging toward stable values after several hundred episodes. RSUs with higher task loads, such as RSU_3 and RSU_0, achieved steadier improvements, while lighter-load RSUs, such as RSU_5, exhibited greater fluctuations due to limited training data. Each level-2 agent is trained solely on locally assigned tasks, and no additional mitigation mechanism is applied for lightly loaded RSUs. Consequently, low-load RSUs generate fewer training samples, which increases reward variance; however, since these RSUs process a smaller share of total tasks, their impact on the overall system-level performance remains limited.
Fig. 7.
Episodic average reward of RSUs.
The average task failure percentage over episodes are illustrated in Fig. 8. Overall, most RSUs show a clear decline in failures. Again, RSUs with heavier loads exhibit lower and steadier failure reductions over episodes.
Fig. 8.
Average task failure rate of RSUs.
Third Experiment: The impact of level-1 DQN agent on latency and task failures (RQ3)
This experiment aims to evaluate the impact of applying the level-1 DQN agent on total latency and task failures under different traffic conditions. Two scenarios were designed: (1) a light traffic with task arrival rates uniformly sampled between 0.4 and 0.5 tasks per second and an evenly distributed load among RSUs and (2) a heavy traffic with task arrival rates between 0.8 and 1 task per second and an imbalanced task distribution among RSUs.
The proposed bi-level DQN method is compared against bi-level PPO, Greedy RSU Selection, and No-Forwarding. Figures 9 and 10 report the average reward and the average number of task failures under light traffic conditions, where the differences between the proposed method and the baseline approaches remain minor due to the underutilization of edge resources in this scenario.
Fig. 9.
Average reward comparison under light traffic load.
Fig. 10.
Task failures comparison under light traffic load.
Figures 11 and 12 illustrate the results under heavy traffic conditions. In this scenario, the proposed bi-level DQN method demonstrates a clear performance advantage. By leveraging the level-1 learning agent to dynamically balance the workload across RSUs, the proposed method significantly reduces task failures while achieving higher average reward values. The bi-level PPO baseline also benefits from the hierarchical structure and improves upon heuristic baselines; however, its performance gradually degrades under severe load imbalance, leading to higher task failure rates compared to the proposed method.
Fig. 11.
Average reward comparison under heavy traffic load.
Fig. 12.
Task failures comparison under heavy traffic load.
In contrast, the Greedy approach, which relies solely on instantaneous system information, fails to effectively mitigate load imbalance under heavy traffic and exhibits noticeably higher failure rates. The No-Forwarding method performs the worst in this scenario, as the lack of task forwarding and global coordination results in severe RSU congestion and a substantial increase in task failures.
Fourth Experiment: The impact of network delay on the learning performance of replicated level-1 DQN agent (RQ4)
This experiment evaluates the impact of network-induced delays on the learning performance of the replicated level-1 DQN agent. As described in Sect. 6, the level-1 agent is replicated across RSUs and relies on exchanging experience tuples to learn a consistent global policy. Network delays may therefore affect the timeliness and freshness of shared experiences and, consequently, the learning process. To model network delay effects, packet drops are introduced between RSUs according to predefined drop ratios. For each drop ratio, the average reward and the average number of task failures are measured after the learning process stabilizes. The proposed bi-level DQN method is evaluated under increasing network delay conditions. Figure 13 illustrates the episode-wise evolution of the average reward under different packet drop ratios, while Fig. 14 shows the corresponding task failure trends over training episodes. This experiment showed that the network delay factor does not have a significant impact on the level-1 agent performance particularly in low and moderated network drop rates due to the off-policy nature of the DQN agent.
Fig. 13.
Average reward under different packet drop ratios.
Fig. 14.
Average task failures under different packet drop ratios.
Fifth Experiment: Robustness under route topology prediction uncertainty (RQ5)
To evaluate robustness under imprecise route topology prediction, controlled prediction noise is injected into the candidate delivery RSU set. For a given error rate
, in-path RSUs are randomly removed with probability
, while additional out-of-path RSUs are randomly added in proportion to the path length, jointly modeling false-negative and false-positive prediction errors. This formulation reflects connectivity-level uncertainty in RSU availability rather than spatial or temporal deviations in vehicle trajectory estimation. The results of this experiment are illustrated in Fig. 15, which reports the average task failure rate for each value of
, where the bars represent mean values and the error bars indicate the standard deviation across multiple runs. As shown in the figure, although increasing noise generally degrades the task offloading failure rate, the system remains robust when the prediction error is kept below 10%. This robustness arises because the proposed deep reinforcement learning (DRL) approach can adapt, over time, to consistent patterns of imprecise predictions.
Fig. 15.
Average task failure rate under varying topology prediction error rates.
Sixth Experiment: The scalability analysis of the proposed method (RQ6)
The proposed hierarchical bi-level method demonstrates substantially higher scalability than a flat approach. By decomposing the task-offloading decision into two hierarchical levels, the proposed method effectively mitigates the complexity faced by a single monolithic agent. Consider a system with
RSUs, each hosting
servers. In a flat design, the total number of servers is
, and the corresponding agent must handle
actions, resulting in quadratic growth with respect to the total number of servers. In contrast, the proposed hierarchical bi-level architecture significantly reduces the action space. The level-1 agent operates at the RSU level and has
actions, exhibiting linear growth with the number of RSUs, while each level-2 agent, responsible for server selection within an RSU, has
actions. This hierarchical decomposition dramatically lowers the overall complexity of the agent models compared to the flat design.
To evaluate scalability, we conducted experiments in which the number of RSUs was progressively increased. As expected, model performance degrades as the number of RSUs grows; however, the proposed hierarchical method consistently and significantly outperforms the flat method (see Fig. 16). This result is particularly relevant for large-scale road networks, where the number of RSUs may be substantial. Although the action space of the level-1 agent grows linearly with the number of RSUs, very large deployments may still introduce prohibitive complexity that adversely affects performance. To address this limitation, the hierarchical structure could be further extended by introducing additional levels, thereby further partitioning the action space and improving scalability. It should be noted that in the RSU scalability experiments, the number of vehicles was also increased proportionally with the number of RSUs to preserve the task density over the enlarged geographic area and to ensure sufficient training samples for all RSUs. Specifically, as the number of RSUs increased from 8 to 12, 16, and 20, the number of vehicles was increased from 3 to 4, 6, and 7, respectively. Each vehicle offloads 600 tasks with a fixed arrival rate of 0.8–1. This density-preserving scale-out reflects realistic VECC deployments, where infrastructure expansion is typically accompanied by proportional traffic growth. Increasing the number of vehicles or the task arrival rate mainly affects queueing behavior and delay-related metrics, while the decision action space of the proposed bi-level framework remains governed by the number of RSUs and the number of servers per RSU.
Fig. 16.
Scalability analysis of the proposed method in terms of (a) average failure percentage and (b) average reward per task.
In addition to RSU-level scalability, the impact of increasing the number of servers managed by each RSU is also evaluated.
Regarding scalability with respect to the number of servers per RSU, real-world deployments typically impose an upper bound, as each RSU hosts only a subset of the total edge servers. This practical constraint limits the size of the level-2 action space and mitigates the impact of its quadratic growth, ensuring the feasibility of the proposed approach in realistic scenarios. This behavior is illustrated in Fig. 17.
Fig. 17.
Impact of the number of servers per RSU on (a) average task failure rate and, (b) average reward per task.
Training stability under dynamic RSU failure conditions
To further investigate the training stability of the proposed bi-level DQN framework under rapidly changing RSU reliability conditions, an additional experiment was conducted. The training process was performed over 500 episodes. During the first 250 episodes, the system operated under a low RSU failure-rate setting. At episode 250, the failure rates of the servers associated with the RSUs were abruptly increased to represent a high-failure condition, while all other system parameters, including traffic intensity, vehicle mobility, and network topology, remained unchanged. This setup introduces controlled non-stationarity and emulates sudden degradations in RSU reliability.
The transition corresponds to an approximate threefold increase in the average server failure rate between the two conditions.
As shown in Fig. 18, a noticeable performance degradation is observed immediately after the transition point. However, the learning process does not exhibit persistent oscillations or divergence. After a transient adjustment phase, the average reward stabilizes around a new operating region under the high-failure condition. These results indicate that the proposed hierarchical DQN framework maintains stable system-level convergence behavior and adapts to abrupt changes in RSU failure characteristics without sustained instability.
Fig. 18.
Average reward evolution under a sudden increase in RSU failure rates.
Discussion
The experimental results confirm the effectiveness of the proposed bi-level DRL approach in minimizing total task execution latency and failure rates, even under stringent deadline constraints. The level-1 DQN agent, responsible for first-level scheduling by selecting the optimal RSU for task execution, significantly outperforms the baseline methods. This improvement stems from the model’s ability to consider both RSU workloads and network latencies when making assignment decisions. As the level-1 agent is implemented as a DQN, it continuously learns to enhance its RSU selection policy over time. Consequently, it achieves progressive reductions in latency and failures across episodes, unlike the Greedy baseline, which relies on static, locally optimal decisions that are often suboptimal from a system-wide perspective. The superiority of the proposed method is particularly evident under heavy traffic conditions and when task distributions among RSUs are imbalanced. Both traffic intensity and load distribution have a significant impact on the performance advantages of the proposed method, particularly when compared with the No-Forwarding and Greedy baselines. Under light traffic conditions, edge resources are underutilized, leading all methods to exhibit comparable performance. Similarly, when the load distribution among RSUs is balanced, the Greedy and No-Forwarding methods perform comparably to the proposed approach, as the environment is insufficiently challenging to reveal the benefits of an intelligent RSU selection mechanism. When deadlines become more stringent, the importance of the level-1 DQN agent increases, as it can identify less utilized RSUs capable of meeting latency requirements while considering vehicle travel paths through the RSU network (addressing RQ1 and RQ3).
The comparison between bi-level DQN and bi-level PPO shows that, although both methods achieve comparable performance and outperform the other baseline approaches, bi-level DQN attains higher average rewards and a lower average number of failures under heavy and imbalanced traffic loads. This performance advantage can be attributed to DQN’s ability to efficiently reuse past experiences through a replay buffer, resulting in improved sample efficiency. In contrast, PPO, while also a powerful learning algorithm, trades some sample efficiency for improved training stability due to its on-policy nature. The comparison between the bi-level DQN and the flat method demonstrates that the proposed hierarchical approach significantly outperforms the flat method in terms of average task failure rate and normalized average reward (addressing RQ3). This improvement arises from decomposing the task-offloading decision into two hierarchical levels, which effectively reduces the action space faced by each agent. As a result, the proposed method mitigates model complexity relative to a single monolithic agent, particularly in large-scale problem settings. Despite the hierarchical decision structure, the inference overhead remains minimal, as each decision step involves only lightweight forward passes through two neural networks. Accordingly, the resulting inference latency is negligible compared to communication and task execution delays, and does not offset the performance gains of the proposed framework even for ultra-low latency applications. Regarding the performance of level-2 DQN agents, it was observed that the number of tasks offloaded to an RSU directly influences the learning performance of its level-2 DQN agent. Frequent interactions with the environment provide richer feedback, thereby improving its policy learning (addressing RQ2).
Although the level-2 DQN action space grows quadratically with the number of servers hosted by an RSU, in real-world deployments each RSU typically hosts a subset of the total available edge servers. This practical constraint naturally limits the size of the level-2 action space and mitigates the impact of its quadratic growth on model complexity. Moreover, the proposed hierarchical bi-level method exhibits substantially higher scalability than a flat approach. By decomposing the task-offloading decision into two hierarchical levels, the proposed method effectively reduces the action spaces of agents compared to a single monolithic agent (addressing RQ6).
With respect to the assumption of knowing the vehicle trajectory in this paper, it is justified as follows: Before starting a trip, the driver specifies the destination, and the vehicle’s navigation system computes the optimal route. During the task offloading process, the vehicle then uploads the list of RSUs located along this route. Under normal conditions, this RSU list remains unchanged unless there is a modification in the destination or a significant change in traffic conditions, in which case the route and the corresponding RSU list are updated. This travel model is particularly reasonable in the context of future autonomous and smart vehicles. Without this assumption, an imprecise RSU list would require the proposed method to multicast computation results to a broader set of RSUs to ensure successful delivery to the vehicle, thereby maintaining the quality of service (QoS) at the cost of increased network traffic.
From a practical deployment perspective, several assumptions adopted in the simulation environment may influence the real-world behavior of the proposed framework. Although the packet-drop experiments in Sect. 7.3.4 show that moderate communication instability has limited impact on the learning performance of the replicated level-1 DQN agents, more severe network instability or intermittent connectivity may reduce the effectiveness of coordinated decision-making across RSUs. In addition, the results in Sect. 7.3.2 indicate that RSUs receiving fewer tasks naturally generate fewer training samples, which may lead to higher reward variance and slower local policy refinement. Another practical limitation is that the state representation assumes reasonably accurate knowledge of RSU load, server capacity, and network conditions. In real vehicular edge–cloud deployments, such information may be partially observable or delayed, introducing additional uncertainty into the decision process and potentially affecting policy quality.
Under this deployment model, each RSU hosts a replica of the level-1 DQN agent at the edge layer. Sharing replay buffers among these replicas can substantially improve learning efficiency by exposing each model to a larger and more diverse set of experiences. Though network propagation delays/failures may hinder this process, our experiments show that this factor does not significantly affect the level-1 agent performance particularly in low and moderate network drop rates due to the off-policy nature of the DQN agent.
In addition, a supplementary stability experiment under dynamic RSU failure conditions demonstrates that the proposed bi-level DQN framework maintains bounded and stable system-level convergence behavior after an abrupt increase in RSU failure rates during training.
Conclusions and future research directions
In this paper, a fault-tolerant VECC architecture with mobility awareness for vehicular task offloading was proposed. The proposed bi-level DQN architecture comprises a level-1 DQN agent for high-level scheduling and level-2 DQN agents operating at each RSU. The level-1 DQN agent determines the optimal RSU for task execution by considering real-time workload and network latency, while the level-2 DQN agents jointly decide the node assignment and select the most appropriate recovery pattern among First Result, Recovery Block, and Retry. This hierarchical learning approach enables the system to achieve reduced task execution latency and enhanced reliability under dynamic vehicular network conditions. Experimental evaluations conducted in SimPy and SUMO simulators confirmed that the proposed method outperforms baseline approaches in minimizing task latency and failure rates, particularly under heavy traffic conditions and stringent deadline constraints. The results demonstrate the capability of the bi-level learning framework to adapt to varying network states and computational loads, thereby improving overall system efficiency and robustness. Nevertheless, certain limitations were identified. The model performance is affected negatively by the imprecise vehicle’s trajectory prediction and the high number of servers at RSUs. In addition to these system-level and modeling-related limitations, real-world deployment in safety-critical ITS environments raises broader concerns regarding the reliability and responsible use of AI-based decision-making. Distributional shifts, rare events, and potential over-reliance on learned policies may affect system robustness in practical settings49,50.
From a practical standpoint, the proposed model offers a promising solution for next-generation ITS and vehicular edge networks, where delay-sensitive and safety-critical applications, such as cooperative perception, autonomous driving, and real-time traffic coordination, require reliable and adaptive task execution. The ability to dynamically balance workload distribution and fault-tolerance strategies provides a foundation for resilient and efficient vehicular edge computing deployments. Future research will focus on extending the framework toward scalable and delay-tolerant learning architectures through asynchronous federated DQN mechanisms.
Author contributions
**Vahide Babaiyan: ** Conceptualization, Software Development, Implementation, Simulation, Data Analysis, Visualization, Writing – Original Draft. **Omid Bushehrian: ** Conceptualization, Methodology, Supervision, Validation, Writing – Review and Editing. **Reza Javidan: ** Advisory Guidance, General Review, and Manuscript Feedback.
Data availability
The code and datasets supporting the findings of this study are publicly available at: [https://github.com/vahide-b-84/vecc-fault-tolerant-drl-offloading](https:/github.com/vahide-b-84/vecc-fault-tolerant-drl-offloading).
Declarations
Competing interests
The authors declare no competing interests.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Ling, C. et al. QoS and fairness oriented dynamic computation offloading in the Internet of vehicles based on estimate time of arrival. IEEE Trans. Veh. Technol.73(7), 10554–10571 (2024). [Google Scholar]
- 2.Souaifi, M. et al. Artificial Intelligence in Sports Biomechanics: A Scoping Review on Wearable Technology, Motion Analysis, and Injury Prevention, Bioengineering, vol. 12, no. 8, p. 887, [Online]. (2025). Available: https://www.mdpi.com/2306-5354/12/8/887 [DOI] [PMC free article] [PubMed]
- 3.da Costa, J. B. et al. Mobility and deadline-aware task scheduling mechanism for vehicular edge computing. IEEE Trans. Intell. Transp. Syst.24(10), 11345–11359 (2023). [Google Scholar]
- 4.Materwala, H., Ismail, L. & Hassanein, H. S. QoS-SLA-aware adaptive genetic algorithm for multi-request offloading in integrated edge-cloud computing in Internet of vehicles. Veh. Commun.43, 100654 (2023). [Google Scholar]
- 5.Zeng, J., Gou, F. & Wu, J. Task offloading scheme combining deep reinforcement learning and convolutional neural networks for vehicle trajectory prediction in smart cities.. Comput. Commun.208, 29–43 (2023). [Google Scholar]
- 6.Farimani, M. K., Karimian-Aliabadi, S., Entezari-Maleki, R., Egger, B. & Sousa, L. Deadline-aware task offloading in vehicular networks using deep reinforcement learning.. Expert Syst. Appl.249, 123622 (2024). [Google Scholar]
- 7.Men, R., Fan, X., Yau, K.-L., Shan, A. & Xiao, Y. Mobility-aware parallel offloading and resource allocation scheme for vehicular edge computing.. Ad Hoc Networks164, 103639 (2024). [Google Scholar]
- 8.Shen, S. et al. Mean-field reinforcement learning for decentralized task offloading in vehicular edge computing.. J. Syst. Archit.146, 103048 (2024). [Google Scholar]
- 9.Sun, Y., Song, J., Zhou, S., Guo, X. & Niu, Z. Task replication for vehicular edge computing: A combinatorial multi-armed bandit based approach, in IEEE Global Communications Conference (GLOBECOM), 2018: IEEE, pp. 1–7., 2018: IEEE, pp. 1–7. (2018).
- 10.Zhou, S., Sun, Y., Jiang, Z. & Niu, Z. Exploiting moving intelligence: Delay-optimized computation offloading in vehicular fog networks.. IEEE Commun. Mag.57(5), 49–55 (2019). [Google Scholar]
- 11.Babaiyan, V. & Bushehrian, O. A deep-reinforcement-learning-based strategy selection approach for fault-tolerant offloading of delay-sensitive tasks in vehicular edge-cloud computing. J. Supercomput.81(5), 1–37 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Singh, L. K., Garg, H., Khanna, M. & Bhadoria, R. S. An analytical study on machine learning techniques. In Multidisciplinary functions of Blockchain technology in AI and IoT applications 137–157 (IGI Global Scientific Publishing, 2021).
- 13.Singh, L. K. & Khanna, M. Introduction to artificial intelligence and current trends. In Innovations in Artificial Intelligence and Human-Computer Interaction in the Digital Era 31–66 (Elsevier, 2023).
- 14.Ning, Z., Dong, P., Wang, X., Rodrigues, J. J. & Xia, F. Deep reinforcement learning for vehicular edge computing: An intelligent offloading system. ACM Trans. Intell. Syst. Technol.10(6), 1–24 (2019). [Google Scholar]
- 15.Geng, L. et al. Deep-reinforcement-learning-based distributed computation offloading in vehicular edge computing networks. IEEE Internet Things J.10 (14), 12416–12433 (2023). [Google Scholar]
- 16.Hou, Y., Wei, Z., Zhang, R., Cheng, X. & Yang, L. Hierarchical task offloading for vehicular fog computing based on multi-agent deep reinforcement learning. IEEE Trans. Wireless Commun.23 (4), 3074–3085 (2023). [Google Scholar]
- 17.Wang, X. et al. Deep reinforcement learning: A survey. IEEE Trans. Neural Networks Learn. Syst.35 (4), 5064–5078 (2022). [DOI] [PubMed] [Google Scholar]
- 18.Arulkumaran, K., Deisenroth, M. P., Brundage, M. & Bharath, A. A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag.34(6), 26–38 (2017). [Google Scholar]
- 19.Hernandez-Leal, P., Kartal, B. & Taylor, M. E. A survey and critique of multiagent deep reinforcement learning. Auton. Agent. Multi-Agent Syst.33 (6), 750–797 (2019). [Google Scholar]
- 20.Zhang, K., Yang, Z. & Başar, T. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook reinforcement Learn. control, pp. 321–384, (2021).
- 21.Vinyals, O. et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature575(7782), 350–354 (2019). [DOI] [PubMed] [Google Scholar]
- 22.Lillicrap, T. P. et al. Continuous control with deep reinforcement learning, arXiv preprint arXiv:1509.02971, (2015).
- 23.Oza, P., Hudson, N., Chantem, T. & Khamfroush, H. Deadline-aware task offloading for vehicular edge computing networks using traffic light data. ACM Trans. Embed. Comput. Syst.23(1), 1–25 (2024). [Google Scholar]
- 24.Syed, S. A. et al. QoS aware and fault tolerance based software-defined vehicular networks using cloud-fog computing. Sensors22(1), 401 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Umer, A., Ali, M., Jehangiri, A. I., Bilal, M. & Shuja, J. Multi-objective task-aware offloading and scheduling framework for Internet of Things logistics. Sensors24(8), 2381 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Umer, A. et al. Fault tolerant & priority basis task offloading and scheduling model for IoT logistics. Alex. Eng. J.110, 400–419 (2025). [Google Scholar]
- 27.Wang, N., Li, Y., Li, Y. & Nie, H. Fault-tolerant and mobility-aware loading via Markov chain in mobile cloud computing. Sci. Rep.15(1), 18844 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Chen, X., Xiao, B., Lin, X., Chen, Z. & Min, G. Multi-agent collaboration for vehicular task offloading using federated deep reinforcement learning. IEEE Trans. Mob. Comput.24, 8856–8871. 10.1109/tmc.2025.3557898 (2025). [Google Scholar]
- 29.Lopez, P. A. et al. Microscopic traffic simulation using sumo, in 21st international conference on intelligent transportation systems (ITSC), 2018: Ieee, pp. 2575–2582., 2018: Ieee, pp. 2575–2582. (2018).
- 30.SimPy, T. Simpy: Discrete event simulation for python, Tech. Rep. 9, URL (2017). https://simpy.readthedocs.io/en/latest, 2017.
- 31.van Steen, M. T. & S, A. Distributed Systems 4th Edition ed. (Amazon Digital Services LLC - KDP, 2025). [Google Scholar]
- 32.Sutton, R. S. & Barto, A. G. Reinforcement learning: An introduction. MIT press. Camb.1 (2), 25 (2018). 2nd ed. [Google Scholar]
- 33.Mnih, V. et al. Human-level control through deep reinforcement learning. Nature518(7540), 529–533 (2015). [DOI] [PubMed] [Google Scholar]
- 34.Chen, L., Du, J. & Zhu, X. Mobility-aware task offloading and resource allocation in UAV-assisted vehicular edge computing networks. Drones10.3390/drones8110696 (2024). [Google Scholar]
- 35.Jahandar, S. et al. Mobility-aware offloading decision for multi-access edge computing in 5G networks. Sensors22(7), 2692 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Zhiwei, Q., Juan, L., Wei, L. & Xiao, Y. Mobility-aware and energy-efficient task offloading strategy for mobile edge workflows. Wuhan Univ. J. Nat. Sci.27 (6), 476–488 (2022). [Google Scholar]
- 37.Jiang, Z., Zhou, S., Guo, X. & Niu, Z. Task replication for deadline-constrained vehicular cloud computing: Optimal policy, performance analysis, and implications on road traffic. IEEE Internet Things J.5(1), 93–107 (2017). [Google Scholar]
- 38.Wu, Q. et al. Delay-sensitive task offloading in the 802.11 p-based vehicular fog computing systems. IEEE Internet Things J.7(1), 773–785 (2019). [Google Scholar]
- 39.Sun, C. et al. Hierarchical deep reinforcement learning for joint service caching and computation offloading in mobile edge-cloud computing. IEEE Trans. Serv. Comput.17 (4), 1548–1564 (2024). [Google Scholar]
- 40.Shinde, S. S. & Tarchi, D. Hierarchical reinforcement learning for multi-layer multi-service non-terrestrial vehicular edge computing. IEEE Trans. Mach. Learn. Commun. Netw.10.1109/TMLCN.2024.3433620 (2024). [Google Scholar]
- 41.Zhou, H. et al. Hierarchical multi-agent deep reinforcement learning for energy-efficient hybrid computation offloading. IEEE Trans. Veh. Technol.72(1), 986–1001 (2022). [Google Scholar]
- 42.Cao, D. et al. A relay-assisted parallel offloading strategy for multi-source tasks in internet of vehicles. Sustain. Energy Technol. Assess.62, 103619 (2024). [Google Scholar]
- 43.Wan, N., Luo, Y., Zeng, G. & Zhou, X. Minimization of VANET execution time based on joint task offloading and resource allocation. Peer-to-Peer Netw. Appl.16 (1), 71–86 (2023). [Google Scholar]
- 44.Liu, L. et al. Asynchronous deep reinforcement learning for collaborative task computing and on-demand resource allocation in vehicular edge computing. IEEE Trans. Intell. Transp. Syst.24(12), 15513–15526 (2023). [Google Scholar]
- 45.Yang, J., Chen, Y., Lin, Z., Tian, D. & Chen, P. Distributed computation offloading in autonomous driving vehicular networks: A stochastic geometry approach. IEEE Trans. Intell. Veh.9(1), 2701–2713 (2023). [Google Scholar]
- 46.Cong, Y. et al. Latency-energy joint optimization for task offloading and resource allocation in MEC-assisted vehicular networks. IEEE Trans. Veh. Technol.72 (12), 16369–16381 (2023). [Google Scholar]
- 47.Wegener, A. et al. TraCI: an interface for coupling road traffic and network simulators, in Proceedings of the 11th communications and networking simulation symposium, pp. 155–163. (2008).
- 48.Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347, (2017).
- 49.Ben Saad, H. et al. The assisted technology dilemma: A reflection on AI chatbots use and risks while reshaping the peer review process in scientific research. AI & SOCIETY10.1007/s00146-025-02299-6 (2025). [Google Scholar]
- 50.Amodei, D. et al. Concrete problems in AI safety, arXiv preprint arXiv:1606.06565, (2016).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The code and datasets supporting the findings of this study are publicly available at: [https://github.com/vahide-b-84/vecc-fault-tolerant-drl-offloading](https:/github.com/vahide-b-84/vecc-fault-tolerant-drl-offloading).







































































































































































































































