Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Jan 26;16:6226. doi: 10.1038/s41598-026-37191-w

End-to-end emergency response protocol for tunnel accidents augmentation with reinforcement learning

Hafiz Muhammad Raza ur Rehman 1, M Junaid Gul 1, Rabbiya Younas 1, Muhammad Zeeshan Jhandir 2, Roberto Marcelo Alvarez 3,4,5, Yini Miro 3,6,7, Imran Ashraf 1,
PMCID: PMC12905362  PMID: 41588147

Abstract

Autonomous unmanned aerial vehicles (UAVs) offer cost-effective and flexible solutions for a wide range of real-world applications, particularly in hazardous and time-critical environments. Their ability to navigate autonomously, communicate rapidly, and avoid collisions makes UAVs well suited for emergency response scenarios. However, real-time path planning in dynamic and unpredictable environments remains a major challenge, especially in confined tunnel infrastructures where accidents may trigger fires, smoke propagation, debris, and rapid environmental changes. In such conditions, conventional preplanned or model-based navigation approaches often fail due to limited visibility, narrow passages, and the absence of reliable localization signals. To address these challenges, this work proposes an end-to-end emergency response framework for tunnel accidents based on Multi-Agent Reinforcement Learning (MARL). Each UAV operates as an independent learning agent using an Independent Q-Learning paradigm, enabling real-time decision-making under limited computational resources. To mitigate premature convergence and local optima during exploration, Grey Wolf Optimization (GWO) is integrated as a policy-guidance mechanism within the reinforcement learning (RL) framework. A customized reward function is designed to prioritize victim discovery, penalize unsafe behavior, and explicitly discourage redundant exploration among agents. The proposed approach is evaluated using a frontier-based exploration simulator under both single-agent and multi-agent settings with multiple goals. Extensive simulation results demonstrate that the proposed framework achieves faster goal discovery, improved map coverage, and reduced rescue time compared to state-of-the-art GWO-based exploration and random search algorithms. These results highlight the effectiveness of lightweight MARL-based coordination for autonomous UAV-assisted tunnel emergency response.

Keywords: Robotic systems, Drones, Multi-agents system, Path finding, Reinforcement learning, Tunnel hazards, Unmanned aerial vehicles

Subject terms: Engineering, Mathematics and computing

Introduction

Over the past decades, a wide range of natural and man-made disasters, including earthquakes, floods, explosions, and large-scale fires, have caused severe loss of human life and critical infrastructure. Such events frequently lead to collapsed buildings and damaged tunnel systems, trapping victims beneath debris and creating extremely hazardous conditions for emergency response teams. Rapid and effective search-and-rescue (SAR) operations are therefore essential; however, conventional response methods are often constrained by structural instability, toxic environments, and limited accessibility. In this context, UAVs have emerged as a promising technological solution for enhancing SAR operations and improving disaster response efficiency.

Tunnels and underground facilities have been widely developed to support modern transportation and urban infrastructure. Despite their economic and logistical importance, tunnels are fully enclosed environments, which significantly increase risk during emergency situations. In the event of a tunnel fire, trapped individuals often have limited escape routes, resulting in a high likelihood of casualties. For example, a tunnel fire in the Shanxi Yanhou Tunnel in 2014 resulted in 40 fatalities and 12 injuries1. Such incidents highlight the critical importance of tunnel fire safety management and efficient emergency response mechanisms.

Tunnel environments pose multiple safety hazards, including fire, smoke propagation, structural collapse, electrical failures, and the presence of hazardous materials. Among these, smoke inhalation remains one of the leading causes of fatalities in tunnel accidents. Fires in confined spaces can rapidly generate dense smoke, severely reducing visibility and causing asphyxiation. Furthermore, toxic gases released during combustion can induce disorientation and respiratory distress, significantly worsening survival prospects for trapped individuals.

The enclosed nature of tunnels also complicates firefighting and rescue efforts. Limited ventilation accelerates heat accumulation and smoke spread, while narrow passages and structural damage restrict responder mobility. Emergency personnel must often operate under extreme uncertainty, with incomplete situational awareness and rapidly evolving conditions.

Several real-world incidents further illustrate these challenges. In July 2018, a fire occurred in the Tianjin Binhai Tunnel in China due to the ignition of flammable goods transported by a truck, resulting in injuries to tunnel users and firefighters2. Similarly, the Sasago Tunnel ceiling collapse in Japan on December 2, 2012, resulted in multiple vehicles being crushed and at least nine fatalities, highlighting the catastrophic consequences of tunnel structural failures and the challenges faced by emergency responders in such confined spaces3. These cases emphasize the need for robust safety protocols, continuous infrastructure monitoring, and intelligent emergency response systems capable of operating in hazardous tunnel environments.

Effective tunnel emergency response requires responders to navigate complex, maze-like infrastructures under conditions of low visibility and dynamic obstruction. Traditional path-planning methods often struggle in such scenarios due to smoke, debris, and partial structural collapse. Additionally, satellite-based navigation systems such as GPS are unreliable or unavailable in underground environments, further complicating localization and routing tasks.

Most existing navigation and rescue approaches rely heavily on accurate environmental models or prior knowledge of tunnel layouts4,5. In real-world emergencies, however, such information is frequently incomplete, outdated, or entirely unavailable. Reinforcement learning (RL) offers a viable alternative by enabling autonomous agents to adapt their behavior through interaction with the environment rather than relying on predefined models6. RL has been extensively applied to UAV control and robotic navigation tasks7,8, including trajectory tracking, path following, and disturbance mitigation. For example, RL-based frameworks have been proposed for UAV motion planning with suspended loads9, stable trajectory generation10, adaptive PID control11, and disturbance compensation in complex airflow conditions12. Cooperative UAV path-planning approaches, such as Dubins-based methods13, have also been explored, although they often struggle with rapid local environmental changes.

Despite these advances, the application of RL to autonomous UAV-based disaster response particularly for mission planning, victim search, and cooperative exploration in confined tunnel environments remains relatively underexplored14,15. Moreover, many existing approaches rely on computationally intensive models, explicit inter-agent communication, or centralized learning architectures, which may limit their applicability in time-critical rescue operations.

Motivated by these challenges, this work focuses on developing a lightweight, adaptive, and cooperative multi-agent learning framework tailored for tunnel emergency response. The proposed approach emphasizes real-time feasibility, efficient exploration, and safety-aware decision-making under partial observability. In the proposed scheme GWO has been used with RL as policy despite others GWO based RL algorithms where they used GWO for managing exploration and exploitation ratio 16.

The main contributions of this work are summarized as follows:

  • Development of a MARL framework for autonomous tunnel emergency response and victim search.

  • Adoption of an IQL paradigm to enable real-time decision-making under limited computational resources.

  • Integration of frontier-based exploration with graph-based path planning to efficiently navigate partially known environments.

  • Design of a reward mechanism that discourages redundant exploration, penalizes unsafe behavior, and prioritizes victim discovery.

  • Formulation of observable and hidden state representations to address partial observability in cooperative multi-agent settings.

  • Comprehensive simulation-based evaluation demonstrating improved exploration efficiency, rescue time, and collision avoidance compared to baseline methods.

The remainder of this paper is organized as follows. “Literature review” reviews related work. “Preliminaries” introduces the theoretical foundations and background concepts. “Material and methods” presents the proposed methodology. “Performance evaluation” discusses the experimental setup and performance evaluation. Finally, “Conclusion” concludes the paper and outlines directions for future research.

Literature review

Deep reinforcement learning (DRL) has drawn significant attention among researchers in Unmanned Aerial Vehicle (UAV) systems, as it addresses the growing need for autonomous aerial vehicles capable of executing complex tasks in dynamic and uncertain environments. Recent literature explores how DRL enhances UAV guidance, navigation, and control (GNC), particularly in unpredictable or GPS-denied scenarios.

For instance17, presents an asynchronous deep deterministic policy gradient (ADDPG) method for mapless navigation with mobile robots in challenging environments, demonstrating the applicability of RL to autonomous navigation tasks. In another study18, proposes a technique that integrates external memory, enabling neural network models to perform mapping, localization, and navigation decision-making within a unified framework. This configuration allows simultaneous position estimation and map construction alongside continuous control.

For continuous control in autonomous navigation19, utilizes sparse LiDAR inputs and relative target locations within a DRL framework, resulting in improved path-planning efficiency and robustness. Moreover20, introduces an integrated communication and control architecture based on DDPG for UAV swarm formation management, enabling enhanced control precision and collision avoidance. Despite these advancements21, report that DRL models face challenges related to generalization, safety, training stability, and computational overhead, which hinder their deployment in real-world, safety-critical environments.

While general-purpose UAVs demonstrate strong DRL-enabled navigation and control capabilities, deploying them in mission-critical operations such as search and rescue (SAR) introduces additional challenges. SAR missions often involve cluttered, GPS-denied environments and strict time constraints, requiring UAVs to exhibit high levels of autonomy, reliability, and adaptability. Consequently, recent research has focused on UAV systems tailored specifically for SAR applications.

In this context22, propose a UAV-based SAR framework that leverages received signal strength (RSS) measurements and a Q-learning-based strategy to detect indoor victims. Their results show that directional antennas improve convergence speed and localization accuracy compared to omnidirectional antennas. Similarly, Donnelly et al. 23 model UAV-based SAR using partially observable Markov decision processes (POMDPs) and deep Q-networks (DQNs), demonstrating improved performance over heuristic methods in complex environments. However, such approaches typically rely on deep learning architectures and centralized training, which may limit real-time applicability.

Maritime UAV-based SAR missions pose further challenges due to large operational areas and rapidly changing conditions. To address this, Wu et al. 24 propose a hybrid genetic algorithm and RL (GA-RL) approach for path planning, embedding Q-learning into evolutionary optimization. Their method achieves improved convergence and solution quality compared to standard optimization techniques. For wilderness SAR, Bhattacharya et al. 25 develop a modular DRL framework for 3D UAV navigation and person detection using curriculum learning, achieving high accuracy in both semi-autonomous and guided navigation tasks.

In multi-agent UAV systems, Wang et al. 26 present a Q-learning-based 3D deployment framework that enables multiple UAVs to dynamically reposition for optimal coverage, outperforming traditional clustering approaches. Nevertheless, many existing multi-agent studies primarily focus on coverage or communication efficiency rather than rescue prioritization or redundant exploration avoidance.

Recent studies have also explored heterogeneous and cooperative multi-agent systems for dynamic monitoring and patrolling tasks. For example, the UAV–UGV cooperative framework presented in27 investigates coordinated patrolling and energy management in urban environments, demonstrating how task allocation and inter-agent cooperation can improve system endurance and coverage. While effective, such approaches typically rely on explicit coordination strategies and stable communication, which may be difficult to guarantee in confined tunnel environments affected by smoke, fire, or structural damage.

UAV path planning under obstacle-rich environments has also been explored using bio-inspired optimization techniques. Ant Colony Optimization (ACO), inspired by collective ant foraging behavior, has been widely applied to robotic and UAV path planning2831. These methods have been shown to generate collision-free trajectories and optimize routes under various constraints. However, their performance is often sensitive to parameter tuning and may degrade in highly dynamic or partially observable environments.

Classical robotic exploration techniques provide important foundations for autonomous navigation. Frontier-based exploration, introduced by Yamauchi32, enables robots to expand their knowledge of unknown environments by targeting boundaries between explored and unexplored regions. Probabilistic mapping and SLAM-based approaches33 further improve navigation by maintaining belief distributions over the environment. These methods, however, are primarily designed for single-agent settings and lack adaptive coordination mechanisms for cooperative rescue scenarios.

RL-based exploration has traditionally relied on Q-learning and policy gradient methods. Q-learning, introduced by Watkins and Dayan34, provides a model-free mechanism for learning optimal actions in discrete state-action spaces. Policy gradient methods35 extend learning to continuous action spaces but generally require higher computational resources. In multi-agent contexts, cooperative exploration strategies have been investigated by Cao et al. 36, demonstrating improved coverage efficiency, though without explicit consideration of real-time rescue constraints or victim prioritization.

In time-critical disaster scenarios, balancing exploration and exploitation becomes particularly important. To address this challenge, recent work has proposed infrastructure-assisted learning frameworks that leverage edge intelligence and communication-aware optimization. For instance,37 integrates UAV-mounted reconfigurable intelligent surfaces (RIS) and high-altitude platforms (HAPs) to optimize disaster response under strict latency constraints. While such approaches improve exploration efficiency, they require sophisticated communication infrastructure and centralized coordination, limiting their applicability in underground or tunnel-based rescue operations. Scalability in large-scale systems has also been addressed using deep reinforcement learning in communication-centric domains. For example38, employs a DRL-based relaying election mechanism to improve energy efficiency in large IoT networks. Although DRL-based solutions offer scalability and performance benefits, they typically require extensive training data, powerful computational resources, and centralized training paradigms, which may not be feasible for real-time emergency response in tunnel environments.

Overall, the literature highlights significant progress in UAV navigation, RL, and multi-agent coordination. However, most existing approaches rely on deep or centralized learning architectures, explicit communication strategies, or computationally intensive optimization methods. Limited attention has been given to lightweight MARL frameworks that operate under real-time constraints, minimize redundant exploration, and ensure safety in confined tunnel environments. These limitations motivate the proposed IQL-based multi-agent framework, which emphasizes computational efficiency, implicit coordination through reward design, and practical applicability for tunnel emergency response.

Preliminaries

This section introduces the fundamental concepts and mathematical tools required to understand the proposed multi-agent rescue framework. Specifically, we review graph-based shortest-path planning, artificial potential fields for collision avoidance, and the RL foundations underpinning the IQL paradigm adopted in this work.

Graph-based shortest path planning

Graph-based path planning is widely used in robotic navigation to compute collision-free and efficient routes in structured environments. In the context of tunnel rescue, the environment is represented as a weighted graph, where nodes correspond to discrete spatial locations and edges denote traversable connections between them 39.

Let Inline graphic be a weighted graph, where V is the set of vertices and E is the set of edges. Each edge Inline graphic is associated with a non-negative weight w(uv) representing traversal cost.

Given a source node Inline graphic, Dijkstra’s algorithm computes the shortest path distance from s to all other nodes in V by iteratively expanding the closest unvisited node and relaxing adjacent edges. The algorithm is formally defined as:

graphic file with name d33e539.gif 1

where d(v) denotes the minimum cumulative cost from s to node v.

In this work, shortest-path computation is used to guide agents toward frontier cells during exploration, enabling efficient navigation through partially explored tunnel environments.

Artificial potential fields for collision avoidance

Artificial Potential Fields (APFs) are a classical motion planning technique used to generate collision-free trajectories by modeling the environment as a combination of attractive and repulsive forces. Goals exert attractive forces, while obstacles and other agents exert repulsive forces40.

Let Inline graphic denote the current position of an agent and Inline graphic the target position. The total potential field is defined as:

graphic file with name d33e577.gif 2

where the attractive potential is given by:

graphic file with name d33e582.gif 3

and the repulsive potential generated by obstacles is defined as:

graphic file with name d33e587.gif 4

Here, Inline graphic denotes the position of the i-th obstacle, Inline graphic is the influence radius, and Inline graphic, Inline graphic are scaling constants.

In the proposed framework, collision avoidance is not enforced explicitly through force-based control but is incorporated implicitly through reward penalties inspired by APF principles.

Markov decision process formulation

RL problems are commonly modeled using a Markov Decision Process (MDP), defined by the tuple:

graphic file with name d33e618.gif

where: S is the set of states, A is the set of actions, Inline graphic is the state transition probability, Inline graphic is the reward function, Inline graphic is the discount factor.

At each time step t, an agent observes state Inline graphic, executes action Inline graphic, receives reward Inline graphic, and transitions to state Inline graphic.

Value functions and Bellman equations

The state-value function under policy Inline graphic is defined as:

graphic file with name d33e670.gif 5

Similarly, the action-value function is given by:

graphic file with name d33e675.gif 6

The optimal action-value function satisfies the Bellman optimality equation:

graphic file with name d33e680.gif 7

Q-learning

Q-learning is a model-free RL algorithm that iteratively approximates the optimal action-value function Inline graphic without requiring prior knowledge of transition probabilities41. The update rule is defined as:

graphic file with name d33e696.gif 8

where Inline graphic is the learning rate.

IQL in multi-agent systems

In IQL, each agent maintains its own Q-table and learns independently by treating other agents as part of the environment. Although this introduces non-stationarity, IQL remains computationally lightweight and suitable for real-time applications.

In this work, IQL is adopted to ensure scalability and real-time feasibility in tunnel rescue scenarios, avoiding the high computational cost associated with DRL or centralized training paradigms.

Grey wolf optimizer

The GWO is a population-based metaheuristic inspired by the social hierarchy and cooperative hunting behavior of grey wolves. In GWO, candidate solutions are categorized into four hierarchical groups: alpha (Inline graphic), beta (Inline graphic), delta (Inline graphic), and omega (Inline graphic), where Inline graphic, Inline graphic, and Inline graphic represent the three best solutions guiding the search process, while Inline graphic represents the remaining candidates42,43.

Let Inline graphic denote the position of a search agent at iteration t, and let Inline graphic, Inline graphic, and Inline graphic denote the positions of the three best solutions. The position update mechanism in GWO is defined as:

graphic file with name d33e777.gif 9
graphic file with name d33e781.gif 10
graphic file with name d33e785.gif 11

where Inline graphic, Inline graphic, and Inline graphic are random vectors. The parameter a decreases linearly from 2 to 0 over iterations, allowing a smooth transition from exploration to exploitation.

GWO has been widely adopted for path planning and optimization tasks due to its simplicity, fast convergence, and low computational overhead. However, standalone GWO methods are prone to premature convergence in complex or dynamic environments.

In this work, GWO is not employed as an independent optimizer. Instead, its exploration behavior is integrated with RL to guide policy exploration and mitigate local optima. This hybridization preserves the lightweight nature of tabular learning while improving search diversity in complex tunnel rescue environments.

Material and methods

Figure 1 illustrates the overall scenario exploitation process of the proposed tunnel emergency response framework.

Fig. 1.

Fig. 1

Scenario exploitation pictorial explanation of the proposed multi-agent rescue system.

Frontier-based cooperative exploration

A frontier-based exploration strategy is employed to enable efficient navigation and mapping of tunnel environments. Initially, the environment is represented as a discretized occupancy grid where all cells are marked as unexplored. As agents traverse the environment, onboard sensors continuously update the grid based on newly observed information.

Frontier cells, defined as the boundary between explored and unexplored regions, are selected as exploration targets. Agents compute collision-free paths toward these frontier cells using the A* algorithm while avoiding static and dynamic obstacles. Exploration continues until all reachable frontiers are exhausted or all victims are successfully located.

Multi-agent rescue system overview

The proposed rescue system consists of multiple autonomous UAV/robot agents operating cooperatively in a shared tunnel environment. Each agent independently performs navigation, victim detection, obstacle avoidance, and map updating. Cooperation is achieved implicitly through shared environmental feedback rather than explicit inter-agent communication.

The performance of the system is evaluated using the following criteria:

  • number of victims successfully rescued,

  • coverage of the tunnel environment,

  • avoidance of redundant exploration,

  • collision-free navigation.

Accordingly, the global objectives are defined as:

graphic file with name d33e849.gif 12
graphic file with name d33e853.gif 13

The corresponding utility function is formulated as:

graphic file with name d33e858.gif 14

The proposed framework adopts an IQL paradigm, where each agent maintains its own Q-table and updates it independently based on local observations. This design choice is motivated by the strict computational and latency constraints of real-time tunnel rescue operations.

Unlike centralized training or DRL approaches, IQL avoids neural network inference, replay buffers, and extensive training requirements, making it suitable for deployment in resource-constrained and time-critical environments. Potential non-stationarity associated with IQL is mitigated through reward shaping and structured frontier-based task decomposition42.

The state space S consists of partially observable states Inline graphic and partially hidden states Inline graphic. Observable states represent the agent’s position:

graphic file with name d33e882.gif 15

while hidden states correspond to victim locations:

graphic file with name d33e887.gif 16

The complete state vector of an agent at time t is:

graphic file with name d33e896.gif 17

Each agent p observes the environment using exteroceptive sensors. The observation space at time t is defined as:

graphic file with name d33e907.gif 18

where Inline graphic denotes the relative positions of other agents.

At each time step, an agent can perform one of nine discrete actions corresponding to grid-based movement, as shown in Fig. 2. The action space is defined as:

graphic file with name d33e921.gif 19

Fig. 2.

Fig. 2

Grid-based action space for agent movement.

To enable adaptation to dynamic tunnel conditions such as blocked passages, smoke, or fire spread, GWO is integrated into the exploration policy of RL. Rather than acting as a standalone optimizer, GWO biases action selection during exploration to prevent premature convergence.

This sustained exploration mechanism ensures that agents continue adapting their policies online as the environment evolves, allowing exploration to proceed until all victims are rescued and the operation is completed.

A customized reward function is designed to promote cooperative exploration, safety, and efficiency:

graphic file with name d33e938.gif 20

The parameters are set as Inline graphic, Inline graphic, Inline graphic, Inline graphic, and Inline graphic.

The duplicate-exploration penalty discourages redundant exploration, while the collision penalty ensures safety during learning. Since agent positions are known in the centralized simulation environment, collisions are detected prior to action execution.

Each agent updates its Q-table using the standard Q-learning update rule:

graphic file with name d33e968.gif 21

Action selection follows an Inline graphic-greedy strategy augmented with GWO-guided exploration.

Algorithm 1.

Algorithm 1

GWO-guided IQL for multi-agent rescue.

The proposed framework relies on well-established convergence properties of Q-learning and frontier-based exploration. Reward shaping and shared environmental feedback promote stable cooperative behavior and reduce non-stationarity. These properties summarize known results rather than introducing new theoretical guarantees.

Property 1: bounded exploration and exploitation

Statement: Under standard Q-learning conditions, the proposed IQL framework maintains a bounded balance between exploration and exploitation during the learning process, preventing premature convergence while ensuring policy improvement over time.

Explanation: Each agent updates its action-value function using the classical Q-learning update rule:

graphic file with name d33e996.gif 22

where Inline graphic denotes the learning rate and Inline graphic is the discount factor.

When the learning rate satisfies the Robbins-Monro conditions,

graphic file with name d33e1011.gif

and when state-action pairs are sufficiently explored through an Inline graphic-greedy policy, the Q-values are known to converge toward stable estimates in stationary environments.

In the proposed framework, exploration is further regulated through reward shaping and step penalties, which discourage excessive wandering while still allowing agents to explore unvisited regions. This results in a bounded exploration-exploitation trade-off that supports stable learning behavior without introducing additional computational complexity.

Property 2: frontier coverage behavior

Statement: The integration of frontier-based exploration with RL leads to progressive reduction of unexplored regions while prioritizing victim discovery in partially observable environments.

Explanation: Let E(t) and U(t) denote the sets of explored and unexplored cells at time step t, respectively. Frontier cells are defined as:

graphic file with name d33e1043.gif 23

representing the interface between known and unknown regions.

By selecting actions that guide agents toward frontier cells, the exploration process incrementally expands E(t) while reducing U(t). In the proposed reward formulation, revisiting previously explored cells incurs a penalty, which discourages redundant exploration and promotes efficient coverage of new areas.

As exploration progresses, the number of frontier cells naturally decreases:

graphic file with name d33e1064.gif

indicating saturation of the reachable environment. Simultaneously, victim discovery events are reinforced through positive rewards, ensuring that exploration remains goal-directed rather than purely spatial. This behavior supports systematic coverage without requiring explicit global coordination.

Property 3: cooperative utility improvement

Statement: The collective utility of the multi-agent system improves as individual agents learn policies that are shaped by shared environmental feedback and complementary exploration behaviors.

Explanation: Let Inline graphic denote the reward received by agent p at time t, and define the cumulative system-level reward as:

graphic file with name d33e1084.gif 24

where C is the number of agents.

Although each agent maintains an independent Q-table, coordination emerges implicitly through shared environmental states and reward signals, such as penalties for collisions and redundant exploration. As agents learn to avoid overlapping paths and unsafe actions, the collective reward accumulated over an episode increases.

The expected utility over a finite horizon T can be expressed as:

graphic file with name d33e1099.gif 25

which improves as agents adopt policies that balance individual objectives with system-level efficiency. This property reflects cooperative behavior emerging from decentralized learning rather than guaranteeing global optimality, making it suitable for real-time rescue operations.

Performance evaluation

This section presents an extensive evaluation of the proposed GWO-guided IQL framework in both single-agent and multi-agent tunnel rescue scenarios. The evaluation focuses on exploration efficiency, rescue effectiveness, safety, and execution time under complex and constrained environments. To ensure fair and reliable comparisons, all experiments are conducted under identical environmental settings and averaged over 20 independent simulation runs.

Two representative environments are considered: (i) a maze environment with a single agent and a single rescue goal, and (ii) a road-map maze environment with multiple agents and multiple rescue goals. These environments emulate realistic tunnel accident conditions characterized by limited visibility, narrow pathways, and dynamically distributed victims.

Evaluation environments

Single-agent, single-goal maze environment

To evaluate baseline navigation and exploration capability, three different maze configurations with varying obstacle densities, corridor structures, and goal locations are employed. These environments are illustrated in Fig. 3. In each configuration, the agent is represented by a blue box, while the rescue target (victim) is shown as a green box.

Fig. 3.

Fig. 3

Single-agent exploration in three maze environments with one rescue target.

The environments include multiple dead ends and narrow passages, posing challenges for exploration strategies that suffer from premature convergence or inefficient search behavior. The agent is equipped with onboard sensors to perceive nearby obstacles and free space, enabling partial observability similar to real tunnel conditions.

The agent follows a Q-learning-based policy to balance exploration and exploitation. Successful victim discovery yields a positive reward, while collisions with obstacles incur penalties. Additionally, a step penalty of Inline graphic is applied at each time step to discourage unnecessary movement and promote efficient navigation. This reward structure ensures comprehensive environment coverage while prioritizing timely victim rescue and collision avoidance.

Multi-agent, multi-goal road-map maze environment

To assess cooperative behavior and scalability, three road-map maze environments with varying sizes, obstacle distributions, and victim locations are used. These environments are shown in Fig. 4. Victims are depicted in red, obstacles in black, explored regions in white, frontier regions in light blue, and agents in green.

Fig. 4.

Fig. 4

Multi-agent exploration in road-map maze environments with multiple rescue targets.

Multiple agents operate simultaneously to locate and rescue all victims. Each agent employs the IQL paradigm with an individual Q-table, while coordination is achieved implicitly through shared environment updates in a centralized simulation framework. This setting reflects practical rescue operations where a command center maintains global situational awareness while individual agents act autonomously.

Positive rewards are assigned for successful victim rescues, while penalties are applied for collisions, redundant exploration, and inefficient movements. A step penalty of Inline graphic further encourages agents to minimize rescue time. This environment provides a comprehensive testbed for evaluating cooperative efficiency, safety during learning, and robustness in complex tunnel rescue scenarios.

The proposed framework is compared against the following baseline methods:

  • Random search,

  • Utility-based cooperative exploration (UCE),

  • Cooperative multi-agent exploration (CME),

  • GWO-based exploration.

All algorithms are evaluated under identical environmental conditions, agent counts, and termination criteria. Each experiment is repeated 20 times to mitigate randomness and ensure statistical reliability. Performance is measured in terms of explored area, number of iterations to achieve rescue goals, total execution time, and collision avoidance.

Single-agent performance analysis

Figure 5 illustrates the comparative performance of the proposed framework and baseline methods in single-goal environments. The UCE approach exhibits the lowest exploration efficiency, requiring a significantly higher number of steps to reach the goal. This behavior is primarily attributed to its lack of adaptive exploration mechanisms.

Fig. 5.

Fig. 5

Performance comparison in single-goal environments.

The GWO-based approach demonstrates improved exploration compared to Random Search but still suffers from premature convergence in certain maze configurations. In contrast, the proposed GWO-guided IQL framework consistently achieves faster goal discovery and higher map coverage efficiency across all environments.

Quantitative results are summarized in Table 1. The proposed approach achieves the rescue goal using only 51 iterations and 199 s, significantly outperforming Random Search (72 iterations, 280 s) and GWO (81 iterations, 313 s). These gains highlight the effectiveness of reward shaping, step penalties, and GWO-guided exploration in accelerating convergence and improving real-time performance.

Table 1.

Single goal evaluation.

Algorithm Explored per 50 iteration/step Explored per 100 iteration/step Iteration/step for goal Iteration/step to explore whole map Time (s)
Random search algorithm (single goal (1)) 34 57 154 989 389
GWO (single goal (1)) 39 61 127 965 391
OUR (single goal (1)) 42 85 98 785 310
Random search algorithm (single goal (2)) 21 49 537 1334 465
GWO (single goal (2)) 26 60 456 1287 441
OUR (single goal (2)) 35 72 395 961 379
Random search algorithm (single goal (3)) 41 89 72 782 280
GWO (single goal (3)) 43 78 81 795 313
OUR (single goal (3)) 45 91 51 673 199

Multi-agent performance analysis

Figure 6 presents the performance comparison in multi-goal environments. The proposed framework consistently outperforms CME and GWO across all scenarios in terms of both rescue time and exploration efficiency. Notably, the proposed method identifies shorter collective paths that cover all victims, which can be reused for subsequent detailed rescue operations.

Fig. 6.

Fig. 6

Performance comparison in multi-goal environments.

Table 2 further demonstrates the scalability of the proposed framework. With four agents, the proposed approach completes the rescue task in 754 s using 798 iterations, compared to 876 iterations for Random Search and 950 iterations for GWO. When only two agents are used, the proposed method again achieves superior performance, completing the mission in 780 s, compared to 1056 s and 1120 s for Random Search and GWO, respectively.

Table 2.

Multi goal evaluation.

Algorithm Explored per 100 iteration/step Explored per 200 iteration/step Iteration/step for all goals Iteration/step to explore whole map Time (s) # of agents
Random search algorithm (multi goals (1)) 190 350 876 1938 1015 4
GWO (multi goals (1)) 218 343 950 2032 1119 4
OUR (multi goals (1)) 265 427 798 1358 754 4
Random search algorithm (multi goals (2)) 133 219 1035 1763 950 4
GWO (multi goals (2)) 158 231 965 1581 887 4
OUR (multi goals (2)) 209 327 734 1402 809 4
Random search algorithm (multi goals (3)) 115 202 691 1565 1056 2
GWO (multi goals (3)) 127 191 718 1742 1120 2
OUR (multi goals (3)) 164 238 605 1268 780 2

These improvements are attributed to implicit coordination through reward shaping, duplicate-exploration penalties, and collision avoidance mechanisms, which collectively enhance cooperative efficiency without introducing communication overhead.

Discussion and practical implications

The experimental results demonstrate that the proposed GWO-guided IQL framework consistently outperforms Random Search, CME, UCE, and standalone GWO-based methods across both single-agent and multi-agent rescue scenarios. These improvements are observed in terms of reduced rescue time, fewer iterations to achieve goals, and higher exploration efficiency, while maintaining collision-free navigation.

A key factor contributing to this performance gain is the integration of GWO into the exploration policy rather than as a standalone optimizer. By guiding exploration without replacing the underlying RL process, the proposed approach avoids premature convergence to suboptimal paths, which is a common limitation of greedy or purely heuristic-based exploration strategies. This sustained exploration capability is particularly important in tunnel environments, where dynamic changes such as blocked passages or newly accessible regions can significantly alter optimal rescue routes during operation.

The use of IQL with tabular representation plays a critical role in enabling real-time feasibility. Unlike DRL or centralized-training paradigms, the proposed framework avoids neural network inference and extensive training overhead, allowing agents to make rapid decisions based on lightweight table lookups. This design choice is well aligned with the operational constraints of emergency response systems, where computational resources, energy availability, and response time are limited.

Another important observation from the results is the effectiveness of reward shaping in achieving implicit coordination among agents. Penalizing duplicate exploration and collisions discourages inefficient or unsafe behaviors without requiring explicit inter-agent communication or task assignment. As demonstrated in multi-agent experiments, this mechanism enables agents to naturally distribute themselves across the environment, reducing redundant coverage and accelerating collective victim discovery.

From a safety perspective, embedding collision avoidance directly into the reward function ensures that unsafe actions are penalized during learning, allowing agents to internalize safety constraints early in the training process. This approach reduces the likelihood of collision-prone policies emerging, which is essential for operation in narrow tunnel environments where maneuvering space is constrained.

In practical deployment scenarios, the proposed framework can be integrated into centralized tunnel monitoring and command systems, where a global situational map is maintained and shared with multiple autonomous agents. The lightweight nature of the learning algorithm makes it suitable for onboard implementation on resource-constrained platforms, while the centralized simulation assumption provides a foundation for future extensions that incorporate communication delays, sensor noise, or decentralized coordination mechanisms.

Despite its advantages, the proposed framework has certain limitations. The current implementation assumes idealized sensing and reliable global map updates, which may not fully reflect real-world tunnel conditions characterized by sensor noise, communication disruptions, or partial observability. Additionally, while the tabular IQL approach is effective for the evaluated environments, scaling to very large or continuous state spaces may require function approximation or hierarchical learning strategies44.

Overall, the results suggest that the proposed GWO-guided IQL framework offers a practical and effective solution for time-critical tunnel rescue operations. Through balancing exploration efficiency, safety, and computational feasibility, the framework provides a strong foundation for real-world emergency response systems and opens avenues for future research on decentralized coordination, adaptive communication strategies, and integration with physical robot platforms.

Conclusion

This paper presented a lightweight multi-agent reinforcement learning (MARL) framework for autonomous UAV-assisted emergency response in tunnel accident scenarios. The proposed approach employs an Independent Q-Learning (IQL) paradigm augmented with frontier-based exploration and policy-level guidance from Grey Wolf Optimization (GWO) to enable efficient, real-time decision-making under partial observability and dynamic environmental conditions. Extensive simulation results across both single-agent and multi-agent environments demonstrate that the proposed framework consistently achieves faster victim discovery, improved map coverage, and reduced overall rescue time when compared with baseline approaches such as random search and standalone GWO-based exploration. In particular, the results show that the proposed reward design effectively discourages redundant exploration, balances exploration and exploitation, and enhances cooperative behavior among agents without requiring explicit inter-agent communication. These characteristics are especially important in confined tunnel environments, where communication may be unreliable and rapid response is critical. From a practical perspective, the proposed method emphasizes computational efficiency and decentralized execution, making it suitable for real-time deployment in emergency scenarios where hardware resources and response time are constrained. By relying on tabular learning and implicit coordination through reward shaping, the framework avoids the heavy training and infrastructure requirements associated with deep or centralized reinforcement learning methods.

Future work will focus on extending the framework to three-dimensional tunnel models, incorporating realistic sensor noise and communication delays, and validating the approach in high-fidelity simulators or real-world testbeds. Additionally, hybrid architectures that combine lightweight IQL with selective deep reinforcement learning components or adaptive communication strategies will be explored to further improve scalability and robustness in large and highly dynamic rescue operations.

Author contributions

HRuR conceptualization, data curation and writing - the original manuscript. MJG conceptualization, formal analysis and writing - the original manuscript. RY methodology, and formal analysis and data curation. MZJ software, methodology and project administration. RMA investigation, funding acquisition, and visualization. YM visualization, software, and investigation. I.A. supervision, validation and writing - review & edit the manuscript. All authors reviewed the manuscript and approved it.

Funding

This research is funded by the European University of Atlantic.

Data availability

The dataset used in this study can be requested from corresponding authors.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Ma, H., Zhao, J., Huang, H., Wang, Z. & Yao, Y. An experimental investigation into the fire behaviors and smoke characteristics of continuous spill fires in road tunnels. Fire Saf. J.141, 104009. 10.1016/j.firesaf.2023.104009 (2023). [Google Scholar]
  • 2.Chen, Q., Zhao, J. Case study of the Tianjin accident: Application of barrier and systems analysis to understand challenges to industry loss prevention in emerging economies. Process Saf. Environ. Protect.131. 10.1016/j.psep.2019.08.028 (2019).
  • 3.Sasago Tunnel. Wikipedia. https://en.wikipedia.org/wiki/Sasago_Tunnel (2025).
  • 4.La, H.M. Multi-robot swarm for cooperative scalar field mapping. In Robotic Systems: Concepts, Methodologies, Tools, and Applications. 208– 223. (IGI Global, 2020)
  • 5.La, H. M., Sheng, W. & Chen, J. Cooperative and active sensing in mobile sensor networks for scalar field mapping. IEEE Trans. Syst. Man Cybern. Syst.45(1), 1–12 (2014). [Google Scholar]
  • 6.Sutton, R.S., Barto, A.G. et al. Reinforcement Learning: An Introduction. Vol.1. (MIT Press, 1998)
  • 7.La, H. M., Lim, R. & Sheng, W. Multirobot cooperative learning for predator avoidance. IEEE Trans. Control Syst. Technol.23(1), 52–63 (2014). [Google Scholar]
  • 8.La, H.M., Lim, R.S., Sheng, W. & Chen, J. Cooperative flocking and learning in multi-robot systems for predator avoidance. In 2013 IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems. 337– 342 (IEEE, 2013).
  • 9.Faust, A., Palunko, I., Cruz, P., Fierro, R. & Tapia, L. Learning swing-free trajectories for UAVs with a suspended load. In 2013 IEEE International Conference on Robotics and Automation. 4902– 4909 (IEEE, 2013).
  • 10.Bou-Ammar, H., Voos, H. & Ertel, W. Controller design for quadrotor UAVs using reinforcement learning. In 2010 IEEE International Conference on Control Applications. 2130– 2135 (IEEE, 2010).
  • 11.Santos, S.R.B., Nascimento, C.L. & Givigi, S.N. Design of attitude and path tracking controllers for quad-rotor robots using reinforcement learning. In 2012 IEEE Aerospace Conference. 1– 16 (IEEE, 2012).
  • 12.Waslander, S.L., Hoffmann, G.M., Jang, J.S. & Tomlin, C.J. Multi-agent quadrotor testbed control design: Integral sliding mode vs. reinforcement learning. In 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems. 3712– 3717 (IEEE, 2005).
  • 13.Bellingham, J., Richards, A. & How, J. P. Receding horizon control of autonomous aerial vehicles. In: Proceedings of the 2002 American Control Conference (IEEE Cat. No. CH37301). Vol. 5. 3741–3746 (IEEE, 2002).
  • 14.Pham, H.X., La, H.M., Feil-Seifer, D. & Van Nguyen, L. Reinforcement learning for autonomous UAV navigation using function approximation. In 2018 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR). 1– 6 (IEEE, 2018).
  • 15.Hung, S.-M. & Givigi, S. N. A q-learning approach to flocking with UAVs in a stochastic environment. IEEE Trans. Cybern.47(1), 186–197 (2016). [DOI] [PubMed] [Google Scholar]
  • 16.Jia, W., Lv, L., Duan, R., Sun, T. & Sun, W. A reinforcement learning-based adaptive grey wolf optimizer for simultaneous arrival in manned/unmanned aerial vehicle dynamic cooperative trajectory planning. Drones9(10). 10.3390/drones9100723 (2025).
  • 17.Tai, L. & Liu, M. A robot exploration strategy based on q-learning network. In 2016 IEEE International Conference on Real-time Computing and Robotics (RCAR). 57– 62 (IEEE, 2016).
  • 18.Zhang, J., Tai, L., Liu, M., Boedecker, J. & Burgard, W. Neural slam: Learning to explore with external memory. arXiv preprint arXiv:1706.09520 (2017)
  • 19.Tai, L., Paolo, G. & Liu, M. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 31– 36 (IEEE, 2017).
  • 20.Wei, J., Zhao, Y. & Yang, K. Integrated communication and control for intelligent formation management of UAV swarms: A deep reinforcement learning approach. In IEEE Wireless Communications Letters (2025).
  • 21.Azar, A. T. et al. Drone deep reinforcement learning: A review. Electronics10(9), 999 (2021). [Google Scholar]
  • 22.Kulkarni, S., Chaphekar, V., Chowdhury, M. M. U., Erden, F. & Guvenc, I. UAV aided search and rescue operation using reinforcement learning. In 2020 SoutheastCon. Vol 2. 1–8 (IEEE, 2020).
  • 23.Zuluaga, J.G.C., Leidig, J.P., Trefftz, C. & Wolffe, G. Deep reinforcement learning for autonomous search and rescue. In NAECON 2018-IEEE National Aerospace and Electronics Conference. 521– 524 (IEEE, 2018).
  • 24.Zhan, H. et al. A reinforcement learning-based evolutionary algorithm for the unmanned aerial vehicles maritime search and rescue path planning problem considering multiple rescue centers. Memet. Comput.16(3), 373–386 (2024). [Google Scholar]
  • 25.Talha, M., Hussein, A. & Hossny, M. Autonomous UAV navigation in wilderness search-and-rescue operations using deep reinforcement learning. In Australasian Joint Conference on Artificial Intelligence. 733– 746 (Springer, 2022).
  • 26.Liu, X., Liu, Y. & Chen, Y. Reinforcement learning in multiple-UAV networks: Deployment and movement design. IEEE Trans. Veh. Technol.68(8), 8036–8049 (2019). [Google Scholar]
  • 27.Zhang, Y., Liu, H., Wang, X. & Chen, J. A UAV-UGV cooperative system for patrolling and energy management in urban monitoring. IEEE Trans. Veh. Technol.74(2), 2451–2464 (2025). [Google Scholar]
  • 28.Gao, P., Zhou, L., Zhao, X. & Shao, B. Research on ship collision avoidance path planning based on modified potential field ant colony algorithm. Ocean Coast. Manag.235, 106482 (2023). [Google Scholar]
  • 29.De Castro, G. G. et al. Dynamic path planning based on neural networks for aerial inspection. J. Control Autom. Electr. Syst.34(1), 85–105 (2023). [Google Scholar]
  • 30.Shen, Z., Ding, W., Liu, Y. & Yu, H. Path planning optimization for unmanned sailboat in complex marine environment. Ocean Eng.269, 113475 (2023). [Google Scholar]
  • 31.Seyyedabbasi, A., Kiani, F., Allahviranloo, T., Fernandez-Gamiz, U. & Noeiaghdam, S. Optimal data transmission and pathfinding for WSN and decentralized IOT systems using I-GWO and EX-GWO algorithms. Alex. Eng. J.63, 339–357 (2023). [Google Scholar]
  • 32.Yamauchi, B. A frontier-based approach for autonomous exploration. In Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97.’Towards New Computational Principles for Robotics and Automation’. 146– 151 (IEEE, 1997).
  • 33.Thrun, S., Burgard, W. & Fox, D. Probabilistic Robotics Cambridge. [Google Scholar] (MIT Press, 2005)
  • 34.Clifton, J. & Laber, E. Q-learning: Theory and applications. Annu. Rev. Stat. Appl.7(1), 279–301 (2020). [Google Scholar]
  • 35.Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
  • 36.Cao, Y. U., Fukunaga, A. S. & Kahng, A. Cooperative mobile robotics: Antecedents and directions. Auton. Robot.4, 7–27 (1997). [Google Scholar]
  • 37.Huang, Z., Li, Q., Zhao, Y. & Zhang, R. Optimizing disaster response with UAV-mounted reconfigurable intelligent surfaces and HAP-enabled edge computing in 6G networks. J. Netw. Comput. Appl.221, 104213. 10.1016/j.jnca.2025.104213 (2025).
  • 38.Alqahtani, S., Nguyen, T. & Kim, D. Energy efficiency relaying election mechanism for 5G internet of things: A deep reinforcement learning technique. In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC). 1– 6. (IEEE, 2024)
  • 39.Nawaz, F. et al. Graph-Based Path Planning with Dynamic Obstacle Avoidance for Autonomous Parking. arXiv:2504.12616 (2025).
  • 40.Srivastava, A., Vasudevan, V.R., Harikesh, Nallanthiga, R. & Sujit, P.B. A modified artificial potential field for UAV collision avoidance. In 2023 International Conference on Unmanned Aircraft Systems (ICUAS). 499– 506. 10.1109/ICUAS57906.2023.10156492 (2023).
  • 41.Kostrikov, I., Nair, A. & Levine, S. Offline Reinforcement Learning with Implicit Q-Learning. arXiv:2110.06169 (2021).
  • 42.Rehman, H.M.R.U., On, B.-W., Ningombam, D.D., Yi, S. & Choi, G.S. QSOD: Hybrid policy gradient for deep multi-agent reinforcement learning. IEEE Access9, 129728– 129741. 10.1109/ACCESS.2021.3113350 ( 2021) .
  • 43.Mirjalili, S., Mirjalili, S. M. & Lewis, A. Grey wolf optimizer. Adv. Eng. Softw.69, 46–61. 10.1016/j.advengsoft.2013.12.007 (2014). [Google Scholar]
  • 44.Younas, R. et al. Sa-marl: Novel self-attention-based multi-agent reinforcement learning with stochastic gradient descent. IEEE Access13, 35674–35687. 10.1109/ACCESS.2025.3544961 (2025). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The dataset used in this study can be requested from corresponding authors.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES