Skip to main content
Sensors (Basel, Switzerland) logoLink to Sensors (Basel, Switzerland)
. 2026 Jan 28;26(3):853. doi: 10.3390/s26030853

DSAC-ICM: A Distributional Reinforcement Learning Framework for Path Planning in 3D Uneven Terrains

Yixin Zhou 1, Fan Liu 2, Zhixiao Liu 3, Xianghan Ji 4, Guangqiang Yin 2,*
Editors: Sergio Toral Marín, Zheng Chen
PMCID: PMC12899962  PMID: 41682369

Abstract

Ground autonomous mobile robots are increasingly critical for reconnaissance, patrol, and resupply tasks in public safety and national defense scenarios, where global path planning in 3D uneven terrains remains a major challenge. Traditional planners struggle with high dimensionality, while Deep Reinforcement Learning (DRL) is hindered by two key issues: (1) systematic overestimation of action values (Q-values) due to function approximation error, which leads to suboptimal policies and training instability; and (2) inefficient exploration under sparse reward signals. To address these limitations, we propose DSAC-ICM: a Distributional Soft Actor–Critic framework integrated with an Intrinsic Curiosity Module (ICM). Our method fundamentally shifts the learning paradigm from estimating scalar Q-values to learning the full probability distribution of state-action returns, which inherently mitigates value overestimation. We further integrate the ICM to generate dense intrinsic rewards, guiding the agent toward novel and unvisited states to tackle the exploration challenge. Comprehensive experiments conducted in a suite of realistic 3D uneven-terrain environments demonstrate that DSAC-ICM successfully enables the agent to learn effective navigation capabilities. Crucially, it achieves a superior trade-off between path quality and computational cost when compared to traditional path planning algorithms. Furthermore, DSAC-ICM significantly outperforms other RL baselines in terms of convergence speed and return.

Keywords: distributional reinforcement learning, soft actor–critic, intrinsic curiosity module, global path planning, 3D uneven terrain

1. Introduction

Autonomous navigation in complex natural environments has become one of the key challenges for mobile robots [1,2]. In applications such as geological exploration, planetary rovers, and disaster rescue, robots must traverse three-dimensional uneven terrains, where elevation variations, irregular slopes, and uncertain traversability significantly complicate motion planning [3]. Path planning, which aims to generate an optimal feasible trajectory from a start to a goal position while minimizing cost functions such as distance, energy consumption, or time, plays a central role in achieving autonomous navigation. However, global path planning is a well-known NP-Hard problem, and its computational complexity grows rapidly with dimensionality [4].

Traditional path planning algorithms—such as Dijkstra [5], A* [6], and Rapidly-exploring Random Tree (RRT) [7]—have been extensively studied and widely applied in structured environments. These methods rely on search strategies. While they can efficiently find collision-free paths in low-dimensional static environments, their performance deteriorates on uneven terrains, where the cost of traversing each region is strongly influenced by terrain elevation and slope. Moreover, when the environment changes, these algorithms must reconstruct the search graph or tree, leading to poor adaptability and limited scalability [3]. Beyond classical single-agent path planning, several studies have explored navigation problems under more complex constraints. For instance, task assignment for multiple vehicles with partially unreachable targets focuses on coordination and feasibility at the decision-making level rather than low-level motion planning [8]. Path planning in uncertain environments with moving obstacles has also been investigated using warm-start cross-entropy methods to improve robustness under dynamic conditions [9]. In addition, multi-objective optimization-based approaches, such as the 3D-M method, jointly consider distance and energy consumption for path planning on uneven 3D terrains [10]. While effective in modeling terrain-induced costs, these methods typically rely on predefined cost formulations and optimization heuristics, and do not explicitly address continuous-control learning in complex 3D environments. Moreover, it is worth noting that most global path planning methods, including both classical optimization-based approaches and learning-based planners, are commonly formulated under static and idealized environment assumptions. Such formulations focus on long-horizon route optimization based on prior terrain information, while deferring real-time adaptability to lower-level planning or control modules.

From a computational perspective, reinforcement learning-based planners and traditional search-based planners also exhibit fundamentally different characteristics. For small-scale maps or single-shot planning tasks, classical algorithms such as Dijkstra or A* are often more computationally efficient, as they do not require a training phase. In contrast, RL-based methods incur higher computational cost during training, but this cost is amortized over repeated deployments. Once trained, policy inference requires only a forward pass through a neural network, resulting in near-constant planning time and enabling efficient reuse across different start–goal configurations.

In contrast, reinforcement learning (RL) offers a data-driven approach for learning navigation policies [11]. Agents improve through trial-and-error interactions with the environment [12]. By updating policies and value functions based on observed rewards, RL enables adaptation to dynamic or partially known environments. RL algorithms can be classified as on-policy or off-policy.

Off-policy algorithms such as DDPG [13], TD3 [14], and SAC [15] are sample-efficient and suitable for high-dimensional continuous control. However, they often suffer from Q-value overestimation [16,17], which in uneven terrain navigation can lead to unsafe or infeasible paths that appear optimal in the learned value function. Additionally, sparse and delayed goal-reaching rewards limit exploration, causing the agent to get stuck in local minima and fail to find traversable routes [18]. Together, overestimation and sparse rewards make stable and efficient global path planning in 3D uneven terrains particularly challenging.

To address these challenges, we propose DSAC-ICM, a reinforcement learning framework for 3D path planning in uneven terrains. Our approach integrates the Distributional Soft Actor–Critic (DSAC) algorithm [19,20] with an Intrinsic Curiosity Module (ICM) [21] to jointly enhance value estimation stability and exploration capability. Specifically, DSAC models the full return distribution rather than its expectation, which mitigates Q-value overestimation and improves robustness against high-variance terrain dynamics. Meanwhile, ICM provides an intrinsic reward signal that promotes efficient exploration in sparse-reward environments, guiding the agent toward informative states and reducing the likelihood of premature convergence. The combination enables the agent to learn safer, smoother, and more energy-efficient global paths that better adapt to complex topographies. The main contributions are summarized as follows:

  • We propose a reinforcement learning-based global path planning framework for 3D uneven terrains, which explicitly considers robot motion constraints, supports continuous action spaces, and optimizes for traversed distance over uneven terrain.

  • We design a DSAC-ICM algorithm, which combines distributional value learning with curiosity-driven intrinsic motivation to jointly address Q-value overestimation and sparse-reward exploration.

  • We conduct comprehensive experiments on DEM-based 3D terrain datasets, showing that DSAC-ICM performs well in path optimality, learning stability, and exploration efficiency.

The remainder of this paper is organized as follows. We first formalize the 3D path planning problem and present the Markov decision process formulation in Section 3. Section 4 describes the proposed DSAC-ICM framework, including its distributional critic and curiosity-driven exploration strategy. Experimental results and analysis are presented in Section 5, and conclusions and future directions are discussed in Section 6.

2. Related Work

Over the past decades, a wide range of algorithms have been proposed for global path planning. Classical methods include graph-based algorithms such as Dijkstra [5] and A* [6], which guarantee optimal solutions but suffer from high computational costs on large or complex terrains, and sampling-based methods like RRT [7] and RRT* [22], which scale well to high-dimensional spaces but require frequent replanning in dynamic environments. Bio-inspired algorithms and heuristic search approaches have also been explored to approximate optimal paths or incorporate multi-objective criteria such as energy consumption, slope constraints, or scenic routes [3]. While these methods are effective in structured or flat environments, they struggle with uneven 3D terrains where path cost depends on both horizontal displacement and elevation changes, and where safe traversal requires consideration of slope feasibility and energy consumption. Consequently, traditional planners often fail to simultaneously satisfy physical constraints, minimize traversal cost, and maintain computational efficiency in complex 3D terrains.

Among learning-based methods, reinforcement learning (RL) has emerged as a promising alternative for path planning [23], enabling agents to discover feasible and efficient trajectories through interaction with the environment. Recent deep RL (DRL) applications include UAV-based weed localization using DQN [24], urban IoT data collection via convolutional networks [25], hierarchical DRL for indoor navigation with LiDAR-based complexity metrics [26], and attention-guided TERP for rugged terrains [27]. Off-policy algorithms, such as DQN [28], DDPG [13], TD3 [14], and SAC [15], are particularly suitable for continuous control due to their sample efficiency and stability, but often suffer from Q-value overestimation and unstable learning in high-variance environments such as 3D uneven terrains [16,17]. To mitigate overestimation, methods such as Double Q-learning [29] and Double DQN [30] decouple action selection from evaluation in discrete spaces. Continuous control extensions, such as Double DDPG [14], reduce but do not fully eliminate overestimation, while Clipped Double Q-learning, used in TD3 and SAC, partially alleviates overestimation by taking the minimum of two Q-estimates but may introduce underestimation bias. Distributional RL methods, including Distributional SAC (DSAC) [19,20], address these issues by modeling the full return distribution, capturing uncertainty across returns and improving value estimation stability in complex or high-variance environments. In addition to value decomposition and distributional modeling, reward shaping-based approaches have also been proposed to mitigate value estimation bias in off-policy reinforcement learning. Munchausen Reinforcement Learning (M-RL) incorporates a log-policy term directly into the reward function, inspired by entropy-regularized RL, thereby reshaping the Bellman target and reducing overestimation effects [31]. By augmenting the reward with a scaled log-policy term, M-RL encourages more conservative value updates and has been shown to improve learning stability in several continuous control tasks. However, as a reward-level modification, its effectiveness may still be limited in environments characterized by highly sparse rewards and strong return variance induced by complex terrain dynamics.

Another critical challenge in path planning is sparse reward signals, where meaningful feedback is only obtained when the agent reaches or approaches the goal. Curiosity-driven exploration mechanisms, such as the Intrinsic Curiosity Module (ICM) [21], provide intrinsic reward signals based on state prediction errors, guiding the agent to explore informative states and improving training efficiency in sparse-reward scenarios. To further enhance exploration stability in continuous action spaces, ICM has been integrated with TD3, introducing a randomness-enhanced module to encourage exploration of unknown regions and reduce local optima [32]. Multi-agent sparse-reward scenarios have been addressed with the I-Go-Explore framework, which combines ICM with Go-Explore to leverage historical exploration experience for targeted state visitation and improved sample efficiency [33].

3. Problem Formulation

In this paper, we consider the problem of global path planning for ground mobile robots operating in uneven 3D terrains, where the objective is to generate a feasible path from a start position to a target position while respecting terrain-dependent physical constraints. Compared with flat terrain, path planning in uneven environments is more challenging because the robot must additionally account for slope variations and its maximum admissible slope constraint.

The uneven terrain is modeled as a 2D grid map consisting of N nodes. Each node i is associated with an elevation value H(i) derived from a Digital Elevation Map (DEM). Let the start and goal positions be denoted as s0 and sg, respectively.

The robot moves with a fixed step length l, and at each decision step selects a moving direction from a continuous action space. Therefore, the action space is defined as:

A={aR2 a=l}, (1)

i.e., the action corresponds to selecting a continuous movement direction, and the robot moves to the next state by

st+1=st+a. (2)

During the planning process, robot motion must satisfy the slope constraint physical constraint. The slope angle between two adjacent nodes i and j must not exceed the robot’s maximum admissible slope θmax. The slope angle is computed as:

θ(i,j)=arctanH(j)H(i)d(i,j), (3)

where d(i,j) denotes the Euclidean distance between nodes i and j in the horizontal plane.

The goal of path planning is to generate a sequence of states and actions that starts from s0, ends at sg, satisfies the slope constraint, and minimizes the total path cost. The path cost is defined as the sum of the movement costs along the path:

J(τ)=t=0T1c(st,st+1), (4)

where the unit movement cost is given by

c(st,st+1)=d(st,st+1)2+H(st+1)H(st)2. (5)

We formulate the path planning problem as a Markov Decision Process (MDP), defined by a tuple S,A,R,γ:

  • State space S. Includes the elevation matrix H, the robot’s current position, and the target position.

  • Action space A. A continuous action space representing the movement direction with fixed step length.

  • Reward function R. The immediate reward r=R(s,a) is determined by the current state and action.

  • Discount factor γ. A scalar γ(0,1] used to compute the cumulative return.

To clearly define the scope of this study, we make several assumptions: the robot has complete prior knowledge of the terrain elevation data and full awareness of its own state, including position and actions. The environment is assumed to be static during the global planning process, which allows the planner to focus on long-horizon route optimization based on known terrain elevation data. Handling dynamic obstacles and real-time disturbances is beyond the scope of this work and is typically addressed by a local planner or reactive control layer.

Ideally, the path planning algorithm aims to find a feasible path τ* that minimizes the cost:

τ*=argminτJ(τ), (6)

which is equivalent to maximizing the cumulative reward in the MDP formulation. However, global path planning is an NP-hard problem, making it computationally infeasible to guarantee global optimality in polynomial time. Therefore, practical approaches aim to compute, within reasonable time, a path that satisfies physical constraints and achieves as low a cost as possible.

In this work, we focus on improving planning efficiency and path optimality for robots navigating uneven terrains, while ensuring compliance with slope constraints and maintaining robustness and computational efficiency.

4. Method

4.1. Method Overview

We propose DSAC-ICM, a novel reinforcement learning framework that integrates Distributional Soft Actor–Critic (DSAC) with the Intrinsic Curiosity Module (ICM) to achieve robust and efficient global path planning in complex 3D uneven terrains. Our framework specifically addresses the dual challenges of Q-value overestimation prevalent in standard off-policy algorithms and sparse exploration inherent in goal-reaching tasks. The overall architecture is based on the actor–critic paradigm, utilizing a continuous action space for flexible movement direction selection.

4.2. Overestimation in 3D Terrain Path Planning

Off-policy actor–critic algorithms such as DDPG, TD3, and SAC approximate value functions using neural networks and update them via Bellman backups. Due to approximation noise and the implicit max operator in policy improvement, these methods commonly suffer from Q-value overestimation. In SAC, for example, the one-step target is

y=r+γQθ¯(s,a)αlogπ(a|s), (7)

where noise in Qθ¯ is directly propagated into optimistic targets, causing error accumulation and unstable learning.

This issue becomes more severe in 3D uneven terrain path planning. First, transition dynamics exhibit high variance because movement cost depends on both horizontal displacement and elevation change; slight action perturbations can cause large height variations and reward noise, leading the critic to favor steep “shortcut” directions. Second, the slope constraint θ(i,j)θmax creates a hard feasibility boundary. Overestimated critics often assign high value to actions near or beyond this boundary, promoting transitions that are physically infeasible. Third, sparse goal-reaching rewards make early training dominated by bootstrapping, amplifying the accumulation of optimistic bias. As a result, SAC-based planners may prefer steep or unsafe trajectories that appear advantageous in the value approximation but violate terrain constraints or lead to unstable behavior.

4.3. Distributional Soft Actor–Critic

Conventional SAC estimates the expected return using a scalar critic, which is prone to overestimation in noisy or high-variance environments such as 3D uneven terrains. To address this limitation, Distributional Soft Actor–Critic (DSAC) extends SAC by modeling the full distribution of soft state-action returns Zπ(s,a) instead of a single mean. This enables the critic to capture uncertainty in value estimates, reducing overestimation and promoting safer policy learning in complex terrains. Figure 1 illustrates the overall DSAC architecture.

Figure 1.

Figure 1

Distributional Soft Actor–Critic.

DSAC treats the soft return as a random variable:

Zπ(s,a)TπZπ(s,a), (8)

where the distributional Bellman operator is defined as

TπZπ(s,a)=Dr+γZπ(s,a)αlogπ(a|s), (9)

with sp(·|s,a), aπ(·|s), and rR(·|s,a). A parametric approximation Zθ(s,a) is maintained and updated by minimizing a divergence d(·,·), specifically the Kullback–Leibler (KL) divergence, between the target distribution and the predicted distribution:

θargminθE(s,a)DdTπZθ¯(s,a),Zθ(s,a), (10)

where Zθ¯ is the target critic.

DSAC typically maintains two critics and one actor, similar to SAC. The critics output a parameterized approximation of Z(s,a), and a target network stabilizes learning. The actor is trained to maximize the expected return under the distributional critic:

Jπ=EsD,aπαlogπ(a|s)E[Zθ(s,a)]. (11)

By modeling the full return distribution, DSAC provides several key advantages for 3D uneven terrain navigation. First, it reduces overestimation by capturing uncertainty and preventing overly optimistic Q spikes near steep slopes or unstable transitions. Second, distributional targets smooth the high variance caused by irregular terrain dynamics, leading to more stable learning. Third, actions with risky or high-variance returns are naturally penalized, enabling safer policy learning that respects terrain constraints and robot stability. These features make DSAC more robust and reliable than conventional SAC in complex navigation tasks with nonlinear constraints and sparse rewards.

4.4. Intrinsic Motivation for Exploration

In 3D uneven terrain navigation, the reward signal is typically sparse, as significant feedback is only obtained when the robot approaches the goal. This sparsity limits the agent’s ability to explore effectively, often causing the policy to converge prematurely to suboptimal routes. To address this challenge, we incorporate an Intrinsic Curiosity Module (ICM) that generates an auxiliary reward based on the agent’s prediction error of environment dynamics (Figure 2).

Figure 2.

Figure 2

Intrinsic Curiosity Module.

Formally, ICM consists of a forward model fF and an inverse model fI. Given a transition (st,at,st+1), the inverse model predicts the action a^t=fI(st,st+1), while the forward model predicts the next state embedding ϕ^t+1=fF(ϕ(st),at), where ϕ(s) is a learned feature representation of the state. The intrinsic reward is defined as the prediction error of the forward model:

rtint=ηϕ^t+1ϕ(st+1)2, (12)

where η is a scaling factor. This reward encourages the agent to visit states that are novel or hard to predict, effectively guiding exploration in sparse-reward environments.

The total reward used for policy learning combines the environment reward and intrinsic reward:

rttotal=rtext+λrtint. (13)

where λ controls the contribution of curiosity-driven reward.

By leveraging intrinsic motivation through ICM, the agent can efficiently explore complex 3D terrains, discover feasible paths, and overcome the limitations imposed by sparse goal-reaching rewards.

4.5. Reward Function

The design of the reward function is critical for guiding the agent to navigate effectively and safely in 3D uneven terrains. Our reward structure consists of several components reflecting environmental constraints and providing informative feedback at each step.

A positive reward, rgoal, is granted when the agent reaches the goal. If a proposed action leads the agent outside the valid map boundaries, the action is aborted and a collision penalty, rcoll, is applied. Slope constraints are enforced by computing the slope along the proposed movement. If the slope exceeds the robot’s maximum traversable angle, θmax, the move is allowed but a negative slope penalty, rslope, is incurred, guiding the agent to avoid overly steep terrain.

To provide denser feedback, a progress-based reward, rprogress, is given according to the reduction in horizontal distance to the goal:

rprogress=wprogress·(disthoriz,t1disthoriz,t), (14)

where the horizontal distance is computed as

disthoriz,t=(xtxgoal)2+(ytygoal)2, (15)

and wprogress is a positive scaling factor. Additionally, a path efficiency reward penalizes the 3D Euclidean distance traveled in each step:

rpath=wpath·d3D_step, (16)

where the 3D step length is

d3D_step=(xtxt1)2+(ytyt1)2+(ztzt1)2, (17)

discouraging unnecessarily long or circuitous paths while respecting vertical variations.

A small negative step penalty, rstep, is applied at each step to encourage shorter trajectories.

The total external reward at each time step is the sum of all components:

rt=rgoal+rcoll+rslope+rprogress+rpath+rstep. (18)

This reward design ensures the learned policy is goal-directed, safe, and physically plausible, effectively promoting efficient navigation across challenging 3D uneven terrains.

5. Experimental

5.1. Experimental Setup

5.1.1. Datasets

The experiments are conducted on a custom set of 20 synthetic 3D terrain maps, each with a size of 100×100 grid cells. The terrains are designed to simulate complex uneven environments with varying slopes and elevations. Each map contains a mixture of topographic features, including randomly generated high-elevation regions resembling mountains that are completely impassable, moderate hilly areas that can be navigated with careful path planning, and relatively flat regions with small random perturbations that introduce subtle irregularities. This diverse combination of terrain features ensures that the agent encounters a wide range of navigation challenges, from entirely blocked paths to subtle elevation changes requiring precise movement. Figure 3 illustrates four samples from the dataset. Although the terrains are synthetically generated, they are designed to cover a wide range of elevation patterns and slope distributions, serving as controlled testbeds for evaluating global planning performance under uneven terrain conditions.

Figure 3.

Figure 3

Four sample from the terrain datasets.

5.1.2. Evaluation Metrics

To comprehensively evaluate the performance of DSAC-ICM, we employ multiple metrics:

  • Average Return (AR): The mean cumulative reward per episode, measured over training iterations. This reflects the learning progress and convergence of the policy.

  • Path Cost (PC): The total 3D distance traversed by the agent along the planned trajectory. Unlike 2D horizontal distance, this accounts for elevation changes and better reflects energy expenditure and traversal effort.

  • Path Cost Ratio (PCR): The ratio between the agent’s path cost and the optimal path cost computed by Dijkstra’s algorithm in the same 3D terrain. Values closer to 1 indicate near-optimal paths.

  • Planning Time Ratio (PTR): The ratio of the agent’s planning time to Dijkstra’s planning time. Lower values correspond to higher efficiency.

  • Cost-Time Tradeoff (CTT): A novel metric introduced in this work that combines path cost and planning time to provide a single measure of efficiency:
    CTT=PCR·log1+PTR. (19)

5.2. Experimental Settings

This section summarizes the hyperparameters, software setup, and implementation details used for training the DSAC-ICM agent and all baseline algorithms. All methods are implemented in Python using PyTorch, and experiments are conducted under the same environment for fair comparison. The key training, network, and exploration parameters are listed in Table 1. These include the actor–critic learning rates, ICM configuration, replay buffer size, discount factors, update frequencies, and main architectural choices.

Table 1.

Summary of key experimental hyperparameters for DSAC-ICM.

Parameter Value
Discount factor γ 0.99
Target update rate τ 0.001
Actor learning rate 3×104
Critic learning rate 3×104
Entropy coefficient α 0.2
Batch size 256
Replay buffer size 20,000
Warm-up steps 2000
Training frequency Every step
Gradient steps per update 1
Total training timesteps 300,000
ICM learning rate 3×104
ICM feature dimension 64
ICM hidden dimension 64
ICM forward loss weight β 0.2
Intrinsic reward scale η 0.05
Intrinsic reward weight λ 0.005
Value network architecture MLP [128, 128], GELU
Policy network architecture MLP [128, 128], GELU
Action distribution Tanh Gaussian

The ICM forward loss weight β determines the contribution of the forward model prediction error in the total ICM loss. The intrinsic reward is defined as rtint=ηϕ^t+1ϕ(st+1)2, where η scales the prediction error. The intrinsic reward weight λ balances exploration and task-driven optimization in rttotal=rtext+λrtint. These three parameters are critical for guiding exploration in sparse-reward environments.

All experiments run on a workstation equipped with an Intel(R) Xeon(R) Gold 6226R CPU @ 2.90 GHz, a single NVIDIA Tesla T4 GPU with 16 GB memory, and Ubuntu 22.04.4 LTS. The implementation uses Python 3.12.0 and PyTorch 2.5.1 with CUDA 12.4 support. To ensure reproducibility, all experiments are conducted under three different random seeds, and the results presented in the figures and tables represent the mean standard deviation across these runs.

5.3. Experimental Results

This section presents a comprehensive evaluation of the proposed DSAC-ICM framework for 3D uneven terrain path planning. The experiments are conducted on multiple terrain datasets. Figure 4 shows a representative 2D projection of one uneven terrain instance. The black dot denotes the start point, and the orange star marks the target. The red line indicates the trajectory generated by our DSAC-ICM planner. As shown in the figure, the agent successfully identifies a feasible and collision-free path that reaches the goal while avoiding steep mountainous regions. This behavior reflects the agent’s ability to respect the maximum traversable slope constraint and optimize long-horizon navigation under complex elevation variations.

Figure 4.

Figure 4

Path planning result.

Table 2 reports the quantitative comparison among classical algorithms (Dijkstra, A*, RRT) and our method. Four metrics are considered: Path Cost Ratio (PCR), Planning Time Ratio (PTR), and the proposed Cost-Time Tradeoff (CTT). Dijkstra is used as the reference optimal planner in terms of path cost. A* achieves similar optimality (PCR = 1.000) and improves planning time efficiency (PTR = 0.196), but its CTT remains limited due to its high dependence on heuristic expansions. RRT exhibits significantly higher path cost (PCR = 1.159) and considerably slower planning time (PTR = 0.870), resulting in a large CTT value. The 3D-M method achieves relatively fast planning efficiency (PTR = 0.098), but this advantage is obtained at the expense of path quality, resulting in a higher path cost ratio (PCR = 1.242) and a limited cost-time tradeoff (CTT = 0.106).

Table 2.

Comparison of Path Planning Performance Across Methods.

Methods PCR PTR CTT
Dijkstra 1.000 1.000 0.693
A* 1.000 0.196 0.145
RRT 1.159 0.870 0.606
3D-M 1.242 0.098 0.106
DSAC-ICM (Ours) 1.137 0.056 0.055

In contrast, DSAC-ICM achieves a favorable balance between path optimality and computational efficiency. Although the average PCR (1.137) is slightly higher than that of A*, the planning time requirement is drastically reduced (PTR = 0.056). The combined effect leads to the lowest CTT score (0.055), indicating that DSAC-ICM offers the most efficient overall tradeoff between path quality and computation. The stable training process of its underlying framework (DSAC) further supports this superior performance, as shown in Figure 5: in the early training stage (steps < 50,000), the return fluctuates and stays at a low level, reflecting the unstable strategy during initial exploration; when steps reach 50,000, the return rises rapidly and converges to a stable range close to 0, which means the model’s strategy gradually optimizes and matures, laying a solid foundation for subsequent task performance. This superior performance can be attributed to the intrinsic curiosity module, which enhances exploration in unfamiliar topographic regions, and the distributional critic, which improves value estimation under elevation-driven uncertainty.

Figure 5.

Figure 5

Training return curve of DSAC-ICM. The horizontal axis represents training steps, and the vertical axis denotes the training return.

Overall, these results demonstrate that DSAC-ICM not only generates feasible and smooth trajectories on highly uneven 3D terrains but also achieves competitive optimality with significantly lower planning cost. This makes it suitable for real-time autonomous navigation in outdoor environments with complex elevation profiles.

Following the above results, we further compare the learning performance of DSAC-ICM with several mainstream deep reinforcement learning algorithms, including A2C [34], PPO [35], SAC [15], and M-RL [31]. Since the path planning task is formulated with a continuous action space, we implement Munchausen Reinforcement Learning using its SAC-based variant, ensuring a fair comparison under the same continuous-control setting. To provide a direct comparison of policy learning efficiency, we plot the learning curves with the number of environment interaction steps on the horizontal axis (steps) and the episodic return on the vertical axis (return). This visualization highlights how quickly each algorithm discovers effective policies and how stable their final performance is.

As illustrated in Figure 6, all learning-based methods except A2C are able to achieve positive returns during training, indicating successful policy learning in the 3D uneven-terrain environment. Among them, DSAC-ICM (blue) converges faster and more consistently to the highest return regime among all compared methods, exhibiting both strong final performance and stable learning behavior. SAC (red) and M-RL (purple) also achieve positive returns, but exhibit either slower convergence or higher performance variance compared to DSAC-ICM. PPO (green) improves steadily but plateaus at a substantially lower return level, while A2C (orange) fails to obtain meaningful rewards within the same training budget.

Figure 6.

Figure 6

Training return curves of different deep reinforcement learning algorithms. The horizontal axis denotes the number of environment interaction steps, and the vertical axis represents the episodic return. The curves correspond to DSAC-ICM (blue), A2C (orange), PPO (green), SAC (red), and M-RL (purple), respectively.

Specifically, DSAC-ICM exhibits the fastest convergence and the best final performance: after a short exploration phase (steps < 50,000), it rises rapidly and stabilizes at the highest return plateau. SAC also converges to a positive-return level, but its convergence is slower, and its asymptotic return is lower than that of DSAC-ICM. Notably, M-RL shows slightly improved early-stage learning dynamics compared to SAC, exhibiting faster initial improvement, which can be attributed to Munchausen reward shaping that mitigates overly optimistic value updates under sparse and noisy terrain rewards.

Overall, these results suggest that the combination of distributional value estimation and intrinsic curiosity enables DSAC-ICM to explore more efficiently and learn more reliably under elevation-induced reward uncertainty.

In addition to the learning curves shown in Figure 6, the quantitative comparison of final episodic returns across reinforcement learning algorithms is summarized in Table 3. Each value represents the mean standard deviation computed over three independent runs with different random seeds.

Table 3.

Comparison of Training Performance Among Reinforcement Learning Algorithms.

Algorithm Final Return (Mean ± Std)
SAC 59.91±2.69
M-RL 59.64±6.22
PPO 35.50±23.73
A2C 1419.06±770.83
DSAC-ICM (Ours) 64.17±2.23

To further investigate the contribution of the Intrinsic Curiosity Module (ICM), we conduct an ablation study by training DSAC with and without ICM. Similar to the previous setup, we plot the learning curves using steps (horizontal axis) and return (vertical axis) for comparison.

As shown in Figure 7, the two variants exhibit distinct learning dynamics. The With ICM variant (blue curve) shows larger fluctuations in the early exploration stage (even dropping to around 900), but it recovers quickly and converges to a stable high-return plateau (around 60) after approximately 50,000 steps. In contrast, the Without ICM variant (orange curve) exhibits milder fluctuations in the early stage, but suffers from frequent oscillations and occasional sharp drops in later training, particularly around 150,000 steps, indicating less robust exploration and policy optimization.

Figure 7.

Figure 7

Training return curves of DSAC in the ICM ablation study. The horizontal axis denotes training steps, and the vertical axis represents episodic return; the blue curve corresponds to the framework With ICM, while the orange curve corresponds to Without ICM.

Table 4 further confirms the benefit of ICM in terms of final return: DSAC-ICM achieves a higher mean return (64.17) than DSAC without ICM (61.74). This contrast indicates that the ICM plays two core roles: it equips the agent to tolerate temporary exploration setbacks (reflected in the rapid recovery from deep early returns) and enhances the robustness of policy learning (reflected in the stable convergence of the With ICM variant). This confirms that ICM effectively promotes exploration in high-uncertainty terrain regions and contributes substantially to the overall policy quality.

Table 4.

Comparisons of final episodic returns in the DSAC ablation study (mean ± std).

Algorithm Final Return (Mean ± Std)
DSAC 61.74 ± 0.82
DSAC-ICM (Ours) 64.17 ± 2.23

Among the hyperparameters introduced by the Intrinsic Curiosity Module (ICM), the intrinsic reward weight λ plays a particularly important role, as it directly controls the balance between curiosity-driven exploration and task-oriented optimization. Compared to other hyperparameters, inappropriate choices of λ can more easily lead to either insufficient exploration or unstable learning dynamics. Therefore, we focus our sensitivity analysis on this parameter.

Specifically, we evaluate three representative values, λ{0.001,0.005,0.01}, while keeping all other hyperparameters unchanged.

As shown in Figure 8, when λ is too small (λ=0.001), the intrinsic reward provides insufficient exploration guidance, resulting in slower improvement and a lower final return. When λ is too large (λ=0.01), exploration becomes overly aggressive and introduces larger training fluctuations. In contrast, λ=0.005 achieves a better balance between exploration and exploitation, yielding more stable convergence and the highest final return.

Figure 8.

Figure 8

Training return curves of DSAC-ICM with different intrinsic reward weights λ. The horizontal axis denotes training steps, and the vertical axis represents episodic return; solid lines denote the mean return and shaded regions indicate one standard deviation across three runs.

Note that λ=0.005 is used as the default setting in all main experiments; therefore, its final return is consistent with Table 5.

Table 5.

Effect of intrinsic reward weight λ on final performance (mean ± std).

Setting Final Return (Mean ± Std)
λ=0.001 58.68±6.47
λ=0.01 62.16±1.71
λ=0.005  (Ours) 64.17 ± 2.23

6. Conclusions

We presented DSAC-ICM, a novel DRL framework designed to mitigate the inherent challenges of Q-value overestimation and inefficient exploration for path planning in 3D uneven terrains. By integrating distributional value learning with an Intrinsic Curiosity Module (ICM), DSAC-ICM achieves both algorithmic stability and exploration robustness.

Our experimental results lead to four main conclusions: Firstly, DSAC-ICM successfully enables the agent to learn robust strategies, generating high-quality, physically constrained paths in complex terrains. Secondly, compared to traditional planners like A*, our method achieves the optimal trade-off between path quality and computational cost, making it suitable for real-time navigation. Thirdly, DSAC-ICM significantly outperforms all mainstream DRL baselines in convergence speed and asymptotic return. Finally, ablation studies confirmed that the ICM is crucial for enhancing exploration efficiency and accelerating convergence.

In summary, DSAC-ICM provides an efficient and stable solution for path planning in complex 3D environments. Future work will extend this framework toward more realistic navigation scenarios. Specifically, we plan to integrate DSAC-ICM as the global planner within a hierarchical planning architecture, where a local planner or reactive controller handles dynamic obstacles, perception uncertainty, and real-time disturbances. In addition, incorporating online terrain perception and validating the Sim-to-Real transferability on real robotic platforms will be important directions for future research.

Acknowledgments

We thank the University of Electronic Science and Technology of China, for supporting this research work.

Author Contributions

Conceptualization: Y.Z. and F.L.; Methodology: Y.Z., F.L. and Z.L.; Software: Y.Z. and Z.L.; Validation: X.J.; Formal analysis: Y.Z. and F.L.; Investigation: X.J.; Resources: F.L.; Data curation: Z.L.; Writing—original draft preparation: Y.Z.; Writing—review and editing: F.L., Z.L. and X.J.; Visualization: Y.Z. and X.J.; Supervision: G.Y.; Project administration: G.Y. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding Statement

This research received no external funding.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

  • 1.Liu Y., Liu L., Zheng Y., Liu Y., Dang F., Li N., Ma K. Embodied navigation. Sci. China Inf. Sci. 2025;68:141101. doi: 10.1007/s11432-024-4303-8. [DOI] [Google Scholar]
  • 2.Gao R., Liu M., Du J., Bao Y., Wu X., Liu J. Research on a Cooperative Grasping Method for Heterogeneous Objects in Unstructured Scenarios of Mine Conveyor Belts Based on an Improved MATD3. Sensors. 2025;25:6824. doi: 10.3390/s25226824. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Liu L., Wang X., Yang X., Liu H., Li J., Wang P. Path planning techniques for mobile robots: Review and prospect. Expert Syst. Appl. 2023;227:120254. doi: 10.1016/j.eswa.2023.120254. [DOI] [Google Scholar]
  • 4.Karur K., Sharma N., Dharmatti C., Siegel J.E. A survey of path planning algorithms for mobile robots. Vehicles. 2021;3:448–468. doi: 10.3390/vehicles3030027. [DOI] [Google Scholar]
  • 5.Dijkstra E.W. A note on two problems in connexion with graphs. Numer. Math. 1959;1:269–271. doi: 10.1007/BF01386390. [DOI] [Google Scholar]
  • 6.Hart P.E., Nilsson N.J., Raphael B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 1968;4:100–107. doi: 10.1109/TSSC.1968.300136. [DOI] [Google Scholar]
  • 7.LaValle S. Rapidly-Exploring Random Trees: A New Tool for Path Planning. Iowa State University; Ames, IA, USA: 1998. Research Report 9811. [Google Scholar]
  • 8.Bai X., Yan W., Ge S.S. Efficient Task Assignment for Multiple Vehicles with Partially Unreachable Target Locations. IEEE Internet Things J. 2021;8:3730–3742. doi: 10.1109/JIOT.2020.3025797. [DOI] [Google Scholar]
  • 9.Tao X., Lang N., Li H., Xu D. Path Planning in Uncertain Environment With Moving Obstacles Using Warm Start Cross Entropy. IEEE/ASME Trans. Mechatron. 2022;27:800–810. doi: 10.1109/TMECH.2021.3071723. [DOI] [Google Scholar]
  • 10.Huang G., Yuan X., Shi K., Liu Z., Wu X. A 3-D Multi-Object Path Planning Method for Electric Vehicle Considering the Energy Consumption and Distance. IEEE Trans. Intell. Transp. Syst. 2022;23:7508–7520. doi: 10.1109/TITS.2021.3071319. [DOI] [Google Scholar]
  • 11.Jin W., Wang N., Zhang L., Tian X., Shi B., Zhao B. A Review of AI-Driven Automation Technologies: Latest Taxonomies, Existing Challenges, and Future Prospects. Comput. Mater. Contin. 2025;84:3961. doi: 10.32604/cmc.2025.067857. [DOI] [Google Scholar]
  • 12.Shakya A.K., Pillai G., Chakrabarty S. Reinforcement learning algorithms: A brief survey. Expert Syst. Appl. 2023;231:120495. doi: 10.1016/j.eswa.2023.120495. [DOI] [Google Scholar]
  • 13.Lillicrap T.P., Hunt J.J., Pritzel A., Heess N., Erez T., Tassa Y., Silver D., Wierstra D. Continuous Control with Deep Reinforcement Learning; Proceedings of the International Conference on Learning Representations (ICLR); San Juan, Puerto Rico. 2–4 May 2016; [DOI] [Google Scholar]
  • 14.Fujimoto S., van Hoof H., Meger D. Addressing Function Approximation Error in Actor-Critic Methods; Proceedings of the 35th International Conference on Machine Learning (ICML); Stockholm, Sweden. 10–15 July 2018. [Google Scholar]
  • 15.Haarnoja T., Zhou A., Abbeel P., Levine S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor; Proceedings of the International Conference on Machine Learning, PMLR; Stockholm, Sweden. 10–15 July 2018; pp. 1861–1870. [Google Scholar]
  • 16.Watkins C.J.C.H. Ph.D. Thesis. King’s College, University of Cambridge; Cambridge, UK: 1989. Learning from Delayed Rewards. [Google Scholar]
  • 17.Sutton R.S., Barto A.G. Reinforcement Learning: An Introduction. 2nd ed. MIT Press; Cambridge, MA, USA: 2018. [Google Scholar]
  • 18.Ladosz P., Weng L., Kim M., Oh H. Exploration in deep reinforcement learning: A survey. Inf. Fusion. 2022;85:1–22. doi: 10.1016/j.inffus.2022.03.003. [DOI] [Google Scholar]
  • 19.Duan J., Guan Y., Li S.E., Ren Y., Sun Q., Cheng B. Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors. IEEE Trans. Neural Netw. Learn. Syst. 2021;33:6584–6598. doi: 10.1109/TNNLS.2021.3082568. [DOI] [PubMed] [Google Scholar]
  • 20.Duan J., Wang W., Xiao L., Gao J., Li S.E., Liu C., Zhang Y.Q., Cheng B., Li K. Distributional soft actor-critic with three refinements. IEEE Trans. Pattern Anal. Mach. Intell. 2025;47:3935–3946. doi: 10.1109/TPAMI.2025.3537087. [DOI] [PubMed] [Google Scholar]
  • 21.Pathak D., Agrawal P., Efros A.A., Darrell T. Curiosity-driven Exploration by Self-supervised Prediction; Proceedings of the 34th International Conference on Machine Learning (ICML), PMLR; Sydney, NSW, Australia. 6–11 August 2017; pp. 2778–2787. [Google Scholar]
  • 22.Karaman S., Frazzoli E. Sampling-based algorithms for optimal motion planning. Int. J. Robot. Res. 2011;30:846–894. doi: 10.1177/0278364911406761. [DOI] [Google Scholar]
  • 23.Garaffa L.C., Basso M., Konzen A.A., de Freitas E.P. Reinforcement learning for mobile robotics exploration: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2021;34:3796–3810. doi: 10.1109/TNNLS.2021.3124466. [DOI] [PubMed] [Google Scholar]
  • 24.van Essen R., van Henten E., Kootstra G. UAV-based path planning for efficient localization of non-uniformly distributed weeds using prior knowledge: A reinforcement-learning approach. Comput. Electron. Agric. 2025;237:110651. doi: 10.1016/j.compag.2025.110651. [DOI] [Google Scholar]
  • 25.Bayerlein H., Theile M., Caccamo M. Proceedings of the 2020 IEEE Global Communications Conference (GLOBECOM), Taipei, Taiwan, 7–11 December 2020. IEEE; New York, NY, USA: 2020. UAV path planning for wireless data harvesting: A deep reinforcement learning approach. [Google Scholar]
  • 26.Chen P., Liu Q., Li Y., Ma S. Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. IEEE; New York, NY, USA: 2024. An Environmental-Complexity-Based Navigation Method Based on Hierarchical Deep Reinforcement Learning; pp. 5119–5125. [Google Scholar]
  • 27.Weerakoon K., Sathyamoorthy A.J., Patel U., Manocha D. Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022. IEEE; New York, NY, USA: 2022. TERP: Reliable planning in uneven outdoor environments using deep reinforcement learning; pp. 9447–9453. [Google Scholar]
  • 28.Mnih V., Kavukcuoglu K., Silver D., Graves A., Antonoglou I., Wierstra D., Riedmiller M. Playing atari with deep reinforcement learning. arXiv. 2013 doi: 10.48550/arXiv.1312.5602.1312.5602 [DOI] [Google Scholar]
  • 29.van Hasselt H. Double Q-learning; Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Vancouver, BC, Canada. 6–9 December 2010; pp. 2613–2621. [Google Scholar]
  • 30.van Hasselt H., Guez A., Silver D. Deep Reinforcement Learning with Double Q-learning; Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI); Phoenix, AZ, USA. 12–17 February 2016; pp. 2094–2100. [Google Scholar]
  • 31.Vieillard N., Pietquin O., Geist M. Munchausen Reinforcement Learning; Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Virtual. 6–12 December 2020; pp. 4237–4248. [Google Scholar]
  • 32.Yang J., Liu Y., Zhang J., Guan Y., Shao Z. Mobile Robot Navigation Based on Intrinsic Reward Mechanism with TD3 Algorithm. Int. J. Adv. Robot. Syst. 2024;21:17298806241292893. doi: 10.1177/17298806241292893. [DOI] [Google Scholar]
  • 33.Li J., Gajane P. Curiosity-Driven Exploration in Sparse-Reward Multi-Agent Reinforcement Learning. arXiv. 2023 doi: 10.48550/arXiv.2302.10825.2302.10825 [DOI] [Google Scholar]
  • 34.Mnih V., Badia A.P., Mirza M., Graves A., Lillicrap T.P., Harley T., Silver D., Kavukcuoglu K. Asynchronous methods for deep reinforcement learning; Proceedings of the International Conference on Machine Learning, PMLR; New York, NY, USA. 20–22 June 2016; pp. 1928–1937. [Google Scholar]
  • 35.Schulman J., Wolski F., Dhariwal P., Radford A., Klimov O. Proximal Policy Optimization Algorithms. arXiv. 2017 doi: 10.48550/arXiv.1707.06347.1707.06347 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.


Articles from Sensors (Basel, Switzerland) are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES