A study on multi-agent reinforcement learning for autonomous distribution vehicles

Serap Ergün

doi:10.1007/s42044-023-00140-1

. 2023 Mar 30:1–9. Online ahead of print. doi: 10.1007/s42044-023-00140-1

A study on multi-agent reinforcement learning for autonomous distribution vehicles

Serap Ergün ^1,^✉

PMCID: PMC10061394

Abstract

A self-driving automobile, also known as an autonomous vehicle, can observe its surroundings and maneuver without the assistance of a driver thanks to software algorithms and a completely automated driving system. As a result, the car may react to the environment in a way that is comparable to a human driver. Road regulations and limitations are required to guarantee the security and effectiveness of delivery services, as well as to stop accidents and damage brought on by technology failures. This paper formulates the Autonomous Delivery Vehicles optimization problem and proposes a Multi-Agent Reinforcement Learning approach that makes use of shortest-path data that have been obtained analytically. This method allows for the training of multiple agents to work together and improve delivery. Using this strategy, autonomous delivery vehicles' efficiency and security can be markedly enhanced.

Keywords: Reinforcement learning, Multi-agent systems, Autonomous vehicles, Shortest-path information

Introduction

A vehicle that can sense its surroundings and function without human intervention, is autonomous. There is no requirement for a human passenger to operate the car at any moment, or even to be inside the vehicle at all. Everywhere a traditional car can travel, an autonomous vehicle can go, and it can do anything a qualified human driver can do. Autonomous vehicles need sensors, actuators, sophisticated algorithms, machine learning systems, and powerful processors.

Based on several sensors positioned throughout the vehicle, autonomous vehicles build and update a map of their surroundings. Radar sensors keep track of the whereabouts of adjacent automobiles. Traffic lights, road signs, other vehicles, and pedestrians are all detected by video cameras (Fig. 1). Lidar (light detection and ranging) sensors estimate distances, find road boundaries and recognize lane markers by reflecting light pulses of the environment around the car. When parking, ultrasonic sensors on the wheels pick up curbs and other cars [1].

Advanced software then analyzes all of this sensory data, creates a path, and sends instructions to the actuators in the car that controls steering, braking, and acceleration. The software helps to adhere to traffic regulations and avoid obstructions through the use of hard-coded rules, obstacle avoidance algorithms, predictive modeling, and object identification [1, 2].

The development of automated driving technology has made smart urban transportation possible. With the growth of the smart city agenda, automated vehicles are currently a hot topic. A potentially disruptive breakthrough in the context of urban mobility is autonomous vehicle delivery [3].

Considering the increases in the e-commerce sector, the demand for Autonomous Distribution Vehicles (ADVs) becomes significant. It is expected that this demand will continue to increase in the coming years. Especially the COVID-19 pandemic has shown us that contactless delivery is extremely reliable and risk-free. Autonomous delivery vehicles are predicted to facilitate contactless delivery, the current trend due to the ongoing pandemic. It is thought that they will become even more popular in the future due to the new variations.

Also, the amount of traffic accidents could be considerably decreased by autonomous vehicles. This presumption is based on the observation that autonomous cars lessen the impact of a human factor on the probability of a traffic collision [4].

In this paper, the challenge of ADV is addressed as a Multi-Agent Pathfinding (MAPF) problem with collision avoidance. The MAPF problem involves controlling many agents to reduce overall costs or increase rewards. It is an NP-hard problem that has numerous solutions suggested for it. Using a given graph, it seeks a team of agents' collision-free paths while reducing the makespan or the total cost of the paths. Various approaches have been presented for dealing with different limitations, although many of them rely on discrete time and a four-link grid, involving numerous agents and vast maps.

This problem is complex because time is discretized, and there are five different types of agent activities: movement up, down, left, and right, as well as stopping. The routes are connected at all angles, the travel duration is constant, and each agent completes one action before moving on to the next action on a map based on the actual city area that ADV wishes to manage. It takes varying amounts of time to decide and begin. To overcome these challenges, it is proposed to use Multi-Agent Reinforcement Learning (MARL) with constructive constraints. MARL is a branch of Reinforcement Learning (RL) that examines how different learning agents interact with each other in a common setting. Each agent acts to further its interests and is driven by its rewards. The suggested strategy is compared with recently created heuristics and high-quality benchmark policies from the literature. The study's main difference from other studies in the literature is that it focuses on conflict avoidance.

This research provides significant contributions to the vehicle routing literature problem and method, including:

Problem A fast and effective solution to an important autonomous vehicle routing problem is proposed. The study thoroughly evaluates the approach's functionality and effectiveness in various instance configurations. It demonstrates that the suggested solution is resilient to variations in timing and collision avoidance distributions.

Method The paper formulates the ADV optimization problem by compiling the requirements of the operation plan optimization problem necessary for the safe and efficient delivery of AVs. It also demonstrates that MARL can be successfully implemented using analytically derived shortest-path information.

The paper is organized as follows: Sect. 2 provides a review of recent studies on AV. Section 3 introduces some preliminaries to ADV and the algorithm. Section 4 discusses the design of the system using MARL. Section 5 presents the evaluation of the study. Finally, Sect. 6 presents the conclusion and outlook.

Literature survey

In this section, a review of relevant literature on autonomous driving and routing is presented. Firstly, the literature on AVs is reviewed, and then the existing applications of RL in AVs are presented.

Filos et al. [5] conduct research on autonomous driving agents in distribution tasks without training and draw attention to the shortcomings of current methods for planning novel driving scenes. They suggest an epistemic uncertainty-aware strategy and attempt to provide experimental proof by answering the following two fundamental questions: can one effectively query an expert for online feedback using their way of estimating epistemic uncertainty, and can an online adaptation mechanism increase the success rate?

A new cooperative distribution model is built by Guo et al. (2022) that takes into account the various scenarios in which automated driving cars could serve as both distribution tools and distribution vehicles [2]. To incorporate the self-distribution of autonomous vehicles, reduce transportation costs, and increase overall efficiency, they develop a better adaptive big neighborhood algorithm to solve the problem, using splitter and merger heuristics to help produce the best results. The evaluation of the long distance that autonomous driving vehicles may travel on their shows that the suggested model's cost reduction and quality enhancement become more significant as the distance increases.

In [6], the scenario where AVs form a cooperative formation and use a centralized optimal controller is studied by Li et al. (2022). The optimal formation of numerous AVs is investigated using set-function optimization. The results of numerous numerical experiments show that uniform distribution and platoon formation are the two most common optimal formations. Interestingly, when HDVs exhibit poor string stability behavior, platooning may not significantly improve traffic efficiency. These findings suggest that there are additional options for the AV cooperative formation in real-world mixed-traffic flows, beyond platooning.

Li et al. (2022) address the topic of driving safety in [7]. A deep RL-based lane change decision-making framework is suggested to identify a risk-aware driving strategy with the lowest projected risk of autonomous driving to assure driving safety. The probabilistic model-based method evaluates the danger of driving using location uncertainty and distance-based safety criteria. A risk-aware decision-making system using deep RL is proposed to identify a course of action with the lowest expected risk. The methods are assessed using the CARLA simulator in two scenarios with static impediments and dynamically moving cars. The results show that these techniques can be used to create sound-safe driving strategies and that they result in better driving performances than earlier ways.

According to Parathiba et al. (2021) [8], cooperative driving of AVs, a groundbreaking vehicle platooning initiative, represents the next stage in automotive technology by lowering the risk of accidents and the amount of fuel and energy needed during transportation. The authors argue that the classic machine learning-based platooning approaches fail to control the policy with AVs' dynamic features. They suggest a hybrid deep RL and genetic algorithm for smart-platooning AVs. By utilizing a deep RL approach, high dynamic platoon environments are supported while addressing computational complexity. Deep RL uses the genetic algorithm to solve the slow convergence issue and provide long-term performance. The simulation results demonstrate that smart-platooning forms efficiently and maintain the platoons while reducing traffic congestion and fuel consumption.

According to Yun et al. (2021) [9], one of the upcoming on-demand transportation networks is the urban aerial mobility system, which comprises drone taxis and air vehicles. Electric vertical take-off and landing are one of these technologies; it flies to the locations of the passengers, picks them up, and takes them to their destinations. The authors suggest a distributed deep RL method, in which the agents are constructed as automobiles that can choose the best routes for transporting passengers while considering factors like passenger behavior, vehicle collisions, and battery life.

A model of traffic flow management is presented by Mushtaq et al. (2021) in [10], based on a novel two-phase strategy for optimizing traffic flow during peak hours with AVs. An adaptive traffic light management system is developed using deep RL to improve traffic flow at road crossings during heavy traffic periods. A smart re-routing strategy is used to load-balance traffic to other routes and avoid congested road crossings for traffic approaching intersections. The experimental evaluation of the suggested approach, through simulations, demonstrates up to a 31% increase in performance efficiency over conventional settings utilizing pre-timed signals and without re-routing.

The hidden space RL method for autonomous vehicles at highway on-ramps is described by Wang et al. (2021) to enable interpretable decision-making [11]. In this approach, a hidden Markov model, and a Gaussian mixture regression combination model are employed. Wang et al. (2021) assert that the standard decision-making process, which uses high-dimensional and task-unaware input, cannot understand the environment. Therefore, they use the model to obtain an interpretable state that provides semantic information and environmental understanding. The latent model is used to extract underlying task-relevant data and minimize the dimension of the interpretable state in a framework that unifies representation learning with the deep RL approach. Experimental results reveal that this method achieves an ideal balance between driving safety and efficiency in the difficult situations of merging highway on-ramps.

Brittain et al. (2021) propose a deep MARL framework for identifying and resolving a variable number of airplanes in a high-density, stochastic, and dynamic route sector. The proposed framework uses a customized version of proximal policy optimization incorporating an attention network to achieve high traffic throughput under uncertainty, allowing agents to access different aircraft information in the sector. During the training process, a single neural network is taught and shared by all agents, employing a centralized learning and decentralized execution model [12]. To validate the proposed framework, the authors utilize three case studies in the BlueSky air traffic simulator. The numerical findings indicate that the proposed framework reduces offline training time, improves safe separation performance, and produces a more effective policy.

Tran and Bae (2021) present an advanced deep RL technique in [3] that investigates the impact of leading autonomous vehicles in a mixed-traffic scenario on the urban network. They propose a set of hyperparameters to improve the performance of deep RL agents. They examine one of the most significant autonomous vehicle experiments in the metropolitan network, with varying levels of autonomous vehicle penetration, using both fully manual and leading manual vehicle experiments to evaluate the benefits of leading autonomous vehicles. To demonstrate the superiority of the proposed hyperparameter, they compare proximal policy optimization with a clipped objective to proximal policy optimization with an adaptive penalty. As the penetration rate of autonomous vehicles increases, the proposed approach becomes more efficient.

Research has demonstrated the effectiveness of using a multi-agent approach for vehicle routing in studies [3, 12]. Furthermore, studies [7, 8, 10] have shown the effectiveness of using RL for learning in complex systems like cooperative driving. However, upon closer examination, only [6] considered collision situations. To address this gap, MARL has been developed to enable collaborative autonomous vehicles to avoid collisions.

Differently from the research analyzed in Sect. 2, the problem addressed in this study involves discrete time and five different types of agent activities, including movement up, down, left, and right, as well as stopping. The routes are connected at all angles, travel duration is constant, and each agent completes one action before moving on to the next action on a map based on the actual city area that ADV wishes to manage. It takes varying amounts of time to decide and begin. The proposed strategy in this paper is compared with recently created heuristics and high-quality benchmark policies from the literature. The main difference between this study and others in the literature is that it focuses on conflict avoidance. Other studies may have focused on different aspects, such as optimizing traffic flow or reducing travel time. By focusing on conflict avoidance, the proposed strategy aims to improve safety and reduce the likelihood of accidents in the managed area.

ADV

The undirected graph $G = (V, E)$ is used in ADV to describe the AV's moveable area, as shown in Fig. 2. The continuous value of the edge $E_{ab}$ that connects the nodes $V_{a}$ and $V_{b}$ represents the distance between the nodes. An AV and other AV cannot both be on the same node or traverse the same edge at the same time. The block system, which is intended to prevent railway collisions, prevents the same edge from only being reversed but also from moving in the same direction at the same time.

To ensure safety and prevent collisions, it is prohibited for multiple autonomous vehicles to enter the same edge. If there are multiple edges between the same nodes, it indicates that the usable space for the section is larger and several autonomous vehicles can safely pass through it [5].

Let $N$ be the number of AVs, $l_{c}^{i} [t]$ represents each AV's location at time $t$ , and $l_{d}^{i} [t]$ represents the final destination (delivery destination or collection point). The next location is announced after the AV has reached its goal. The next location may be in advance thanks to reservations, deliveries, etc., but in this case, it is believed that new instructions will be issued immediately to handle the demand-driven inflow of orders.

The challenge of selecting a preferred order for handling can be considered a developmental challenge. While it is possible for models to have different moving speeds and loading capacities, the experiments in this study only evaluate the case of a single model for simplicity.

Each agent, or control tower, has access to mapping data G and knowledge of every agent's present location $l_{c} t$ and destination $l_{d} t$ (complete information). ADV's goal is to reduce the total AVs' sum of the costs (SoC) (a fully cooperative game) [13, 14]. The time needed for the experiment in this paper is the trip expense.

Solving DRP's non-grid problems in continuous time requires significant computational complexity [15]. To compare the performance of learning agents to that of non-learning agents, a route selection method, such as Algorithm ADV, that can be easily implemented and uses the graph structure as prior knowledge, is employed.

The design of the system by MARL

RL is a subfield of Machine Learning that deals with making appropriate decisions to maximize rewards in a given situation. Various programs and machines use it to determine the optimal course of action in a particular situation [16]. Unlike supervised learning, where the training data provides the solution, RL selects what to do to achieve the task at hand.

To develop the best possible policy, the agent must determine how to explore new states while maximizing its total reward. This entails balancing exploration and exploitation, which can involve making short-term sacrifices for long-term benefits. Therefore, the agent must collect enough data to make high-level decisions in the future [13].

A multi-agent system refers to a collection of distributed entities, or agents, that interact and make independent decisions in a shared environment to accomplish certain tasks. To exhibit intelligent behavior, agents may require a broad range of abilities, depending on the task at hand. In complex systems, programming a priori for intelligent behavior can be challenging, if not impossible. Thus, agents must have the capability to adapt and improve their behavior over time. The most popular paradigm for learning in interactive contexts is Reinforcement Learning (RL), which adjusts behavior through trial and error [14].

Most MAPF algorithms are tested on grid worlds, where agents can move step-wise along the four cardinal directions. However, this study aimed to avoid collisions by allowing agents to briefly collaborate by actively exchanging their locations, rather than focusing on large-scale road networks [17].

This section presents some preliminary design aspects of the ADV system using MARL.

Action space

Here, there are two simple actions: $a_{0}$ : Stay on the current node for a certain period ( $T_{wait}$ ), $a_{i}$ : Move from the current node to the $i - t h$ the lowest numbered node connected by the edge.

“The number of connections of the node having the greatest number of connections + 1” is the type of action. The learning episode ends as an incorrect action if a node with a small number of associated nodes chooses an action with a higher number.

Pre-calculation of shortest-path information

Even with the use of a suitable technique, finding the shortest-path between any two points in a graph has a computational complexity of $O (V^{3})$ and an efficiency that is worse than RL because as the map expands, there is a lot more calculation and learning involved. The Floyd-Warshall approach, on the other hand, makes it possible to complete this calculation quickly. It only needs to be calculated once because this calculation is invariant as long as the map stays the same.

The aim is to improve the learning efficiency by incorporating the shortest-path information, as shown in Fig. 2, in deep learning instead of solely relying on the map's graph structure. Moreover, as illustrated in Fig. 3, the action of sending more agents to invade the edge is assigned a value of -1 to signify the appropriate action. Tables 1 and 2 present the shortest-path information vectors for relevant scenarios. These vectors are very similar to those handled by Algorithm ADV, thus selecting the maximum value for this input at all times results in the same action choice as Algorithm ADV.

Fig. 3 — The flow diagram when there are other agents on the nearest route

Table 1.

The shortest-path information vector when there is no other agent on the nearest route

	$l_{d}$ distance	Input vector
$α_{0}$	$-$	$- T_{wait} / C$
$α_{1}$	$19.2 + 28.7 = 47.9$	$- 7.9 / C$
$α_{2}$	$18.3 + 21.7 = 40.0$	$0 / C$
$α_{3}$	$26.4 + 17.1 = 38.5$	$- 3.5 / C$

Open in a new tab

C $Clipping constant$

Table 2.

The shortest-path information vector when there are other agents on the nearest route

	$l_{d}$ distance	Input vector
$α_{0}$	$-$	$- T_{wait} / C$
$α_{1}$	$19.2 + 28.7 = 47.9$	$- 7.9 / C$
$α_{2}$	$18.3 + 21.7 = 40.0$	$0 / C$
$α_{3}$	$26.4 + 17.1 = 38.5$	$- 1$

Open in a new tab

C $Clipping constant$

Reward design

The commonly used reward setting is to assign a value of 0 to move toward the goal node and − 1 to other actions. However, this approach is inefficient since propagating the Q value to the starting point takes time.

Additionally, since the amount of time needed to complete an action in ADV is not constant, moving a short edge is preferable if "movement to a site other than the destination node is required time 1" to take part in the evaluation. It is not feasible to learn properly because it is rated as "better than the behavior moving along the edge."

As a result, the reward is set to the value of the selected action of the input vector in Fig. 2. The cumulative reward is “to the destination” and “the action as closest to the goal is 0 and the other actions are the distance (time) x -1 that is between more detour than as.”

It is possible to learn the action selection to reach the target in the smallest distance (shortest time) by maximizing this since the cumulative number of detoured distances (time) compared to the shortest distance in -1". Receiving a reward for every action increases learning speed concurrently.

Thanks to the intended RL model is paired with the collision prevention strategy for the safety checks, along with the established reward function, to control the vehicle's speed in a way to maximize cumulative rewards and avoid potential risks.

Using rule-based policies

In the context of RL, the agent experiments with various actions to determine the most suitable one for the given situation. However, in the case of MARL, if Agent B performs poorly (such as colliding with Agent A), it will result in a low reward even if Agent A performs well.

There are circumstances where behavior cannot be accurately assessed. This is apparent in the early learning phases and leads to ineffective learning since it favors undesirable behavior. Therefore, the agent chooses actions using Algorithm ADV, except for one of the other agents who need to learn, to efficiently perform and aims to learn “actions to be taken when other agents produce reasonably good actions.”

When learning continues to some level, other agents also select actions with the policy being learned and adjust the environment. Ultimately, it is important to learn "actions to be taken when other agents also select actions with the same policy."

Evaluation

The approach is implemented in MATLAB and ran all simulations on a machine with 32 GB RAM and a 1.8 GHz Intel^® Core i7-8550U CPU.

The usefulness of the algorithm suggested in the previous section is examined in the experiment that follows. The used map is seen in Fig. 4. This picture depicts an actual street. The edge value displays the length of time it takes the agent to move. The necessary time and destination arrival rate indicate the average value, and the number of AVs at the time of learning is random (even distribution) and ranges from 2 to 4.

The destination arrival rate and the overall amount of time needed for each learning step are depicted in Figs. 5 and 6, respectively. The rate at which all agents can go to the initial destination within 1000 unit hours without colliding is known as the destination arrival rate. The total journey time represents the average time it takes for all AVs to arrive at their starting point. The average value won't be computed if a collision or time out happens. Other agents are allowed to choose their actions based on the learned policy at the time of evaluation.

Fig. 6 — The overall amount of time needed for each learning step

As displayed in Fig. 5, it is possible to effectively learn the action selection that can get to the destination by employing the shortest-path information. Figure 6 also shows that short path selection might be quickly learned by running other agents with Algorithm ADV. This is because when agents made action decisions under the policy they are learning, they tend to learn to take a detour so that they wouldn't clash. The total amount of time necessary in “other agent rule-based policy” and “both in one” does not appear to alter, however in “other agent rule-based policy”, the event of a collision or time out resulted in the destination not being reached.

The shortest-length paths are frequently the shortest-time paths. In the “Shortest-path information” approach, a vehicle agent always includes the shortest-length path in the collection of potential paths when exploring new routes. “Both in one” thus provides performance comparable to” other agent rule-based policy”.

The performance of “Both in one” is superior to the other two approaches, as shown by a comparison of the destination arrival rate across the test situations. This demonstrates how the “Both in one” can maintain higher scalability and generalization. The reciprocal reward is free from the dimensionality curse since only nearby AVs that directly interact are taken into account when calculating the reciprocal reward of each AV.

Because the time needed in both difficult and easy conditions is not averaged, it can be argued that “both in one”, which used both approaches, was able to choose a more efficient one. It is also impossible to learn effectively in this situation if neither the shortest-path information nor the rule-based policy of other agents is applied. Therefore, it is important to come up with ways to gradually increase the number of agents, but these suggested methods have not yet been put to use. Thus, numerous agents may be used for training from the start.

As seen in Fig. 7, the reward fluctuates depending on the episode, and the MARL-based ADV algorithm's convergence performance can be seen. A total of 100 training stages for each episode is established with cumulative rewards for completing each training step. The analysis demonstrates that the reward value gradually rises with repeated training repetitions. The suggested algorithm starts to converge after 500 episodes have been completed by the training. Among them, the network architecture will vary quickly due to the high mobility of the vehicles, and the channel fading will change. When the algorithm converges, there are numerical fluctuations.

Conclusion and outlook

Understanding multi-behavior is crucial for autonomous driving. RL has significantly improved performance in recent years for crucial object detection and tracking for autonomous driving. To assess, operate in, and interact safely with their surroundings, autonomous vehicles must have a higher level grasp of their dynamics. This project's overall objective is to collaboratively see, learn from, and comprehend the behavior of various autonomous vehicle agents to improve autonomous driving systems.

The requirements of the operation plan optimization issue necessary for the supply of safe and effective AVs are compiled in this work and formulated as the ADV optimization problem. Also, it is demonstrated how using the analytically determined shortest-path information, MARL can be carried out effectively. As a future work, it can be researched how to apply the problem of which delivery orders should be placed by which AV while taking into account AV acceleration and connecting it to the actual application.

There are several potential directions for future work using MARL in autonomous vehicles. Some possibilities include:

Investigating more complex scenarios: The experiments in the current study involved a relatively simple scenario with a limited number of agents and a small map size. Future work could explore more complex scenarios with larger maps, more agents, and more complicated interactions between agents.

Enhancing the training process The training process used in the current study relied on a pre-defined reward function and a fixed set of training episodes. Future work could explore more advanced training methods, such as curriculum learning, transfer learning, or meta-learning, to improve the efficiency and robustness of the training process.

Incorporating uncertainty Real-world autonomous driving scenarios involve a high degree of uncertainty, such as sensor noise, communication delays, and unexpected obstacles. Future work could investigate how to incorporate uncertainty into the learning process, for example using probabilistic models or Bayesian approaches.

Integrating with other technologies MARL can be integrated with other technologies, such as computer vision, natural language processing, or blockchain, to improve the performance and scalability of autonomous driving systems. Future work could explore how to leverage these technologies to enhance the learning and decision-making capabilities of autonomous vehicles.

Addressing ethical and social issues: Autonomous driving raises important ethical and social issues, such as liability, privacy, and equity. Future work could investigate how to design MARL algorithms that take these issues into account, for example by incorporating fairness constraints or considering the preferences and beliefs of different stakeholders.

In the future study, the focus will be fine-tuning the exploration process parameters, particularly for the alternate path-finding technique. The goal is to include bargaining techniques in the strategy, where vehicle agents can bargain with each other when researching routes to reduce the overall cost. Additionally, the Assign MAS will be expanded to handle the charging and routing activities involved in the AGV transportation problem.

Author contributions

S.E. wrote the main manuscript text.

Funding

Not applicable.

Availability of data and materials

Not applicable.

Declarations

Competing interests

The authors declare no competing interests.

Conflict of interest

The author declares that they have no conflict of interest.

Ethical approval

Not applicable.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Schwarting W, Alonso-Mora J, Rus D. Planning and decision-making for autonomous vehicles. Annu. Rev. Control. Robot. Autonom. Syst. 2018;1(1):187–210. doi: 10.1146/annurev-control-060117-105157. [DOI] [Google Scholar]
2.Guo Y, Chen X, Yang Y. Multimodal transport distribution model for autonomous driving vehicles based on improved ALNS. Alex. Eng. J. 2022;61(4):2939–2958. doi: 10.1016/j.aej.2021.08.029. [DOI] [Google Scholar]
3.Tran QD, Bae SH. An efficiency-enhancing methodology for multiple autonomous vehicles in an Urban network adopting deep reinforcement learning. Appl. Sci. 2021;11(4):1514. doi: 10.3390/app11041514. [DOI] [Google Scholar]
4.Petrović Đ, Mijailović R, Pešić D. Traffic accidents with autonomous vehicles: type of collisions, maneuvers and errors of conventional vehicles’ drivers. Transport. Res. Proc. 2020;45:161–168. doi: 10.1016/j.trpro.2020.03.003. [DOI] [Google Scholar]
5.Filos, A., Tigkas, P., McAllister, R., Rhinehart, N., Levine, S., Gal, Y.: Can autonomous vehicles identify, recover from, and adapt to distribution shifts? In: International Conference on Machine Learning (pp. 3145–3153). PMLR (2020)
6.Li G, Yang Y, Li S, Qu X, Lyu N, Li SE. Decision making of autonomous vehicles in lane change scenarios: deep reinforcement learning approaches with risk awareness. Transp. Res. Part C: Emerg. Technol. 2022;134:103452. doi: 10.1016/j.trc.2021.103452. [DOI] [Google Scholar]
7.Li, K., Wang, J., Zheng, Y.: Cooperative formation of autonomous vehicles in mixed traffic flow: Beyond platooning. IEEE Trans. Intell. Transport. Syst. (2022)
8.Prathiba SB, Raja G, Dev K, Kumar N, Guizani M. A hybrid deep reinforcement learning for autonomous vehicles smart-platooning. IEEE Trans. Veh. Technol. 2021;70(12):13340–13350. doi: 10.1109/TVT.2021.3122257. [DOI] [Google Scholar]
9.Yun WJ, Jung S, Kim J, Kim JH. Distributed deep reinforcement learning for autonomous aerial eVTOL mobility in drone taxi applications. ICT Express. 2021;7(1):1–4. doi: 10.1016/j.icte.2021.01.005. [DOI] [Google Scholar]
10.Mushtaq A, Haq IU, Imtiaz MU, Khan A, Shafiq O. Traffic flow management of autonomous vehicles using deep reinforcement learning and smart rerouting. IEEE Access. 2021;9:51005–51019. doi: 10.1109/ACCESS.2021.3063463. [DOI] [Google Scholar]
11.Wang H, Gao H, Yuan S, Zhao H, Wang K, Wang X, Li D. Interpretable decision-making for autonomous vehicles at highway on-ramps with latent space reinforcement learning. IEEE Trans. Veh. Technol. 2021;70(9):8707–8719. doi: 10.1109/TVT.2021.3098321. [DOI] [Google Scholar]
12.Brittain MW, Yang X, Wei P. Autonomous separation assurance with deep multi-agent reinforcement learning. J. Aerosp. Inform. Syst. 2021;18(12):890–905. [Google Scholar]
13.Fruit, R.: Exploration-exploitation dilemma in Reinforcement Learning under various forms of prior knowledge. Doctoral dissertation, Université de Lille 1, Sciences et Technologies; CRIStAL UMR 9189 (2019)
14.Gronauer S, Diepold K. Multi-agent deep reinforcement learning: a survey. Artif. Intell. Rev. 2022;55(2):895–943. doi: 10.1007/s10462-021-09996-w. [DOI] [Google Scholar]
15.Sartoretti G, Kerr J, Shi Y, Wagner G, Kumar TS, Koenig S, Choset H. Primal: pathfinding via reinforcement and imitation multi-agent learning. IEEE Robot. Autom. Lett. 2019;4(3):2378–2385. doi: 10.1109/LRA.2019.2903261. [DOI] [Google Scholar]
16.Yarats, D., Fergus, R., Lazaric, A., Pinto, L.: Reinforcement learning with prototypical representations. In: International Conference on Machine Learning, pp. 11920–11931 (2021)
17.Trinh, L. A., Ekström, M., Cürüklü, B.: Multi-path planning for autonomous navigation of multiple robots in a shared workspace with humans. In: 2020 6th International Conference on Control, Automation and Robotics (ICCAR), pp. 113–118 (2020)

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Not applicable.

[CR1] 1.Schwarting W, Alonso-Mora J, Rus D. Planning and decision-making for autonomous vehicles. Annu. Rev. Control. Robot. Autonom. Syst. 2018;1(1):187–210. doi: 10.1146/annurev-control-060117-105157. [DOI] [Google Scholar]

[CR2] 2.Guo Y, Chen X, Yang Y. Multimodal transport distribution model for autonomous driving vehicles based on improved ALNS. Alex. Eng. J. 2022;61(4):2939–2958. doi: 10.1016/j.aej.2021.08.029. [DOI] [Google Scholar]

[CR3] 3.Tran QD, Bae SH. An efficiency-enhancing methodology for multiple autonomous vehicles in an Urban network adopting deep reinforcement learning. Appl. Sci. 2021;11(4):1514. doi: 10.3390/app11041514. [DOI] [Google Scholar]

[CR4] 4.Petrović Đ, Mijailović R, Pešić D. Traffic accidents with autonomous vehicles: type of collisions, maneuvers and errors of conventional vehicles’ drivers. Transport. Res. Proc. 2020;45:161–168. doi: 10.1016/j.trpro.2020.03.003. [DOI] [Google Scholar]

[CR5] 5.Filos, A., Tigkas, P., McAllister, R., Rhinehart, N., Levine, S., Gal, Y.: Can autonomous vehicles identify, recover from, and adapt to distribution shifts? In: International Conference on Machine Learning (pp. 3145–3153). PMLR (2020)

[CR6] 6.Li G, Yang Y, Li S, Qu X, Lyu N, Li SE. Decision making of autonomous vehicles in lane change scenarios: deep reinforcement learning approaches with risk awareness. Transp. Res. Part C: Emerg. Technol. 2022;134:103452. doi: 10.1016/j.trc.2021.103452. [DOI] [Google Scholar]

[CR7] 7.Li, K., Wang, J., Zheng, Y.: Cooperative formation of autonomous vehicles in mixed traffic flow: Beyond platooning. IEEE Trans. Intell. Transport. Syst. (2022)

[CR8] 8.Prathiba SB, Raja G, Dev K, Kumar N, Guizani M. A hybrid deep reinforcement learning for autonomous vehicles smart-platooning. IEEE Trans. Veh. Technol. 2021;70(12):13340–13350. doi: 10.1109/TVT.2021.3122257. [DOI] [Google Scholar]

[CR9] 9.Yun WJ, Jung S, Kim J, Kim JH. Distributed deep reinforcement learning for autonomous aerial eVTOL mobility in drone taxi applications. ICT Express. 2021;7(1):1–4. doi: 10.1016/j.icte.2021.01.005. [DOI] [Google Scholar]

[CR10] 10.Mushtaq A, Haq IU, Imtiaz MU, Khan A, Shafiq O. Traffic flow management of autonomous vehicles using deep reinforcement learning and smart rerouting. IEEE Access. 2021;9:51005–51019. doi: 10.1109/ACCESS.2021.3063463. [DOI] [Google Scholar]

[CR11] 11.Wang H, Gao H, Yuan S, Zhao H, Wang K, Wang X, Li D. Interpretable decision-making for autonomous vehicles at highway on-ramps with latent space reinforcement learning. IEEE Trans. Veh. Technol. 2021;70(9):8707–8719. doi: 10.1109/TVT.2021.3098321. [DOI] [Google Scholar]

[CR12] 12.Brittain MW, Yang X, Wei P. Autonomous separation assurance with deep multi-agent reinforcement learning. J. Aerosp. Inform. Syst. 2021;18(12):890–905. [Google Scholar]

[CR13] 13.Fruit, R.: Exploration-exploitation dilemma in Reinforcement Learning under various forms of prior knowledge. Doctoral dissertation, Université de Lille 1, Sciences et Technologies; CRIStAL UMR 9189 (2019)

[CR14] 14.Gronauer S, Diepold K. Multi-agent deep reinforcement learning: a survey. Artif. Intell. Rev. 2022;55(2):895–943. doi: 10.1007/s10462-021-09996-w. [DOI] [Google Scholar]

[CR15] 15.Sartoretti G, Kerr J, Shi Y, Wagner G, Kumar TS, Koenig S, Choset H. Primal: pathfinding via reinforcement and imitation multi-agent learning. IEEE Robot. Autom. Lett. 2019;4(3):2378–2385. doi: 10.1109/LRA.2019.2903261. [DOI] [Google Scholar]

[CR16] 16.Yarats, D., Fergus, R., Lazaric, A., Pinto, L.: Reinforcement learning with prototypical representations. In: International Conference on Machine Learning, pp. 11920–11931 (2021)

[CR17] 17.Trinh, L. A., Ekström, M., Cürüklü, B.: Multi-path planning for autonomous navigation of multiple robots in a shared workspace with humans. In: 2020 6th International Conference on Control, Automation and Robotics (ICCAR), pp. 113–118 (2020)

PERMALINK

A study on multi-agent reinforcement learning for autonomous distribution vehicles

Serap Ergün

Abstract

Introduction

Fig. 1.

Literature survey

ADV

Fig. 2.