Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Sep 26;15:33122. doi: 10.1038/s41598-025-18353-8

Decentralized resource allocation in UAV communication networks through reward based multi agent learning

Muhammad Shoaib 1, Ghassan Husnain 1,, Muhsin Khan 2, Yazeed Yasin Ghadi 3, Sangsoon Lim 4,
PMCID: PMC12475095  PMID: 41006638

Abstract

Unmanned aerial vehicles (UAVs) used as aerial base stations (ABS) can provide economical, on-demand wireless access. This research investigates dynamic resource allocation in multi-UAV-enabled communication systems with the aim of maximizing long-term rewards. More specifically, without exchanging information with other UAVs, every UAV chooses its communicating users, power levels, and sub-channels to establish communication with a ground user. In the proposed work, the dynamic scheme-based resource allocation is investigated of communication networks made possible by many UAVs to achieve the highest possible performance level over time. Specifically, each UAV selects its connected users, battery power, and communication channel independently, without exchanging information across multiple UAVs. This allows each UAV to connect with ground users. To model the unpredictability of the environment, we present the problem of long-term allocation of system resources as a stochastic game to maximize the anticipated reward. Each UAV in this game plays the role of a learnable agent, and the system solution for resource allocation matches the actions made by the UAV. Afterward, we built a framework called reward-based multi-agent learning (RMAL), in which each agent uses learning to identify its best strategies based on local observations. RMAL is an acronym for ″reward-based multi-agent learning″. We specifically offer an agent-independent strategy where each agent decides algorithms separately but cooperates on a common Q-learning-based framework. The performance of the suggested RMAL-based resource allocation method may be enhanced by employing the right development and exploration parameters, according to the simulation findings. Secondly, the proposed RMAL algorithm provides acceptable performance over full information exchange between UAVs. Doing so achieves a satisfactory compromise between the increase in performance and the additional burden of information transmission.

Keywords: Unmanned aerial vehicles, Aerial base stations, Dynamic resource allocation, Multi-Agent learning, Decentralized Decision-Making

Subject terms: Electrical and electronic engineering, Mechanical engineering

Introduction

Recently, there has been a rise in interest in airborne communication networks, which encourages the development of novel wireless infrastructure deployment techniques1. Aerial communication systems may provide better system capacity and coverage, which is why this occurred. Unmanned aerial vehicles (UAVs), also known as remotely piloted aircraft systems (RPAS) or drones, are small unmanned aircraft that may be deployed fast2. These are yet another kind of Third Generation Partnership Project-built LTE-A (long-term evolution - advanced) system (3GPP). In contrast to ground communication, the channel characteristics for communication between UAV and the ground are more likely to be line-of-sight (LoS) links2, which makes wireless communication easier. With regard to deployment, navigation, and control, UAVs built on a variety of airborne platforms have drawn a significant amount of academic and industry effort3. To increase UAV communication systems’ coverage and energy efficiency, resource allocation which includes transmit power, service users, and sub-channel is also required4. This is because crucial communication issues are involved. UAVs can typically be deployed in less time than terrestrial base stations and offer greater configuration flexibility5. The distance between various UAV deployments and the altitude of UAV-enabled small base stations are studied by the author4. A cyclic packing-based three-dimensional (3D) deployment algorithm is developed in reference6 to maximize the performance of the downlink coverage. Additionally develops a 3D deployment method for a single UAV to maximize the number of coverage users6. Additionally, proposes a continuous UAV placement method by maintaining the same altitude7. This plan intends to reduce the overall number of UAVs needed while making sure that each genuine ground user is protected by at least one UAV8. Even though the UAV deployment has been optimized, the design of UAV trajectories to optimize communication performance has received considerable attention, as evidenced by911. The authors investigate the problem of throughput maximization and view UAVs as mobile relays9. To achieve optimal results, they optimize the power distribution and UAV trajectory. Then, in reference9, successive convex approximation (SCA) is proposed as a method for the design of UAV trajectories. The authors of9 examine the UAV trajectory design that reduces the amount of time needed to finish a task using UAV multicast systems. To accomplish this, they changed an uninterrupted trajectory into a set of distinct way-points. Furthermore10, consider wireless communication systems capable of supporting multiple UAV systems. This paper analyses a collaborative design for the best trajectory and resource distribution by increasing the minimum throughput for all users to maintain fairness. To mitigate the delay of the sensing task while maintaining the overall rate of a multi-UAV aided uplink single-cell network, the authors of12 suggest a joint sub-channel allocation and trajectory design technique. This can be accomplished by designing a trajectory that takes both the total rate and latency of the sensing task into account. Due to their adaptability and maneuverability, the control design of UAVs is constrained by the need for human intervention. The performance of UAV communication systems necessitates intelligent UAV control based on machine learning as a result13. The design of neural network-based trajectories for UAVs is examined from the standpoint of manufacturing architecture in14,15. The paper16 proposes a weighted expectation-based UAV on-demand predictive deployment method to minimize transmit power in UAV-enabled communication systems. This method uses a Gaussian mixture model to construct the data distribution.

In the related work16, the authors investigate autonomous path planning for UAVs by jointly considering energy efficiency, transmission delay, and interference management. To address this complex optimization problem, they propose a deep reinforcement learning framework based on Echo State Networks (ESNs), enabling adaptive decision-making in dynamic environments. Furthermore, the same study presents a resource allocation strategy leveraging Liquid State Machines (LSMs) for efficient spectrum utilization across both licensed and unlicensed LTE bands in cache-enabled UAV networks. In a related work17, a joint channel and time-slot selection mechanism for multi-UAV systems is introduced. The proposed approach employs log-linear learning to optimize spectrum sharing and mitigate collisions in a distributed manner, thereby enhancing the overall communication performance of UAV-enabled networks17.

Machine and deep learning are two types of artificial intelligence model that learns directly from the data with explicitly programming a computer system to detect and recognition, both are promising and potent tools that can provide autonomous and effective solutions to intelligently improve communication systems that support UAVs18. However, the majority of research contributions have been on how UAVs are deployed and how their trajectories are designed in communication systems16. Prior research has primarily focused on time-independent scenarios, despite11,12 discussing resource allocation schemes for UAV-supported communication systems, including transmit power and sub-channels. In other words, the optimal design is independent of the time being taken into account. Additionally19,20, investigated the possibility of resource allocation techniques based on machine learning for time-dependent scenarios. However, the majority of proposed machine learning algorithms focus on scenarios involving a single UAV or multiple UAVs, assuming that each UAV possesses comprehensive network information. Due to the rapid movement of UAVs21,22, it is not simple to acquire a comprehensive understanding of the dynamic environment in practice. This creates a difficult environment for the design of reliable UAV wireless communication, which poses a significant challenge. Additionally, the majority of earlier research contributions were on centralized techniques, making modeling and computing tasks challenging as the network’s scale continues to grow. For communication systems that allow UAVs, reward-based multi-agent learning (RMAL) can offer a distributed view of intelligent resource management. This is especially useful in situations where each UAV only has access to its local data23.

In dynamic UAV-enabled communication networks, centralized control or full network state awareness is often impractical due to high mobility, limited energy, and real-time operational constraints. Most existing solutions either assume complete inter-UAV information sharing or rely on static deployment strategies. In contrast, the proposed RMAL (Reward-Based Multi-Agent Learning) framework enables each UAV to make decentralized resource allocation decisions using only local observations, eliminating the need for inter-agent communication. This reduces overhead while retaining adaptability in highly dynamic environments. The motivation for using RMAL lies in its ability to capture environmental uncertainty through a stochastic game formulation, enabling each UAV to maximize long-term rewards independently via Q-learning. This makes the method scalable, practical, and well-suited for real-time UAV applications24.

Based on the proposed framework, the following summarizes our primary contributions:

  • To enhance multi-UAV downlink systems’ long-term effectiveness, our work focuses on concurrently constructing user, power level, and sub-channel selection algorithms. To ensure reliable communication, we specifically created a limited energy efficiency function based on the quality-of-services (QoS) as a reward mechanism. The exceptional nature of the formulation of the optimization problem can be attributed to its time-dependent and uncertain nature. To tackle this challenging issue, we describe a method for dynamic resource allocation based on reward learning.

  • Our method of analyzing the dynamic resource allocation problem of a multi-UAV system is based on a novel stochastic game theory. According to this design, every UAV performs the role of a learning agent, and every resource allocation strategy is based on the actions of the UAV. This gives us the ability to describe the dynamic resource allocation issue in a system of several UAVs. Each UAV’s actions in a designed random game specifically satisfy the properties of the Markov chain. This suggests that a UAV’s rewards depend only on its current state and actions. Additionally, resource allocation problems for various multi-UAV dynamic systems may be simulated using the framework.

  • We created an RMAL-based resource allocation algorithm to solve stochastic formula games that take place in multi-UAV systems. Since each UAV uses the traditional Q-learning techniques and functions as its learning agent, the behaviors of the UAVs are not taken into consideration. We created a resource allocation system based on the RMAL algorithm to tackle stochastic formula games that happen in multi-UAV systems. Each UAV functions as its learning agent, carrying out common Q-learning algorithms without taking into account what other UAVs are doing. This significantly reduces the amount of data shared between UAVs and the computational work performed by each UAV. In addition, we provide evidence that the RMAL-based algorithm for resource allocation converges.

  • Various system parameters are used to derive the development and exploration parameters of the Inline graphic-greedy algorithm from the simulation results presented here. In addition, simulation results demonstrate that the RMAL-based multi-UAVs system resource allocation framework provides a satisfactory trade-off between performance increases and increases in the quantity of information that must be exchanged.

To facilitate clarity and improve comprehension of technical terms used throughout this study, a comprehensive list of abbreviations is presented in Table 1. This table provides definitions for commonly used acronyms related to UAV-enabled communication systems, reinforcement learning, and wireless network modeling. The summarized notations serve as a reference for readers to interpret various terminologies consistently within the context of this work.

Table 1.

Summary of acronyms.

Acronyms Definition Acronyms Definition
UAV Unmanned Aerial Vehicle NLoS None-Line-of-Sight
AGU Authorized Ground Users SINR Signal to Interference Plus Noise Ratio
A2G Air to Ground MDS Markov Decision Scheme
LoS Line-of-Sight UAS Unmanned Aerial System
3D 3 Dimensions QoS Quality of Service
RMAL Reward-Based Multi-Agent Learning 5G Fifth Generation
LTE Long Term Evolution CSI Channel State Information

System model

We presented a multi-UAVs A2G communication system, depicted in Fig. 1, that operates on a discrete timeline and is comprised of a single antenna UAVs X and U are the users of a single antenna, represented by Inline graphic and Inline graphicrespectively. Randomly dispersed the ground users on a radius Inline graphic Disks. As depicted in Fig. 1, several UAVs fly over the area and interact directly with the ground users25 via an aerial communication link. The UAV total bandwidth Inline graphic is subdivided into orthogonal sub-channels K, abbreviated as Inline graphic In addition, the UAV is expected to operate autonomously based on a preprogrammed flight plan without human interaction, as described in20. In other words, a preprogrammed flight plan predetermined the UAV’s trajectory. Figure 1 shows three UAVs flying over the region of interest along a predetermined path. This article examines the resource distribution dynamic design in UAV systems concerning the user, power level, and sub-channel selection. In addition, it is believed that the communication among UAVs is without a central controller, and the a lack of global understanding in the wireless communication environment27. In simple words, local knowledge exists regarding the UAV and the user’s CSI. In practice, this assumption is reasonable due to UAV mobility, similar to research contributions21,22.

Fig. 1.

Fig. 1

The proposed system for UAV-enabled communication employs Reinforcement Learning for decentralized dynamic resource allocation.

A2G channel model

Compared to terrestrial communication propagation, A2G channels are significantly reliant on altitude, elevation angle, and propagation environment. In reference3,21, we investigated the dynamic resource allocation topic in multi-UAVs under A2G channel model:

  • The Probabilistic Models: As demonstrated in21,29, The probabilistic rout loss model, which allows for the independent treatment of line-of-sight (LoS) and non-line-of-sight (NLoS) links with different probabilities, can be used to simulate the A2G communication link. According to29, the likelihood of establishing a LoS connection between time slot D, Inline graphic, and Inline graphic is the ground user given by environment-dependent constants a and b.

graphic file with name d33e597.gif 1

Inline graphic denotes Inline graphicand user Inline graphicand the altitude of Inline graphic denoted by H. In addition,

Inline graphicis the NLoS link probability.

The corresponding Non-Line-of-Sight (NLoS) probability is:

graphic file with name d33e638.gif 2

The trajectory path-loss LoS and NLoS from the permitted ground user U in time slot D to the Inline graphic may be expressed as follows,

graphic file with name d33e652.gif 3
graphic file with name d33e658.gif 4

where Inline graphic denotes the route loss in free space with Inline graphic and the carrier frequency Inline graphic. Additionally, Inline graphic and Inline graphic representing the average additional path-losses LoS and NLoS, respectively. Consequently, the following representation may be used to show the average trajectory path loss between Inline graphicand user Inline graphic during time slot D:

graphic file with name d33e709.gif 5
  • The LoS Model: In reference8, for practical A2G communication, the LoS model provides a good approximation. The path loss between an authorized ground user and a UAV depends on both their locations and the kind of propagation, according to the LoS model30. The channel gains between the authorized users on the ground and the UAVs are computed, taking into account their relative distances, using the LoS model and the free path loss model. The power gain of the LoS channel model in time slot D from the X-th UAV to the approved ground users U-th can be represented as follows:

graphic file with name d33e732.gif 6

where Inline graphic Inline graphic indicate the horizontal position of the Inline graphic in time slot D. Consequently, Inline graphic reflects the user’s location Inline graphic. In addition, Inline graphic is denoted by the channel power with the distance Inline graphic whereas Inline graphicis the path loss index.

The signal model

Each pair of UAVs operating on the same sub-channel causes interference for ground users when it comes to UAV-to-ground communication. Let Inline graphicbe a sub-channel indication, where Inline graphicif Inline graphic occupies sub-channel Inline graphic during time slot D; otherwise, Inline graphic. It is satisfactory

graphic file with name d33e825.gif 7

.

In other words, each drone is restricted to a single sub-channel per time. Make Inline graphicA user-facing indication. Inline graphicif the user Inline graphic in the time frame D provided by the Inline graphic; Inline graphicif not. Thus, at time slot D, on sub-channel Inline graphic and the SNIR of the UAV to ground transmission between Inline graphic and authorized user U are the following:

graphic file with name d33e877.gif 8

where Inline graphic indicates the channel gain of Inline graphic and the authentic user U in sub-channel Inline graphic and the time slot D. Inline graphic indicates the transmit power chosen by Inline graphic for time slot D. The Inline graphic with Inline graphic. The SINR of the Inline graphic can be stated as follows for any time slot D:

graphic file with name d33e934.gif 9

In22, the UAVs implement discrete transmit power control to manage interference and optimize communication performance within the network. The vector Inline graphic shows the transmit power value for each UAV that is in communication with the relevant associated user. For each Inline graphic, the binary variable Inline graphicis defined. Inline graphic if Inline graphic decides to transmit at time slot D with power level Inline graphic; else Inline graphic. Note that for each D time slot, Inline graphic may only choose single power.

graphic file with name d33e995.gif 10

Now, the Inline graphic has a limited set of power-level selection options including the following:

graphic file with name d33e1009.gif 11

Similar to user selection via Inline graphic, all sub-channel selection has finite sets that are as follows:

graphic file with name d33e1023.gif 12
graphic file with name d33e1029.gif 13

Furthermore, we also assume that the multi-UAV system runs on a discrete-time basis, with the time timeline being divided into equal, non-overlapping time intervals. In addition, it is expected that the communication parameters do not change between time slots. For the time slot index, let D be the integer value. Specially, when each UAV records the CSI and decisions of authorized ground users in time slots Inline graphic at preset intervals, which is referred to as the decision cycle. We look into the following approach for scheduling UAV transmissions: each UAV receives a time slot D to begin transmission, and the handover must be completed after its decision cycle, in the time slot Inline graphic. We suppose that UAVs are unaware of the precise amount of time they spend in the network. This characteristic prompted us to develop an online learning system for the maximization of energy efficiency performance for the multi-UAV networks over the long run31.

The framework of stochastic game for multi-UAVs systems

In this part, it’s started with a description of the optimization challenges addressed in this study. To imitate the randomness of the environment, a random set is then used to formulate the joint power level, user, and sub-channel selection problem.

Problem formulation

Note that beginning with (6), each UAV transmits at full power for maximum throughput, resulting in greater interference with other UAVs32. To ensure reliable communication from the UAV, the primary objective of the dynamic design of power level, user, and sub-channel selection is to ensure that the SINR generated by the UAV does not fall below the predetermined threshold33. In particular, the mathematical form can be shown as follows:

graphic file with name d33e1070.gif 14

where Inline graphic is the QoS threshold objective for UAV users. If constraint (14) is satisfied in time slot D, the UAV is awarded Inline graphic, which is characterized as the gap between throughput and power cost reached by the user, power level, and the selected sub-channel, otherwise it will earn no reward. Thus, the Inline graphic can be used to represent the reward function of the Inline graphic in D time slot:

graphic file with name d33e1102.gif 15

For every Inline graphic, the instantaneous payoff is represented by Inline graphic. The power level in terms of cost per unit is Inline graphic. The instantaneous reward for Inline graphic in any D time slot relies on the following:

  1. Unobserved data: sub-channel and power levels as well as channel gain selected by other UAVs. Note that we exclude the UAV’s fixed energy consumption, such as that of the control unit and data processing23.

  2. Information observed: For the single user, power level and sub-channel decisions for Inline graphic, i.e., Inline graphic. Additionally, it is dependent on the current channel gain Inline graphic;

To maximize the long-term benefit, select the service users, power level transmission, and sub-channels for each time slot34. Specifically, we use future discounts24 as a criterion for evaluating each UAV. Specifically, at some point in the procedure, the discount reward equals the sum of its current period benefits plus the future reward discounted through a constant factor. Consequently, the following equation provides the long-term rewards for the Inline graphic:

graphic file with name d33e1186.gif 16

where Inline graphic represents Inline graphic 1 discount factor. For example, if the value Inline graphic is near 0, the choice emphasizes short-term gain; but, if the Inline graphic is close to 1, visionary decisions are made. This value illustrates how future rewards influence optimum judgments.

In Eq. (16), the parameter τ represents the time-step offset or prediction horizon into the future, used to compute the discounted cumulative reward from the current time slot D onward. It starts at τ = 0 and increases indefinitely (theoretically up to +∞), reflecting the forward-looking nature of reinforcement learning where agents aim to optimize not only immediate but also long-term outcomes. Mathematically, τ indexes the number of steps into the future from the current decision point. The term Δ^τ serves as the discount factor that reduces the impact of future rewards as τ increases, making the algorithm more focused on near-term performance when Δ is small, and more long-term focused when Δ approaches 1. In practice, although the sum in (14) is over an infinite horizon, the influence of distant rewards becomes negligible for Δ < 1 and large τ, thus convergence is ensured. The cumulative reward function vₓ(D) is central to evaluating the utility of a UAV’s current policy, driving updates in Q-learning.

Next, we list the power level, sub-channel, and all the possible authorized users’ decisions taken by Inline graphic, Inline graphic which may be written as Inline graphic and Inline graphic is for the Cartesian product. Thus, the goal of each Inline graphic is to take decisions Inline graphic for the long-term performance maximization (14). For the UAV optimization problem, Inline graphic, Inline graphic can therefore be stated as follows:

graphic file with name d33e1274.gif 17

So, the optimum design of the multi-UAVs system under consideration comprises sub-problems X corresponding to various X UAVs. Additionally, since each UAV lacks knowledge about the other UAVs, such as their rewards, the problem cannot be precisely resolved (17).

In the subsections that follow, we make an effort to articulate joint sub-channel, power level, and the authorized user’s selection problems as non-cooperative stochastic games to resolve the random environment optimization problem (17).

Equation (15) formulates the optimization problem for each UAV as a single-agent objective, aiming to select a combination of user, sub-channel, and power level Ωₓ*(D) ∈ Φₓ that maximizes its instantaneous reward Srₓ(D). However, in a multi-UAV environment, each UAV’s reward is influenced not only by its own action but also by the simultaneous actions of other UAVs due to interference and shared sub-channels. Therefore, the independent optimization of Eq. (15) becomes coupled and interdependent, necessitating a game-theoretic formulation. To capture this interdependence, we reformulate the problem as a stochastic game (Markov game) where each UAV is a rational agent44. The global system state evolves over time, and each UAV selects its strategy based on its observed state. The key to solving this game lies in identifying a Nash equilibrium: a set of strategies µ* = [µ₁*, µ₂*, …, µₓ*] where no UAV can improve its expected cumulative reward by unilaterally deviating from its strategy, given the strategies of others.

Formulation of stochastic game

We modeled the problem in formula (17) in this section using the framework of a randomized game (also known as a Markov game)25 because it generalizes the Markov decision-making process to the case of multiple agents.

In the network under consideration, the UAV X communicates with the user without knowledge of the operating system. We assume that all UAVs are rational and self-catered. Thus, for the maximization of long-term returns (17), all UAVs select the movements independently at any given time slot D. So, the action of each Inline graphicis chosen in its action spaceInline graphic. The triples Inline graphic represent the actions performed by Inline graphic in time slot D, where Inline graphicstated the power level, user selection, and sub-channel of Inline graphicin time slot D, respectively. For each Inline graphic, Inline graphicrepresents the operation performed in time slot D by the other UAVs Inline graphic, which is Inline graphic.

As a result, the instantaneous SINR of Inline graphicin time slot D can be expressed as follows:

graphic file with name d33e1376.gif 18

where Inline graphic Inline graphic and Inline graphic in (18). Additionally, Inline graphicrepresents the instantaneous channel matrix responses between Inline graphic and authorized ground user U at the given time slot D are the following:

graphic file with name d33e1416.gif 19

with Inline graphic Inline graphic Inline graphic for all Inline graphic and Inline graphic.

Each Inline graphic can express its current SINR level Inline graphic at any given time slot D. Consequently, the Inline graphic state for each Inline graphic, Inline graphic is fully observed are the following:

graphic file with name d33e1488.gif 20

Let the state vector for all UAVs be Inline graphic. As UAVs cannot cooperate, the Inline graphic in this article is unaware of the states of the other UAVs.

We assume that each UAV’s actions follow the rules of the Markov chain, which means that a UAV’s reward is solely dependent on its state and path of action at any given moment. According to26, the dynamics of the state in a stochastic game where each player only acts in each state are represented by the Markov chain38. The Markov chain is defined formally in the manner that is detailed below.

Definition 1

A discrete stochastic process called a finite state Markov chain has the following definition: Let’s assume that a q Inline graphic q transition matrix E has entries Inline graphic and Inline graphicfor any 1Inline graphic q and that the collection of states Inline graphic is finite.

It progresses steadily from one state to the next. Assume that the chain is currently in the state. Inline graphic. The next state’s Inline graphic probability is

graphic file with name d33e1567.gif 21

It is also known as the Markov property because it just depends on the current state and not any past states.

Consequently, the Inline graphic reward function, Inline graphic, can be expressed as

graphic file with name d33e1589.gif 22

For the sake of compact notation, the time slot index D is expressed in superscript here. This notation will also be used for notational simplicity in the next sections.In (22), the action Inline graphic determines the instantaneous transmit power, while the UAV’s instantaneous rate is given by

graphic file with name d33e1603.gif 23

The present state Inline graphic, which is completely observed, and the actions that are partially observed (Inline graphic, which are both dependent on the current state Inline graphic, determine the pay-out Inline graphic that Inline graphic will get at each time slot D, starting from (22). The chosen actions (Inline graphic and the previous state Inline graphicare the only factors used to determine the possibilities of the new random state Inline graphic to which Inline graphic flies. This happens at the next time slot D + 1. This process is repeated until all available slots have been filled. Inline graphic may specifically monitor its state Inline graphic and the related action Inline graphic at any time slot D, but it is unaware of other players’ actions, Inline graphic, and the precise values Inline graphic. Each player Inline graphicis also unaware of the probabilities of state transition. The examined UAV system in reference27 can thus be expressed as a stochastic game.

Definition 2

A tuple with values Inline graphic can be used to construct a stochastic game where,

  • Inline graphic denotes the state set with Inline graphic;

  • The group for players is Inline graphic;

  • Inline graphic stands for the player’s Inline graphic action set, while Inline graphic is the joint action set;

  • Inline graphic is the probability function for sate transition, and it is affected by what each player does.

Specifically, Inline graphic indicates the probability that the current state Inline graphic will change to the next stage Inline graphic by carrying out the joint action Inline graphic with Inline graphic.

• For player Inline graphic, where Inline graphic is a legitimately valuable reward function.

A mixed strategy in a stochastic game, Inline graphic refers to a group of probability distributions over the potential actions, indicating the relationship between the action set and the state set. In further detail, the mixed strategy for Inline graphic in state Inline graphic is defined as: Inline graphic = [Inline graphic], where each element Inline graphic of Inline graphic shows the probability distribution of Inline graphic selecting a state action Inline graphic in state Inline graphic. X players and a vector of policies, one plan for each player, is called a joint strategy and has the form Inline graphic. Let Inline graphic represents the same policy profile, but without player Inline graphic policy Inline graphic. Based on the aforementioned factors, each player Inline graphicin the specified stochastic game has the optimization goal of maximizing its expected payoff over time. The goal in (14) may be restated as follows for player Inline graphic under a joint strategy Inline graphic = [Inline graphic] with assigning a strategy Inline graphic to each Inline graphic is

graphic file with name d33e1943.gif 24

Where Inline graphic is the instantaneous reward received by Inline graphic at time Inline graphicand Inline graphic stand for expectation operations. Individuals (UAVs) in the defined stochastic game have individual anticipated rewards that depend on the combined strategy rather than the players’ tactics. Because not all participants could maximize their expected rewards at once, it is unrealistic to simply expect players to do so. Next, we discuss a Nash equilibrium solution for the stochastic game28.

Definition 3

The collection of techniques, called a Nash equilibrium, one for each participant, which is the most effective way to counter each other’s strategy. To put it another way, if the Nash equilibrium solution is Inline graphic, then for any Inline graphic, the Inline graphic strategy like.

graphic file with name d33e2002.gif 25

It implies that each UAV’s activity is the optimum reaction to the decision made by other UAVs in a Nash equilibrium. So long as all other UAVs maintain their current tactics, no UAV can gain from altering its approach in a Nash equilibrium solution. Keep in mind that the non-cooperative stochastic game’s imperfect information structure gives players the chance to repeatedly engage with the stochastic environment and figure out their best course of action. A Nash equilibrium strategy for each state Inline graphic is what each player Inline graphic hopes to find, each player is viewed as a learning agent. In the following section, the RMAL framework is shown as a means of optimizing the sum of expected rewards (22) using partial data.

The proposed solution

In this part, the RMAL framework for multi-UAV systems is introduced. Then, a resource allocation plan based on Q-learning will be suggested to optimize the multi-UAV system under consideration’s expected long-term gain.

RMAL framework for Multi-UAV SYSTEMS

Figure 2 depicts the principal RMAL ingredients that were examined for this work. Specifically, the information obtained locally during the time slot D-state Inline graphic and the reward (result) Inline graphicare presented for each Inline graphic, while the actions that Inline graphic performed during the time slot D is displayed as well. The players in a stochastic game face a decision issue identical to a Markov decision scheme (MDS)26 when all other players adopt a fixed policy profile. Individuals of all ages execute the decision algorithm individually while conforming to a common framework built on Q-learning. The dynamics of the electronic environment are characterized by Markov characteristics, and the incentives received by UAVs are often based on their current condition and behavior39. The MDS of an agent Inline graphicincludes the following elements:

Fig. 2.

Fig. 2

RMAL framework for multi-UAV Systems.

  • A discrete set of environmental states represented by Inline graphic;

  • A discrete set of possible actions represented by Inline graphic;

  • The state migration probabilities are a representation of the environment time-gap dynamics, Inline graphic for all Inline graphic and Inline graphic;

  • a reward function represented by Inline graphic that represents the expected value of the subsequent Inline graphic reward.

For example, if the current state is Inline graphic, the action Inline graphic will be performed, and the subsequent state will be Inline graphic where Inline graphic represents the direct reward that the environment will offer to Inline graphic at time Inline graphic. Due to the inability of drones to communicate with one another, it is essential to remember that each UAV has only limited knowledge of the stochastic environment in which it functions. In this study, MDSs with learning agents operating in unknown stochastic environments and unaware of the reward and transition functions are solved using Q-learning29. The Q-learning technique that can be utilized to solve a UAV’s MDS will be discussed next. Consider, without sacrificing generality, the Inline graphic for the sake of simplicity. The functions of the state valve and the action value, commonly known as the Q function, are the two key concepts required to solve the MDS method mentioned above30.

To be more precise, the former is essentially the anticipated reward for achieving various stages in (22); this is what motivates the agent to follow certain rules. Similarly, the Q function of Inline graphic begins in state Inline graphic, then goes into auction Inline graphic, and then it follows the expected reward of policy Inline graphic, which may be represented as follows:

graphic file with name d33e2219.gif 26

where the value that corresponds to Eq. (26) is referred to as the action value or the Q-value.

Proposition 1

The specified function returns can be used as a starting point for deriving the recurrence relation of the state-value function. To be more specific, for any policy and any state Inline graphic to be consistent, the following characteristics must exist between the two states: Inline graphic = Inline graphicand Inline graphic = Inline graphic, withInline graphicInline graphic

graphic file with name d33e2275.gif 27

where Inline graphicis the probability that the Inline graphic would select a state-level action Inline graphic in state Inline graphic.

Take note that the reward that is anticipated when beginning in state Inline graphic and strategy Inline graphic.

subsequently adhering to policy is denoted by the state-value function Inline graphic. Based on Proposition 1, Eq. (26) can have the Q function rewritten such that it can also operate recursively. The resulting equation is as follows:

graphic file with name d33e2335.gif 28

Keep in mind that starting with the value (26), all UAV behaviors become reliant on the Q-value. It is essential to be aware that Eqs. 27 and 28 make up the fundamental building blocks of the Q-learning-based reinforcement learning method used to solve the MDS for each UAV36. Equations (27) and (28), which may be found above, can also be applied to produce the connection shown below between state values and Q-values.

graphic file with name d33e2360.gif 29

As was noted before, the objective of figuring out how to solve the MDS is to identify the best course of action that will result in the greatest possible payoff. When examining the situation from the standpoint of the state value function41, we can say that the best course of action for the Inline graphicin state Inline graphic is as follows:

graphic file with name d33e2387.gif 30

To achieve the best possible Q-values, we also have.

graphic file with name d33e2398.gif 31

when solving Eq. (28) by substituting into Eq. (29), one possible rewrite of the optimal state value equation is:

graphic file with name d33e2410.gif 32

Also, consider the fact that the use of Inline graphic yields (32). It is important to keep in mind that, as opposed to the strategy space, the optimal state value equation in Eq. (32) maximizes the action space. Equation (32) can then be used with Eqs. (27) and (28), respectively, to create the Bellman optimum equations for state values and Q-values42, as follows.

 

graphic file with name d33e2436.gif 33

And.

 

graphic file with name d33e2446.gif 34

The most optimum strategy of action is always that which maximizes the Q-function of the current state (34). This can be inferred from the ideal policy of always choosing the option with the highest value43. It can be challenging to choose the ideal joint strategy since, in a multi-intelligent situation, the collaborative strategy requires that each intelligence’s Q-function be determined by the combined action30. Q-functions for each intelligence in the multi-intelligence case. We treat UAVs as independent learners (ILs) to address these issues. According to this, UAVs act and interact with their surroundings as if there are no other UAVs around since they are blind to the rewards and the actions of other UAVs.

Resource allocation based on Q-learning for Multi-UAVs systems

The resource allocation problem among UAVs is addressed in this part with an ILs31 based RMAL algorithm. The optimum policy for the MDS is chosen by each UAV, which then executes a typical Q-learning procedure to get its ideal Q-value45. More specifically, the choice of actions in each iteration is determined by the Q-value expressed in terms of dual states. Inline graphic and its subsequent iterations. Thus, the Q-values reveal the nature of the activities that will be performed in the subsequent states. The following expression provides the update rule for Q-learning.

graphic file with name d33e2480.gif 35

with Inline graphic where Inline graphicand Inline graphic, respectively, equating to Inline graphic and Inline graphic. It is essential to remember that the best action value function may be created by iteratively deriving the appropriate action values46. To be more specific, each intelligence acquires the optimal action value by following the update algorithm in Eq. (35), Inline graphic is the action value of the Inline graphic in time slot Inline graphic and Inline graphic denotes the learning rate, respectively. Another crucial component of the Q-learning system is the action selection mechanism. This mechanism is what determines the activities that the intelligence will carry out while they are in the process of acquiring new knowledge. For the agent to build on what it now recognizes as outstanding judgment and study new activities, achieving equilibrium between exploration and exploitation is the aim32. Within the scope of this research, we investigate Inline graphic– greedy exploration. With a probability of Inline graphic, the agent makes a random selection. With a probability of Inline graphic, the agent then decides on the optimal course of action, which is determined by the current Q-value that is the most significant. This is an example of Inline graphic selection. As a result, the probability of selecting an action Inline graphic while in a state Inline graphic can be computed using the following Eq. 

graphic file with name d33e2589.gif 36

Exactly Inline graphic. To guarantee that Q-learning will eventually converge, the learning rate Inline graphic has been fixed at33 and is represented by the following Eq. 

graphic file with name d33e2613.gif 37

where Inline graphic. It is imperative to bear in mind that every UAV operates independently during the Q-learning phase of the suggested ILs-based RMAL algorithm. Therefore, the Q-learning process ends in Algorithm 1 for every. Inline graphic.

Algorithm 1

Because the starting value of Q in Algorithm 1 is always set to zero, this learning method is sometimes referred to as zero-initialized Q learning34. Because the UAV does not have any previous information about the beginning state, it employs a strategy with an equal probability, denoted by the letter Inline graphic.

Algorithm: Q-learning based RMAL algorithm for Multi-UAVs System

  • Begin.

  • Set Inline graphic and parameters Inline graphic.

  • for all Inline graphic do.

  • Declare action-value with Inline graphic(Inline graphic) = 0, approach Inline graphic;

  • Load and assign value to the state Inline graphic = Inline graphic = 0;

  • Terminate for loop.

  • while Inline graphic do//Begin the while loop.

  • for each UAV Inline graphic, Inline graphic do.

  • Tune the base learning rate Inline graphic based on.

graphic file with name d33e2761.gif
  • Choose an action Inline graphic based on the selection scheme Inline graphic.

  • Compute the SINR values of the receiver on the basis of.

graphic file with name d33e2789.gif
  • if Inline graphicdo.

  • Set Inline graphic = 1.

  • else-do.

  • Set Inline graphic =0.

  • Terminate if.

  • Update and select the instantaneous system reward Inline graphic on the basis of.

graphic file with name d33e2841.gif
  • Update and select the values of actions Inline graphic(Inline graphic) on the basis of.

Inline graphic (Inline graphic) = Inline graphic(Inline graphic) Inline graphic.

  • Modification of the selection scheme Inline graphic(Inline graphic) on the basis of.

graphic file with name d33e2918.gif
  • Set the values Inline graphic and update the state values Inline graphic.

  • Terminate for-loop.

  • Terminate while-loop.

The proposed RMAL algorithm analysis

Here, we will look at the convergence of the previously suggested RMAL-based resource allocation strategy. It is essential to remember that the RMAL algorithm presented here may be thought of as a standalone multi-intelligent Q-learning method. In this concept, each UAV performs as a learning intelligence that uses the Q-learning algorithm to make judgments. As a consequence, by taking into account the following idea, convergence may be understood.

Proposition 2

When applying the RMAL method of the proposed Algorithm 1, which can be found in Algorithm 1, Every UAV Q-learning algorithm eventually reaches the Q-value of a single optimal procedure. The following observation is crucial to demonstrating that Proposition 2 is correct. Because UAVs are non-cooperative, the suggested RMAL algorithm’s convergence is reliant on the Q-learning method’s convergence31.

Theorem 1

The update rule in (33) of algorithm 1 Q-learning approach converges to the ideal Inline graphic value with the probability one Inline graphic if.

  • There is a finite number of states and actions;

  • Inline graphicuniformly Inline graphic;

  • VarInline graphic is bounded;

Simulation results

In this portion of the article, the performance of the suggested RMAL-based resource allocation strategy for multi-UAV systems is evaluated through simulations. We assume a multi-UAVs system set up in a disk of radius Inline graphic. The ground users are uniformly and randomly distributed throughout the disk. It is thought that all UAVs fly at the same altitude, i.e., Inline graphic, During the simulation, we assume a noise power of Inline graphic dBm. The simulation uses Inline graphic and Inline graphic as sub-channel bandwidths. The channel parameters in the simulation are determined by the probabilistic model and follow Eq6., where Inline graphic and Inline graphic, respectively. In addition, the carrier frequency is Inline graphic, followed by Inline graphic and Inline graphic. The routing loss factor is defined as Inline graphic and the channel power gain is given as Inline graphicm at the reference distance Inline graphicdB in the LoS channel model scenario11. Inline graphic is the maximum power per UAV in the simulation, and Inline graphic is the maximum number of power levels. The maximum power is split into J discrete power values in an equal amount. One power unit cost is Inline graphic and the user is expected to maintain a minimum SINR of Inline graphicdB.In addition to Inline graphic, Inline graphic, and Inline graphic.

In Fig. 3, we look at an implementation of a random multi-UAV system. In this version, a disc with a radius of Inline graphic has Inline graphic users randomly dispersed across it, and three UAVs are initially positioned at the disc edges at an angle of Inline graphic To make things clearer, Fig. 4 shows the average reward and average reward per time slot for the UAV running at 40 m/s under the conditions shown in Fig. 3. The different average rewards are computed as shown in Fig. 4(a) and noted as Inline graphic. As shown in Fig. 4(a), the number of algorithm iterations raises the typical number of rewards. This is due to the potential of the already proposed RMAL algorithm to increase long-term rewards. Nevertheless, the average reward curve becomes flat when t increases to a value greater than 250 time slots. When the time is more significant than 250 s, the UAV flies out of the disc. As a direct consequence of this, the typical bonus does not rise. Figure 4(b) depicts the average number of immediate bonuses received by Inline graphicper time slot, which corresponds to the previous statement.

Fig. 3.

Fig. 3

UAVs based Systems with Inline graphicand Inline graphic.

Fig. 4.

Fig. 4

Comparing the Average Rewards with different Inline graphic, X = 3 and U = 80.

The key simulation parameters and performance observations for evaluating the proposed RMAL-based multi-UAV communication framework are summarized in Table 2. This table outlines critical metrics including environmental setup (e.g., disk radius, number of users, UAV altitude), communication parameters (e.g., noise power, sub-channel bandwidth, carrier frequency), and algorithm-specific observations (e.g., immediate and long-term rewards, SINR thresholds). These metrics provide a foundational basis for assessing the efficiency and stability of the proposed algorithm under realistic operational constraints and dynamic network conditions.

Table 2.

Performance metrics for RMAL-Based Multi-UAV system.

Metric Value/Observation Description
Disk Radius Inline graphic 600 m The radius of the area where ground users are randomly distributed.
Number of Users Inline graphic 80 Total number of ground users uniformly dispersed across the disk
UAV Inline graphic 80 m Altitude at which UAVs are operating.
Noise Power Inline graphic −70 dBm Assumed noise power in the simulation.
Sub-Channel Bandwidth (Inline graphic) 65 kHz Bandwidth allocated to each sub-channel.
Time Slot Duration (Inline graphic) 0.1 s Duration of each time slot in the simulation.
Carrier Frequency (Inline graphic) 2 GHz Frequency used in the simulation.
Maximum UAV Power (Inline graphic) 23 dBm Maximum power output per UAV in the simulation.
Power Levels (Inline graphic) 3 Number of discrete power levels for UAVs.
Minimum SINR (Inline graphic) 3 dB Required minimum SINR for maintaining communication quality.
Average Reward (Inline graphic) Increase unit t = 2 250 time slots Reflects the ability of the RMAL algorithm to maximize long-term rewards before UAVs leave the disk.
Immediate Reward (Inline graphic) Increase with iterations; plateaus after Inline graphic time slot Represents the instantaneous rewards received per time slot.
Efficiency Plateau Observed at Inline graphic seconds Rewards stop increasing as UAVs exit the operational area.
Execution Time Efficiency RMAL shows consistent reward improvement over < 250 time slots Efficient in achieving rewards quickly due to adaptive Q-learning updates.

In Fig. 4(b), the x-axis represents the number of algorithm iterations, which corresponds to discrete time slots during which each UAV updates its policy based on observed states and rewards. These iterations range from 0 to a predefined simulation horizon (e.g., 500 slots) and are crucial for the convergence of the Q-learning-based RMAL algorithm. The y-axis denotes the UAV speed (in meters per second), which influences how frequently a UAV encounters new users and changes its spatial context. Higher speeds typically allow UAVs to explore the environment more dynamically, while lower speeds may result in more localized communication. It is important to note that both algorithm iterations and UAV speed are inherently non-negative in the actual simulation. Any negative values observed in the figure are purely visual artifacts from the surface plotting function and do not correspond to real-world or simulated states. These have been retained only to provide a smooth visualization of the reward surface.

To analyze the impact of the exploration rate (ϵ) on the learning dynamics of the RMAL-based multi-UAV system, a comparative evaluation of average rewards under different exploration settings was conducted. The results, as summarized in Table 3, highlight the critical role of balancing exploration and exploitation in reinforcement learning environments40.

Table 3.

Average rewards for different exploration rates (ϵ).

Exploration Rate (ϵ) Initial Average Reward Initial Average Reward Initial Average Reward
ϵ = 0.5 ~5 ~70 Achieves the highest final reward, showing a balanced exploration and exploitation.
ϵ = 0.2 ~3 ~65 Slightly lower final reward compared to ϵ=0.5, indicating less exploration
ϵ = 0.8 ~4 ~50 Moderate performance with more exploration but slower convergence.
ϵ = 0 ~0 ~20 Achieves the lowest reward due to purely exploitative behavior, with no exploration for learning.

Specifically, the exploration rate ϵ = 0.5 yielded the highest final average reward, indicating an effective trade-off between exploring new actions and exploiting known high-reward strategies. In contrast, lower exploration (ϵ = 0.2) led to slightly reduced performance, reflecting limited exposure to alternative actions and potentially suboptimal policy convergence. At ϵ = 0.8, the algorithm engaged in broader exploration but exhibited slower convergence and achieved only moderate rewards, suggesting that excessive exploration can delay learning stabilization. Notably, the purely exploitative configuration (ϵ = 0) resulted in the lowest reward values, as the system lacked the exploratory behavior needed to discover optimal strategies in dynamic environments.

As the algorithm iterates more, the average reward per time slot declines, as seen in Fig. 4(b).

Considering that the recommended Q-learning strategy’s learning rate Inline graphic depends on the value of D in (35) and D value drops as the number of time slots rises in the case involving D. It is significant to remember that when the quantity of method iterations increases, Inline graphic falls, showing that the update rate of Q values slows down as the time step rises. In addition, Fig. 4 analyzes how the average reward changes according to Inline graphic. Each UAV decides on a greedy action, commonly referred to as an exploitation strategy if Inline graphic. Each UAV will select a random action with a greater probability of occurring when Inline graphic is equal to 1. It should be noted that Inline graphic is a reliable choice in the considered arrangement, as shown in Fig. 4.

In Figs. 5 and 6, we take a look at how different system settings affect the typical number of rewards received. Using the LoS channel model stated in Eq. (4), Fig. 5 shows a graphical depiction of the average rewards received at various settings.

Fig. 5.

Fig. 5

Average Rewards Comparison for LoS Channel Model with different Inline graphic X = 3 and U = 80.

Fig. 6.

Fig. 6

Multi-UAVs Systems Illustration with K = 3, M = 4, and U = 250.

In addition, a typical reward generated by a probabilistic model using Inline graphicis shown in Fig. 6. To be more precise, the UAVs are dispersed in a random pattern along the edges of the cells. In the iterative method, each UAV flies over the cell and then keeps flying over the disk centre, which is also the centre of the cell. As can be seen in Figs. 5 and 6, the pattern of the curves representing the average reward applied to the different Inline graphic values is similar to that depicted in Fig. 6. In addition, the multi-UAV network under study is capable of achieving the optimal average reward for a variety of different network configurations.

By comparing, as shown in Fig. 7, the proposed RMAL algorithm with a corresponding theory-based resource allocation method, we assess the average reward of the algorithm. In Fig. 7, we analyze the same configuration as in Fig. 4, but this time we use the Inline graphic value to simplify the algorithm implementation. The UAV activities only include the options that the user selects for each time slot. Further, we consider that all of the information transmission between the UAVs is handled by a matching theory-based user selection algorithm. This implies that before making a decision, each UAV is aware of what the other UAVs have done. We employ the Gale-Shapely (GS) approach35 in the matching theory-based user selection processes as a point of comparison. Each time slot involves doing this. Additionally, we evaluated Fig. 7 baseline scheme, the random user technique (Rand), for effectiveness. Figure 7 illustrates how, in terms of average reward, the matching-based user selection algorithm performs better than the recommended RMAL method. This is due to the lack of information sharing in the proposed RMAL algorithm. Each UAV makes its decision independently since it is impossible for them to keep track of the information that other UAVs are processing, such as rewards and choices, in this situation.

Fig. 7.

Fig. 7

Average Rewards Comparison with different Algorithms where K = 1,J = 1, M = 2, and U = 80.

The performance comparison of different algorithms is summarized in Table 4. Among them, the proposed MARLA algorithm achieved the highest final average reward (~ 80) using a power level of 23 dBm, showing efficient learning and resource allocation37. The Mach algorithm performed moderately well with a final reward of ~ 50, while the Rand algorithm, based on random actions, performed the worst with a reward of only ~ 30. These results clearly demonstrate the effectiveness of MARLA in dynamic UAV communication environments.

Table 4.

Average rewards comparison for different algorithms.

Algorithm Power Level (Inline graphic) Initial Average Reward Final Average Reward Observation
MARLA Inline graphic ~ 5 ~ 80 MARLA achieves the highest reward, showcasing efficient learning and resource allocation.
Mach Inline graphic ~ 3 ~ 50 Mach performs moderately well, with slower convergence compared to MARLA.
Rand Inline graphic ~ 1 ~ 30 Rand performs the worst, with low rewards due to its purely random resource allocation strategy.

The suggested RMAL algorithm also offers a higher average reward than the random user selection strategy, as shown in Fig. 7, which results in a lover average reward for the random user selection algorithm. The system cannot effectively exploit the observed information since the user selection was made at random. As a result, the developed RMAL algorithm may balance lowering the information exchange cost with enhancing system performance as a whole.

We investigate how the algorithm’s iterations and the UAV’s speed affect the average reward in Fig. 8. By having the UAV take off at random from the disc’s edge and then fly directly over the disc’s center while moving at different speeds, this is UAV. Figure 8 depicts the identical layout as Fig. 6, with the exception that Inline graphic and Inline graphic have been added for illustration. As can be observed, given a fixed pace, the usual payout increases continuously in direct proportion to the algorithm’s iterations. Additionally, when D is less than 150, the average reward for more huge speeds rises more quickly than the average reward for slower speeds. This is the case of keeping a constant time gap. This is because the user and the UAV positions are chosen randomly. As a result, to satisfy its QoS criteria, the UAV might not be able to recognize the right user right away. Figure 8 further shows that there is a negative correlation between the increase in speed at the end of the algorithm iteration and the average reward obtained. This is so that the time needed to launch the disc will be shorter if the UAV flies quickly. As a direct result, the total service time for UAVs flying at faster speeds is shorter than that for UAVs flying at slower speeds.

Fig. 8.

Fig. 8

Average Rewards Comparison with different Algorithms where K = 1, I = 1, X = 2, and U = 80.

Comparative analysis with MARL algorithms

To evaluate the effectiveness of the proposed RMAL framework, we compare it with three state-of-the-art multi-agent reinforcement learning algorithms:

  • QMIX: A value-decomposition-based MARL that combines individual Q-functions into a global Q-function while maintaining consistency.

  • MADQN: Multi-agent DQN, which extends DQN to multi-agent settings using shared experience and coordinated updates.

  • MAPPO: Multi-Agent Proximal Policy Optimization, a popular actor-critic based algorithm for continuous action spaces.

The simulation environment consists of 3 UAVs (X = 3), 80 users (U = 80), 4 sub-channels (K = 4), and 3 power levels (i = 3). Each algorithm was executed for T = 500 time slots with identical initialization and reward functions.

Table 5 presents a performance comparison between the proposed RMAL algorithm and other multi-agent reinforcement learning methods. RMAL demonstrates competitive results with lower computational overhead and faster convergence. While MAPPO and QMIX yield slightly higher average rewards and SINR satisfaction, they come with higher computational and memory requirements due to centralized training and value decomposition mechanisms47. In contrast, the proposed RMAL algorithm achieves competitive performance with lower overhead and full decentralization, making it suitable for real-time UAV networks where bandwidth and processing power are limited.

Table 5.

Performance comparison of RMAL with other MARL algorithms.

Algorithm Final Avg. Reward Convergence Time (slots) SINR ≥ γ₀ (%) Computation Overhead
RMAL (Ours) 80 250 89.3 Low
QMIX 85 320 92.1 Medium
MADQN 78 400 87.6 High
MAPPO 90 350 94.2 High

Conclusion

To optimize the long-term benefit, this research will investigate a real-time resource allocation mechanism for a multi-UAVs downlink system. We provide a stochastic game theory as a solution to the system’s dynamic resource allocation issue. Each UAV in this game aims to discover a resource allocation strategy that optimizes its expected reward. This is due to how unpredictable the environment might be. To solve the specified stochastic game, we develop an RMAL technique based on ILs. This reduces the cost of information sharing and processing by enabling all UAVs to make decisions independently based on Q-learning. Based on the simulation results, developing a multi-UAV system resource allocation method based on RMAL can balance the cost of information sharing with the overall performance of the system. The necessity for cooperative information exchange and the consideration of more sophisticated joint learning algorithms for multi-UAV systems provide an attractive and possibly profitable continuation of this study. The integration of UAV deployment and trajectory optimization into multi-UAV systems, which can further improve their energy efficiency, is another fascinating subject for future research.

Acknowledgements

This research was supported by the Chung-Ang University Research Grants in 2025.

Author contributions

Muhammad Shoaib: Conceptualization, Methodology, Writing – original draft, InvestigationGhassan Husnain: Methodology, Investigation, Writing – original draftMuhsin Khan: Data curation, Formal analysis, Validation, Writing – review & editingYazeed Yasin Ghadi: Validation, Resources, Writing – review & editing, VisualizationSangsoon Lim: Supervision, Software, Data curation, Writing – review & editing.

Data availability

All the relevant data is within the text of manuscript.

Declarations

Competing interests

The authors declare no competing interests.

Ethics approval statement

This study did not require ethics approval as it involved the analysis of publicly available, anonymized data and did not involve direct interaction with human or animal subjects.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Ghassan Husnain, Email: ghassan.husnain@gmail.com.

Sangsoon Lim, Email: lssgood80@gmail.com.

References

  • 1.WAN, F., YASEEN, M. B., RIAZ, M. B., SHAFIQ, A. & THAKUR, A. and M. D. O. RAHMAN, Advancements and challenges in uav-based communication networks: A comprehensive scholarly analysis, Results Eng., p. 103271, (2024).
  • 2.Orgeira-Crespo, P. &amp; Garc\’\ia-Luis, U. Brief Introduction To Unmanned Aerial Systems, in Applying Drones To Current Societal and Industrial Challengespp. 1–22 (Springer, 2024).
  • 3.Mozaffari, M., Saad, W., Bennis, M., Nam, Y. H. & Debbah, M. A tutorial on UAVs for wireless networks: applications, challenges, and open problems. IEEE Commun. Surv. Tutorials. 21 (3), 2334–2360 (2019). [Google Scholar]
  • 4.Banafaa, M. K. et al. A comprehensive survey on 5G-and-beyond networks with uavs: applications, emerging technologies, regulatory aspects, research trends and challenges. IEEE Access.12, 7786–7826 (2024). [Google Scholar]
  • 5.Yuan, B. et al. Service time optimization for UAV aerial base station deployment. IEEE Internet Things J.11 (2024).
  • 6.Mozaffari, M., Saad, W., Bennis, M. & Debbah, M. Efficient deployment of multiple unmanned aerial vehicles for optimal wireless coverage. IEEE Commun. Lett.20 (8), 1647–1650 (2016). [Google Scholar]
  • 7.Lyu, J., Zeng, Y., Zhang, R. & Lim, T. J. Placement optimization of UAV-Mounted mobile base stations. IEEE Commun. Lett.21 (3), 604–607 (2017). [Google Scholar]
  • 8.Debnath, D., Vanegas, F., Boiteau, S. & Gonzalez, F. An integrated geometric obstacle avoidance and genetic algorithm TSP model for UAV path planning. Drones8 (7), 302 (2024). [Google Scholar]
  • 9.Zeng, Y., Zhang, R. & Lim, T. J. Throughput maximization for UAV-Enabled mobile relaying systems. IEEE Trans. Commun.64 (12), 4983–4996 (2016). [Google Scholar]
  • 10.Zeng, Y., Xu, X. & Zhang, R. Trajectory design for completion time minimization in UAV-enabled multicasting. IEEE Trans. Wirel. Commun.17 (4), 2233–2246 (2018). [Google Scholar]
  • 11.Wu, Q., Zeng, Y. & Zhang, R. Joint trajectory and communication design for multi-UAV enabled wireless networks. IEEE Trans. Wirel. Commun.17 (3), 2109–2121 (2018). [Google Scholar]
  • 12.Zhang, S., Zhang, H., Di, B. & Song, L. Cellular UAV-To-X communications: design and optimization for Multi-UAV networks. IEEE Trans. Wirel. Commun.18 (2), 1346–1359 (2019). [Google Scholar]
  • 13.Qureshi, K. I., Lu, B., Lu, C., Lodhi, M. A. & Wang, L. Multi-agent Drl for air-to-ground communication planning in uav-enabled Iot networks. Sensors24 (20), 6535 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Horn, J. F., Schmidt, E. M., Geiger, B. R. & DeAngelo, M. P. Neural network-based trajectory optimization for unmanned aerial vehicles. J. Guid Control Dyn.35 (2), 548–562 (2012). [Google Scholar]
  • 15.Nodland, D., Zargarzadeh, H. & Jagannathan, S. Neural network-based optimal control for trajectory tracking of a helicopter UAV, Proc. IEEE Conf. Decis. Control, pp. 3876–3881, (2011).
  • 16.Institute of Electrical and Electronics Engineers. 2018 IEEE Global Communications Conference (GLOBECOM): proceedings : Abu Dhabi, UAE, 9–13 December 2018., 2018 IEEE Glob. Commun. Conf., pp. 1–6, (2018).
  • 17.Chen, J. et al. Extrinsic-and-Intrinsic Reward-Based Multi-Agent reinforcement learning for Multi-UAV cooperative target encirclement. IEEE Trans. Intell. Transp. Syst.7 (2025).
  • 18.Betalo, M. L. et al. Generative AI-Driven Multi-Agent DRL for task allocation in UAV-Assisted EMPD within 6G-Enabled SAGIN networks. IEEE Internet Things J.12 (2025).
  • 19.Chen, M., Saad, W. & Yin, C. Liquid State Machine Learning for Resource Allocation in a Network of Cache-Enabled LTE-U UAVs, 2017 IEEE Glob. Commun. Conf. GLOBECOM 2017 - Proc., vol. 2018-Janua, pp. 1–6, (2017).
  • 20.Chen, J., Wu, Q., Xu, Y., Zhang, Y. & Yang, Y. Distributed Demand-Aware Channel-Slot selection for Multi-UAV networks: A Game-Theoretic learning approach. IEEE Access.6, 14799–14811 (2018). [Google Scholar]
  • 21.Sun, N. & Wu, J. Minimum error transmissions with imperfect channel information in high mobility systems, Proc. - IEEE Mil. Commun. Conf. MILCOM, pp. 922–927, (2013).
  • 22.Cai, Y., Yu, F. R., Li, J., Zhou, Y. & Lamont, L. Medium access control for unmanned aerial vehicle (UAV) ad-hoc networks with full-duplex radios and multipacket reception capability. IEEE Trans. Veh. Technol.62 (1), 390–394 (2013). [Google Scholar]
  • 23.Li, J. et al. A reinforcement learning based stochastic game for Energy-Efficient UAV swarm assisted MEC with dynamic clustering and scheduling. IEEE Trans. Green. Commun. Netw.9 (2024).
  • 24.Guerra, A., Guidi, F., Dardari, D. & Djurić, P. M. Reinforcement learning for joint detection and mapping using dynamic UAV networks. IEEE Trans. Aerosp. Electron. Syst.60 (3), 2586–2601 (2023). [Google Scholar]
  • 25.Bucaille, I. et al. Rapidly deployable network for tactical applications: Aerial base station with opportunistic links for unattended and temporary events ABSOLUTE example, Proc. - IEEE Mil. Commun. Conf. MILCOM, pp. 1116–1120, (2013).
  • 26.How, J. & King, E. \r\n \r\n flight demonstrations of cooperative control for UAV Teams\r\n \r\n (AIAA)\r\n, no. September3 (2004).
  • 27.Xiao, Y. et al. Space-Air-Ground integrated wireless networks for 6G: basics, key technologies and future trends. IEEE J. Sel. Areas Commun.42 (2024).
  • 28.Chandrasekharan, S. et al. Designing and implementing future aerial communication networks. IEEE Commun. Mag. 54 (5), 26–34 (2016). [Google Scholar]
  • 29.Al-Hourani, A., Kandeepan, S. & Lardner, S. Optimal LAP altitude for maximum coverage. IEEE Wirel. Commun. Lett.3 (6), 569–572 (2014). [Google Scholar]
  • 30.Ali, S., Abu-Samah, A., Abdullah, N. F. & Kamal, N. L. M. Propagation modeling of unmanned aerial vehicle (UAV) 5G wireless networks in rural mountainous regions using ray tracing. Drones8 (7), 334 (2024). [Google Scholar]
  • 31.Qi, X., Chong, J., Zhang, Q. & Yang, Z. Towards cooperatively caching in Multi-UAV assisted network: A Queue-Aware CDS-Based reinforcement learning mechanism with energy efficiency maximization. IEEE Internet Things J.19 (2024).
  • 32.Gu, L. & Mohajer, A. Joint throughput maximization, interference cancellation, and power efficiency for multi-IRS-empowered UAV communications. Signal. Image Video Process.18 (5), 4029–4043 (2024). [Google Scholar]
  • 33.Shao, Z. et al. Deep reinforcement Learning-Based resource management for UAV-Assisted mobile edge computing against jamming. IEEE Trans. Mob. Comput.23 (2024).
  • 34.Yu, H., Zhang, L., Li, Y., Chin, K. W. & Yang, C. Channel access methods for RF-Powered IoT networks: A survey. ArXiv Prepr arXiv2404.14826, 2024.
  • 35.Shoham, Y. & Leyton-Brown, K. Multiagent systems: Algorithmic, Game-Theoretic, and logical foundations, Multiagent Syst. Algorithmic, Game-Theoretic, Log. Found., vol. 9780521899, pp. 1–483, (2008).
  • 36.Nowé, A., Vrancx, P. & De Hauwere, Y. M. Game theory and multi-agent reinforcement learning. Adapt. Learn. Optim.12, 441–470 (2012). [Google Scholar]
  • 37.Muthoo, A., Osborne, M. J. & Rubinstein, A. A Course in Game Theory., vol. 63, no. 249. (1996).
  • 38.Neyman, A. From Markov chains to stochastic games. Stoch. Games Appl.570, 9–25 (2003).
  • 39.Betti Sorbelli, F. UAV-Based delivery systems: a systematic review, current trends, and research challenges. J. Auton. Transp. Syst.1 (3), 1–40 (2024). [Google Scholar]
  • 40.Neto, G. Reinforcement Learning : Foundational, no. May, (2005).
  • 41.Castrillo, V. U., Pascarella, D., Pigliasco, G., Iudice, I. & Vozella, A. Learning-in-Games approach for the mission planning of autonomous Multi-Drone Spatio-Temporal sensing. IEEE Access.12 (2024).
  • 42.He, Q., Zhou, T., Fang, M. & Maghsudi, S. Adaptive regularization of representation rank as an implicit constraint of bellman equation. ArXiv Prepr arXiv2404.12754, 2024.
  • 43.Hu, K. et al. A review of research on reinforcement learning algorithms for multi-agents. Neurocomputing599, 128068 (2024).
  • 44.Matignon, L., Laurent, G. J. & Fort-Piat, N. L. Independent reinforcement learners in cooperative Markov games: A survey regarding coordination problems. Knowl. Eng. Rev.27 (1), 1–31 (2012). [Google Scholar]
  • 45.Tan, L. et al. An adaptive Q-learning based particle swarm optimization for multi-UAV path planning. Soft Comput.28 (13), 7931–7946 (2024). [Google Scholar]
  • 46.Iftikhar, A. et al. A reinforcement learning recommender system using bi-clustering and Markov decision process. Expert Syst. Appl.237, 121541 (2024). [Google Scholar]
  • 47.Applied to finding shortest paths in deterministic domains Sven Koenig and Reid G. Complexity Analysis of Real-Time Reinforcement Learning & Simmons Mach. Learn.1 (1992).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All the relevant data is within the text of manuscript.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES