Skip to main content
Sensors (Basel, Switzerland) logoLink to Sensors (Basel, Switzerland)
. 2026 Feb 11;26(4):1170. doi: 10.3390/s26041170

GNN-DRL-Based Intelligent Routing and Resource Allocation Algorithms for Multi-Layer Wireless Mesh Network

Lei Xu 1, Shu Han 1, Wei Fu 1, Ziran Zhu 1, Jing Wu 1, Xiaorong Zhu 1,*
Editor: Luca Leonardi1
PMCID: PMC12944231  PMID: 41755113

Abstract

This research introduces a new intelligent routing and resource allocation algorithm called Graph Sample and Aggregate-Multi-Agent Proximal Policy Optimization (GraphSAGE-MAPPO), which targets dynamic wireless mesh networks like those present in emergency communications. Aiming to address the emergency communication scenario where the network topology changes dynamically and the introduction of Artificial Intelligence (AI) model training services leads to more diverse user services and more dynamic node resource capabilities, a three-dimensional mesh network intelligent routing and resource allocation algorithm, GraphSAGE-MAPPO, based on Graph Neural Networks (GNN) combined with Deep Reinforcement Learning (DRL), is proposed. During the training process, the algorithm first uses GNN as a network feature extraction module to extract the resource capabilities and link status indicators of the nodes, thereby generating a hidden feature vector representation for each backbone mesh node; then, the feature vectors of each node are combined with the arrival service flow as the state input of the distributed multi-agent DRL model, supporting efficient routing and resource allocation decisions for service flows with different user Quality of Service (QoS) requirements. Simulation results show that in the face of dynamically changing network environments and user needs, the GraphSAGE-MAPPO algorithm proposed in this thesis can flexibly adjust routing strategies to better meet the QoS requirements of various services and has good generalization performance for network topology and resource changes. These results demonstrate that the algorithm has good flexibility and scalability in large-scale, real-world wireless mesh network environments.

Keywords: wireless mesh network, graph neural network, routing algorithm, multi-agent deep reinforcement learning, resource optimization

1. Introduction

Wireless Mesh networks, with their distributed architecture and flexible resource scheduling capabilities, demonstrate advantages in terms of comprehensive coverage and differentiated services in scenarios such as emergency response, smart cities, and industrial Internet of Things (IoT). In recent years, as the environment of wireless Mesh networks has become increasingly complex and the diversity of service traffic has expanded, routing optimization and efficient resource allocation have emerged as significant research hotspots. A key research question is how to achieve unified, adaptive, and user satisfaction-oriented multidimensional resource scheduling and routing optimization for multiple types of services in a dynamic Mesh network environment.

Compared to traditional routing algorithms based on fixed rules, DRL can dynamically optimize path selection and resource allocation strategies for diverse traffic flows by online learning of the spatiotemporal characteristics of network traffic and the patterns of service demands. Ref. [1] presents a comprehensive performance comparison of routing forwarding rules generated by supervised learning and reinforcement learning, indicating that supervised learning, which lacks interaction with the environment, struggles to model effectively, whereas reinforcement learning achieves optimization objectives for improved network performance through interaction with the environment to learn routing forwarding strategies. Addressing the energy efficiency optimization problem in wireless sensor networks, Ref. [2] proposes a data aggregation-aware routing algorithm based on Q-learning mechanisms, employing a dynamically designed reward function to achieve balanced energy consumption distribution; experimental results demonstrate that this approach can extend the network’s lifespan by 23%. To tackle the issue of insufficient representational capability of reinforcement learning in complex network scenarios, which can lead to slow or ineffective learning processes, Ref. [3] introduces an intelligent routing algorithm based on Deep Q-Networks, utilizing neural networks to learn optimal routing strategies. The findings indicate that the DRL-based intelligent routing algorithm can capture network states more accurately and make superior routing decisions compared to traditional methods and reinforcement learning-based strategies [4]. Furthermore, to address the challenge of high retraining costs in highly dynamic environments such as vehicular networks, Ref. [5] proposes a computing and communication cost-aware service migration scheme enabled by Transfer Reinforcement Learning. By leveraging knowledge transfer, this method significantly reduces the training time required to obtain optimal policies compared to standard schemes, ensuring better satisfaction of delay requirements.

However, due to the distributed characteristics of wireless Mesh networks, the action space and state space of the DRL model significantly increase as the network scale expands and concurrent traffic surges, leading to a slow convergence of the training process or even difficulties in convergence. Consequently, centralized DRL models are challenging to apply in large-scale networks. Ref. [6] proposes a multi-agent Actor-Critic architecture reinforcement learning algorithm in the context of wireless Mesh networks to balance end-to-end delay and channel efficiency, demonstrating its effectiveness. Ref. [7] introduces an online routing algorithm based on multi-agent DRL, which generates routes in a hop-by-hop manner and employs deep neural networks to learn traffic flow scheduling strategies under different QoS requirements, thereby meeting the QoS demands of various network services.

As research progresses, scholars are increasingly recognizing that wireless networks essentially consist of a set of nodes and their interconnections, which collectively form a complex graph structure rich in feature information. Recognizing this structural characteristic, early works such as Ref. [8] have successfully utilized graph theory-based clustering schemes for resource allocation and interference coordination in heterogeneous cellular networks. This approach demonstrated that explicitly modeling the neighborhood relationships between cells and users can significantly improve spectral efficiency. Existing intelligent algorithms based on DRL struggle to accurately model and process this graph structure. This difficulty arises because current methods [5,6] primarily employ standard neural networks, such as fully connected neural networks and convolutional neural networks, to construct agent models. These networks cannot accurately extract and fully utilize high-dimensional data, such as the features of network nodes and links. Consequently, it becomes challenging to generalize the trained models to different network topologies [9], resulting in suboptimal performance of DRL agents when evaluated in previously unseen topological environments.

Fortunately, as a neural network model specifically designed for graph structures, Graph Neural Networks (GNNs) can derive feature vector representations for each node through multiple rounds of information interaction and aggregation with neighboring nodes, demonstrating a strong generalization capability to changes in network structure. To further enhance the representation capability of GNNs in radio resource management, Ref. [10] introduces a general edge-update empowered GNN architecture (ENGNN). By extending the standard node-update mechanism to explicitly handle variables defined on edges, this method achieves superior scalability and generalization across varying network sizes and interference levels. Similarly, Ref. [11] proposes an offline reinforcement learning algorithm based on GNN for radio resource allocation in Ultra-Reliable Low Latency Communication (URLLC) services within 5G networks, exhibiting stable convergence and meeting the QoS requirements for URLLC. Ref. [12] integrates Deep Q-Networks (DQN) with GNNs to address the association problem between users and base stations, resulting in a 10% improvement in utility compared to traditional DQN methods. Ref. [13] introduces a technique that combines GNN and DRL to tackle resource allocation issues in cellular vehicular networks, ensuring a high success rate for Vehicle-to-Vehicle (V2V) communications while minimizing interference with vehicle and infrastructure links; the results indicate that the proposed algorithm effectively enhances the decision-making quality of agents with a minimal increase in computational load. Ref. [14] reformulates the cooperative navigation problem of multi-drone systems’ underground base station communication coverage as a Markov game, proposing a GNN-based Multi-Agent Proximal Policy Optimization (GMAPPO) algorithm that effectively integrates extracted latent features related to drones with their state space, thereby improving navigation safety. Ref. [15] addresses the challenges posed by a large number of satellites, rapid changes in network topology, and diverse traffic demands in low Earth orbit satellite inter-routing, designing a GNN and DQN integrated solution framework aimed at minimizing the end-to-end average latency across the network; GNNs are employed to learn the complex relationships between satellite nodes, while DQNs are utilized to formulate routing decisions that adapt to the characteristics of the satellite network. Ref. [16] presents an online routing optimization algorithm, G-Routing, which utilizes GNN in conjunction with DRL; by modeling and understanding the relationships between various network features, it predicts trends in service latency and throughput, allowing agents to learn optimal paths that adapt to environmental changes, with simulation results indicating faster convergence and improved performance across various metrics. Ref. [17] proposes a Dense GNN-DRL model to address the issues of low bandwidth utilization and insufficient model generalization capabilities in smart grids, further enhancing the feature aggregation capacity of graphs through the combination of GNN and recurrent neural networks.

The literature above conducts an in-depth exploration of network resource allocation or routing optimization for GNN-assisted agent decision-making across various scenarios; however, most of the studies focus primarily on a single type of service or performance metric.

Therefore, in response to the increasing and dynamically changing user demands and network topologies in wireless Mesh networks, this paper employs GNN in conjunction with DRL techniques for resource allocation and routing optimization in wireless Mesh networks. Initially, the inductive learning capability of GraphSAGE was utilized to extract the hidden feature vectors of nodes and edges within the wireless Mesh network. Subsequently, these hidden feature vectors, along with the locally observed traffic arrival conditions at the agent nodes, are used as inputs for the multi-agent Proximal Policy Optimization (PPO) algorithm. A centralized training and distributed execution architecture is adopted, with each backbone Mesh node functioning as an independent agent executing the algorithm, thereby enabling efficient decision-making in traffic routing and resource scheduling.

The main contributions of this paper are as follows:

  • A node feature aggregation algorithm based on GraphSAGE is proposed, which utilizes the inductive GraphSAGE GNN model to extract node resource capabilities and link state indicators, generating hidden feature vectors for each node. This approach possesses inductive learning capabilities, enhancing the model’s generalization ability and enabling it to adapt to the frequent changes of nodes in large-scale dynamic Mesh networks. The algorithm is designed with a mechanism of global centralized training combined with feature vector distribution. During the training phase, it uniformly aggregates features and constructs globally consistent node representations, effectively improving the accuracy and consistency of network state representations. Additionally, it incorporates service flow conditions as input for a distributed multi-agent DRL model, thereby facilitating efficient and adaptive routing and resource allocation decisions.

  • The topology-adaptive GraphSAGE-MAPPO algorithm is proposed, integrating the inductive bias of GNN with DRL to fundamentally address the “input dimension mismatch” challenge in dynamic emergency communication Mesh networks. The algorithm leverages the inductive GraphSAGE model to extract node feature vectors invariant to network scale, and characterizes the state, action, and reward functions of the DRL through a multi-agent Markov process. The action space encompasses discrete routing node selection and bandwidth resource allocation, as well as continuous allocation of computational and storage resources. The Multi-Agent Proximal Policy Optimization (MAPPO) algorithm is employed as the DRL implementation, retaining the “clipping” proximal update strategy of Proximal Policy Optimization (PPO) to constrain policy updates, thereby avoiding policy mutations and ensuring training stability, which is particularly suitable for the rapidly changing network conditions characteristic of Mesh network scenarios. Furthermore, MAPPO effectively accommodates a mixed action space by integrating optimization objectives for both discrete and continuous actions, thereby supporting diverse user QoS requirements.

  • Through simulation validation, the proposed algorithm demonstrates the ability to flexibly adjust routing strategies, outperforming other algorithms in terms of average delay, packet loss rate, network throughput, and the accuracy of AI service training. Particularly in dynamic scenarios such as network node failures, the trained model exhibits exceptional transferability, effectively adapting to the dynamic changes in network topology and resource capacities, while maintaining robust generalization performance in complex network environments and diverse user demands.

2. System Model

2.1. Network Model

This paper considers the emergency communication scenario of vacant three-dimensional wireless Mesh networks, as illustrated in Figure 1. The network comprises various types of entity nodes, including mobile terminal nodes, unmanned aerial vehicle (UAV) aerial base station nodes, sensor nodes, and ground vehicle-mounted base station nodes, with the number of nodes and topology dynamically varying within a certain range. To explicitly model these dynamics, we assume that the mobile nodes follow the Random Waypoint mobility model, where nodes move towards randomly selected destinations with velocities uniformly distributed within a specific range, resulting in a time-varying network topology. Terminal nodes and various sensor nodes generate traditional communication services such as voice, video, and data collection and uploading, as well as novel AI services like small model training, necessitating the network to provide real-time dynamic services that meet QoS requirements. Considering that energy efficiency is typically a constraint in such scenarios, we clarify the scope of this study: for the purpose of focusing on routing logic under high mobility, we assume the nodes possess sufficient battery capacity for the duration of the mission.

Figure 1.

Figure 1

Architecture Diagram of Stereoscopic Wireless Mesh Network Based on GNN-DRL.

A directed graph represents the network scenario G=N,L, where N denotes the set of nodes and L represents the set of links. Nodes such as drone base stations and vehicular base stations provide forwarding and computational services for user traffic flows. All nodes share a total bandwidth of B from the spectrum resources, which are divided into KB unit subchannels. Base station nodes can allocate one or more subchannel bandwidth resources to traffic flows based on the service demands of the traffic and their own service capabilities. For the sake of clarity in subsequent representations, the i node in the network is denoted as ni, with user nodes labeled as nUE, and base station nodes as nBS. The number of user nodes is NUE, and the number of base station nodes is NBS. The network link set L consists of the set of inter-base station links Ls and the set of user access links Lu, expressed as L=LsLu. The link set dynamically changes with the addition or removal of nodes and variations in the network topology.

In the process of wireless transmission, the link channel gain between the nodes nu and nv within the communication range is denoted as guv. At any given moment in the network, multiple data flows requiring service may coexist. Due to the shared utilization of the same spectral resources among various Mesh nodes, there is a potential for transmission interference among service flows allocated to the same sub-channel bandwidth. Furthermore, a single node or link may simultaneously serve multiple services at a given time, necessitating consideration of resource allocation efficiency.

Let the transmission power of the node nu be pu, then the signal-to-noise ratio (SNR) for data received by the node nv can be expressed as:

SINRuv=guvpukN\{i}gkvpk+σu2 (1)

The term σu2 represents the power of the additive white Gaussian noise (AWGN) at node nu. Consequently, the link capacity between the two nodes is given by:

Cuv=Bluvlog2(1+SINRuv) (2)

Here, Bluv denotes the bandwidth allocated to the link lij. The achievable throughput of a multi-hop path is constrained by the minimum capacity link, which is referred to as the bottleneck link. Consequently, the bottleneck transmission rate for the traffic flow f is expressed as:

Cminf=minlpathfCl (3)

In this context, pathf denotes the end-to-end service path of the service flow. The transmission delay from the user to the base station refers to the time required for data packets to be transmitted between the user and the access base station. The transmission delay of the service flow f from user nUE to base station nBS is expressed as:

TtrannUE,nBS=datanUE,nBSfCnUE,nBSd(nUE,nBS) (4)

Among them, datanUE,nBSf represents the data volume of the service flow f, d(nUE,nBS) denotes the distance between user nUE and access base station nBS, and CnUE,nBS indicates the link transmission rate.

The computational processing delay refers to the time spent by the base station processing the service data packets after receiving them. Let the computational processing capacity allocated by the base station and nBS to user nUE for service f be denoted as cnBSnUE. Consequently, the base station processing delay for the user nUE accessing the base station nBS is given by datanUE,nBSfcnBSnUE delayluv¯.

In the network backhaul, the transmission delay between the base stations nBSu and the next-hop base station nBSv can be expressed as:

TtrannBSu,nBSv=datanBSu,nBSvCnBSu,nBSvd(nBSu,nBSv) (5)

The end-to-end delay of the service flow is the sum of the base station’s computation processing delay and the path transmission delay.

At the same time, considering the packet loss rate of data transmission, the packet loss rate of the link between two nodes is defined as the ratio of the number of lost packets, denoted as Nuvloss to the total number of packets sent, denoted as Nuvsend Over a statistical time period. This can be expressed as:

Puvloss=(NuvlossNuvsend)100% (6)

The end-to-end packet loss rate of the service f is represented by the cumulative packet loss condition across each segment of the service path pathf:

Ppathfloss=1lpathf(1Plloss) (7)

2.2. Service Model

When a user node ni initiates a service request, if the end-to-end destination node for its service is nj, a service flow fij will be established upon successful network access, defined as fij={typefij,datafij,Cfijdemand,reqfij}. Here, typefij denotes the type of service, with a value of 1 indicating traditional communication services such as voice, video, and sensor data uploads, where the QoS metrics include transmission rate, latency, and packet loss rate; a value of 0 signifies novel AI services, which in this paper refers to Deep Neural Network (DNN) model inference and training tasks, with QoS metrics encompassing not only the indicators as mentioned above but also training accuracy. datafij represents the volume of service data, Cfijdemand indicates the computational resources required to complete the task. The set of QoS metric requirements for the service flow is represented as reqfij={delayfijmax,ratefijmin,lossratefijmax,accuracyfijmin}, where if a particular metric is not of concern, its corresponding position is assigned a value of 0. The collection of all service flows within the network is defined as F, where fijF. To characterize the dynamic nature of network traffic, we assume that the generation of these service requests follows a Poisson process to simulate random arrival intervals, and the source-destination pairs (ni,nj) are randomly selected from the network to ensure diverse traffic patterns.

For traditional communication services, the computational demands are relatively low. During the routing decision-making process, nodes that can meet the computational requirements are selected for processing, while the remaining nodes in the route serve merely as transmission relays. However, it is still essential to ensure the QoS metrics during the transmission process.

For AI services, a complete task process encompasses the inference task data and the Deep Neural Network (DNN) model required for training [18]. The specific training process is not the focus of this paper; rather, the emphasis lies on how to allocate network resources for these training tasks and select appropriate routing. Therefore, this paper assumes that the required DNN model has already been pre-trained. When the base station node first receives a task training request, the model will be deployed to that node from the cloud server, and no further redeployment will be necessary. It is assumed that the tasks are non-divisible; during the routing process of the service flow, the network attempts to dispatch the task. If the task remains processable at the destination node, it will be considered a service failure.

The AI service involves metrics related to model accuracy, with the accuracy of trained Deep Neural Network (DNN) models and data quality being two crucial factors influencing the training accuracy of AI tasks [9,10]. During the training process, due to resource constraints, different algorithms with varying compression rates may be selected. Compressed training models exhibit smaller sizes and consume fewer resources; however, this compression also results in a decrease in model accuracy. On the other hand, data quality is affected by factors such as channel conditions and transmission distance during the transmission process. Poor quality of data likewise leads to a reduction in training accuracy. This paper does not focus on the compression of DNN models; rather, it primarily considers how to reasonably select the training model’s precision based on network resources, link conditions, and service requirements, as well as how to enhance data transmission quality to improve overall training accuracy.

Define the data transmission quality of each link as a binary variable ϑuv where

ϑuv=1, Puvloss<Puvloss_th0.95, PuvlossPuvloss_th (8)

Inspired by the literature [18], we define the parameter representation of the pre-trained DNN model as Mz={Pz,Qz,Sz}, z{0,1}, where Pz, Qz and Sz represent the model’s inference accuracy, computational power requirements, and memory footprint, respectively. This paper provides two levels of DNN models for AI applications, where z=0 denotes a lightweight model, and z=1 indicates a full-scale model. The specific numerical relationships between model magnitude, model accuracy, computational power requirements, and memory size are presented in Table 1.

Table 1.

Performance Parameters of Pre-trained DNN Models.

Model Scale Model Accuracy Computational Demand/MHz Model Size/MB
z=0 (lightweight model) [0.929, 0.954] [20, 30] [3.5, 5.5]
z=1 (full-scale model) [0.973, 0.992] [80, 120] [9, 12]

The training accuracy ultimately achieved by the AI system is equal to the transmission quality of each link during the transmission process multiplied by the training accuracy of the pre-trained DNN model. The training accuracy ultimately achieved by the AI system is equal to the transmission quality of each link during the transmission process multiplied by the training accuracy of the pre-trained DNN model, which can be expressed as: accuracy=ϑuv·Pz where accuracy represents the final training accuracy, ϑuv denotes the transmission quality defined in Equation (8), and Pz is the training accuracy of the pre-trained DNN model as listed in Table 1.

3. Problem Modeling

In wireless mesh networks, when performing resource allocation and routing optimization for traditional communication services and AI services, it is essential to simultaneously consider conventional QoS metrics, including transmission rate, latency, and packet loss rate, alongside the training accuracy metrics specific to AI services. Given the varying QoS requirements of different services, employing a singular metric as the standard for measuring service quality fails to ensure fairness and efficient resource utilization.

Therefore, this paper establishes a unified utility function for different types of services, normalizing the latency, rate, packet loss rate, and training accuracy of the traffic flow, and representing the four demand indicator vectors of the service flow as {delay,rate,lossrate,accuracy}. Different service flows exhibit differentiated requirements for various indicators, and the normalization formula for these indicators is defined as follows:

delaynorm=delaymaxdelaydelaymax, if delay<delaymax0, else (9)
ratenorm=0, if rate<rateminraterateminratemaxratemin, if rateminrateratemax1, else (10)
lossratenorm=lossratemaxlossratelossratemax, if lossrate<lossratemax0, else (11)
accuracynorm=accuracyaccuracyminaccuracymin, if accuracy>accuracymin0, else (12)

In this context, ratenorm, lossratenorm and accuracynorm represent the normalized values of delay, transmission rate, packet loss rate, and training accuracy, respectively. It is important to note that different services have varying threshold requirements for the metrics as mentioned above.

Let Iweight(τ)={Idelay(τ),Irate(τ),Ilossrate(τ),Iaccuracy(τ)} be the indicators for evaluating the performance weights of the traffic type τ, with the sum equal to 1. The rationale behind this weighted design is to address the multi-objective optimization nature of heterogeneous services. The weighting parameters Iweight(τ) are not arbitrary but are selected based on the specific QoS priorities of the traffic type τ. For instance, for delay-sensitive services, a higher value is assigned to Idelay, whereas for AI inference tasks, Iaccuracy is prioritized. This parameter selection ensures that the reward signal accurately reflects the distinct Service Level Agreements (SLAs) required by different applications.

Utility(τ)=1Iweight(Idelay(τ)delaynorm+Irate(τ)ratenorm+Ilossrate(τ)lossratenorm+Iaccuracy(τ)accuracynorm) (13)

The average service utility function for all user service types in the network is represented as τUtility(τ)|τ|.

Define the binary variable χfijn to indicate whether the task is processed at node n; it takes the value of 1 if true, and 0 otherwise. Define bfijluv as the bandwidth allocated to the service flow fij on link luv, measured in MHz. Define cfijn as the computational resource capacity allocated to fij at node n, measured in GHz. Define sfijn as the amount of storage resources allocated to fij at node n, measured in MB. Define φfijz to indicate whether the AI service flow fij selects a compressed DNN model, where z denotes a lightweight model, and otherwise indicates a full-scale model. Additionally, for the sake of clarity, define the binary variable ψfijluv to represent whether the service flow fij traverses link luv in the physical topology; it takes the value of 1 if true, and 0 otherwise. However, ψfijluv is determined by χfijn during the decision-making process and does not require separate handling.

The objective of this paper is to maximize the average service utility of users across the entire network, namely,

maxχfijn,bfijluv,cfijn,sfijn,φfijzτUtility(τ)|τ| (14)

The constraints that need to be satisfied are:

v:luvLψfijluvk:lkuLψfijlku=1, i=strfij1, i=desfij, fij0, otherwiseF (15)
fijFψfijluvbfijluvBluv, luvL, fijF (16)
fijFχfijncfijnCn, nN, fijF (17)
fijFχfijnsfijnSn, nN, fijF (18)
delayfijdelayfijmax, fijF (19)
ratefijratefijmin, fijF (20)
lossratefijlossratefijmax, fijF (21)
accuracyfijaccuracyfijmin, fijF (22)

Among them, constraint (15) indicates that for each service flow fij, the routing from the source node strfij to the destination node desfij must satisfy the flow conservation principle; constraint (16) imposes a limitation on the link bandwidth capacity, stating that the sum of the bandwidth consumed by the service flows traversing a particular link must not exceed the link’s bandwidth capacity; constraint (17) pertains to the computational resource limitations at nodes, specifying that the total amount of computational resources utilized by the computational tasks undertaken by a node must not exceed the total computational resources available to that node; constraint (18) addresses the storage resource limitations at nodes, indicating that the current storage of DNN models and processing data at a node must not surpass the total storage resources of that node; constraint (19–22) sets forth requirements regarding the latency, packet loss rate, minimum transmission rate, and training accuracy for AI services related to the demands of the service flows.

4. Design of Routing and Resource Allocation Algorithms for Mesh Networks Based on GNN-DRL

As user demands, network topologies, and resource statuses continually evolve, the application of traditional heuristic algorithms or mathematical optimization methods may result in failure to promptly release resources from users who have completed their service. Upon the completion of a single execution of the algorithm, only a static strategy tailored to the current set of users and network conditions can be obtained. However, when nodes are added or removed from the network, changes occur in user request types, and link quality and network available resources fluctuate over time, the original static strategy is unable to adapt swiftly to the new network environment. This leads to a reduction in resource utilization efficiency and poses challenges in meeting the continuously dynamic task demands of users in future emergency scenarios.

To this end, it is essential to seek an intelligent strategy capable of adapting to the dynamic changes in network topology and service requirements. GNNs, as neural network models specifically designed for graph structures with strong generalization capabilities, can directly extract information such as node adjacency relationships, link performance metrics, and available resources from Mesh networks, thereby forming high-dimensional hidden feature vectors for the nodes. Furthermore, by leveraging DRL techniques, multi-agent systems can make real-time decisions in a distributed environment based on the features extracted by GNN. The actions taken can autonomously adjust according to the learned strategies in response to updates in network topology, resource conditions, and variations in service demands. Consequently, this paper will design an intelligent algorithm based on GNN-DRL to address the proposed optimization of routing and resource allocation in Mesh networks.

This paper proposes a distributed Graph Neural Network based Multi-Agent Deep Reinforcement Learning (GNN–MADRL) routing and resource allocation algorithm in which each base station functions as an agent. Each base station is modeled as an agent, and the decision-making process encompasses network feature representation based on GraphSAGE and collaborative routing decisions based on MAPPO. The following sections provide a detailed introduction to the algorithm design.

4.1. Node Feature Aggregation Algorithm Based on GraphSAGE

GraphSAGE is an inductive GNN model that aggregates information from the current node and its adjacent nodes to form feature vectors, enabling the representation of any node through an aggregation function. In large-scale dynamic mesh networks, GraphSAGE allows each node to learn hidden feature vectors by relying solely on a subgraph of its immediate neighborhood, thereby exhibiting generalization capabilities. However, distributed subgraph training may lead to inconsistencies in the hidden vector information among nodes. Consequently, this paper adopts a global centralized training approach combined with the distribution of hidden vectors for node feature aggregation.

The model input includes: (1) Network topology information G=N,L; (2) The initial feature vectors of each node xni={posni,Bni,Cni,Sni}, representing the position, currently allocable bandwidth, current computational resources, and current storage capacity of each node; (3) The initial feature vectors of each link xluv={throughputluv¯,lossrateluv¯,delayluv¯}, where throughputluv¯ denotes the historical average throughput of the link, lossrateluv¯ represents the historical average packet loss rate of the link, delayluv¯ indicates the historical average transmission delay of the link, which are sampled and collected from training sample data; (4) Network depth K, referring to the number of layers in the feature aggregation process through which feature propagation occurs. (5) Weight matrix W, which is continuously updated during the training process by calculating the loss gradient through backpropagation.

The model output consists of the hidden feature vectors of each node hi(K).

As illustrated in Figure 2, assuming the target node is ni, the process of aggregating and updating the features of the neighboring nodes using the GraphSAGE model is as follows:

Figure 2.

Figure 2

Schematic Diagram of Node Feature Vector Generation by GraphSAGE.

  • 1.

    Neighbor node sampling

To balance network scale and computational efficiency, a partial sampling strategy is employed for aggregating neighbor information. Specifically, for each node ni, a fixed number s of neighbors is randomly sampled from its adjacent node set Adji for feature aggregation. If the number of neighbors is less than s, sampling is conducted with replacement to ensure that the total number of sampled adjacent nodes reaches s; if the number of neighbors exceeds s, random sampling without replacement is utilized, selecting only s nodes from all neighbors to control computational load. The number of samples per layer may vary; Figure 2 demonstrates a sampling process with a network depth of 2, where the number of sample nodes in the first layer is 3, and the number of samples in the second layer is 7.

In this study, due to the potential for each node to connect to multiple links, we aim to fully leverage the link feature information while avoiding an excessively large input space for the GNN. For each node ni, its initial feature vector xni={posni,Bni,Cni,Sni} is extended in length and redefined as xni={posni,Bni,Cni,Sni,0,0,0}. When a node nj is selected during sampling, the feature vector of link luv, represented as xluv={throughputluv¯,lossrateluv¯,delayluv¯}, is appended to the last three zeros of the feature vector of nj. This approach enables effective collection of both node and link features, allowing subsequent DRL processes to make targeted decisions based on the QoS requirements of the service flow.

  • 2.

    Feature aggregation of adjacent nodes

For each layer of the network, the feature vector of each node is generated by the previous layer of the network, and each layer employs a method of sampling adjacent nodes to aggregate the features of neighboring nodes. The process of generating the hidden feature vector hi1 in the first layer of the network is represented as follows:

hi1=σW1·CONCAThi0,AGGR{hj0:njAdji} (23)

In this context, W represents the parameter matrix that is progressively learned during the training process, while σ(·) denotes the nonlinear activation function. By aggregating the initial features of the s neighboring nodes of node ni, denoted as hj0, njAdji, the aggregated features for the neighbors of ni are obtained as hAdji1. Subsequently, the aggregated features hAdji1 are linearly combined with the initial features hi0 of node ni using a residual connection. This is then multiplied by the parameter matrix to yield the hidden vector representation hi1 of the target node ni in the first layer of the network after undergoing a nonlinear transformation.

The mapping process of the intermediate hidden layers is like the above expression, represented as:

hik=σWk·CONCAThik1,AGGRhjk1,njAdji (24)

The output of the final hidden layer is the ultimate hidden feature vector of the node ni, denoted as hi(K). Consequently, the GraphSAGE model can produce the final hidden feature vectors for all nodes.

  • 3.

    Parameter Learning

In the GraphSAGE model, the parameter matrix W defines the nonlinear mapping relationship between node features and neighborhood aggregation features. During the offline pre-training phase, the network parameters of the DRL model remain fixed, allowing the weight parameter matrix W of the GraphSAGE model to be adjustable. Since the overall objective is to maximize the cumulative reward of the routing task, we define R(W) as the objective loss function parameterized by the weight matrix W. Consequently, the stochastic gradient descent algorithm is employed to update the weights. The optimization process is represented as:

Wt=Wt1γWR(W) (25)

In this context, Wt denotes the model weight parameters at step t, γ represents the learning rate, and WR(W) indicates the gradient with respect to the parameter W. Through continuous iterations, the model progressively acquires node neighbor feature representations that are conducive to routing decisions. Upon completion of the pre-training phase, the obtained W can be applied to new network nodes to generate hidden vectors with enhanced discriminative power and generalization capability, thereby providing support for the subsequent policy optimization of the DRL routing model.

The implementation of the feature extraction algorithm based on GraphSAGE is illustrated in the pseudocode presented in Algorithm 1.

Algorithm 1. Feature Extraction Algorithm Based on GraphSAGE.
Input: directed graph G, initial node feature vector {x0,x1,,xNs1}, sample size s, sampling depth K, weight matrix Wk, nonlinear activation function σ(·), aggregation function Mean
Output: Hidden feature vector of each node {hi(K),niNs}
  • 1:

    initial state of each node in the network hi0xi,niNs

  • 2:

    for k = 1, 2, …, K

  • 3:

    for niNs

  • 4:

    Sample s neighbor nodes of ni and aggregate them using a mean aggregator:

    hAdj(ni)kMean({hvk1,vAdj(ni)})

  • 5:

    Concatenate the aggregated neighbor features hAdj(ni)k with the node features hnik1:

    hnikσ(Wk·CONCAT(hnik1,hAdj(ni)k))

  • 6:

    end

  • 7:

    Normalize the result using a norm hnikhnik/||hnik||2

  • 8:

    end

  • 9:

    Obtain the final hidden feature vector of the node hihi(K),niNs

The computational complexity of the proposed feature extraction algorithm is analyzed as follows. Let Ns be the number of nodes, K the search depth, s the neighbor sample size, and d the feature dimension.

Time Complexity: The algorithm iterates through K layers and Ns nodes. In each iteration, it performs neighbor aggregation (O(s·d)) and feature transformation using matrix multiplication (O(d2)). Thus, the total time complexity is O(K·Ns·(s·d+d2)). Since K, s, d are constants, the complexity is linear with respect to the number of nodes, i.e., O(Ns), ensuring low processing latency for real-time applications.

Space Complexity: The space requirement consists of storing the node feature matrices (O(Ns·d)) and the weight parameters (O(K·d2)). The overall space complexity is O(Ns·d+K·d2), which is efficient for deployment on UAVs with limited memory resources.

4.2. Representation of Mesh Node Proxies Based on Markov Models

For a given service request from the source node ni to the destination node nj, the routing generation process can be modeled as a series of Markov decision processes involving various agents along the service path from ni to nj. The resource allocation and routing generation process of agent i deployed at node ni can be represented by a triplet <Si,Ai,Ri>, where Si denotes the local observation state space of agent i, Ai represents the action space of agent i and Ri signifies the reward received by agent i upon taking action ai. The following section will provide a detailed design of the key components of the DRL agent, using agent i deployed at ni as an example.

  • 1.

    The state space Si

The agent i formulates the state input Si for the DRL algorithm by integrating the node hidden feature vector hi obtained from the sampling and aggregation process based on the GraphSAGE model, along with the current incoming service flow information. This is expressed as Si={hi,fijF}, where hi represents the hidden feature vector of node extracted by the pre-trained GraphSAGE encoder, which explicitly captures local topological structures, dynamic link states, and available nodal resources. Crucially, to facilitate effective encoding, the continuous variables within these states are first normalized to [0, 1] via the mapping functions defined in Equations (9)–(12). Additionally, fij={typefij,datafij,Cfijdemand,reqfij} denotes the service flow characteristics, including type, data volume, computational requirements, and various QoS demands.

  • 2.

    Action space Ai

To explicitly structure the decision variables, the action space Ai is defined as a joint tuple Ai=(aresource,aroute), comprising two distinct decision components:

Resource allocation actions aresource: Allocate the bandwidth resources of bfijluv to the traffic flow fij utilizing the link luv. It is important to note that the bandwidth size of each unit sub-channel in this paper is denoted as WK, and the allocated bfijluv must be an integer multiple of WK. Allocate computational resources cfijn to the traffic flow fij executing computational tasks at the node n; allocate storage resources sfijn to the traffic flow fij passing through the node n.

Routing Action aroute: The selection of the next hop node n is determined by the adjacent relationships of the current node, resulting in a set of optional nodes. Consequently, as defined above, the composite action (aroute,aresource) allows the agent to jointly optimize topology selection and resource distribution. To mitigate the potential for an excessively large action space that could hinder algorithm convergence, the implementation of the algorithm adopts a “demand-based allocation, best-effort” strategy when the current node is handling a single flow. In cases where multiple flows require processing, priority is given to servicing the flow that yields a greater service utility.

  • 3.

    The reward function Ri

To achieve the optimization objective of maximizing the average service utility for users across the entire network, it is essential to design a rational instantaneous reward for the agent at each decision-making step. The reward function is defined as the utility value resulting from the execution of actions on the service flow fij, with negative rewards imposed in the event of constraint violations. Furthermore, to mitigate unnecessary exploration, the reward is set to a single-step maximum value of 1 when the current node is the endpoint of the flow. Specifically, the reward function rit obtained by agent I, after taking a routing action at time slot t is calculated based on various performance metrics, including latency, throughput, packet loss rate, and training accuracy, and is expressed as follows:

rit=1,if kQ,knorm=01,if ni=desfUtility(f),otherwise (26)

where Q={delay,rate,lossrate,accuracy} represents the set of QoS metrics. The definition of each condition is as follows: Constraint Violation (−1): If any normalized QoS metric becomes 0, a penalty of −1 is imposed to strictly discourage actions that violate service agreements. Target Reached (1): If the current node ni is the destination desf, a maximum reward of 1 is granted to encourage flow completion. Step Forwarding (Utility(f)): For intermediate steps where constraints are satisfied, the reward equals the weighted link utility Utility(f), encouraging the selection of high-quality links that maximize the overall path utility.

The objective of Agent i is to maximize the reward Ri obtained from all flows through this agent, which can be expressed as the objective function:

Ri=γt=1rit (27)

The parameter γ[0,1) represents the discount factor of the reward function. Additionally, during the model training phase, an auxiliary state vector of length Ns is appended for each traffic flow as input, wherein the nodes selected by the current traffic flow in the routing process are assigned a state of 1, while all other nodes are assigned a state of 0, in order to prevent the occurrence of routing loops.

4.3. Design of Resource Scheduling and Routing Algorithm Based on MAPPO

In the preceding sections, we have conducted a detailed design of node feature vector extraction, as well as the state, action, and reward functions of the DRL framework, based on the GraphSAGE model and multi-agent Markov processes. Within the design of the action space, the selection of routing nodes and bandwidth resource allocation are treated as discrete actions, whereas the allocation of computational and storage resources is considered as continuous actions. The MAPPO algorithm retains the “clipping” proximal update strategy of PPO, which mitigates policy mutation through constraints on policy updates, thereby ensuring stability throughout the training process and making it suitable for the frequently changing network states characteristic of the Mesh network scenario discussed here. Furthermore, due to the integration of optimization objectives for both discrete and continuous actions within the objective function, the MAPPO algorithm applies to mixed action spaces comprising both discrete and continuous actions. Consequently, this paper selects the MAPPO algorithm as the specific implementation for the DRL component, which will be elaborated upon in the following section.

As illustrated in Figure 3, an intelligent agent is deployed at each base station node, where the state of each agent at time slot t is represented as sit. This state is composed of the node hidden vector hi Outputted by the GraphSAGE above and the current service flow demand. Subsequently, the action ait is derived from the policy function πi(a|s). The reward function rit is synthesized based on multiple QoS metrics, including latency, packet loss rate, transmission rate, and AI training accuracy, to characterize the contribution of the current action to the service utility function of the service flow. The DRL model employs an Actor-Critic neural network architecture to approximate the policy function, thereby generating the optimal input-output mapping relationship.

Figure 3.

Figure 3

Architecture of the Service flow Routing and Resource Scheduling Algorithm Based on MAPPO.

In a Markov Decision Process, agent i maps the input state sit to a probability distribution vector {p1,p2,,pq} over a discrete action space through a parameterized policy function πi(a|s), where q represents the number of adjacent nodes at the node ni. The routing algorithm selects the next-hop node corresponding to the maximum probability and performs resource allocation as the decision action ait, subsequently generating an immediate reward rit upon the issuance of the routing decision. It is important to note that the resource allocation action is not included in the probability vector output by the agent during the algorithm’s implementation; otherwise, this would result in an excessively large action space. In fact, the input state space of the DRL model encompasses the hidden feature vectors of each node, thereby incorporating resource capabilities into the decision-making process. If the resources of a node can meet the service demands of a post-routing decision, they are allocated; otherwise, a penalty reward is issued.

In the distributed routing model, agents are required to collaborate to identify an optimal strategy πθ(a|s) that maximizes the overall return of network routing decisions. Let the global objective function be denoted as R(θ), where θ represents the collection of neural network policy parameters for each agent. The gradient estimate for agent i during time slot t is expressed as:

θiR(θi)=Et[rt(θi)θilogπθi(ait|sit)Ait] (28)

where rt(θi)=πθi(ait|sit)πθi(ait|sit) denotes the ratio of probabilities of generating the same policy using new and old strategies within time slot t. By manipulating the ratio of old to new strategies rt(θi) the effectiveness of the agent’s learning can be influenced. Ait is the advantage function estimate. To prevent the update of rt(θi) from occurring too frequently, the MAPPO algorithm introduces a clipped objective function, constraining it within the interval (1ε,1+ε) to limit the magnitude of policy changes during each update, where ε is the clipping threshold. Therefore, the objective function of the Actor network for agent i at time slot t is represented as:

L(θi)=Et[min(rt(θi)Ait,clip(rt(θi),1ε,1+ε)Ait)] (29)

When the advantage function Ait>0, it indicates that the current action yields an expected return that exceeds that of the baseline policy. In this case, the probability weight of the action should be reinforced, while taking the minimum of rt(θi) and 1ε to prevent over-optimization. Conversely, when the advantage function Ait<0, it is necessary to reduce the probability of selecting the current action, taking the maximum of rt(θi) and 1ε to avert excessive policy shifts.

The advantage function can be estimated using the following formula:

Ait=δt+(γλ)δt++(γλ)Tt+1δT1 (30)

In this context, δt represents the time difference error, defined as:

δt=rt(θi)+γV(st+1)V(st) (31)

In this process, the state-value function of the Critic network, denoted as V(st), is utilized to assess the quality of the overall policy of the network at the current moment. The Critic network calculates the temporal difference error by incorporating the state values V(st) and V(st+1), along with the reward rt(θi), and introduces a weighting parameter γ. Let the parameters of the Critic network be denoted as ω, to minimize the mean squared error of the value function.

L(ω)=t=0[rt(θi)+γV(st+1)V(st)]2 (32)

Each agent utilizes the MAPPO algorithm to update policy parameters, and the specific process for routing and resource allocation decision-making is illustrated in Algorithm 2.

Algorithm 2. Distributed Routing Decision Algorithm Based on GraphSAGE-MAPPO.
Input: the service flow arriving at agent i fij={typefij,datafij,Cfijdemand,reqfij}
Output: (aroute,aresource), routing and resource allocation actions.
  • 1:

    initialize the network G, Actor network parameters θ, Critic network parameters ω, Experience replay buffer D.

  • 2:

    while episode<Nepisodes

  • 3:

        for step from 1 to T

  • 4:

          if the current node is not the destination of the traffic flow.

  • 5:

    Obtain the initial feature vector of node ni xni={posni,Bni,Cni,Sni}.

  • 6:

    Compute the hidden feature vector of the node using the algorithm in Algorithm 1. hi, form the state space si by incorporating the traffic flow fij

  • 7:

    Generate the next-hop probability distribution according to the policy function πθii(ai,si), select the optimal next hop, allocate resources, and compute the reward Ri using Equation (26).

  • 8:

    Obtain the next state s, and store (s,a,r,s) in the experience replay buffer D

  • 9:

    Sample experiences from the replay buffer and compute the advantage function Ait according to Equation (30).

  • 10:

    Compute the loss function of the Actor network and update θ.

  • 11:

    Compute the loss function of the Critic network and update ω.

  • 12:

        end if

  • 13:

      end for

  • 14:

    end while

  • 15:

    Save the model.

The computational complexity of the proposed Distributed Routing Decision Algorithm is analyzed from two perspectives: the training phase and the inference phase. Let Nep be the number of episodes, T the steps per episode, B the batch size for updates, and d the feature dimension.

Time Complexity:

Training Phase: The algorithm runs for Nep episodes with T steps each. In each step, the computational cost consists of feature extraction, action generation, and network updates. Feature Extraction & Action: Calling the GraphSAGE module costs O(N·d2), and the Actor network forward pass costs O(L·d2), where L is the number of layers; Parameter Update: Updating the Actor and Critic networks involves processing a batch of size B. The backpropagation complexity is approximately O(B·d2); Total Training Time: O(Nep·T·(N·d2+B·d2)). Since training is performed offline or in parallel, this cost is acceptable.

Inference Phase: During the mission, the UAV only executes the forward pass. For a single routing decision, the complexity is dominated by the local neighbor aggregation and the Actor network inference, which is O(K·s·d+L·d2). This complexity is independent of the global network size N and the iteration limit T, ensuring O(1) constant-time complexity per hop relative to the network scale, guaranteeing low latency for emergency communications.

Space Complexity:

The space consumption is primarily determined by the Experience Replay Buffer D and the neural network parameters; Replay Buffer: Storing transitions (s,a,r,s) requires O(|D|·d), where D is the buffer capacity; Model Parameters: Storing weights for Actor (θ) and Critic (ω) networks requires O(L·d2). Total Space: O(|D|·d+L·d2). This falls well within the memory constraints of modern UAV onboard processors.

5. Simulation Analysis

This paper simulates the construction and deployment process of the GraphSAGE and MAPPO algorithm. The simulation experiments are conducted on a Windows 11 PC equipped with an AMD Ryzen 7 5800H CPU running at 3.20 GHz and 16 GB of RAM. Anaconda3-2020.02 is utilized to establish a Python virtual environment, and the deep learning framework is built using Python 3.7 in conjunction with TensorFlow 2.4.1. Initially, the physical network environment is constructed, considering the frequently changing characteristics of wireless mesh network topologies. The NetworkX library is employed to develop the underlying physical infrastructure network, and a directed graph is constructed to simulate real-world scenarios based on the coverage area of base station nodes. After achieving training stability, a small number of nodes and links are randomly added or removed every 500 training episodes to simulate topological changes. In each training episode, 400 tasks of various types are randomly generated, with the task size selectable from the set {3, 9, 15, 21, 27} Mbit. The relevant parameters of the pre-trained model for AI tasks are defined in Table 1, and the nodes for the transmission and reception of task data are selected randomly. The initial topology of the network is illustrated in Figure 4, while the settings for the physical network simulation parameters are provided in Table 2. The main parameter settings for the GraphSAGE-MAPPO algorithm model are outlined in Table 3.

Figure 4.

Figure 4

Distribution of Various Types of Base Stations and Users at Network Initialization.

Table 2.

Parameters for Physical Network Simulation.

Physical Network Environment Parameters Parameter Value
simulation area size m2 1000 × 1000
number of ground base stations [5, 10]
GBS coverage radius (m) 200
number of UAV-BS [15, 20]
UAV-BS coverage radius (m) 400
base station allocatable channel bandwidth (MHz) 20
computing resources of various base stations (GHz) [1, 2]
base station storage resources (MB) [500, 800]
link transmission delay (ms) [4, 10]
link packet loss rate (%) [0.01, 10]
tolerable delay of delay-sensitive task flows (ms) [30, 50]
tolerable rate threshold of rate-demanding task flows (Mbps) [30, 50]
packet loss tolerance threshold of loss-sensitive task flows (%) [1, 5]
training accuracy tolerance of AI tasks (%) [80, 90]

Table 3.

Key Parameters of the GraphSAGE-MAPPO Model.

Trainable Model Parameters Parameter Value
GraphSAGE sampling depth K 2
GraphSAGE per-layer node sampling number s 5
Actor network learning rate 0.0003
Critic network learning rate 0.001
reward discount factor 0.99
initial value of the exploration rate 1.0
final value of the exploration rate 0.01
rate of decay for the exploration rate 0.995
PPO clip parameter ϵ 0.2
replay buffer size 5000
batch size of randomly sampled training data 64
training episodes 10,000
max steps in each episode 50

Table 3 shows the key parameters of the GraphSAGE-MAPPO model, which are important for training the deep reinforcement learning agent. These are the parameters that govern issues such as network sampling, learning effort for actor and critic networks, exploration behavior and training setup.

This paper selects routing and resource allocation methods based on GraphSAGE-DQN [19], single-agent PPO [20] and the traditional Dijkstra algorithm as comparative algorithms, to validate the superior performance of the proposed algorithm under various traffic demand types and dynamic changes in network topology. The performance of the algorithms is assessed using metrics such as average delay, network throughput, packet loss rate, and training accuracy for AI services. Specifically, the GraphSAGE-DQN algorithm first utilizes a graph neural network to generate hidden feature vector representations of its nodes and subsequently employs a centralized DQN model to train routing for service flows with latency and rate requirements, making decisions by selecting entire paths from a pool of candidate paths for each service flow. The centralized PPO algorithm conducts hop-by-hop routing to maximize packet forwarding rates while minimizing latency. The Dijkstra algorithm greedily selects feasible shortest paths for each flow and allocates resources accordingly.

In the preceding text, we defined the reward function as shown in Equation (26). Considering the varying QoS requirements for different traffic types, we assigned the following weights to the metrics of delay, rate, and packet loss for rate-sensitive traffic flows: (0.25, 0.5, 0.25). For delay-sensitive traffic flows, the weights for delay, rate, and packet loss are (0.5, 0.25, 0.25). For packet loss-sensitive traffic flows, the corresponding weights are (0.25, 0.25, 0.5). For AI traffic flows, the weights assigned to delay, rate, packet loss, and training accuracy are (0.2, 0.2, 0.2, 0.4). Consequently, the average service utility of the traffic flows can be calculated using Equations (4) and (13), which subsequently allows for the computation of the overall reward value.

The system normalization rewards under different algorithms are illustrated in Figure 5. To clearly visualize the convergence trends and mitigate high-frequency fluctuations caused by stochastic exploration, a moving average smoothing technique has been applied to the curves. From the figure, it can be observed that the reward values obtained from the flow routing and resource allocation algorithms based on the proposed method and GraphSAGE-DQN outperform those of other algorithms. Furthermore, when comparing the convergence speed, the PPO algorithm demonstrates superior performance relative to the proposed method and GraphSAGE-DQN. Specifically, the proposed algorithm converges after approximately 6000 iterations, while the PPO algorithm converges around 2000 iterations, and the GraphSAGE-DQN algorithm converges after approximately 4500 iterations. This indicates that the introduction of GNN for node feature extraction prior to DRL decision-making has a notable impact on algorithm complexity, and that multi-agent collaboration necessitates a greater number of training iterations to achieve a stable policy. The Dijkstra algorithm, as a deterministic shortest path algorithm, consistently pursues the shortest path for all reachable flows, with its policy remaining unchanged as the number of training iterations increases.

Figure 5.

Figure 5

Normalized Reward Values.

As illustrated in Figure 6, during the process of random flow generation, the demand is set for high bandwidth, low latency, and low packet loss rate service flows, with the proportion of AI service flows being {0.4, 0.2, 0.2, 0.2}. As the number of training iterations increases, the average transmission rate of various service flows obtained by different intelligent algorithms initially grows rapidly before stabilizing. Although convergence is relatively slow, the algorithm proposed in this paper, along with the GraphSAGE-DQN algorithm, achieves a commendable average transmission rate for service flows, demonstrating the enhancement of traditional DRL agents’ decision-making capabilities through the prior updating of extracted network features, which allows for more than just local information perception. The PPO algorithm cannot perceive neighbor node features and relies on a single agent to learn the policy; consequently, it serves incoming traffic sequentially, rendering it unable to promptly adapt to changes in resource and link conditions within the environment, thus only achieving local optimality. The Dijkstra algorithm employs a fixed shortest-path strategy for scheduling various flows, resulting in low bandwidth resource utilization and a propensity for the emergence of “bottleneck links”.

Figure 6.

Figure 6

Average Transmission Rate of The Traffic Flow with Training Epochs.

To validate the performance of the algorithm in dynamic environments, a fixed training iteration counts of 10,000 was established, with a total of 30 initial ground base stations and drones in the network. Following the specified proportions of various service flows, nodes and links were randomly removed after training stabilized around the 6000th iteration for all algorithms. The average end-to-end service flow rate, as depicted in Figure 7, varies with the number of randomly removed drones or ground base stations. Overall, as the number of randomly removed base stations increases, the average service flow rate obtained by all algorithms exhibits a downward trend. The proposed algorithm and the GraphSAGE-DQN algorithm demonstrate a relatively smaller and more stable decrease in average rate, reflecting a commendable adaptability to changes in network topology. Conversely, the PPO algorithm experiences fluctuations in performance due to alterations in node and link conditions following topology changes, necessitating retraining to explore new optimal strategies for convergence. The Dijkstra algorithm, when confronted with a significant number of node removals, suffers from insufficient bandwidth resource allocation at “central nodes,” resulting in a performance that is significantly influenced by the randomness of node removals, leading to the most considerable overall rate decline.

Figure 7.

Figure 7

Average Transmission Rate of The Traffic Flow with the Number of Randomly Removed Nodes.

To further rigorously validate the adaptability and generalization performance of the proposed algorithm in highly dynamic emergency scenarios, we conducted a sensitivity analysis regarding the frequency of topological changes. Figure 8 illustrates the average transmission rate of the traffic flow with respect to the varying topological update frequency. As the update frequency increases, the environment becomes increasingly dynamic, posing greater challenges to routing stability. It can be observed that the average transmission rates obtained by the baseline algorithms, particularly PPO and Dijkstra, exhibit a significant downward trend. The PPO algorithm experiences performance degradation because standard reinforcement learning methods struggle to re-converge and explore optimal strategies within shorter stable periods. Similarly, the Dijkstra algorithm suffers from frequent path invalidations caused by rapid link changes. Conversely, the proposed GraphSAGE-MAPPO algorithm demonstrates superior robustness. As shown in Figure 8, its performance curve remains relatively stable and maintains a high transmission rate even under the highest update frequency of 250 updates per 10,000 episodes. This demonstrates that the algorithm is insensitive to the frequency of topological changes. This robustness is attributed to the inductive learning capability of the GraphSAGE encoder, which learns generalizable aggregator functions rather than overfitting to specific network snapshots, thereby allowing the agent to rapidly adapt to new topologies without extensive retraining.

Figure 8.

Figure 8

Average Transmission Rate of The Traffic Flow with the Frequency of Topological Changes.

As illustrated in Figure 9, with the demand for high bandwidth, low latency, and low packet loss rate service flows, as well as an AI service flow ratio of {0.2, 0.4, 0.2, 0.2}, the average transmission delay of various intelligent algorithm service flows initially decreases and then stabilizes as the number of training epochs varies. The GraphSAGE-DQN algorithm and the algorithm proposed in this paper exhibit relatively low average transmission delays once training stabilizes, as the inclusion of neighboring node features in the state input expands the agent’s observation space, thereby reducing unnecessary exploration of distant nodes. Furthermore, the DQN algorithm, by selecting entire paths directly from K shortest paths, mitigates certain deficiencies in the decision-making capabilities of a single agent. The Dijkstra algorithm, as a classical shortest path algorithm, achieves superior average end-to-end latency; however, due to the lack of scheduling strategies, when previously arrived flows greedily occupy resources on shorter paths, subsequent flows experience a higher hop count, thereby increasing latency.

Figure 9.

Figure 9

Average Transmission Delay of The Traffic Flow with Training Epochs.

The fixed number of training iterations is set to 10,000, with the initial total number of ground base stations and drones in the network being 30. Under the previously specified proportions of various service flows, as the number of randomly removed drones or ground base stations increases, the variation in end-to-end average transmission delay of the service flows is illustrated in Figure 10. It can be observed that the overall average delay of all algorithms exhibits an increase; however, the proposed algorithm and the GraphSAGE-DQN algorithm show a relatively smaller increment, while the delay performance of the PPO algorithm experiences significant fluctuations due to re-convergence training following topology changes. The traditional Dijkstra algorithm, being greedy in its utilization of shortest path resources, results in a rapid increase in the hop count of subsequently arriving flows as the number of randomly removed nodes rises, leading to a substantial reduction in the availability of selectable resources.

Figure 10.

Figure 10

Average Transmission Delay of The Traffic Flow with the Number of Randomly Removed Nodes.

As illustrated in Figure 11, with a bandwidth-intensive, low-latency, and low-packet-loss traffic flow ratio set to {0.2, 0.2, 0.4, 0.2}, the average packet loss rate of various intelligent algorithms initially decreases rapidly with increasing training epochs, followed by a gradual slowdown and eventual stabilization, indicating convergence. The proposed algorithm in this paper, along with the GraphSAGE-DQN algorithm, exhibits a relatively low packet loss rate after achieving training stability. This is attributed to the relationship between packet loss rate and link transmission quality. In the GraphSAGE algorithm, each node’s features aggregate the characteristics of neighboring nodes and connected links, which include average packet loss rate information gathered from the experience buffer. This enables the DRL model to avoid congested links and resource-deficient nodes during exploration. Conversely, the Dijkstra algorithm’s greedy selection of the shortest path results in congestion on certain links, leading to the highest overall packet loss rate.

Figure 11.

Figure 11

Average Packet Loss Rate of The Traffic Flow with Training Epochs.

The fixed number of training iterations is set to 10,000, with the initial total number of ground base stations and drones in the network being 30. Under the previously specified proportions of various service flows, as the number of randomly removed drones or ground base stations increases, the variation in the end-to-end average packet loss rate of the service flows is illustrated in Figure 12. The average packet loss rate of all algorithms increases with the removal of network nodes. Compared to benchmark algorithms, the proposed algorithm in this study exhibits the least increase in packet loss rate. This is attributed to the significant changes in network topology when a larger number of nodes are removed. The proposed algorithm extracts node features and integrates the link packet loss rate metrics, allowing the DRL decision-making process to merely require the re-extraction of hidden features to identify the next-hop nodes and links with lower real-time packet loss rates, effectively avoiding congested links.

Figure 12.

Figure 12

Average Packet Loss Rate of The Traffic Flow with the Number of Randomly Removed Nodes.

As illustrated in Figure 13, for the AI service flow, the training accuracy of various intelligent algorithms gradually increases and reaches a stable value with the increase in training epochs. The training accuracy of the AI service flow is contingent upon the compression of the training model and the quality of the samples, where the model compression is determined by the resource capabilities of the nodes, and the sample quality is influenced by the transmission quality of hop-by-hop links. The algorithms proposed in this paper, along with the GraphSAGE-DQN algorithm, extract node feature vectors that encompass link packet loss rates and node resource information. This effectively reduces the search space during DRL decision-making, avoids low-quality links, and enhances data quality, while routing tasks to resource-abundant Mesh nodes and employing a full model to improve accuracy. In contrast, the single-agent PPO algorithm lacks an effective perception of the surrounding environment of each node, leading to a local optimum. The Dijkstra algorithm is deficient in scheduling strategies, resulting in low data quality over congested links; the greedy selection of central nodes imposes resource constraints, and the adoption of low-precision models further diminishes overall accuracy.

Figure 13.

Figure 13

Variation of Training Accuracy of AI Service Process Flow with Respect to Training Epochs.

Finally, regarding practical applicability, the proposed GraphSAGE-MAPPO framework inherently supports scalability through its architectural design. Unlike traditional centralized algorithms that suffer from computational bottlenecks as the network size (N) grows, the inductive nature of GraphSAGE utilizes neighbor sampling, decoupling the inference complexity from the total number of nodes. Furthermore, the decentralized execution paradigm allows each UAV to make routing decisions based solely on local observations, effectively preventing communication congestion and control signaling overhead even under high traffic loads or in large-scale dense networks. This ensures the solution remains viable for extensive emergency deployment.

6. Conclusions

In response to the limitations of existing deep reinforcement learning-based intelligent routing and resource allocation algorithms in adapting to the dynamic changes in network structures and demands of wireless networks, this paper proposes an intelligent routing algorithm, GraphSAGE-MAPPO, which integrates GNN and DRL. During the training process, GraphSAGE is utilized to extract aggregated features of the nodes and links, which are then combined with the status of service flows to serve as the state input for the subsequent DRL algorithm. The model is trained to make hop-by-hop routing decisions and allocate resources efficiently. Experimental results demonstrate that, compared to the GraphSAGE-DQN algorithm and the PPO and Dijkstra algorithms that do not employ GNN for network feature extraction, the proposed algorithm exhibits superior performance in terms of average latency, packet loss rate, network throughput, and training accuracy for AI services. Furthermore, it demonstrates robust generalization capabilities in the event of network node failures, with the trained model effectively transferring to dynamically changing scenarios.

Abbreviations

The following abbreviations are used in this manuscript:

AI Artificial Intelligence
MAPPO Multi-Agent Proximal Policy Optimization
GNN Graph Neural Networks
DRL Deep Reinforcement Learning
URLLC Ultra-Reliable Low Latency Communication
DQN Deep Q-Networks
GMAPPO GNN-based Multi-Agent Proximal Policy Optimization
UAV unmanned aerial vehicle
SNR signal-to-noise ratio
AWGN additive white Gaussian noise
GNN–MADRL Graph Neural Network based Multi-Agent Deep Reinforcement Learning

Author Contributions

Conceptualization, L.X., S.H. and X.Z.; methodology, L.X.; software, L.X. and S.H.; validation, L.X., S.H. and W.F.; formal analysis, L.X. and Z.Z.; investigation, J.W.; resources, Z.Z.; data curation, J.W.; writing—original draft preparation, X.Z.; writing—review and editing, X.Z.; visualization, L.X. and X.Z.; supervision, L.X. and X.Z.; project administration, L.X.; funding acquisition, L.X. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

All data generated or analyzed during this study are included in this published article.

Conflicts of Interest

All authors were employed by the Nanjing Yuanneng Power Engineering Co., Ltd. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Funding Statement

Science and Technology Project of State Grid Jiangsu Electric Power Co., Ltd. (JCNJ20250106).

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

  • 1.Valadarsky A., Schapira M., Shahaf D., Tamar A. Learning to route; Proceedings of the 16th ACM Workshop on Hot Topics in Networks; Palo Alto, CA, USA. 30 November–1 December 2017; pp. 185–191. [Google Scholar]
  • 2.Yun W.K., Yoo S.J. Q-learning-based data-aggregation-aware energy-efficient routing protocol for wireless sensor networks. IEEE Access. 2021;9:10737–10750. doi: 10.1109/ACCESS.2021.3051360. [DOI] [Google Scholar]
  • 3.Zhao C., Ye M., Xue X., Lv J., Jiang Q., Wang Y. DRL-M4MR: An intelligent multicast routing approach based on DQN deep reinforcement learning in SDN. Phys. Commun. 2022;55:101919. doi: 10.1016/j.phycom.2022.101919. [DOI] [Google Scholar]
  • 4.Liu Q., Cheng L., Jia A.L., Liu C. Deep reinforcement learning for communication flow control in wireless Mesh networks. IEEE Netw. 2021;35:112–119. doi: 10.1109/MNET.011.2000303. [DOI] [Google Scholar]
  • 5.Peng Y., Tang X., Zhou Y., Li J., Qi Y., Liu L., Lin H. Computing and Communication Cost-Aware Service Migration Enabled by Transfer Reinforcement Learning for Dynamic Vehicular Edge Computing Networks. IEEE Trans. Mob. Comput. 2024;23:257–269. doi: 10.1109/TMC.2022.3225239. [DOI] [Google Scholar]
  • 6.Sun W., Lv Q., Xiao Y., Liu Z., Tang Q., Li Q., Mu D. Multi-Agent Reinforcement Learning for Dynamic Topology Optimization of Mesh Wireless Networks. IEEE Trans. Wirel. Commun. 2024;23:10501–10513. doi: 10.1109/TWC.2024.3372694. [DOI] [Google Scholar]
  • 7.Liu C., Wu P., Xu M., Yang Y., Geng N. Scalable deep reinforcement learning-based online routing for multi-type service requirements. IEEE Trans. Parallel Distrib. Syst. 2023;34:2337–2351. doi: 10.1109/TPDS.2023.3284651. [DOI] [Google Scholar]
  • 8.Zhou L., Hu X., Ngai E.C.-H., Zhao H., Wang S., Wei J., Leung V.C.M. A Dynamic Graph-Based Scheduling and Interference Coordination Approach in Heterogeneous Cellular Networks. IEEE Trans. Veh. Technol. 2015;65:3735–3748. doi: 10.1109/TVT.2015.2435746. [DOI] [Google Scholar]
  • 9.Xu X., Zhang H., Sefidgar Y., Ren Y., Liu X., Seo W., Brown J., Kuehn K., Merrill M., Nurius P., et al. GLOBEM dataset: Multi-year datasets for longitudinal human behavior modeling generalization. Adv. Neural Inf. Process. Syst. 2022;35:24655–24692. [Google Scholar]
  • 10.Wang Y., Li Y., Shi Q., Wu Y.-C. ENGNN: A General Edge-Update Empowered GNN Architecture for Radio Resource Management in Wireless Networks. IEEE Trans. Wirel. Commun. 2024;23:5330–5344. doi: 10.1109/TWC.2023.3325735. [DOI] [Google Scholar]
  • 11.Jing W., Wang J., Ren J., Lu Z., Shao H., Wen X. Radio Resource Allocation Optimization for Delay-Sensitive Services Based on Graph Neural Networks and Offline Dataset. IEEE Trans. Veh. Technol. 2024;74:4510–4525. doi: 10.1109/TVT.2024.3497603. [DOI] [Google Scholar]
  • 12.Randall M., Belzarena P., Larroca F., Casas P. Proceedings of the 2022 IEEE Latin-American Conference on Communications (LATINCOM) IEEE; Piscataway, NJ, USA: 2022. Deep reinforcement learning and graph neural networks for efficient resource allocation in 5g networks; pp. 1–6. [Google Scholar]
  • 13.Ji M., Wu Q., Fan P., Cheng N., Chen W., Wang J., Letaief K.B. Graph neural networks and deep reinforcement learning based resource allocation for v2x communications. IEEE Internet Things J. 2024;12:3613–3628. doi: 10.1109/JIOT.2024.3469547. [DOI] [Google Scholar]
  • 14.Wu D., Cao Z., Lin X., Shu F., Feng Z. A Learning-based Cooperative Navigation Approach for Multi-UAV Systems Under Communication Coverage. IEEE Trans. Netw. Sci. Eng. 2024;12:763–773. doi: 10.1109/TNSE.2024.3517872. [DOI] [Google Scholar]
  • 15.Xu P., Feng M., Zhou J., Xiao L., Xiao P., Jiang T. Inter-satellite routing for LEO satellite networks: A GNN and DRL integrated approach; Proceedings of the 2024 IEEE/CIC International Conference on Communications in China (ICCC); Hangzhou, China. 7–9 August 2024; pp. 1346–1351. [Google Scholar]
  • 16.Wei H., Zhao Y., Xu K. G-Routing: Graph neural networks-based flexible online routing. IEEE Netw. 2023;37:90–96. doi: 10.1109/MNET.012.2300052. [DOI] [Google Scholar]
  • 17.Zhang P. Reinforcement Learning-based Routing Optimization Model for Smart Grid Scenarios; Proceedings of the 2024 11th International Conference on Wireless Communication and Sensor Networks; Chengdu, China. 12–14 April 2024; pp. 39–43. [Google Scholar]
  • 18.Zhang W., Yang D., Peng H., Haixia P., Wu W., Quan W., Zhang H., Shen X. Deep reinforcement learning based resource management for DNN inference in industrial IoT. IEEE Trans. Veh. Technol. 2021;70:7605–7618. doi: 10.1109/TVT.2021.3068255. [DOI] [Google Scholar]
  • 19.Rao Z., Xu Y., Yao Y., Meng W. DAR-DRL: A dynamic adaptive routing method based on deep reinforcement learning. Comput. Commun. 2024;228:107983. doi: 10.1016/j.comcom.2024.107983. [DOI] [Google Scholar]
  • 20.Bai Y., Zhang X., Yu D., Li S., Wang Y., Lei S., Tian Z. A deep reinforcement learning-based geographic packet routing optimization. IEEE Access. 2022;10:108785–108796. doi: 10.1109/ACCESS.2022.3213649. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data generated or analyzed during this study are included in this published article.


Articles from Sensors (Basel, Switzerland) are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES