Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Mar 11;16:9612. doi: 10.1038/s41598-025-33233-x

High performance IP lookup through GPU acceleration to support scalable and efficient routing in data driven communication networks

Veeramani Sonai 1,#, Indira Bharathi 2,✉,#, Samah Alshathri 3,✉,#, Walid El-Shafai 4,5,✉,#, Mohd Fadzil Abdul Kadir 6,#
PMCID: PMC13009253  PMID: 41813697

Abstract

When more people use the network, the overall network traffic grows steadily. To achieve better performance, devices connecting different end systems must process incoming packets at wire speed. One critical aspect of packet processing is IP lookup. So, an effective mechanism for parallelizing IP lookup on General-Purpose Graphics Processing Unit (GPGPU) is introduced in this work. Considering that, the complexity of the lookup operation increases with IP address length, the proposed approach reduces the length of each IP address by employing suitable compression techniques. Consequently, IP address lookup is conducted in parallel using longest prefix match (LPM) by splitting compressed IP address into two parts. In this proposed work, the lookup complexity can be reduced to Inline graphic than binary trie, where w, k represents IP prefix length and stride value respectively. The experimental results highlight that GPU-based IP lookup shows 84% improvement than CPU-based approaches and 89% and 97% improvement than GPU based hashing and GPU based Binary Search Tree (BST) respectively.

Keywords: GPU, IP lookup, LPM, CUDA, BST, Hash

Subject terms: Electrical and electronic engineering, Engineering, Mathematics and computing

Introduction

Internetwork is made up of several smaller networks that are linked together by networking devices to form one sizable network. A vast array of network systems, including switches, routers, and various Middlebox types with intricately designed protocols, are used to construct computer networks. IP packet travels from one router to next router to reach the appropriate destination. Every router makes the decision of forwarding the incoming packets to find an exact next hop. Each network device has a collection of data, control plane and management. The key function of data plane is to transfer the IP packets to its destination and the function of control and management plane is to control the management of information about the data plane. Modern routers need to perform massive data-intensive tasks like IP lookup and packet classification.

Many packet processing operations will be performed, including packet forwarding, address lookup, packet classification, packet queuing, packet discard, error detection and correction, fragmentation, segmentation, and reassembly. There are two popular types of address lookup namely, lookup of a MAC (Media Access Control) address occurs in Ethernet bridge and the other type of lookup is based on IP address. In order to prevent IP address allocation waste, modern network systems employ classless inter-domain routing (CIDR)1 addresses. In CIDR, the IP prefix lengths can vary in length. The longest prefix match (LPM) algorithm2 is used to identify the best next hop for packet forwarding when there are several IP prefixes in the forwarding table that matches the destination IP address. When more than one valid prefix matches, the packet is forwarded using the data connected to the longest prefix. Both software and hardware schemes can be utilized to implement IP address lookup efficiently. Many software-based approaches rely on trie-based data structures for their implementation, prioritizing efficiency in the process. Though the software approaches are flexible, still it suffers from performance bottleneck, due to limited computational power offered by the CPU. For routers and switches, the forwarding table typically contain tens of thousands of entries. In this case, the computational demand of performing a longest prefix match (LPM) search can become a significant time-consuming task. The CPU based lookup will not deliver good performance due to, limited number of processors. One possible solution is to use GPGPU by creating a pool of threads. Graphics Processing Unit (GPU) is an exciting solution for high data intensive applications like IP lookup. The GPU is already used in Network Intrusion Detection System3 which gives significant speed improvement against the CPU based implementation.

This paper proposes a GPU-based parallel processing architecture using multibit trie data structure. The GPU serves as a co-processor in the suggested design to efficiently speed up IP lookup. This is accomplished by using a multibit trie, which lowers the trie data structure’s height and improves efficiency all around. Trie construction occurs in the host section, while the IP lookup operation is executed in the device section. To optimize efficiency, the trie construction can occur concurrently by dividing the IP prefix into two equal halves, based on its maximum length. The contribution of the proposed approach:

  • Many studies often prioritize enhancing algorithms for time or space efficiency. However, this work emphasizes the necessity of developing algorithms that are both time and space efficient, aiming to further enhance overall performance.

  • Reducing the length of IP addresses results in decreasing the height of trie-based data structure. This optimization is expected to greatly improve the performance of IP lookup operations.

  • The proposed approach illustrates the mapping of a trie representation from CPU to a one-dimensional array representation in GPU memory. This work is also presents a method for traversing the one-dimensional array in GPU memory, mirroring the traversal of a trie structure in CPU memory. Implementing a trie in GPU memory is inherently complex, thus the representation of a 1-D array in GPU memory is crucial for simplifying the implementation process.

  • A novel approach entails dividing the compressed IP address into two equal segments and employing a multibit trie data structure. Comparative analysis against existing state-of-the-art methods reveal that this approach consistently decreases average lookup time. Efficient mapping of multibit trie to a GPU platform has been achieved, along with a discussion on an efficient traversal method for trie within the GPU environment.

  • This paper investigate a technique for parallelizing the lookup operations and illustrates that the lookup time remains consistent regardless of the size of the forwarding table.

  • The implementation results of the proposed approach have shown a notable 93% enhancement in performance compared to other existing GPU-based lookup solutions

This article is structured as follows: An overview of lookup operations in the literature is given in Sect. “Related works”, Sect. "Proposed GPU-based IP lookup architecture" presents the proposed GPU based lookup method, followed by a discussion and results of the findings in Sect. "Results and discussion", and a conclusion in Sect. “Conclusion”.

Related works

In the recent years, various software-based approaches are proposed for high-performance IP address lookup and classification4. These approaches utilize a trie-based data structure to minimize memory accesses57. A trie, construction based on the binary representation of prefixes from the forwarding table, is a tree-like data structure. In the worst-case scenario, a binary trie requires 32 memory accesses for IPv4 lookup. To address this limitation, the path-compressed trie8 is introduced. In a path-compressed trie, nodes with only one child can be consolidated to decrease the height of the binary trie. If the trie is balanced (not skewed) then path compression produces poor compression. To overcome this drawback of path compression, another trie called level compressed9 (LC) is proposed. The idea behind the LC-trie is that it finds the maximum possible perfect balance sub-trie in each level, and then replaces it by a single node. The LC trie also faces challenges when the trie lacks perfectly balanced sub-tries. In the worst-case scenario for IPv4, it can require 32 memory accesses, equivalent to the length of the prefix w, particularly in single-bit lookup schemes like the binary trie.

The multibit trie10 is the next iteration of the trie, which is introduced to further minimize the amount of memory accesses. This method entails looking at several bits at once. The term “stride” refers to the parameter in the multibit trie structure that determines how many bits are examined simultaneously. In this structure, each node will typically possess Inline graphic children. Assume the multibit trie uses stride value 2, then every node in this case will have 4 children. The complexity will be O(1) if the stride is Inline graphic, but storing Inline graphic combinations will require more memory. Another approach is a binary search on the prefix range11, and in this approach, incremental updates will not be supported. As the number of memory accesses increases, it can significantly degrades IP lookup time. Additionally, updating the forwarding table using these structures may also become more time-consuming. Huang et al12, introduced a memory-efficient offset-encoding scheme for optimizing IP lookup. Their approach utilizes a next hop bitmap and offset value instead of traditional child and next hop pointers, using these during trie traversals to find the longest prefix match. Prior to constructing the offset encoding trie, the paper employs a specific node labeling scheme. Sun et al.13 proposed a hierarchical labeling scheme, combining the tree bitmap algorithm and binary hash search based on prefix length distribution. This scheme encodes IP prefixes using a bitmap scheme within each hash entry. Lim et al.14 introduced an efficient IP address scheme converting the longest prefix match problem into an exact match problem. Their approach utilizes multiple SRAM cells in the forwarding table, each containing multiple tables where hashing is employed to find the longest prefix in parallel. Rojas et al.15 proposed a trie-based parallel lookup algorithm and searching is done in parallel based on the level. If the number of levels increases then lookup time is also increased. In order to reduce the levels, this method used a controlled prefix expansion. Once the target level is selected, all the existing prefixes are expanded to this target level.

In3, a GPU-based IP lookup is suggested, utilizing string matching based on bloom filters and radix trees. An efficient IP lookup algorithm in which IP address is translated into a memory address is proposed in16. This method of translation requires a large amount of memory. A GPU-based IP lookup algorithm called GPU Accelerated IP lookup Architecture (GALE) is proposed in17. All possible IP prefixes are stored in the direct table. This makes it possible to use the direct table to convert the longest prefix match problem into an exact match. Even though it is an efficient method for IP lookup, it suffers when an update takes place in this table. When the same algorithm is extended to IPv6, it will not be suitable because global GPU memory is not sufficient to store all IP prefixes. In18, a method is proposed that combines a priority trie with a parallel processing approach. This method employed several sub-tries based on the lengths of the prefixes, with each sub-trie processed by a specific thread to accelerate IP lookup. In19, a novel binary search tree (BST) based IP lookup algorithm utilizing graphics processors is introduced. This algorithm parallelizes the binary search tree for efficient parallel IP address lookup. It stored 256 independent binary search trees using the third octet as an index and emphasizes exact match rather than longest prefix match. The worst-case time complexity of the BST is O(n) if the tree exhibits a skew structure. A hashing technique for GPU-based IP lookup is proposed in20. It divides the entire IP address into several groups based on their lengths, with each group containing IP addresses of the same length. This allows simultaneous lookup in all groups for a given destination IP address. However, this approach is susceptible to hash collisions within each group. It is difficult to produce a perfect hash function in an on-the-fly system. In21 the author proposes a method called Packetshader which is a GPU based router framework.

The literature22 proposed a novel splitting-based technique to IP lookup. There are two important separating dimensions in this strategy. First, it splits the lookup operation into two parts: figuring out the length of the prefix and the next hop. Second, it splits prefix lengths into two groups: one that are equal to or shorter than 24 and those that are longer than 24. The authors of the articles23,24 suggested using GPUs to speed up table matching in OpenFlow switches. Specifically, Flow enables parallel processing to handle OpenFlow table matching operations in an efficient manner. The author offer the Neurotrie data structure that supports arbitrary strides and allows for quick lookup25. By choosing the appropriate stride for every Neurotrie node, the ideal balance between memory utilization and trie depth can be achieved. This work provides a new entropy hashing based parallel packet classification technique that operates quickly26. The method breaks up large rule sets into more manageable, uniformly distributed sub-rule lookup tables using a single-level hashing data structure implementation. First, we show that because of the complex performance considerations involved, choosing the right granularity for trie-based lookups and updates are not simple. The author27 provided a NameGen, a pseudo-real dataset generation tool that utilized a Markov-based name learning algorithm to generate a variety of customizable name properties. A NameTrie, an innovative data structure designed to quickly find and update names while storing and indexing forwarding table entries in an efficient manner28. Its inventiveness comes from the efficient creation and application of a character-trie framework. A dictionary-compressed variation (DiFCTree) and a statistically compressed variant (StFCTree) as two enhancements to FCTree are proposed in29. Furthermore, we provide multiple customizable parameters to the control plane so that different trade-offs between FIB size and lookup time can be achieved in each of these structures.

TCAM can compare a destination IP address against all stored routing entries in a single clock cycle. It also supports longest prefix matching (LPM) by selecting the most specific prefix when multiple matches occur for the same destination address. Although TCAM delivers excellent lookup performance, it suffers from limitations such as poor scalability and high power consumption. This paper30 proposes a name based lookup using ternary content addressable memory (TCAM). This approach exploit TCAM for name lookup using a binary patrica trie. The TCAM based implementation suffers due its high power consumption. It is not suitable for dynamic update of IP address in the real time applications. The memory is a critical component in the network system as it is limited in a tiny processor. To utilize the memory efficiently, the author31 proposes multi region SRAM based TCAM structure. But the proposed design suffers when rules in TCAM expand further. TCAM is widely used for fast IP lookup operations. The author32 proposes a greedy scheme to compute fast and as well as update efficiently. Inserting a new rule in TCAM become challenging as it requires rule movement and stopping lookup operations. This work33 proposes an update mechanism which significantly reduces computation time and interruption time.

Proposed GPU-based IP lookup architecture

A GPU based IP lookup mechanism for the Switch/Router is shown in Fig. 1. Assume that forwarding table has n number of entries and the longest length of IP address is w. In order to improve IP lookup performance, the forwarding table entries are compressed using Huffman encoding. Each entry in the forwarding table can have its binary code shortened by using octet-based compression applied to the complete forwarding table. This can be done by using a Huffman tree. All of the IP addresses kept in the table are essentially shorter because of this compression. Afterwards, the compressed entries can be used to build multibit trie structures, which will enable faster lookup operations. Then the entire forwarding table can be divided into two tables based on prefix length w. The multibit trie construction is carried out in CPU for each partition separately. The entry with length less than w/2 will be added to one group of trie. If the length is greater than w/2 then it will be added in both the group of trie. A flag is maintained in each node of the trie to indicate whether the prefix is added to the first group or in both groups. The trie construction can be done in the host part of the CPU. IP lookup algorithm can be implemented on the GPU device part. When the destination IP arrives at the CPU, it can be divided into two parts based on the length w from the forwarding table. This can be sent to the GPU kernel in parallel to perform the lookup.

Fig. 1.

Fig. 1

Proposed GPU based IP address lookup architecture.

Compute Unified Device Architecture (CUDA) and its data representation

This section discuss briefly about CUDA enabled GPU architecture. Compute Unified Device Architecture is a combination of software and hardware. CUDA architecture has its own processors and global memory. In CUDA, a stream multiprocessor (SM) serves as the fundamental hardware unit for executing threads. CUDA program is consist of a set of instructions which can be compiled and run on the general purpose graphic processing unit (GPGPU) platform. A function can be invoked from CPU (so-called host) and execute on GPU so-called kernel. A GPU has multiple threads which execute the same set of instruction against different data. These threads are organized as thread blocks. The shared memory allows all of the threads in a block to exchange the data. The kernel consists of a collection of thread blocks called grid of thread blocks. Forwarding table of the switch consist of millions of entries to perform IP lookup over the table takes a huge amount of time. Such a large table can be split into smaller groups where multiple threads can be launched simultaneously in GPU to find the longest prefix match (LPM). Each thread can execute same piece of code with the different data set. Before GPU computation starts, all the necessary data can be copied from CPU memory to GPU’s global memory. The kernel can be called from the host where multiple threads can be executed simultaneously and return the result back to the host. As shown in Fig. 2, GPU has multiple stream processors (SM), where each SM contains multiple core processors. The extracted packet header sent from CPU to GPU global memory. All threads operating simultaneously on the GPU receive the destination IP address that is derived from the packet header, allowing for concurrent IP lookup operations. The resultant longest prefix match with next hop information can be sent back to the CPU.

Fig. 2.

Fig. 2

Structure of the GPU-based system.

Transformation of forwarding table

Consider the forwarding table depicted in Table 1, with next hop information. The forwarding table is represented as multibit trie data structure with the stride value of Inline graphic as shown in Fig. 3. Given that the maximum prefix length is 6, the trie will have a maximum of 2 levels, as illustrated in Fig. 4. Every node in the trie will be assigned with an integer value ranging from 0 to Inline graphic. With Inline graphic, the size of the array at each sub-trie will be 8. The entire trie is kept in an array, and the values of each node are arranged in accordance with the following expression:

Table 1.

A Simple forwarding table.

IP prefix Next hop
0 a
010000 b
011 c
100 d
101 e
110 f
010101 g
Fig. 3.

Fig. 3

Equivalent trie representation of the forwarding table.

Fig. 4.

Fig. 4

Equivalent node values at each levels.

graphic file with name d33e583.gif 1

Figure 5 shows all the node values stored in an array from the trie data structure.

Fig. 5.

Fig. 5

A complete array representation of trie data structure.

Traversal in the array representation

Since GPU does not support pointer, the pointer execution of multibit trie is a challenging task in GPU. So the entire trie is converted into equivalent array representation for easy processing in GPU kernel. Figure 6 shows a simple example of multibit trie with stride value 2 and its equivalent array representation. The array can be traversed similarly to how a multibit trie is navigated, progressing from root node to leaf node. With a stride value of 2, each level of the trie will contain 1, 4, 16, 64, nodes, and so forth. For a given destination IP, every time two bits can (stride=2) be examined in the multibit trie. This process continues till the leaf node is reached or longest match is found. Every node in the trie assigned with integer Inline graphic where Inline graphic. The integer value associated with root node is Inline graphic. Based on the first two bits of destination IP, it points to any one of the node value (1, 2, 3, 4) in the first level. By using Equation 2, the index value is calculated as:

graphic file with name d33e642.gif 2

The index value for root node is 0. From Fig. 6, binary value 00, 01, 10, 11 points to the node value 1, 2, 3, 4 respectively in the first level. Assume the key value to be searched in the trie is 0010. The first leftmost two bits (00) point to the node value 1.

graphic file with name d33e650.gif 3

Then the next two bits of search value is 10. The new index as shown in Equation 4 can be calculated from the previous Equation (3)

graphic file with name d33e661.gif 4

The node at this index stores the value 0010. This process is repeated till the leaf node is reached or longest prefix match is found. Initially, the entire array initialized with value 0. All the node values from the trie is stored into an array using Equation (2). While traversing the trie, if next hop is present in the node it will be stored in the array. Similarly, the array can be traversed in the same way as trie is traversed using Equation (2). Let’s assume the length of the IP address be w. Then the number of levels (l) in the multibit trie is w/k. The general expression for the number of threads running in parallel can be formulated as:

graphic file with name d33e686.gif 5

The above equation can also be written as:

graphic file with name d33e691.gif 6

Consider the prefix length Inline graphic and Inline graphic, then the number of levels will be 4. The total number of threads required is calculated as follows:

graphic file with name d33e704.gif 7
graphic file with name d33e708.gif 8
Fig. 6.

Fig. 6

Multibit Trie and Array Representation.

Assuming there are 12 nodes in the trie data structure and a maximum of 4 threads in a block, then, according to Fig. 7, the number of blocks will be 3. Equation (8) yields a total of 4681 threads running in parallel, derived from Equation (7). These threads are organized into blocks, each containing the maximum allowable number of threads. With the constraint of 1024 threads per block, it will require 5 blocks, as determined by Equation (8).

Fig. 7.

Fig. 7

Mapping data from CPU memory to parallel blocks in GPU memory.

Memory-efficient data representation

The most common octet in Huffman coding is represented by a short binary code, whereas the least common octet is represented by a long binary code. This compression technique guarantees successful data compression by reducing the average length of encoded IP addresses. In order to enable complete recovery of the original data from the compressed version, Huffman coding employs a lossless compression technique. First, IP addresses are split up into octets, and for every octet sequence, a Huffman tree is created. Huffman coding method is used to modify the original forwarding table. Subsequently, each byte and entry in the forwarding table is substituted with its corresponding Huffman code. Compared to uncompressed searching, this compression method dramatically lowers the amount of bits needed for each IP address in the lookup table. As a result, trie searching becomes more efficient by requiring fewer comparisons. Huffman tree construction for Table 2, first octet is shown in Fig. 8. The total number of bits required for the first octet is found using the following calculation after compression:

Table 2.

First octet value from the IP table along with its short binary code.

First octet value Frequency Code Bits
192 120 0 1
200 42 10 2
203 42 110 3
209 37 1110 4
198 32 11110 5
218 7 1111100 7
220 2 1111111 7
215 24 111111 6

Fig. 8.

Fig. 8

Huffman Tree of First Segment Set.

Inline graphic

In reality, Inline graphic bits are needed for uncompressed one. Therefore, the compression ratio is Inline graphic, indicating that the required memory capacity is 34.51% of the original. As a result, 65.48% is the space reduction percentage for the first octet. Consider a sample first octet value extracted from an IP table along with the binary code generated after applying Huffman encoding. For example, to encode an IP address 200.168.150.0, the first octet value is extracted as 200, and the binary code obtained from Table 2 is 101. Similarly, for every octet, a separate table will be constructed and used to encode the given IP address. To represent any octet value, it requires 8 bits, but in encoded form, it takes only 3 bits for the octet value 200. This reduces the total number of bits required for a given IP address.

Algorithm 1.

Algorithm 1

Parallel Trie Construction in CPU.

Construction of parallel multibit trie in CPU

For each trie, a separate multibit trie will be constructed based on the assumed stride value (k). In this approach, two tries will be created with a height of w/2, where ‘w’represents the maximum prefix length in the forwarding table. To construct a trie, the first pointer (pointer-1) is set at the prefix index 0 (i.e., the first bit of the prefix), and the second pointer is set at prefix index ‘w/2’(i.e., the Inline graphic bit in the IPv4 address). For each prefix in the forwarding table, the length of the IP prefix is calculated. If the prefix length is less than or equal to ‘w/2’, then this entry is added to the first trie. A flag value is maintained for every prefix in the trie to distinguish between prefixes of length less than or equal to ‘w/2’and prefixes of length greater than ‘w/2’.

Figures 9 and 10 illustrate multi bit trie construction with the stride length of 3 (Inline graphic) using Table 3. If the prefix length is not multiples of k, it will be expanded to its nearest multiples. This process of expansion is called prefix expansion. Calculate the length of each prefix entry in the forwarding table. If the length is not already a multiple of the specified stride value, expand the prefix to the nearest multiple of the stride value. Depending on the trie height, divide the prefixes into two groups say ‘w1’ (first ‘w/2’ bits) and ‘w2’ (remaining ‘w/2’ bits). Consider the prefix ‘B’ Inline graphic whose length is 5 bits, then expand this prefix into the nearest multiple of 3 i.e., 6 bits. The expanded prefix, in this case, will be (010000, 010001). Divide the prefix into two groups of length ‘w1’ and ‘w2’ i.e., first 3 bits 010 are added into the first trie with next hop as ‘B’ and remaining 3 bits are added in the second trie with next hop as ‘B’. During the prefix expansion, if the expanded prefix match with the existing IP prefix, then expanded prefix will be discarded. A flag is maintained for each prefix as mentioned in the construction of trie. Algorithm 1 can be used to construct a multibit trie with different values of stride ‘k’. Consider the destination IP address 110001 from table 3. First, calculate the maximum length of the IP address (M=6 in this case) and ‘M’ is greater than ‘w/2’ hence both the pointers are set at the respective locations. The first ‘w/2’ bits (first 3 bits) given to the trie Inline graphic and next ‘w/2’ bits (remaining 3 bits) given to trie Inline graphic (pointer-1 for Inline graphic is set at prefix index ‘0’ and pointer-2 for Inline graphic is set at prefix index 3). The search starts in parallel in both the tries. If the bit is ‘1’ then traverse towards the right side of the trie and if the bit is ‘0’ then traverse towards the left side of the trie. In trie Inline graphic, after traversing first ‘w/2’ bits (110) from the destination IP, it points to the node containing prefixes Inline graphic. In Inline graphic, after traversing remaining ‘w/2’ bits (001) points to the node containing prefixes ‘J’. Now, take intersection for the results obtained from both the tries namely Inline graphic and Inline graphic. The best matching prefix (BMP) for the specified destination IP address is this intersecting prefix. The outcomes Inline graphic and Inline graphic in our case include ‘J’ as a common prefix. Therefore, the best matching prefix for the IP address 110001 is ‘J’. When there is no overlap between the prefixes, just backtrack the nodes towards the root node in Inline graphic and intersect the prefixes at that node (currently backtracked node) with Inline graphic node prefixes. If no common prefix found till the root node is reached then backtrack from Inline graphic and return the prefix from Inline graphic whose flag value is ‘0’ as shown in Algorithm 2. multibit tries allow a straightforward way to calculate the longest prefix match by inspecting multiple bits. Using RIPE Data Set34, it has been noted that forwarding table has the maximum prefix length Inline graphic. The entire forwarding table is divided into two blocks each with the prefix length of 12. Threads running in CUDA is grouped into blocks. The number of threads in the first block is Inline graphic. To represent 4096 values, it requires 4 levels in the multibit trie (stride=3). The number of threads required for each block is 4681. Since CUDA does not support pointer, the entire multibit trie (stride=3) is stored in an array. Since stride=3, values can be stored in each level will be Inline graphic where l is the number of level in the trie. Since the proposed work partition entire forwarding table into two groups, this process is done for both the groups. The following two functions are used to copy the array from CPU memory to GPU memory.

Fig. 9.

Fig. 9

Multibit trie for the set of prefixes whose length is less than or equal to w/2.

Fig. 10.

Fig. 10

Multibit trie for set of prefixes whose length is greater than w/2.

Table 3.

A Simple compressed forwarding table.

IP prefix Next hop
0 A
01000 B
011 C
1 D
100 E
110 F
1100 G
1110 H
1111 I
110001 J

Algorithm 2.

Algorithm 2

Proposed GPU Based IP Lookup Algorithm.

cudaMemcpy(d_Array1, Array1, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_Array2, Array2, size, cudaMemcpyHostToDevice);

Here, Array1 belongs to first group and Array2 belongs to second group. The array belongs to devices are represented as d_Array1 and d_Array2. Once the entire trie is converted into array representation, the entire array is copied into GPU global memory using the CUDA construct cudaMemcpy. The Kernal function for IP lookup is called from CPU is

searchKey Inline graphic numOfBlocks, numOfThreads Inline graphic (d_Array1, d_Array2, d_val1, d_val2, key);

Each block maintain resultant array to store the resultant match in the respective blocks (d_val1, d_val2)

cudaMemcpy(lpm1, d_val1, sizeof(int), cudaMemcpyDeviceToHost);

cudaMemcpy(lpm2, d_val2, sizeof(int), cudaMemcpyDeviceToHost);

Once GPU computation is completed, the results will be stored into the array lpm1 and lpm2. Here, the arrays lpm1 and lpm2 belongs to the host. These resultant arrays are copied from GPU to CPU using CUDA construct cudaMemcpy. Subsequently, the intersection can be computed between these two arrays to identify the longest prefix match. Ultimately, the next hop, which is common to both result sets, is regarded as the best match prefix. The blocks are processed in such a way that the multibit trie is traversed. If the result of intersection is NULL, then just backtrack the nodes towards root node in the second block only and intersect the prefixes at that node (currently backtracked node) with first block node prefixes. The proposed approach exhibits scalability even as the number of IP prefixes increases, as it is independent of the number of entries in the forwarding table. Rather, it depends on the longest IP prefix that is present in the forwarding table. Similar to binary trie, Algorithm 2 can be applied, but with a different stride ‘k’.

Theorem 1

Partitioning the IP prefixes with the length w by a factor of c will reduce the search time by Inline graphic

Proof

Assume that the forwarding table ‘F’ has n entries and the longest length of the IP prefix is w. The entire forwarding table ‘F’ is portioned into two equal groups G based on the maximum length of the IP prefixes w. Let group Inline graphic=Inline graphic, where w represents the maximum length of the IP prefix from the forwarding table F. Let group Inline graphic=Inline graphic. Group Inline graphic contains all the IP prefixes of length ranges from 0 to Inline graphic and group Inline graphic will contain all the IP prefixes of length ranges from w/2 to Inline graphic. All the prefixes from the forwarding table F is added to the respective groups based on the range specified in the each group. If the forwarding table is partitioned by c, then the lookup complexity for the given IP prefix will be Inline graphic.

Theorem 2

Intersection Complexity of resultant set takes the complexity of O(m)

Proof

Given a destination address to the Algorithm 2 which returns two resultant sets r1 and r2. Say r1 and r2 has the size of m elements where Inline graphic. The longest prefix match can be found by taking intersection of resultant sets. Storing the results of r1 in a hash table and take each element of r2 and then find the match for each element in r2. The total complexity of this operation takes O(m).

Theorem 3

The update complexity of proposed approach with the prefix length of w and stride size of k will take Inline graphic

Proof

Assume that a new IP prefix p with length w, to be inserted into the trie data structure. Using the stride value k, the trie structure is traversed starting from the root node. Since it must inspect Inline graphic sub-children at each node, there will, in the worst case, be w/k levels to be traversed. The overall complexity for updation of new IP p takes Inline graphic.

The update complexity of the proposed approach is determined by the height of the Huffman tree data structure, which is Inline graphic. In the worst-case scenario, where all entries in the forwarding table require updates, the overall time complexity becomes Inline graphic, as demonstrated in Theorem 4.

Theorem 4

The worst-case time complexity for updating the forwarding table is Inline graphic

Proof

Let Inline graphic where Inline graphic and Inline graphic be an IP address to be added to or removed from a forwarding table F, with arrays a[0:255],b[0:255],c[0:255],d[0:255] storing binary codes and frequency counts for each octet. If each Inline graphic exists in its respective array, update the frequency and perform insertion or deletion in O(1) time by concatenating the binary codes and updating F. If any Inline graphic is new, generate a binary code using a Huffman tree, insert it into the array, and perform the operation in O(log n), where n Inline graphic 256. Then, the overall update complexity of updating ‘n’ such entries in the forwarding table become Inline graphic

Results and discussion

This work was carried out on a PC with an Intel Xeon CPU E5620 that had 15.6 GB of RAM and ran at 2.40 GHz clock frequency. On an NVIDIA TESLA C2075 GPU, the proposed method was implemented35. The host programs were compiled using GCC 10.5, while the device programs were compiled using CUDA release 12.0. At the outset, all experiments were carried out utilizing various trie-based algorithms with RIPE Data Set34. The various distribution of IP entries is shown in Fig. 11. We first simulated the existing trie based data structure using the multi-core platform without using GPU accelerator. Furthermore, the popular binary trie, path-compressed trie, and multibit trie (with strides of 2 and 3) techniques are applied on the same platform for comparison. The proposed method demonstrates superior performance compared to existing approaches, which tend to degrade with increasing table size. Notably, its efficiency is unaffected by the length of IP prefixes. Table 4 shows the theoretical time complexity comparison of various existing approaches against the proposed approach.

Fig. 11.

Fig. 11

Distribution of IP entries in the RIPE data sets.

Table 4.

Theoretical analysis of proposed approach against existing approaches.

Approaches Time complexity
Binary Trie O(w)
Path Compressed Trie O(w)
Multibit Trie O(w/k)
Binary Search Tree Inline graphic
Proposed Approach Inline graphic

Description of dataset

The RIPE Database, managed by RIPE NCC, is a public registry of IP address allocations, routing policies, and network operator details. The RIPE NCC’s Routing Information Service (RIS) collects BGP routing data via Remote Route Collectors (RRCs) located mainly at Internet Exchange Points within the RIPE region. About 100 peers provide full routing tables, while others contribute partial data. These BGP sessions cover both IPv4 and IPv6. The data, stored in Multi-threaded Routing Toolkit (MRT) format, is publicly available on the RIS website. Each collector is identified by an ’RRC’ prefix and a number (e.g., RRC00). RIS datasets include metadata and prefix-level routing information in downloadable index files for research use.

Figure 12 illustrates the speed performance of the proposed scheme on CPU and GPU implementation respectively. Experiments are done with the various sizes of the forwarding tables (in bytes). This clearly shows that GPU based implementation works better than CPU based approaches. Due to the massive array of processors, a GPU based lookup outperforms a CPU based lookup. The proposed GPU implementation gives 83% improvement than binary trie based CPU implementation. The proposed GPU implementation gives 91% improvement than path-compression trie based CPU implementation. The proposed GPU implementation gives 84% improvement than multibit trie with stride value 2 based CPU implementation. The proposed GPU implementation gives 79% improvement than multibit trie with stride value 3 based CPU implementation. Figure 12 shows the result of binary search tree19, hashing on GPU20 and the proposed GPU implementation. The lookup process in the forwarding table takes longer to complete because of BST’s worst-case lookup complexity. Similarly, the hash-based lookup needs different hash functions for each of the forwarding table and it suffers from the hash collision. The experiments are done with the various sizes of the forwarding tables (in bytes). Figure 12 clearly demonstrates that the proposed GPU implementation provides an 89% and 97% improvement compared to GPU-based hashing and GPU-based BST implementations, respectively. Figure 13 illustrates that, on average, the proposed GPU implementation outperforms the binary trie, the path-compressed trie, and the multibit trie with stride=2 and 3 on CPU implementation, achieved improvement of 83%, 91%, 84%, and 79% respectively. The proposed implementation perform 89%, 97% against hashing and binary search tree implementation on GPU respectively. Figure 14 illustrates the memory usage (in KB) of the IP datasets before and after compression. Since the height of the trie structure plays a vital role in determining lookup performance, reducing the height through compression leads to improved lookup efficiency. As shown in Figure 14, the memory consumption decreases significantly after compressing the IP addresses. Figure 15 presents the experimental results for both the compression ratio and the percentage of memory savings. In this figure, the vertical bars denote the percentage of memory saved, while the line plot superimposed on these bars represents the compression ratio for each dataset. The results indicate that the average compression ratio is approximately 1.6, and the average memory savings is 37.19%. Figure 16 shows the time taken to find the intersection on various IP prefixes matched during the parallel IP lookup from overall lookup time. The data collected from NVIDIA profiler gives the performance analysis of various metrics like lookup time, update time utilized for kernel call and memory copy as shown in Figs. 1718 and 21. The x-axis in these figure represent average, minimum and maximum time utilized when running lookup algorithms of various approaches against the proposed approach. Figures 19 and 20 illustrate the lookup and update times, respectively, in comparison with state-of-the-art approaches, namely GALE36, GAMT36, and HBS37 using the RIPE dataset. To ensure generalizability and to avoid results biased by hardware, the detailed theoretical time complexity analysis in the Table 4 is included. The time complexity of the proposed approach is Inline graphic, which is asymptotically more efficient than existing state-of-the-art methods. This theoretical improvement is further supported by Figure 15, which compares the kernel execution times of different approaches under the same experimental setup. The combination of theoretical and comparative empirical analysis offers a balanced and generalizable assessment of our algorithm’s performance.

Fig. 12.

Fig. 12

Lookup time comparison of existing with proposed GPU approach.

Fig. 13.

Fig. 13

Percentage of improvement of proposed with existing approaches.

Fig. 14.

Fig. 14

Memory usage before and after compression for the IPv4 dataset.

Fig. 15.

Fig. 15

The compression ratio and corresponding percentage of memory savings for the IPv4 dataset.

Fig. 16.

Fig. 16

Time taken to find the intersection of matched IP prefixes (ms).

Fig. 17.

Fig. 17

Percentage of time utilized for kernel function call for IPv4 dataset.

Fig. 18.

Fig. 18

Percentage of update time utilized for kernel function call for IPv4 dataset.

Fig. 21.

Fig. 21

Percentage of time utilized for memory copy.

Fig. 19.

Fig. 19

Percentage of time utilized for kernel function call for IPv6 dataset.

Fig. 20.

Fig. 20

Percentage of update time utilized for kernel function call for IPv6 dataset.

Ablation study: impact of CUDA version

To investigate the impact of CUDA toolkit versions on the performance and results of our proposed algorithm, we conducted ablation experiments on an NVIDIA Tesla C2075 GPU using CUDA 11.0 and CUDA 12.0 environments. All other factors including software versions, hardware settings, and algorithm parameters, were kept constant. The empirical results show that upgrading from CUDA 11.0 to CUDA 12.0 results in a modest improvement (1.7%) in kernel execution time, while the precision of the algorithm remains consistent within a negligible margin, as shown in Table 5. This indicates that while CUDA version differences may offer run-time optimizations due to improved compiler and driver support, they do not significantly affect the algorithm’s correctness or overall performance conclusions. This findings show negligible variation, confirming the robustness of our approach.

Table 5.

Comparison of CUDA 11.0 and CUDA 12.0 on Tesla C2075 GPU.

CUDA version Kernel execution time (ms) Accuracy (%) Remarks
11.0 153.2 91.84 Baseline environment
12.0 150.7 91.87 Slightly faster runtime, negligible accuracy difference

Conclusion

The proposed IP lookup algorithm harnesses GPGPU computing to leverage the substantial computational capabilities of GPUs. Each of the threads will perform a simultaneous lookup to discover the longest prefix match for the supplied target IP address, returning the best match. To further enhance search efficiency, a multibit trie data structure with a specified stride ‘k’is employed, reducing the lookup time to ‘Inline graphic’, where c denotes the number of partitions. Simulation results on real datasets demonstrate an 84% improvement over CPU-based approaches and 89% and 97% improvement over GPU-based hashing and GPU-based BST implementations, respectively. In the future, this algorithm can be further enhanced with real-time IPv6 datasets.

Acknowledgements

The authors would like to acknowledge Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R197), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. The authors would like to thank the Automated Systems and Computing Lab (ASCL) at Prince Sultan University, Riyadh, Saudi Arabia, for their support to this work.

Author contributions

Veeramani Sonai identified problems and idea formulation Indira Bharathi implemented the idea Samah Alshathri testing and validation of results Walid El-Shafai involded in drafting and compiling Fadzil Abdul Kadir involved in review and verification

Funding

This work is supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R197), Princess Nourah bint Abdulrahman University, Riyadh,Saudi Arabia.

Data availability

The dataset utilized in this research article is publicly available at https://www.fit.vut.cz/research/product/c71858/.

Code availability

The code generated during the study is available in the following GitHub repository at https://github.com/aridnib-code/Program-demo.git.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Veeramani Sonai, Indira Bharathi, Samah Alshathri, Walid El-Shafai and Mohd Fadzil Abdul Kadir contributed equally to this work.

Contributor Information

Indira Bharathi, Email: indira.b@vit.ac.in.

Samah Alshathri, Email: sealshathry@pnu.edu.sa.

Walid El-Shafai, Email: welshafai@psu.edu.sa, Email: walid.elshafai@el-eng.menofia.edu.eg, Email: eng.waled.elshafai@gmail.com.

References

  • 1.Rekhter, Y. & Li, T. An architecture for IP address allocation with CIDR. RFC 1518 (1993).
  • 2.Doeringer, W., Karjoth, G. & Nassehi, M. Routing on longest-matching prefixes. IEEE/ACM Transactions on Networking (TON)4, 86–97 (1996). [Google Scholar]
  • 3.Mu, S. et al. Ip routing processing with graphic processors. Proceedings of the Conference on Design, Automation and Test in Europe 93–98 (2010).
  • 4.Gupta, P. & McKeown, N. Algorithms for packet classification. IEEE Network15, 24–32 (2001). [Google Scholar]
  • 5.Veeramani, S. & Noor Mahammad, S. Efficient ip lookup using hybrid trie-based partitioning of tcam-based open flow switches. Photonic Network Communications28, 135–145 (2014).
  • 6.Sonai, V., Bharathi, I., Uchimucthu, M., Sountharrajan, S. & Bavirisetti, D. P. Ctla:compressed table look up algorithm for open flow switch. IEEE Open Journal of the Computer Society 1–10 (2024).
  • 7.Sonai, V., Bharathi, I., Jamal, S. S., Bassfar, Z. & Nooh, S. A. Efficient ip address retrieval using a novel octet based encoding technique for high speed lookup to improve network performance. Scientific Reports15, 2254 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Ruiz-Sanchez, M. et al. Survey and taxonomy of IP address lookup algorithms. IEEE Network15, 8–23 (2001). [Google Scholar]
  • 9.Nilsson, S. & Karlsson, G. Ip-address lookup using LC-tries. IEEE journal on Selected Areas in Communications17, 1083–1092 (1999). [Google Scholar]
  • 10.Sahni, S. & Kim, K. S. Efficient construction of multibit tries for IP lookup. IEEE/ACM Transactions on Networking (TON)11, 650–662 (2003). [Google Scholar]
  • 11.Tzeng, H.H.-Y. & Przygienda, T. On fast address-lookup algorithms. IEEE Journal on Selected Areas in Communications17, 1067–1082 (1999). [Google Scholar]
  • 12.Huang, K., Xie, G., Li, Y. & Liu, A. X. Offset addressing approach to memory-efficient IP address lookup. INFOCOM, 2011 Proceedings IEEE 306–310 (2011).
  • 13.Sun, H., Sun, Y., Valgenti, V. C. & Kim, M. S. A hierarchical hashing scheme to accelerate longest prefix matching. IEEE Global Communications Conference (GLOBECOM) 1296–1302 (2014).
  • 14.Lim, H., Seo, J.-H. & Jung, Y.-J. High speed IP address lookup architecture using hashing. IEEE Communications Letters7, 502–504 (2003). [Google Scholar]
  • 15.Rojas-Cessa, R., Ramesh, L., Dong, Z., Cai, L. & Ansari, N. Parallel search trie-based scheme for fast IP lookup. IEEE Global Telecommunications Conference, GLOBECOM’07 210–214 (2007).
  • 16.Zhao, J., Zhang, X., Wang, X. & Xue, X. Achieving O(1) IP lookup on Gpu-based software routers. ACM SIGCOMM Computer Communication Review41, 429–430 (2011). [Google Scholar]
  • 17.Zhao, J., Zhang, X., Wang, X., Deng, Y. & Fu, X. Exploiting graphics processors for high-performance IP lookup in software routers. Proceedings of IEEE INFOCOM 301–305 (2011).
  • 18.Zhian, H., Bayat, M., Amiri, M. & Sabaei, M. Parallel processing priority trie-based IP lookup approach. IEEE 7th International Symposium on Telecommunications (IST) 635–640 (2014).
  • 19.Shekhar, A. & Goyal, J. Parallel binary search trees for rapid IP lookup using graphic processors. 2nd International Conference on Information Management in the Knowledge Economy (IMKE) 176–179 (2013).
  • 20.Yao, X., Lin, Y., Wang, G. & Hu, G. A dynamic ip lookup architecture using parallel multiple hash in gpu-based software router. Journal of Computational Information System 967–976 (2013).
  • 21.Han, S., Jang, K., Park, K. & Moon, S. Packetshader: a gpu-accelerated Software Router. ACM SIGCOMM Computer Communication Review40, 195–206 (2010). [Google Scholar]
  • 22.Yang, T. et al. Constant ip lookup with fib explosion. IEEE/ACM Transactions on Networking26, 1821–1836 (2018). [Google Scholar]
  • 23.Qiu, K., Chen, Z., Chen, Y., Zhao, J. & Wang, X. Gflow: Towards gpu-based high-performance table matching in openflow switches. 2015 International Conference on Information Networking (ICOIN) 283–288 (2015).
  • 24.Sonai, V., Bharathi, I. & Noor Mahammad, S. A perspective of ip lookup approach using graphical processing unit (gpu). International Conference on Distributed Computing and Intelligent Technology 98–103 (2023).
  • 25.Chen, H., Yang, Y., Xu, M., Zhang, Y. & Liu, C. Neurotrie: Deep reinforcement learning-based fast software ipv6 lookup. 2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS) 917–927 (2022).
  • 26.Greenberg, S., Sheps, T., Leon, D. A. & Ben-Shimol, Y. Packet classification using gpu and one-level entropy-based hashing. IEEE Access8, 80610–80623 (2020). [Google Scholar]
  • 27.Ghasemi, C., Yousefi, H., Shin, K. G. & Zhang, B. On the granularity of trie-based data structures for name lookups and updates. IEEE/ACM Transactions on Networking27, 777–789 (2019). [Google Scholar]
  • 28.Ghasemi, C., Yousefi, H., Shin, K. G. & Zhang, B. A fast and memory-efficient trie structure for name-based packet forwarding. 2018 IEEE 26th International Conference on Network Protocols (ICNP) 302–312 (2018).
  • 29.Karrakchou, O., Samaan, N. & Karmouch, A. Fctrees: A front-coded family of compressed tree-based fib structures for ndn routers. IEEE Transactions on Network and Service Management17, 1167–1180 (2020). [Google Scholar]
  • 30.Song, T., Li, T. & Yang, Y. Ptcam: Scalable high-speed name prefix lookup using tcam 707–719 (2025).
  • 31.Zou, Q., Zhang, N., Guo, F., Kong, Q. & Lv, Z. Multi-region sram-based tcam for longest prefix. International Conference on Science of Cyber Security 437–452 (2022).
  • 32.Wan, Y., Song, H. & Liu, B. Greedyjump: A fast tcam update algorithm. IEEE Networking Letters4, 25–29 (2021). [Google Scholar]
  • 33.Wan, Y. et al. Fastup: Fast tcam update for sdn switches in datacenter networks. 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS) 887–897 (2021).
  • 34.RIS. Routing information service (2015). http://www.ripe.net/analyse/internet-measurements/routing-information-service-ris/ris-raw-data.
  • 35.NVIDIA. CUDA GPUs (2016). http://www.nvidia.com/object/cuda_get.html, 2016.
  • 36.Li, Y., Zhang, D., Liu, A. X. & Zheng, J. Gamt: a fast and scalable ip lookup engine for gpu-based software routers. Architectures for Networking and Communications Systems 1–12 (2013).
  • 37.Jiang, D. et al. Heuristic binary search: Adaptive and fast ipv6 route lookup with incremental prefix updates. IEEE/ACM Transactions on Networking (2024).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The dataset utilized in this research article is publicly available at https://www.fit.vut.cz/research/product/c71858/.

The code generated during the study is available in the following GitHub repository at https://github.com/aridnib-code/Program-demo.git.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES