Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Mar 1.
Published in final edited form as: ACM Trans Parallel Comput. 2025 Feb 11;12(1):1. doi: 10.1145/3701623

Dynamic Buffer Management in Massively Parallel Systems: The Power of Randomness

MINH PHAM 1, YONGKE YUAN 2, HAO LI 3, CHENGCHENG MOU 4, YICHENG TU 5, ZICHEN XU 6, JINGHAN MENG 7
PMCID: PMC11841858  NIHMSID: NIHMS2029846  PMID: 39990623

Abstract

Massively parallel systems, such as Graphics Processing Units (GPUs), play an increasingly crucial role in today’s data-intensive computing. The unique challenges associated with developing system software for massively parallel hardware to support numerous parallel threads efficiently are of paramount importance. One such challenge is the design of a dynamic memory allocator to allocate memory at runtime. Traditionally, memory allocators have relied on maintaining a global data structure, such as a queue of free pages. However, in the context of massively parallel systems, accessing such global data structures can quickly become a bottleneck even with multiple queues in place. This paper presents a novel approach to dynamic memory allocation that eliminates the need for a centralized data structure. Our proposed approach revolves around letting threads employ random search procedures to locate free pages. Through mathematical proofs and extensive experiments, we demonstrate that the basic random search design achieves lower latency than the best-known existing solution in most situations. Furthermore, we develop more advanced techniques and algorithms to tackle the challenge of warp divergence and further enhance performance when free memory is limited. Building upon these advancements, our mathematical proofs and experimental results affirm that these advanced designs can yield an order of magnitude improvement over the basic design and consistently outperform the state-of-the-art by up to two orders of magnitude. To illustrate the practical implications of our work, we integrate our memory management techniques into two GPU algorithms: a hash join and a group-by. Both case studies provide compelling evidence of our approach’s pronounced performance gains.

CCS Concepts: Computing methodologies → Massively parallel algorithms, Shared memory algorithms

Additional Key Words and Phrases: GPU, parallel computing, memory allocation, random algorithm, buffer management

1. INTRODUCTION

Dynamic memory allocation is essential for Computer Science. Memory allocators for traditional CPUs have been improved by numerous previous efforts [7, 31, 37]. Recent developments in the semiconductor industry features an increasing number of processing cores on a chip, resulting in massively parallel computing capabilities. For example, the latest CPU products encapsulate up to 64 cores in one chip (e.g., AMD Ryzen 3990X). The co-processor world goes to an extreme by integrating thousands of thin cores into one processor, with a salient example being modern Graphics Processing Units (GPUs). GPUs have become indispensable in today’s high-performance computing (HPC) systems and have shown great value in many compute-intensive applications. In addition, there is also a strong movement toward developing system software on GPUs such as database management systems [32, 47]. Dynamic memory allocation on GPUs was first introduced by NVIDIA in 2009, and many other solutions have been proposed since then [46]. Many GPU-based applications benefit from dynamic memory allocation, such as graph analytics [10, 45], data analytics [6, 41], and databases [4, 21].

There are unique challenges in developing system software on massively parallel hardware due to the need for the efficient support of numerous parallel threads and the architectural complexity of the GPU hardware. Dynamic memory allocators, in particular, face challenges such as thread contention and synchronization overhead, and multiple studies have proposed solutions to address these challenges [2, 38, 42, 44]. Similar to traditional memory allocators, such solutions utilize a shared data structure to keep track of available memory units. For example, the current state-of-the-art solution, Ouroboros [44], uses a combination of linked lists, arrays, and queues to reduce thread contention and memory fragmentation and was shown to outperform previous solutions in a recent comparative study [46]. Nevertheless, we show in Section 2 that thread contention, synchronization overhead, and memory overhead are still problematic with Ouroboros in many use cases. In this paper, we argue that dynamic memory allocators in massively parallel systems demand a complete rethinking in their designs such that: (1) global states are avoided and (2) the search for free memory units is done through statistical processes.

GPUs are designed to be high-throughput systems – their performance depends on running a large number of parallel threads. A modern CPU can run tens of threads simultaneously while it is common to see tens of thousands of active threads in a GPU. This requires us to take a second look at the classic design of a system’s memory manager. Specifically, traditional memory managers maintain a global state (e.g., head and tail of a queue) to keep track of available memory units. The malloc and free operations have to access such states in a protected manner. The protection can be done via a software lock (e.g., mutex), with latency at the hundred-millisecond level on CPU-based systems [18, 30]. To address this bottleneck, hardware-supported mechanisms known as atomic operations have been widely used. While these operations offer a significant improvement, their implementation on GPUs still results in excessive overhead. Despite their speed, atomic operations must be executed sequentially to prevent conflicts, which becomes challenging given the large number of concurrent threads in GPUs. Consequently, utilizing atomic operations in GPUs leads to prolonged waiting queues during atomic access to global states.

Figure 1 reports the average latency of performing an atomic operation against one global 32-bit integer under varying numbers of concurrent threads. The average latency per thread grows linearly with the number of concurrent threads. Furthermore, the latency is much higher than that of CPUs. For example, when running the same code on an AMD EPYC 7662 CPU, the latency is only 23 clock cycles for one thread and 1,262 clock cycles for 128 threads.

Fig. 1.

Fig. 1.

Average latency per thread in accessing a global variable via atomic operations in different NVidia GPUs.

This paper presents a novel high-performance memory management framework for massively parallel hardware such as GPUs. Unlike the traditional wisdom that involves global states, this is a fundamentally new solution that carries very little overhead in allocating memory and is almost free for releasing memory. Instead of keeping any global states explicitly, we let the threads statistically infer the locations of available memory units via a random algorithm named Random Walk (RW). Analytical models and experimental results show that RW achieves asymptotically shorter latency than the state-of-the-art GPU memory allocator by up to an order of magnitude. We then proposed three advanced designs to further achieve up to an order of magnitude improvement over RW. The first idea is a Bitmap data structure of free pages which we mathematically proved to reduce latency by a factor equal to the size w of a Bitmap word, i.e., 32 or 64 times. The other two ideas are search algorithms named Collaborative Random Walk (CoRW) and Clustered Random Walk (CRW). Combined with the Bitmap, CoRW and CRW achieved up to two orders of magnitude improvement in latency over the best-known existing solution.

Paper Organization:

The remainder of this paper is organized as follows: Section 2 sketches the background that includes the technical foundation, the previous state-of-the-art, and its drawbacks; Section 3 summarizes our memory management framework, the basic RW algorithm, and its performance analysis; Section 4 presents the advanced algorithms CoRW and CRW and the mathematical reasoning for them; Section 5 shows results of the experimental evaluation (unit-tests) comparing RW, CoRW, CRW, and the state-of-the-art; Section 6 presents case studies of GPU-based programs when using these memory management methods; and Section 7 concludes this paper.

2. BACKGROUND

2.1. Memory Management on OS/DBMS

Memory management on traditional CPU-based systems, particularly those that are single-threaded or have low concurrency, has been extensively studied.

On the operating system (OS) side, the concept of paging was first introduced by Kilburn et al. [26] in the Atlas system. The idea of page segmentation was initially discussed by Dennisó et al. [14], and later implemented by Corbató et al. [13] in their MULTICS system.

In the context of database management systems (DBMS), early work can be traced back to Stonebraker [40] that discussed OS support. Effelsberg et al. [16] further studied the database buffer manager as a component of DBMS and implemented it. Chou et al. [12] presented the DBMIN algorithm to manage the buffer pool of a relational DBMS (RDBMS). Chen et al. [11] proposed a query execution feedback model to enhance DBMS buffer management. Brown et al. [8] introduced the concept hit rate concavity and developed a goal-oriented buffer allocation algorithm known as Class Fencing.

The subject of memory allocation on CPUs has been extensively examined in a multitude of articles and projects. Dlmalloc [28], a widely-used C memory allocator, employed coarse locking. While this method was effective, it might not have provided the best efficiency in environments with high levels of multithreading. Recognizing this limitation, Ptmalloc [19] was developed as an enhancement of Dlmalloc. Its key improvement was the ability to support concurrent memory allocation, a feature that led to its adoption as the default memory allocator in the GNU C library. In a similar vein, Google’s TCMalloc [20, 24] used bucket-sized locks, and under certain scenarios, it demonstrated faster performance than Ptmalloc. It operated using a global heap and per-thread heaps, with a mutex only on the global heap. Moreover, nedmalloc [15], an independent implementation, also asserted superior multithreaded performance compared to Ptmalloc. On another note, Jemalloc [17], initiated by Facebook and first implemented by FreeBSD’s libc malloc, gained extensive usage in Firefox and Facebook servers. Additionally, Hoard [7], a widely known multithreaded allocator, was compatible with Linux, Mac OS X, and Windows. To further enhance concurrency and scalability, IBM [31] developed a lock-free malloc implementation.

2.2. Dynamic Memory Allocation in GPUs

In CUDA programming, it is a common practice to pre-allocate a specific amount of global memory using the cud-aMalloc function to cater to the runtime memory requirements of a GPU kernel. However, in many applications, the exact memory consumption is unknown in advance. This predicament presents us with two undesirable options: either allocating more memory than necessary, resulting in over-allocation, or terminating the kernel prematurely due to insufficient available memory. The typical approach [22] to deal with this problem is to run the task twice: the first run is only for calculating the amount of memory needed, then the memory can be precisely allocated, and the second run will finish the task. This carries an unnecessary overhead. Thus, a major challenge on GPU systems is to dynamically allocate device memory for output results without interrupting kernel execution.

NVIDIA initially announced its dynamic memory allocator for GPUs in 2010 [46]. It provides the usual malloc/free interface and can be called by threads from a CUDA kernel. XMalloc [23] became the first non-proprietary dynamic memory allocator for GPUs. Its main contribution is the coalescing of allocation requests on the SIMD width for faster queue processing. Allocations are served from a heap that is segmented into blocks and bookkeeping information is stored in a linked-list. The linked-list is a major bottleneck because a thread has to traverse through the list of memory blocks when searching for a free one. ScatterAlloc [39] addressed this bottleneck by scattering the allocation requests across its memory regions. Managed memory is organized into fixed-sized pages, tracked by a page usage table, and grouped into Super Blocks. Super Blocks form a single-linked list and can exist in one large region or be allocated individually. A hash function of request size and multiprocessor ID is used to search for free regions. FDGMalloc [43] (2014) presents a warp-level optimized approach that aggregates all requests in a warp and chooses a leader thread to traverse through a linked-list of free pages. Adinetz and Pleiter [2] proposed Halloc in 2014; the main idea is to divide the managed memory into multiple slabs. At allocation time, Halloc selects a slab and uses a deterministic hash function to traverse through memory chunks within this slab. If no free memory is found, a new slab is selected, which affects performance severely. Vinkler and Havran (2015) [42] proposed RegEff, which splits the bookkeeping information into many linked-lists. During allocation, a thread picks a linked-list and traverses to find the first free chunk that is large enough for the allocation. More recent work includes DynaSOAr (2019) [38] and Ouroboros (2020) [44]. A recent comparative study [46] showed that Ouroboros outperformed all aforementioned methods in both allocation performance and space efficiency under the vast majority of scenarios and thus can be considered state-of-the-art.

Similar to the Buddy Systems mechanism on CPU implementations, Ouroboros divides the managed memory region into multiple queues, each serving a page size twice as large as that of a previous queue. To avoid pre-allocating memory for all the queues, the concept of Virtualized Queues was introduced. The main idea is that queues are dynamically stored on pages in the pool, and a new large page (chunk) is only allocated to a queue when it needs more space.

As a queue-based design, Ouroboros suffers from the long latency in concurrently accessing queue states using a bulk semaphore. Figure 2 illustrates these issues. In this experiment, we ran 5 GPU kernels with Ouroboros: (1) each thread allocates then frees 4096B, (2) each thread allocates 4096B, (3) each thread allocates 4100B, (4) each thread allocates 8192B, and (5) each thread allocates varying sizes. We ran each kernel 100 times on varying numbers of threads and calculated the average kernel time.

Fig. 2.

Fig. 2.

Ouroboros’ Performance under the five scenarios on an NVIDIA Titan V GPU

In scenario (1), the first iteration takes a significantly long time to allocate new pages for the queues. However, after the free operation, the freed pages are put into a virtualized queue, and the second iteration takes no time to get the free pages from the queue. This situation is repeated for the remaining 98 iterations and makes the average kernel time artificially low. In Scenario (2), we solely focused on allocation without performing any free operation. This deliberate choice compelled Ouroboros to expand its virtualized queues continuously. As a result, the average kernel time surged by up to 67 times compared to scenario (1). Notably, when we profiled the program using NVIDIA’s Visual Profiler, we discovered that allocating new pages for extending the queues led to a mere 8% warp efficiency for the kernel. In scenario (3), we slightly increased the request size by only 4 bytes. This results in a doubling of the average kernel time. Surprisingly, scenario (4) featured a request size of 8192 bytes, but exhibited virtually the same average kernel time as scenario (3). This observation indicates that the 4100B request size was allocated within the 8192B queue, thereby wasting 50% of the allocated space. Lastly, scenario (5) introduced varying request sizes, which appeared to magnify the issues above.

3. RANDOM WALK BASED PARALLEL MEMORY MANAGEMENT FRAMEWORK

3.1. Overview

Unlike other random-search approaches such as ScatterAlloc and Halloc, which still maintain central book-keeping records to support their search of free memory, we maintain no central record and rely entirely on statistical random processes for searching. First, we divide the main memory on GPU into pages of equal size. Prior research on both

Algorithm1GETPAGEbasedonRandomWalk¯output:theIDofafreepage1:whileTruedo2:prandomintegerin0,1,,T13:ifpagesp.usedisfalsethen4:trytosetpagesp.usedtotrue5:ifAboveisasuccessthen6:returnp7:endif8:endif9:endwhile¯¯

CPUs and GPUs has leveraged mutex locks and queues to circumvent Write After Write issues. However, with a limited number of queues and mutex locks, bottlenecks can emerge when tens of thousands of threads make simultaneous requests in a GPU system. This calls for a design that fully releases GPU parallelism. To address this, we propose the Random Walk (RW) algorithm that does not depend on any global state to manage page allocation and recycling. Instead of using a few mutex locks or queues on global memory for the entire system, each page will have its own mutex lock (via a used flag). When the flag is set, the page is considered occupied. While requesting a page, each thread will generate a random page ID. If the corresponding page is free, the thread will get the page. Otherwise, the thread will generate a new page ID until it finds one free page. This approach ensures threads operate independently, eliminating the need to queue for shared resources, thereby maximizing GPU parallel computing potential. Figure 3 shows an illustrative example of how seven parallel threads get their free pages. Here the blue squares represent free pages, and the red ones represent occupied pages. Each thread generates random page IDs until it finds a free page.

Fig. 3.

Fig. 3.

Demonstration of the RW-based page request algorithm. The path visited by the same thread is colored the same. Blue pages are free, red pages are occupied.

Detailed implementation of the RW-based algorithm (we name it getPage) can be found in Algorithm 1. Note that in this paper, all pseudo-code is presented from the perspective of a single thread, reflecting the single-program-multiple-data (SPMD) programming model for modern GPUs. Here we assume that there are a total of T pages, N concurrent threads, and each thread requests one page.

Although there are no global variables, acquiring a free page still needs an atomic operation (line 4 of Algorithm 1) because two threads could try to grab the same free page at the same time. For example, in Figure 3, both threads 5 and 7 request page 8 in their first step. However, the atomic operation will only grant access to one thread (e.g., thread 5), and the other threads (e.g., thread 7) will continue their random walk. The above is the only scenario in which two or more threads can conflict in accessing protected data. We have shown that this scenario happens with very low probability. For example, with 1 million total pages and 5,000 concurrent threads, the expected number of conflicts is only 12.5 (see page 5 of [33] for details).

Deallocation:

A great advantage of our method is: the de-allocation operation (named freePage) is almost free! Specifically, we only need to clear the used bit of the corresponding page - no atomic operation is involved.

Knowing the drawbacks of queue-based ideas, the advantage of RW-based solution is still counter-intuitive: the traditional queue-based methods allow for O (1) time in finding a free page while it could take many steps for RW. However, although the number of steps to get a free page can be large for some threads (e.g., 5 steps for thread 2), the average number of steps is highly controllable under most scenarios. Our analysis in Section 3.2 will clearly show this.

Out-of-memory:

Determining when the system runs out of memory is challenging in this design because no centralized book-keeping exists to provide such information. However, we can statistically infer out-of-memory with high confidence. This is discussed at the end of section 3.2 after all the notations and mathematical background are established.

3.2. Performance Analysis

Now we mathematically show that the RW-based design outperforms the traditional queue-based methods.

In general, latency (of individual threads and all the threads) is an appropriate metric for evaluating the performance of memory management mechanisms. However, the running time of a CUDA program is affected by many factors, as shown in our previous work [29]. Instead, we use the following two metrics that can represent latency:

  1. Per-Thread Average Steps (TAS): the average number of steps taken to find a free page for a thread. In Algorithm 1, this is essentially the average number of iterations executed for the while loop;

  2. Per-Warp Average Steps (WAS): the average of the maximum number of steps taken among all 32 threads within a warp.

Both metrics are directly correlated to latency. While WAS has a stronger correlation with latency than TAS, we achieve a more rigorous analysis of TAS. In CUDA, the basic unit of execution is a warp – a group of 32 threads scheduled and executed simultaneously by a streaming multiprocessor. The entire warp will hold the computing resources until all threads exited. In other words, the latency of a warp is the maximum latency among all 32 threads in the warp. Without detailed knowledge of the CUDA runtime engine, it is non-trivial to develop an accurate model of the running time from the steps each thread takes. Our previous work [29] shows that, by measuring the average steps a thread takes, we can say the running time is roughly a linear function of per-thread latency. We verified the effectiveness of the two metrics via a large number of experimental runs: the results show that the correlation coefficient between TAS and the total running time is 0.9046 and that for WAS is 0.962. Following the techniques described in [48], we found that the 99% confidence interval for the difference between the two correlation coefficients is (0.0387, 0.1047). This shows that WAS is a statistically better indicator of total running time.

In the remainder of this paper, we use the following notations in our mathematical analysis:

  • For any warp, let Xi where 0i31 be the random variable representing the number of steps taken until finding a free page. TAS is the expected value of Xi, denoted as EXi;

  • Let Y=maxXi be the max number of steps taken by a thread to find a free page within a warp. WAS is then the expected value of Y, denoted as EY;

  • T is the total number of buffer pages;

  • A is the number of available buffer pages;

  • N is the total number of concurrent threads.

First of all, we can show the performance of getPage and freePage in a queue-based solution such as Ouroboros. While one thread takes one step to modify the protected data structure, other threads have to stall for one step. Since N threads request to modify the protected data structure at the same time, the processing queue would have the length of N. We assume that a random thread would have a random position on the processing queue. Then, Xi is uniformly distributed on 1,N. Therefore, TAS is:

EXi=N+12 (1)

WAS EY can be derived by the Faulhaber’s [27] formula as:

EY=k=1NkPY=k=k=1NkPYj>k1PYj>k
=k=1NkPYj>k1k=1NkPYj>k=k=1Nk1PYj>k1+k=1NPYj>k1k=1NkPYj>k
=k=0N1PYk=k=0N1kN32=Nk=0Nk32N32
=NN3333+N322+8N313124N293+N32=NN33+12+83N1243N2+3233N (2)

Both metrics are linear to N, consistent with results in Figure 1. Maintaining multiple queues will not wipe out the issue, it is easy to show that both TAS and WAS are still linearly related to N.

3.2.1. Analysis of TAS.

The process of acquiring a free page by N parallel threads can be viewed as N parallel series of Bernoulli trials. If there is only one thread requesting a page, its Bernoulli trials have a constant probability of success. When there are multiple threads, a thread’s Bernoulli trials will have a decreasing probability of success over time.

To simplify the discussion, we treat the N series of Bernoulli trials as if they are performed sequentially, i.e., one only starts after another has finished, and still achieve the same results as the parallel process. This treatment is safe because of two reasons. First, two parallel threads can totally be performed sequentially if they do not cross path. Second, if two threads cross path, the outcome should still be the same as in the sequential case. For example, in Figure 3, threads 5 and 7 cross path at page 8, and the outcome is the same as if thread 7 starts executing after thread 5.

Before the first thread executes, there are A free pages out of the total T pages. Therefore, the number of steps that the first thread takes until finding a free page, X0, follows a geometric distribution with p=A/T. Therefore,

EX0=1/p=T/A

After the first thread finishes and before the second thread executes, there are only A1 pages free. Therefore, the number of steps taken until finding a free page, X1, follows a geometric distribution with p=A1/T. Therefore,

EX1=1/p=T/A1

Generalizing the above, the average number of steps taken across all N threads is

EXi=1Nj=0N1TAj=TNj=0N11Aj=TNHAHAN

where Hn=k=1n1k is the harmonic series.

We use the Euler-Mascheroni constant [9] to approximate the harmonic series Hnγ+lnn, the expected average number of steps is then approximated by

EXiTNlnAAN (3)

Unlike the queue-based solution with latency linear to N, Eq. (3) tells us that the value grows slowly with the increase of N. Specifically, under a wide range of N values, the item lnAAN increases very slowly (in a logarithmic manner), and the increase of EXi will be further offset by the inverse of N. The only situation that could lead to a large TAS is when AN, i.e., there are barely enough pages available for all the threads.

Eq. (3) shows a linear growth with the increase of T, but in practice, a larger T value also leads to an increase in A, which would offset the growth by decreasing the logarithmic term.

3.2.2. Analysis of WAS.

Deriving a closed-form for EY is difficult, but we can find an upper bound of EY as follows. We observe that during the process of N threads’ each getting a page, the probability of finding a free page at any moment in the process is at least ANT. The reason is that A is in the AN,A range during the process. Therefore, EXi is upper bounded by EXi where Xi follows a Geometric distribution with probability p=ANT.

With that, the cumulative distribution function of Xi is:

PXix=11px=1TA+NTx

Since EXi is upper bounded by EXi, EY is also upper bounded by EY where Y=maxXi. The cumulative distribution function of Y is:

PYy=PX0y,X1y,,X31y=PX0yPX1yPX31y=1TA+NTy32

Similar to the way we derive Eq. (2), we can derive the expectation of Y as:

EY=k=1kPY=k
=k=1kPY>k1PY>k
=k=1kPYj>k1k=1kPY>k
=k=1k1PY>k1+k=1PY>k1sumk=1kPY>k
=k=0kPY>k+k=1PY>k1k=1kPY>k
=0PY>0+k=1PY>k1
=k=1PY>k1
=k=01PYk1
=k=01PYk
=k=011TA+NTk32

Therefore, an upper bound of WAS EY is

EY<k=011TA+NTk32 (4)

In Figure 4, we plot the value of formulae (1)-(4) under different A and N values with T=1M. We chose five different A values corresponding to 50%, 10%, 1%, 0.7%, and 0.5% of total pages T. Note that the case of 0.5% is an extreme scenario – when N=5,000, only one page is available for each thread. TAS and WAS of RW are significantly smaller than those of queue-based solutions, even under small A/T values such as 0.7%. For the extreme case of A/T=0.5%, we start to see RW’s WAS climb higher than the queue-based method. This is a drawback of the RW method, and we addressed it with the Bitmap data structure in Section 3.3 and the CoRW and CRW algorithms in Section 4.

Fig. 4.

Fig. 4.

Change of TAS and WAS values under different N and A values of RW in comparison to that of queue-based solution

3.2.3. Determination of Out-of-memory Situation.

Instead of finding when there is absolutely no free page left, it is equally useful to find when there are so few pages left that the algorithm’s performance is unacceptably bad. First, we choose a low free percentage number L such that the algorithm’s performance is unacceptably bad. This percentage can be chosen analytically or empirically from unit-testing. For example, Figure 4 and Figure 11 both show that L=0.5% is where RW’s performance becomes unacceptably low. Next, we find the number SL of consecutive steps that all fail to find a free page such that the probability of the number of free page A is less than LN becomes 99%. The null hypothesis is A=LN and the alternative hypothesis is A<LN. If the null hypothesis holds true and as N is a big number, the number of free pages found in SL consecutive steps is normally distributed with mean SLL and variance SLL1L. The test statistics is 0SLLSLL(1L) and the critical value at 1% false rejection is z1%=2.326. Setting these two equal gives us

SL=5.411LL
Fig. 11.

Fig. 11.

Performance of our unit-test kernel calling getPage under different numbers of parallel threads and percentage of free pages. Displayed free percentages are measured at the start of each kernel

That is, we flag an out-of-memory error with 99% confidence if the algorithm cannot find a free page after 5.411LL steps. For example, if we choose L=0.5%, this number is 1,076 steps. After 1,076 steps of not finding any free page, we conclude with 99% confidence that there are less than 0.5% free pages and raise an out-of-memory error.

3.3. Extension: A Bitmap of Used Bits

In each step of getPage in the basic RW design, a thread visits one page at a time. As a result, finding a free page could take many steps, especially under a low A/T ratio. To remedy that, we use a Bitmap to store all pages’ used bits in consecutive (global) memory space. We can utilize a GPU’s high memory bandwidth and in-core computing power to efficiently scan the bitmap to locate free pages. For example, the Titan V has a global bandwidth of 650+GBps and a 3072-bit memory bus. Meanwhile, the CUDA API provides a rich set of hardware-supported bit-operating functions.

In practice, the bitmap can be implemented as an array of 32-bit or 64-bit integers (words) so that we can visit a group of 32 or 64 pages in a single read. Finding a free page now reduces to finding a word from the bitmap that has at least one unset bit. Such an algorithm (named RW-BM) can be easily implemented by slightly modifying Algorithm 1, as presented in Algorithm 2. Note that, for each word in the used bitmap, we introduce a lock bit and store it in another bitmap called LockMap. This LockMap is for the implementation of low-cost locks.

3.3.1. Performance of RW-BM.

When there are A pages available, and we read w bits at a time, the probability of finding a group with at least a free page is 1TATw. Therefore, the expected number of steps for the first thread to find a free page is 11TATw. Following the same logic in deriving Eq. (3) and Eq. (4), we get:

EXi=1Nj=0N111TA+jTw (5)

In fact, Eq. (3) is a special case of Eq. (5) with w=1, and the following theorem shows their difference.

Theorem 1. Denote the TAS Ex for RW-BM as U, and that for the basic RW algorithm as U, we have

limANU=Uw
Algorithm2GETPAGEbasedonRW-BM¯input:w:wordlength,typically32or64output:theIDofafreepage1:whileTruedo2:prandomintegerwithin[0,T/w)3:rAtomicallysetLockMappto14:Pr==1?0xffffffff:BitMapp5:fffsP6:iffisvalidthen7:setfthbitinBitMappto18:returnpw+f-19:endif10:endwhile¯¯

Proof. In Eq. (5), we apply the Taylor series expansion [1] on the term TA+jTw:

TA+jTw=1AjTw=1wAjT+ww12AjT2

As AN, AjT0. Therefore, we get limANTA+jTw=1wAjT. As a result, we have

limANU=1Nj=0N1111wAjT=1Nj=0N1TwAj=1wU

Similarly, the upper bound of WAS in RW-BM becomes

EY<k=011TA+NTwk32 (6)

and we also get the following theorem.

Theorem 2. Denote the upper bound of EY for RW-BM as V, and that for the basic RW algorithm as V, we have

limANV=Vw+w12w

Proof. According to the Euler-Maclaurin formula [1]:

V=k=0fk=k=011TA+NTwk32k=0fkdk+f0+f2

As NA, f01 and f0 Therefore, we get

limANV=12+k=011TA+NTwk32dk=12+1wk=011TA+NTwk32dwk
=12+1wk=011TA+NTk32=12+1wk=011TA+NTk3212=w12w+Vw

The above theorems are encouraging in that TAS and the WAS bound both decrease by a factor up to w, i.e., 32 or 64 times. More importantly, the advantage of RW-BM reaches the highest level when AN, which is an extreme case of low free page availability.

RW-BM is memory efficient: a one-bit overhead is negligible even for page sizes as small as tens of bytes, and the total size of the LockMap is even smaller.

4. ADVANCED TECHNIQUES

The basic RW algorithm provides a framework for developing more advanced algorithms. The goal is to improve performance, especially when the percentage of free pages is small. In this section, we present two advanced techniques that address this problem: a Collaborative RW (CoRW) design (Section 4.1) that closes the gap between TAS and WAS, and a Clustered RW (CRW) design (Section 4.2) that utilizes the spatial property of the Bitmap. Both techniques share the same idea of reusing multiple empty pages found by a thread – CoRW shares the found pages among sibling threads in a warp while CRW saves pages for future requests by the same thread. CoRW and CRW are designed for different use cases: CoRW assumes the getPage requests from threads in a warp are processed at the same time, while CRW is more applicable to sporadic requests in the warp. Two examples of the different use cases are presented in Listing 1. In the first use case, CoRW is very effective because all threads in a warp request memory. In the second use case, the collaboration in CoRW cannot happen because only one thread in a warp requests memory, so CRW is more useful.

Listing 1.

Example use cases of CoRW and CRW

_ _ global_ _ void example () {
int pid = getPage (); // case 1: all threads request memory, suitable for CoRW
if (threadIdx.x == 0)
pid = getPage (); // case2: one thread requests memory, suitable for CRW
}

4.1. Collaborative Random Walk Algorithm

As mentioned earlier, the basic RW design suffers from the large difference between TAS and WAS. Our idea to remedy that is to have the threads in the same warp work cooperatively – threads that found multiple pages from the bitmap will share the pages with others that found nothing. This can effectively reduce the WAS because all resources on a warp are always spent on finding free pages. The algorithm runs in two steps: (1) the threads work together to find enough free pages to serve all getPage requests of the entire warp; (2) the identified free pages are assigned to individual threads according to their needs. All threads terminate at the end of step (2); thus we will get the same TAS and WAS values.

Efficient implementation of the above idea is non-trivial. The main challenge is to keep track of the found pages and distribute them to requesting threads in a parallel way. The SIMD nature of warp execution also requires minimization of code divergence. We design the CoRW algorithm by taking advantage of CUDA shuffle instructions that allow access of data stored in registers by all threads in a warp. The design of CoRW is sketched in Algorithm 3. Note that we use many CUDA intrinsic function names in the pseudocode to highlight implementation details, and we will explain what they compute in the following text.

Algorithm3GETPAGEbasedonCoRW¯input:w:wordlength,typically32or64output:pageID:IDofafreepageacquired1:tlaneID031ofthisthread2:pageID-13:needMaskballot_syncpageID==-14:whileneedMaskdo5:prandomintegerwithin[0,T/w)6:rAtomicallysetLockMappto17:Pr==1?0xfffffffff:BitMapp8:hasMaskballot_syncP0xfffffffff9:whileneedMask0andhasMask0do10:Findthefirst0bitonPandsetitto111:b←correspondingpageIDofthebitsetabove12:spopc0xfffffffflaneID&needMask13:tpageID==-1?fkshasMask,s+1-1:-114:pageIDshfl_syncb,t15:needMaskballot_syncpageID==-116:hasMaskballot_syncP0xfffffffff17:endwhile18:ReleaseLockMappifr==019:endwhile20:returnpageID¯¯

In Algorithm 3, needMask is a 32-bit mask representing which threads still need to get a page and hasMask represents those that find a free page during the search process. The needMask is computed in lines 3 and 15, and hasMask is computed in lines 8 and 16. We perform the search process on all threads until all threads have obtained a page, i.e., needMask becomes 0 (line 4). The repeated search process is as follows. First, each thread reads a random word of the bitmap (line 7) denoted as BitMap[p]. Note the use of LockMap here: we first try to set the value of LockMap[p] to 1; this essentially locks the word BitMap[p] and is done via a single atomic operation (line 6). A key innovation here is: if another thread already locked the word (when r=1), we cannot use the word as a source of free pages. Instead of idling, it will return a word with all bits set and continue the rest of the loop body acting as a consumer of free pages.

We then share the free pages among threads (lines 10 to 16) while some threads still need a page, and some still have pages to share (line 9). This is difficult because the CUDA shuffle instructions only allow a thread to read data from another thread, i.e., the receiving threads have to initiate the transfer. Therefore, our solution is to calculate the sending lane ID t as follows. Each thread calculates s, the number of threads with lower lane ID that still need to get a page as indicated by needMask (line 12). Then this thread has to obtain the s+1-th page found within the warp because the first s pages should be given to the lower lanes. Therefore, t is the position of the s+1-th set bit on hasMask.

The fks function finds the k-th set bit by using logw population count (popc) operations. Its implementation can be found in Algorithm 4. This function allows us to calculate the sending thread t (line 13). Finally, the value of variable b held by thread t is transferred to this thread via the shfl_sync function (line 14).

Our CoRW implementation is efficient because all data (other than Bitmap[p]) are defined as local variables and thus stored in registers. Furthermore, all steps (except reading BitMap[p]) are done via hardware-supported functions with extremely low latency. For example, finding the number of set bits (popc) in a word can be done in 2 clock cycles, and

Algorithm4fks:findkthsetbit¯input:x:awordof32bitsor64bits;w:wordslength,32or64;k:anintegeroutput:1-basedindexpositionofthekthsetbitonx,0ifnotexists1:Ifpopcx<kthenreturn02:left1;rightw3:whileleft<rightdo4:midleft+right/25:countLeftToMidpopc1mid1&x6:ifcountLeftToMid<kthenleftmid+17:elseright←mid8:endif9:endwhile10:returnleft¯¯

finding the first set bit ffs in 4 cycles. Such latency is in sharp contrast to reading the bitmap from the global memory, which requires a few hundred cycles [3].

4.1.1. Performance of CoRW.

In this section, we present a mathematical model for CoRW that explains its superior performance with regard to the TAS and WAS metrics.

In the CoRW algorithm, all threads work together until each secures a page. This means that the values of WAS and TAS are identical, which is the number of steps it takes for a warp to get 32 free pages. In each step, a warp can probe 32w pages by using the Bitmap (where 32 threads are multiplied by w pages per thread). The probability p of finding a free page is dynamic and depends on the activity of other warps in the system. However, p is lower bounded by ANT where A is the total number of free pages, N is the number of threads searching, and T is the number of pages. As the probability p of identifying a free page diminishes, the number of steps to find 32 free pages increases. Therefore, we aim to find an upper bound, B, on WAS and TAS. We explain how we calculate the expected value of B in what follows. To model the count of free pages found in an experiment with predetermined trials and probability, a Binomial random variable is a suitable choice. At step i, let Ci be the number of free pages found, then Ci is a Binomial random variable with parameters n=32w (number of trials) and p=ANT (probability of finding a free page). The cumulative number of free pages found up to step i is given by the partial sum Si=j=1iCj with S0=0. Then the upper bound B is the smallest integer such that SB32. In other words, B is the first time the stochastic process Si reaches 32. In the next steps we shall calculate the first passage time B. The main disadvantage of using the Binomial distribution to find first passage time directly is that it can be computationally expensive, especially for large values of n, the number of trials. For efficiency, we often resort to the Normal distribution as a quicker approximation to calculate the first passage time. The Normal distribution also presents more convenient mathematical properties.

Given a Binomial distribution, a Normal curve with the same mean and standard deviation can often serve as a robust and reliable approximation of the Binomial distribution. The validity of using a Normal distribution as an approximation for a Binomial distribution is generally accepted when the size of the trial n is sufficiently large and the success probability p is not too close to either 0 or 1. A general rule of thumb is that both

np5andn1p5.

This may ensure that the central 95% of the Normal distribution lies between 0 and n, which is necessary for a good approximation of the Binomial distribution. In our case, n=32w where w is usually 32 or 64, p=ANT which is the rate of free pages. So p shall satisfy the following inequalities

0.5%<53232p153232<99.5%,whenw=32 (7)
0.2%<53264p153264<99.8%,whenw=64. (8)

The above inequalities suggest that for almost all values of probability p in our scenarios, Normal distribution can be used to estimate the various probabilities associated with the Binomial distribution.

Hence, we can approximate the Binomial distribution by a Normal distribution with the same mean μ=np and standard deviation σ=np1p [25], that is

μ=32wANT,σ=32wANTA+NT (9)

Therefore, Si is approximately a discrete-time Brownian motion process with drift μ and scale σ. The inverse Gaussian describes the distribution of the time a Brownian motion with positive drift takes to reach a fixed positive target. The density function of the inverse Gaussian distribution given by Theorem 5.3 in [25] is

ft=aσ2πt3expaμt22σ2t. (10)

In our case, target value a=32. The distribution model shows that as t, ftt32, that is to say the probability for achieving the target value after some long time becomes increasingly small.

Finally the expectation of the first passage time B can be derived as:

TAS=WASEB=t=0tftdt=t=0aσ2πtexpaμt22σ2tdt (11)

Although a closed-form of Eq. (11) is not known, its value is not large with respect to the ratio between the target value 32 and the drift μ. The CoRW algorithm improves over RW-BM by removing the gap between TAS and WAS and therefore lowering WAS. Figure 5 compares the WAS values of CoRW and RW-BM derived from equations (6) and (11), respectively. According to Figure 5, CoRW improves WAS significantly, and the improvement is greater at lower free percentages by up to 4 times.

Fig. 5.

Fig. 5.

Theoretical results for WAS of RW-BM and CoRW calculated from Eq. (6) and (11) when T=1,000,000 and N=5,000. The right subgraph plots the ratio of the two equations

As a special note, the analysis above shows that in theory, the number of free pages found by a warp at any moment can be approximated with a Normal distribution. To further support this assumption in our scenarios, we execute a kernel of 32 threads (one warp) that requests pages and collects the number of pages found in one step. We run this kernel 10,000 times and present the data distribution of the number of pages found in Figure 6, which clearly supports our assumption. Please note that we only execute one warp in these experiments to keep the distribution static when the warp is searching for pages. In a multi-warp environment, the distribution is dynamic but is upper-bounded by the static Normal distribution at the end of the kernel, as we stated above.

Fig. 6.

Fig. 6.

Number of free pages found by a warp in one step under several free percentages

In this analysis, we also assume that all 32 threads participate in the allocation request. This may not be the case when the allocation request is made within a conditional statement, as demonstrated in the second example of Listing 1. When this happens, the factor μ in Eq. (11) decreases, which causes the expectation to increase and thus reduces CoRW’s effectiveness. In Section 4.2, we propose another innovation that does not depend on the number of threads participating in the request.

4.2. Clustered Random Walk

We introduce another search algorithm called Clustered Random Walk (CRW) based on the intuition that if one page is found free, its adjacent page(s) are likely to be free. In other words, a free page can be viewed as a member of a free-page cluster. A visual demonstration is presented in Figure 7: all threads keep drawing from their own free-page cluster. Threads 0, 1, and 3 still have free pages in their clusters, so they can quickly grab one. Thread 2 runs out of free pages in its cluster and has to perform a random walk to find a new cluster.

Fig. 7.

Fig. 7.

Illustration of the CRW algorithm

Detailed implementation of CRW is presented in Algorithm 5. We introduce a thread-level local variable last_free_page, which stores the ID of the last page obtained by that thread. With that, the CRW algorithm stores the last page that each thread has obtained and tries to return the adjacent page to serve the next getPage request. If the adjacent page is not available, CRW calls the regular RW procedure to get a page. Before returning a free page, we need to save the ID of the newly-acquired page to last_free_page.

Algorithm5GETPAGEbasedonCRW¯input:w:word'slength,32or64output:IDofafreepageacquired1:n=last_free_page+12:p=n/323:rAtomicallysetLockMappto14:Pr==1?0xffffffff:BitMapp5:fffsP6:iffisvalidthen7:setfthbitinBitMappto18:page_ID=pw+f-19:last_free_page=page_ID10:returnpage_ID11:endif12:page_ID=getPageRandomWalk()13:last_free_page=page_ID14:returnpage_ID¯¯

CRW has two advantages over RW. First, with a certain (high) probability, we can quickly get a page from last_free_page, thus saving the time to continue the random walk, which is more expensive than accessing last_free_page. Second, fragmentation is reduced because used pages are clustered and free pages are also clustered.

4.2.1. Performance of CRW.

During the lifetime of a program that utilizes the CRW algorithm, acquired pages tend to occupy consecutive space in the buffer pool. Therefore, the buffer pool is divided at any point into clusters of consecutive occupied pages and consecutive free pages. This trend may be broken with external fragmentation, i.e., irregular page-freeing patterns that fragment a large cluster into many smaller clusters. Analysis of CRW performance will be based on studying the spatial distribution of such clusters.

TAS Analysis.

Let Xi be the random variable representing the number of steps taken until finding a free page using the CRW algorithm, Xi be that of the RW algorithm, and p is the probability that the adjacent page to the last free page is occupied. We can represent Xi with Xi as:

Xi=1withprobability1p1+Xiwithprobabilityp

The two cases in the equation above represent the two branches in Algorithm 5, lines 7–10 and lines 12–14. It is easy to see that any statistical moment of Xi is upper bounded by that of Xi. For example, the first moment of Xi is:

EXi=1p+pEXi+1=1+pEXi

according to the law of total expectation, and we have EXiEXi+1. The equality is reached when p=1. Therefore, the key parameter for the analysis of EXi is p. Here, EXi follows Equation (5) if Bitmap is used and Equation (3) otherwise.

To formulate p, we need to introduce the concept of free-page clusters. In a buffer pool of T pages, a cluster a,b is a set of consecutive free pages, page a1 and page b+1 are occupied. Since the algorithm tries to get consecutive pages by saving the last free page that it obtained, a thread would keep drawing from one cluster until that cluster is depleted, after which it performs Random Walk to find a new cluster. Therefore, quantity p is the same as the probability that a cluster is depleted in the previous getPage request, that is, when a=b. We need to mathematically characterize the system at two points in time, the previous getPage request and the current getPage request.

Let Mt1 be the number of free-page clusters and a1,t1, a2,t1,,aMt1,t1 be the (positive) sizes of the clusters before the getPage request at time t1. The sum of Mt1 cluster sizes must be At1, which is the total number of free pages in the buffer pool at that moment:

a1,t1+a2,t1++aMt1,t1=At1

Similarly, before the current request:

a1,t+a2,t++aMt1,t=At (12)

Note that in this, ai,t, can be 0 and we do not update the time index for Mt1. The probability p=Pai,t=0 can be calculated as follows.

p=numberofsolutionstoEq.12whereai,t=0numberofsolutionstoEq.12

Eq. (12) is a simple linear Diophantine and its number of solutions has been proven by using the stars-and-bars representation [34] as follows. Suppose that there are At stars and Mt11 bars, an arrangement of the stars and bars is equivalent to one solution of Eq. (12) where the number of stars between two bars equals the value of one ai,t. For example, let At=10 be the number of stars and Mt11=3 be the number of bars. One possible arrangement of 10 stars and 3 bars is

|

This arrangement is equivalent to the solution a1,t=1, a2,t=2, a3,t=0, a4,t=7 for Eq. (12) where At=10 and Mt1=4. Note that there is no star between the second and third bar, equivalent to a3,t=0.

Following this logic, the total number of solutions to Eq. (12) is the number of arrangements of At stars and Mt11 bars, which is the number of ways to select Mt11 positions among At+Mt11 positions to insert the bars (and thus leave the remaining At position for the stars). The number of ways to select Mt11 positions among At+Mt11 positions is:

AtMt1=At+Mt11!Mt11!At!

This is the total number of solutions to Eq. (12). Now we find the number of solutions to Eq. (12) where one ai=0. Given one ai=0, Eq. (12) becomes

a1,t+a2,t++aMt11,t=At

Similar to the above, the total number of solutions to this equation is

AtMt11=At+Mt12!Mt12!At!

Therefore, we have

p=Pai,t=0
=At+Mt12!Mt11!At!Mt12!At!At+Mt11!
=Mt11At+Mt11

To simplify, we drop the time index:

p=M1A+M1 (13)

where A is the current number free pages and M is the number of clusters before the previous getPage request.

Following that, TAS is:

EXi=1+M1A+M1TNlnAAN (14)

When there are few large clusters, M1 and EXi1. When there are many small clusters, MT and EXi approaches the EXi value of the RW algorithm. Therefore, when an irregular page-freeing pattern breaks the space into many small clusters, we will see that the performance of CRW converges to that of RW. This shows that CRW is of significant intellectual and practical value as it will consistently outperform RW.

In Figure 8, we plot the theoretical TAS values of the RW (Eq. 3) and CRW (Eq. 14) algorithms under four scenarios with different percentages of free pages. For each scenario, we show the results under three different M values. In all cases, the TAS value of CRW is much smaller than that of RW. Even when M=A (CRW3) and 0.5% free pages, RW is about twice as large as CRW.

Fig. 8.

Fig. 8.

Expected number of steps of CRW according to Eq. (14) under different M values (CRW1: M=0.01A, CRW2: M=0.5A, CRW3: M=A) and percentage of free pages in comparison to RW

WAS Analysis.

The intra-warp max is Yi=maxX0,X1,,X31. Since Xi=1 if a thread can immediately find a free page next to the last known free page and Xi=Xi otherwise, Yi is the maximum of a number Q of random variable Xi where Q32:

Yi=maxX0,X1,,XQ

The expectation of Yi, EYi, is upper-bounded by WAS of RW, EYi, because Yi is the maximum of 32 random variables Xi. Therefore, EYi is also upper bounded by EYi ‘s upper bound, which is presented in Equation (6) if Bitmap is used and (4) otherwise. This means that, with respect to the intra-warp max steps, CRW is (upper) bounded by RW. A closed-form estimate of EYi is very difficult to achieve analytically because Q itself is also a random variable. Figure 9 shows the WAS of RW and CRW obtained from numerical simulations. The plots clearly show that CRW’s WAS is upper-bounded by that of RW.

Fig. 9.

Fig. 9.

WAS of RW and CRW through numerical simulation

4.3. Finding Multiple Consecutive Pages

An important extension is to request consecutive memory of arbitrary size, much like the malloc function in C. This is very useful for applications in which threads obtain their memory consumption after the kernel is launched. In fact, our work on the Bitmap and CoRW paved the way towards such a procedure (which we name RW_malloc). We still divide the memory pool into small units of the same size and store the used bits in a bitmap (we will discuss the choice of unit size later). Thus, the problem of getting X bytes by RW_malloc reduces to getting n=X/S consecutive units where S is the unit size. Following the RW design, threads scan the bitmap in a parallel and random manner. Instead of a single unset bit, we need to find n consecutive unset bits.

Our design of RW_malloc follows the idea of CoRW. However, instead of each thread’s reading a random word, all (active) threads in a warp will read consecutive words to form a large region of the bitmap. Then each thread will scan a small part (e.g., one word) of the region to find all consecutive unset bits (called free segments). Free segments running across two neighboring words will also be connected. Critical information (e.g., starting position, length, used or not) of all free segments are stored in a data structure K. Finally, each thread will traverse K to find a free segment that can serve its RW_malloc request. A sketch of RW_malloc design is shown in Algorithm 6.

Due to the fast scanning of the bitmap, RW_malloc has good performance because multiple bits can be visited at once. Compared to CoRW, the cost of RW_malloc is higher, as we cannot run all computations using intrinsic and shuffle functions. However, since the main data K is stored in shared memory, the overall performance is still orders of magnitude higher than Ouroboros (Section 5.2.3).

In malloc-style allocations, in addition to latency, we also need to consider the utilization of memory. The size of the basic memory unit, the page, is a key parameter that affects space efficiency. As large pages may contain wasted space, we prefer small page sizes. However, When S is too small, we face the challenge of scanning large chunks of the bitmap, leading to degraded RW_malloc performance. Our solution is to aggregate requests for small sizes within a warp and treat the aggregated sizes as a single request. Once we have found consecutive pages that fit the aggregated request, the allocated space is then distributed to the original small requests.

Limitation:

Our RW_malloc design practically sets a cap on the number of consecutive pages: n should be smaller than the size of a bitmap region (e.g., 2048). Support of larger allocations will be an interesting direction for future work. However, we believe the current work has value. With a large number of threads on GPUs, the requested memory size from each thread tends to be smaller than that in CPU systems. Plus, the range of memory sizes supported in RW_malloc surpasses that by Ouroboros, allowing a meaningful comparison to the best current solution.

Algorithm6RW_MALLOC¯input:n:numberofconsecutivepagestofindoutput:theIDofthefirstfreepage1:tlaneID031ofthisthread2:pageID-13:needMaskballot_syncpageID==-14:whileneedMask0do5:Iftisanychosenthreadthenprandomintegerwithin[0,T/w)6:PBitMapp+t7:S,LfirstrangeofconsecutiveunsetbitsonP8:hasMaskballot_syncL>09:whilehasMask0andneedMask0do10:fort:031do11:S1shfl_syncS,t12:L1shfl_syncL,t13:IfpageID==-1andL1>=nthentaketrue14:ifthisthreadhasthelowesttwhosetake==truethen15:Atomicallysetthes1S1+L11bitsofBitMapp+t16:IfsuccessthenpageID←correspondingpageID17:endif18:endfor19:needMaskballot_syncpageID==-120:S,LnextconsecutiveunsetbitsonP21:hasMaskballot_syncL>022:endwhile23:endwhile24:returnpageID¯¯

5. EXPERIMENTAL EVALUATION

5.1. Experimental Setup

In this section, we conduct experiments to evaluate the performance of the RW-based algorithms. We perform four experiments to compare our methods with Ouroboros, ScatterAlloc, and the built-in CUDA allocator in a unit-test setup. We configure all systems to have a total of 10 GB in the memory pool and to serve the maximum request size of 8192B. Each chunk in the Ouroboros system is 8192B large, and there are ten processing queues that process requests of size 8192B, 4096B, 2048B, etc. We use the same Ouroboros and ScatterAlloc code and environment configurations as presented in [44] to ensure a fair and meaningful comparison. On our side, we evaluate four algorithms: basic Random Walk without bitmap (RW), Random Walk with Bitmap (RW-BM), Clustered Random Walk with Bitmap (CRW), and Collaborative Random Walk (CoRW). In RW-BM, CRW, and CoRW, we use 32-bit words for the bitmap. We built all our code under CUDA 11.4 and run all experiments with an NVidia Titan V GPU. Each data point presented in all figures is the average of 100 experimental runs with the same parameters.

CoRW and CRW have very similar performance, and their differences are not observable in a plot where Ouroboros and RW are present. Therefore, we compare RW, CoRW, and Ouroboros in Section 5.2.1 and compare CoRW and CRW in Section 5.2.2.

Metrics:

In all experiments, we measure the total kernel time, TAS, and WAS. We measure TAS by taking the average of step counts recorded across all threads and WAS by taking the average of the maximum number of steps per warp. The Ouroboros implementation does not provide a measurement of step count, so we implement a simple queue solution as a proxy for measuring TAS and WAS of queue-based solutions.

5.2. Experimental Results

5.2.1. Performance of GetPage.

First, we evaluate the performance of Ouroboros, RW, RW-BM, and CoRW in getting a single page. Specifically, we develop a single GPU kernel whose only task is to request a page of 256B from the memory buffer pool. Similarly, we set up a kernel running on the Ouroboros system that requests a page of 256B for each thread. We launch the kernel with various numbers of threads (i.e., changing the N value) and free percentage (i.e., changing the A value). We have also experimented with various page sizes from 4B to 1024B and found that performance remains the same regardless of page size.

Figure 11 shows the three metrics measured from kernels that call Ouroboros (simple queue in case of TAS and WAS), ScatterAlloc, RW, RW-BM, and CoRW. The simple queue was implemented as described in Section 5.1. The four columns represent scenarios with different free percentages. In each scenario, we pre-set some pages as occupied so that the percentage of free pages before starting the kernels is 50%, 10%, 1%, and 0.5%, respectively. Results from the first row show that our most advanced method CoRW outperforms Ouroboros by more than an order of magnitude when the free percentage is more than 1%. When the free percentage is 0.5%, CoRW outperforms Ouroboros by up to 3 times. Note that the 0.5% free percentage is a really extreme case that is not expected to happen frequently in applications – it means that after serving all the requests, there are 0 pages left. Another observation is: the Ouroboros running time increases with the number of threads, but our algorithm is insensitive to that (except for the 0.5% free page case).

Results from the second and the third rows confirm the validity of our theoretical results (i.e., Equations (3), (4), (5), and (6)). First, the measured TAS values match the theoretical results well. The theoretical upper bound of WAS matches experimental results well, even under 1% of free pages, indicating the bounds are tight. The bound becomes loose as the percentage of free pages decreases to below 1% . Visually, the growth patterns of lines in the 1st row of Figure 11 matches better with that on the third row (WAS) than on the second row (TAS). This shows that WAS is a better indication of the total running time than TAS.

5.2.2. CoRW versus CRW.

Second, we compare the performance of CoRW and CRW. Both are very efficient, and their interesting characteristics are observable in low-memory conditions. By design, CoRW becomes more efficient as more threads in a warp participate in the allocation request. Therefore, we implemented several kernels where a various number of threads in a warp request memory and other threads are inactive. The total kernel time is plotted in Figure 12.

Fig. 12.

Fig. 12.

Performance of CoRW and CRW with different number of requests per warp

Figure 12 shows that CRW always outperforms CoRW when the system is at 5% free. At 1% free, CoRW is better when more threads participate in the request; the break-even happens at 25 threads. At 0.8% free, the break-even happens at 20 threads. Finally, at 0.5%, CoRW is always better. These results are consistent with our analyses because at a low free percentage, the difference between TAS and WAS of CRW gets larger and WAS is more indicative of performance. On the other hand, CoRW performs well at a lower free percentage because its TAS and WAS are the same but is slightly less efficient at a high percentage due to overhead in communication between threads.

We have also compared with Ouroboros and ScatterAlloc under the same scenarios. Since it is difficult to observe small differences between CoRW and CRW when they are plotted together with Ouroboros and ScatterAlloc, this comparison is plotted separately in Figure 13. Figure 13 shows that both CoRW and CRW outperform Ouroboros and ScatterAlloc greatly under the extreme circumstances.

Fig. 13.

Fig. 13.

Performance of CoRW and CRW with different number of requests per warp

5.2.3. Performance of RW_malloc.

We evaluate the performance of our approach in allocating memory with various sizes. Seven scenarios are tested: each thread requests 16B, 256B, 1024B, 4096B, 8192B, mix request sizes ranging from 4B to 8196B, and 4096B but immediately frees the allocated space. The last scenario represents the same setting presented in the survey paper [46]. In each scenario, we launch the kernel with various numbers of threads and free percentages. Since the largest request size that we need to support is 8192B, we choose the page size of 256B (8192/32) to fit this maximum size in the single word of the Bitmap.

Figure 14 presents the total kernel time of RW_malloc and Ouroboros in the seven scenarios. This figure shows that RW_malloc outperforms Ouroboros in allocating a wide range of memory sizes. The improvement is up to 2 orders of magnitude in the best cases where the system has the highest number of parallel threads and most free memory units. The higher the concurrency is, the better RW_malloc performs than Ouroboros due to Ouroboros’ linear scaling. RW_malloc’s performance degrades as fewer memory units are available. However, it is still much better than Ouroboros in its worst scenario when there are almost no free pages (note the logarithmic scale of Y-axis). The only case where Ouroboros wins is when it immediately frees the memory just allocated. By this, Ouroboros hits a sweet spot - it does not need to allocate new chunks nor extend the virtualized queues. However, this is definitely an unrealistic scenario, as buffers will normally be used before released.

Fig. 14.

Fig. 14.

Performance comparison between RW_malloc and Ouroboros in allocating multiple pages at a time. From left to right: (1) allocate 16 bytes, (2) allocate 256 bytes, (3) allocate 1024 bytes, (4) allocate 4096 bytes, (5) allocate 8192 bytes, (6) allocate mix sizes, (7) allocate 4096 bytes then free immediately. Displayed free percentages are measured at the start of each kernel

5.2.4. Memory utilization.

We also evaluate memory utilization. In this experiment, we keep sending memory allocation requests until a system reports an out-of-memory error. The memory utilization rate is then calculated as the fraction between the total allocated amount and the total memory in the system. This design is similar to the Out-Of-Memory test cases in the survey paper [46]. We perform this experiment with various unit sizes while maintaining the fairness between Ouroboros and our system regarding total memory and maximum allocation size. For example, an Ouroboros’ chunk size of 4096B is equivalent to the page size of 128B in our system because both systems can support a maximum allocation size of 4096B. Figure 15 presents the results of the memory efficiency experiment. Both systems achieve very good memory utilization rates when allocation sizes are some power of two because these sizes fit perfectly into some number of pages. This finding is consistent with the Out-of-Memory test cases in [46]. However, the CoRW system significantly outperforms in other situations. The reason is that our system allocates large memory chunks by aggregating consecutive small pages and thus has a finer granularity control over the memory space. For example, to allocate 1050B, our system rounds the size to 1280 (256*5) and allocates five consecutive pages, whereas Ouroboros has to round to 2048B and allocate one page of that size.

Fig. 15.

Fig. 15.

Memory efficiency comparison

5.2.5. Performance of FreePage.

We evaluate the performance of Ouroboros, RW, RW-BM, and CoRW in freeing a single page. We implement a single GPU kernel whose only task is to free a memory page of 256B that was obtained by a previous kernel. We launch the kernel with various numbers of threads (i.e., changing the N value) and free percentage (i.e., changing the A value).

Figure 16 show the average latency in cycles each thread spends freeing the memory. Ouroboros’ latency grows linearly with the number of concurrent threads. This is consistent to the pattern in Figure 1. All of our methods have more than ten times lower latency and their costs are consistent across all number of threads and free percentage.

Fig. 16.

Fig. 16.

Performance of FreePage at different numbers of parallel threads and percentage of free pages. Displayed free percentages are measured at the start of each kernel

6. CASE STUDIES

In this section, we report experimental results of actual GPU programs with page acquisition needs served by relevant algorithms. Compared to the unit-tests discussed in Section 5, this gives us an approach to evaluate our methods in real-world applications. In particular, we focus on two operations implemented as part of a GPU-based database system, in which a global dynamic memory pool is maintained and shared by all GPU processes. The experimental environment is the same as described in Section 5.

6.1. Join Algorithm

Join is arguably the most important relational operator. Among the several types of join algorithms, we use a state-of-the-art hash join design [35] as the foundation of this case study. Based on the popular idea of radix hashing, the algorithm includes three stages: partitioning input data, building a hash table, and probing. After building histograms for hash values and reordering both input tables in the first two stages, the probing will compare tuples of the corresponding partitions of the two tables and output one tuple whenever there is a match. More details about the join implementation on GPUs are presented in Appendix B. We modified the code from [35] such that, when a thread finds a match, it will write the results to the current data page; when the page is full, the thread will ask for another page. We measured the end-to-end processing time of all stages of the hash join code augmented by various memory management implementations: Ouroboros, RW, and CoRW. As discussed in Subsection 5.2.2, CoRW assumes many requests from threads in a warp at the same time and CRW is more applicable to sporadic requests in the warp. The hash join algorithm requests memory from many threads in a warp, therefore, more suitable for CoRW than for CRW. For this reason, we use CoRW and not CRW in this experiment.

We run the code under different input table sizes from 16K to 64M tuples. Note that the data size roughly equals the total number of threads. We also assume that the memory management systems may be serving other database operations at the same time. We therefore preset the percentage of free pages A/T to 50% and 5% prior to our Hash Join operation call.

Figure 17 shows the wall time of the hash join program when using different memory management implementations. When using Ouroboros, the join program runs slower than the others by up to an order of magnitude. RW’s performance degrades quickly as the data size increases under the low-memory situation. CoRW is significantly better than RW, especially in low-memory situations.

Fig. 17.

Fig. 17.

Processing Time of a GPU-based hash join kernel under different data sizes and percentage of free pages

6.2. Group-By Algorithm

The Group-By operator in SQL aims at grouping tuples in a database table by distinct values of certain attributes (i.e., group key). Our Group-By program follows a Radix-hash-based design proposed in [5] by Ashkiani et al.. We improved the design to deal with a large number of distinct groups up to the size of the data domain. The program has multiple runs. In each run, all data tuples will be distributed into a hash table defined by k consecutive bits of the group key domain. Every tuple in a bucket will be further distributed into 2k buckets in the next run. Parallelism is achieved by assigning each data tuple to a thread, which will find the bucket that the tuple belongs to and append the tuple to the end of the data list in that bucket. More details about the group-by implementation on GPUs are presented in Appendix C.

In the existing Group-By program without using a memory manager, input data is stored in a pre-allocated chunk of memory. Although the total size of the data is known beforehand, the size of an individual data group is unknown. We have to run the GPU kernel twice, the first time for counting the size of each data group so that we can allocate memory for each group and the second time for writing the output. Therefore, each run requires 2k buckets, and the last run requires 2r buckets, where r is the range in binary. Therefore, we need to allocate a histogram containing all possible key values, that is 17.19 GB of GPU memory for 32-bit keys.

With our dynamic memory management systems, we can allocate memory for each data group on the fly. There is no need to run the hash functions twice, and the kernels can be executed efficiently. Data tuples for a bucket are now stored in a chain of pages, and the last page maintains a tail pointer that points to the next available slot for appending new data. When the current page is full, we obtain a new page and update the tail pointer. The entire algorithm has a few runs, each covering a k-bit hash space. In each run, all the requested pages have an aggregated size of a little over the database table size; all of these pages will be freed in the next run.

Figure 18 shows the performance of our Group-By code with the support of different memory management systems. The key is a 32-bit integer, and we conduct Radix hashing for 8 bits in each run (i.e., four runs altogether). We run our experiments under different input data sizes of up to 10 million keys, which are distributed by a Gaussian distribution. Figure 18 shows that the processing time of our proposed systems grows very slowly as the amount of data grows. CoRW and CRW have very similar performances; they are better than RW by up to 30%.

Fig. 18.

Fig. 18.

Processing Time of Group-By kernels under different data sizes and percentage of free pages

7. CONCLUSIONS AND FUTURE WORK

In this paper, we study the memory allocation problem in highly parallel software systems, focusing on GPU systems. The main idea of our design relies on random processes to locate free buffer pages. It avoids maintaining a centralized data structure that could become a major bottleneck in a high-concurrency system. Based on that philosophy, we propose a memory page allocation design based on a Random Walk (RW) mechanism. We mathematically prove that RW can significantly outperform any queue-based solution under the vast majority of scenarios. RW shows its limitations in extreme cases when free buffer pages are very rare. To remedy that, we first show that by storing page information in a bitmap, the allocation latency can be improved by a factor of w where w is the size of a bitmap word. In addition, we propose two advanced techniques to further lower the number of steps to find a page. The first technique (named CoRW) involves sharing free pages among neighboring threads, while the other one (named CRW) allows a thread to cache free pages found for future use. Experimental results show that our solutions outperform the best-known GPU memory allocator, Ouroboros, by more than two orders of magnitude. Furthermore, we demonstrate two case studies by integrating the memory management implementations into a GPU-based hash join algorithm and a GPU-based group-by algorithm. Both case studies show significant performance boosts.

Our work provides a framework that can be extended to accommodate a wide range of algorithms to gain better performance under different scenarios. More aggressive random walk approaches can be designed and analyzed. One limitation lies in the allocation of larger memory sizes exceeding 8192 bytes, a constraint shared by previous methods. While our algorithms demonstrate robust performance below 8192 bytes, extending their applicability to larger memory sizes is an area for further exploration and improvement. Another interesting direction is treating the memory buffer pool as a cache of low-speed storage, which requires merging RW-based mechanisms with page replacement policies.

Fig. 10.

Fig. 10.

An example of 4 threads running the RW_malloc algorithm

A. DESCRIPTION OF CUDA INTRINSIC INSTRUCTIONS

The following intrinsic instructions have been used in this study, in alphabetical order.

ballot_sync(predicate): Evaluate predicate for exited threads in a warp and return an integer whose Nth bit is set if and only if the predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.

ffs(x): find the position of the least significant bit set to 1 in x.

popc(x): count the number of bits in x that are set to 1

shfl_sync(var, lane): exchange a variable var from thread lane within a warp without use of shared memory. The exchange occurs simultaneously for all active threads within the warp.

B. DETAILS OF THE HASH JOIN IMPLEMENTATION

Join is arguably the most important relational operator. Among the several types of join algorithms, we use a state-of-art hash join designed for GPUs [36] as the foundation of this case study. Based on the popular idea of radix hashing, the algorithm includes three stages: partitioning input data, building a hash table, and probing. We adopt the idea that by reordering the tuples in a relation according to its hash value, the partitioning and building stages are combined into one. Terefore, the tuples with the same hash value are clustered into a continuous memory space, which ensures coalesced memory access when threads load data from a certain partition.

The partitioning stage starts with building histograms for hash values to reorder the tuples of both input tables. In previous works, a thread reads and processes one tuple at a time because the multiprocessor has very few registers. This method is straightforward but is less capable of hiding latency via instruction-level parallelism. To utilize the large registers in the new GPU architecture, our implementation loads many values of tuples into registers of the thread all at once so that each threads are assigned more workload at the beginning. This increases the instruction-level parallelism within each thread, and the memory access can be overlapped with computation to hide latency. Each thread processes its own data independently and updates the shared histogram in shared memory.

In the probe stage (Figure 3), each partition of input table R is loaded into shared memory by one block of threads. A partition of the other table S with the same hash value is loaded into registers by the same threads. To write the outputs back to memory, the traditional wisdom is to perform the probe twice. The first probe returns the number of outputs for each partition to determine the location of the output buffer for writing outputs. The total number of outputs and starting position of each partition is obtained by a prefix scan of these numbers. Given the number of outputs, the output array can be allocated and then the second probe is performed to actually write the output tuples. This scheme eliminates the overhead of synchronization and dynamic allocation of buffers, and efficiently outputs in parallel by doing more work.

We focus our discussions on the probing stage because that is where our buffer management APIs are needed (to allocate memory for output data). After building histograms for hash values and reordering both input tables in the first two stages, the probing will compare tuples of the corresponding partitions of the two tables and output one tuple whenever there is a match (Figure 19). We modified the code from [36] such that, when a thread finds a match, it will write the results to the current data page; when the page is full, the thread will ask for another page by calling getpage.

C. DETAILS OF THE GROUP-BY IMPLEMENTATION

Our Group-By program follows a Radix-hash-based design proposed in [5] by Ashkiani et al.. We improved the design to deal with a large number of distinct groups up to the size of the data domain as follows. We adopted radix-hash, a direct address hash approach, working on the binary representation from the most significant bit (MSB) to the least significant bit (LSB) in several passes. Essentially, in each pass, a certain number of new distinct groups will be differentiated, and future passes will further distinguish them into sub-groups. This is in contrast to Karnagel’s implementation who hashed the entire key at once. This coarse-to-fine grouping approach naturally avoids knowing the cardinality in advance.

Fig. 19.

Fig. 19.

The Probing stage of GPU hash join

The program has multiple runs. In each run, all data tuples will be distributed into a hash table defined by k consecutive bits of the group key domain. Every tuple in a bucket will be further distributed into 2k buckets in the next run. Parallelism is achieved by assigning each data tuple to a thread, which will find the bucket that the tuple belongs to and append the tuple to the end of the data list in that bucket.

We use a linked-list data structure to chain identical data as a list as shown in Figure 20. In the first round of bucket distribution, a thread binds to one data point and determines which bucket it goes to. Then, the thread will ask the header array (represented as a rectangle, an array of pointers initialized to NULL) that holds the header of each bucket to see if there is an available page to write the data, three scenarios can happen: 1) no page has been assigned to this list (white rectangle); 2) the current page is full (fully shaded square); and 3) the current page has available space (partially shaded square). For scenarios 1) and 2), threads will request a page from the Memory Manager, and the new page will be appended to the page list. In scenario 3, the threads atomically increment the data count and write their data into the current page. In the next round, threads will bind to the data from the page list and perform the same process to further split it into sub-lists. Typically, after the data are written to the sub-lists, the page from the previous list can be released for future use. This process repeats until all bits have been considered and each page list will represent a distinct group at the end.

Fig. 20.

Fig. 20.

One run of our GPU Group-By algorithm with k=0 (eight buckets). Note that the keys are stored in data pages built from the previous run

Footnotes

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

ACM Reference Format:

Minh Pham, Yongke Yuan, Hao Li, Chengcheng Mou, Yicheng Tu, Zichen Xu, and Jinghan Meng. 2024. Dynamic Buffer Management in Massively Parallel Systems: The Power of Randomness. 1, 1 (May 2024), 34 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

Contributor Information

MINH PHAM, University of South Florida, USA.

YONGKE YUAN, Beijing University of Technology, China.

HAO LI, University of South Florida, USA.

CHENGCHENG MOU, University of South Florida, USA.

YICHENG TU, University of South Florida, USA.

ZICHEN XU, Nanchang University, China.

JINGHAN MENG, University of South Florida, USA.

REFERENCES

  • [1].Abramowitz M and Stegun IA. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, pages 16,806,886. New York: Dover Publications. [Google Scholar]
  • [2].Adinetz AV and Pleiter D. Halloc: a high-throughput dynamic memory allocator for gpgpu architectures. In GPU Technology Conference (GTC), volume 152, 2014. [Google Scholar]
  • [3].Arafa Y, Badawy A-H, Chennupati G, Santhi N, and Eidenbenz S. Low overhead instruction latency characterization for nvidia gpgpus, 2019. [Google Scholar]
  • [4].Arefyeva I, Broneske D, Campero G, Pinnecke M, and Saake G. Memory management strategies in cpu/gpu database systems: A survey. In International Conference: Beyond Databases, Architectures and Structures, pages 128–142. Springer, 2018. [Google Scholar]
  • [5].Ashkiani S, Davidson A, Meyer U, and Owens JD. Gpu multisplit: An extended study of a parallel algorithm. ACM Trans. Parallel Comput, 4(1), Aug. 2017. [Google Scholar]
  • [6].Baroudi T, Loechner V, and Seghir R. Static versus dynamic memory allocation: a comparison for linear algebra kernels. In IMPACT 2020, in conjunction with HiPEAC 2020, 2020. [Google Scholar]
  • [7].Berger ED, McKinley KS, Blumofe RD, and Wilson PR. Hoard: A scalable memory allocator for multithreaded applications. ACM Sigplan Notices, 35(11):117–128, 2000. [Google Scholar]
  • [8].Brown KP, Carey MJ, and Livny M. Goal-oriented buffer management revisited. ACM SIGMOD Record, 25(2):353–364, 1996. [Google Scholar]
  • [9].Burić T and Elezović N. Approximants of the euler–mascheroni constant and harmonic numbers. Applied Mathematics and Computation, 222:604–611, 2013. [Google Scholar]
  • [10].Busato F, Green O, Bombieri N, and Bader DA. Hornet: An efficient data structure for dynamic sparse graphs and matrices on gpus. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pages 1–7. IEEE, 2018. [Google Scholar]
  • [11].Chen CM and Roussopoulos N. Adaptive database buffer allocation using query feedback. Technical report, 1998. [Google Scholar]
  • [12].Chou H-T and DeWitt DJ. An evaluation of buffer management strategies for relational database systems. Algorithmica, 1(1–4):311–336, 1986. [Google Scholar]
  • [13].Corbató FJ and Vyssotsky VA. Introduction and overview of the multics system. In Proceedings of the November 30–December 1, 1965, fall joint computer conference, part I, pages 185–196, 1965. [Google Scholar]
  • [14].Dennis JB. Segmentation and the design of multiprogrammed computer systems. Journal of the ACM (JACM), 12(4):589–602, 1965. [Google Scholar]
  • [15].Douglas N. ned productions - nedmalloc. https://www.nedprod.com/programs/portable/nedmalloc, 2006. Retrieved February 01, 2024.
  • [16].Effelsberg W and Haerder T. Principles of database buffer management. ACM Transactions on Database Systems (TODS), 9(4):560–595, 1984. [Google Scholar]
  • [17].Evans J. jemalloc: A scalable concurrent malloc(3) implementation. https://github.com/jemalloc/jemalloc, 2006. Retrieved February 01, 2024. [Google Scholar]
  • [18].Falsafi B, Guerraoui R, Picorel J, and Trigonakis V. Unlocking energy. In 2016 USENIX Annual Technical Conference (USENIX 16), pages 393–406, 2016. [Google Scholar]
  • [19].Gloger W. Ptmalloc3 - a multi-thread malloc implementation. https://github.com/Cloudifold/ptmalloc, 2006. Retrieved February 01, 2024. [Google Scholar]
  • [20].Google. Tcmalloc. https://github.com/google/tcmalloc, 2024. Retrieved February 01, 2024.
  • [21].He B, Lu M, Yang K, Fang R, Govindaraju NK, Luo Q, and Sander PV. Relational query coprocessing on graphics processors. ACM Transactions on Database Systems (TODS), 34(4):1–39, 2009. [Google Scholar]
  • [22].He B, Yang K, Fang R, Lu M, Govindaraju N, Luo Q, and Sander P. Relational joins on graphics processors. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ‘08, pages 511–524, New York, NY, USA, 2008. ACM. [Google Scholar]
  • [23].Huang X, Rodrigues CI, Jones S, Buck I, and Hwu W.-m.. Xmalloc: A scalable lock-free dynamic memory allocator for many-core machines. In 2010 10th IEEE International Conference on Computer and Information Technology, pages 1134–1139. IEEE, 2010. [Google Scholar]
  • [24].Hunter AH, Kennelly C, Gove D, Ranganathan P, Turner PJ, and Moseley TJ. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), 2021. [Google Scholar]
  • [25].Karlin S. A first course in stochastic processes. Academic press, 2014. [Google Scholar]
  • [26].Kilburn T, Edwards DB, Lanigan MJ, and Sumner FH. One-level storage system. IRE Transactions on Electronic Computers, (2):223–235, 1962. [Google Scholar]
  • [27].Knuth DE. Johann faulhaber and sums of powers. Mathematics of Computation, 61(203):277–294, 1993. [Google Scholar]
  • [28].Lea D. A memory allocator. http://gee.cs.oswego.edu/dl/html/malloc.html, 1996. Accessed: 2024-02-01. [Google Scholar]
  • [29].Li H, Yu D, Kumar A, and Tu Y. Performance modeling in CUDA streams - A means for high-throughput data processing. In Big Data (Big Data), 2014 IEEE International Conference on, pages 301–310, Oct 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Liu T, Curtsinger C, and Berger ED. Dthreads: efficient deterministic multithreading. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pages 327–336, 2011. [Google Scholar]
  • [31].Michael MM. Scalable lock-free dynamic memory allocation. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, PLDI ‘04, page 35–46, New York, NY, USA, 2004. Association for Computing Machinery. [Google Scholar]
  • [32].Paul J, He J, and He B. Gpl: A gpu-based pipelined query processing engine. In Proceedings of the 2016 International Conference on Management of Data, pages 1935–1950. ACM, 2016. [Google Scholar]
  • [33].Pham M, Li H, Yuan Y, Mou C, Ramachandran K, Xu Z, and Tu Y. Dynamic memory management in massively parallel systems: A case on gpus. In Proceedings of the 36th ACM International Conference on Supercomputing, ICS ‘22, New York, NY, USA, 2022. Association for Computing Machinery. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Quinn JJ and Benjamin AT. Proofs That Really Count: The Art of Combinatorial Proof. The Mathematical Association of America, 2003. [Google Scholar]
  • [35].Rui R, Li H, and Tu Y-C. Efficient join algorithms for large database tables in a multi-gpu environment. Proceedings of the VLDB Endowment, 14(4):708–720, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Rui R and Tu Y-C. Fast equi-join algorithms on gpus: Design and implementation. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management, SSDBM ‘17, pages 17:1–17:12, New York, NY, USA, 2017. ACM. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Schneider S, Antonopoulos CD, and Nikolopoulos DS. Scalable locality-conscious multithreaded memory allocation. ISMM ‘06, page 84–94, New York, NY, USA, 2006. Association for Computing Machinery. [Google Scholar]
  • [38].Springer M and Masuhara H. Dynasoar: a parallel memory allocator for object-oriented programming on gpus with efficient memory access. arXiv preprint arXiv:1810.11765, 2018. [Google Scholar]
  • [39].Steinberger M, Kenzel M, Kainz B, and Schmalstieg D. Scatteralloc: Massively parallel dynamic memory allocation for the gpu. In 2012 Innovative Parallel Computing (InPar), pages 1–10. IEEE, 2012. [Google Scholar]
  • [40].Stonebraker M. Operating system support for database management. Communications of the ACM, 24(7):412–418, 1981. [Google Scholar]
  • [41].Team R et al. Rapids: Collection of libraries for end to end gpu data science, 2018. [Google Scholar]
  • [42].Vinkler M and Havran V. Register efficient dynamic memory allocator for gpus. In Computer Graphics Forum, volume 34, pages 143–154. Wiley Online Library, 2015. [Google Scholar]
  • [43].Widmer S, Wodniok D, Weber N, and Goesele M. Fast dynamic memory allocator for massively parallel architectures. In Proceedings of the 6th workshop on general purpose processor using graphics processing units, pages 120–126, 2013. [Google Scholar]
  • [44].Winter M, Mlakar D, Parger M, and Steinberger M. Ouroboros: virtualized queues for dynamic memory management on gpus. In Proceedings of the 34th ACM International Conference on Supercomputing, pages 1–12, 2020. [Google Scholar]
  • [45].Winter M, Mlakar D, Zayer R, Seidel H-P, and Steinberger M. faimgraph: high performance management of fully-dynamic graphs under tight memory constraints on the gpu. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 754–766. IEEE, 2018. [Google Scholar]
  • [46].Winter M, Parger M, Mlakar D, and Steinberger M. Are dynamic memory managers on gpus slow? a survey and benchmarks. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 219–233, 2021. [Google Scholar]
  • [47].Yuan Y, Lee R, and Zhang X. The yin and yang of processing data warehousing queries on gpu devices. Proceedings of the VLDB Endowment, 6(10):817–828, 2013. [PMC free article] [PubMed] [Google Scholar]
  • [48].Zou GY. Toward using confidence intervals to compare correlations. Psychological methods, 12(4):399, 2007. [DOI] [PubMed] [Google Scholar]

RESOURCES