Skip to main content
PLOS One logoLink to PLOS One
. 2026 Feb 3;21(2):e0342167. doi: 10.1371/journal.pone.0342167

Performance evaluation of GPU-based parallel sorting algorithms

Mohammed Alaa Ala’anzy 1,*, Nurdaulet Tolendi 1, Baizhan Baubek 1, Abdulmohsen Algarni 2
Editor: Francesco Bardozzo3
PMCID: PMC12867261  PMID: 41632810

Abstract

Sorting can be approached in two main ways: sequentially and in parallel. In sequential sorting, data is processed in a single-threaded manner, which can be slow for large datasets. However, parallel sorting divides the task across multiple processing units, enabling faster results by processing data simultaneously. Furthermore, Compute Unified Device Architecture (CUDA) technology enables developers to leverage GPU power for general-purpose parallel computing, significantly accelerating tasks like sorting. This paper investigates the GPU-based parallelization of merge sort (MS), quick sort (QS), bubble sort (BS), radix top-k selection sort (RS), and slow sort (SS) presenting optimized algorithms designed for efficient sorting of large datasets using modern GPUs. The primary objective is to evaluate the performance of these algorithms on GPUs utilizing CUDA, with a focus on analyzing both parallel time complexity and space complexity across various data types. Experiments are conducted on four dataset scenarios: randomly generated data, reverse-sorted data, already-sorted data, and nearly-sorted data. Also, the performance of GPU-accelerated implementations is compared with their sequential counterparts to assess improvements in computational efficiency and scalability. Earlier GPU-based generations of this type typically achieved acceleration rates between 2× and 9× over scalar CPU code. With newer GPU enhancements, including parallel-aware primitives and radix- or merge-optimized operations, acceleration rates have seen significant improvement. Our experiments indicate that Radix Sort based on GPUs achieves a significant speedup of approximately 50× (sequential: 240.8 ms, parallel: 4.83 ms) on 10 million random sort elements. Quick Sort and Merge Sort have 97× and 103× speedups, respectively (Quick: 1461.97 ms vs. 15.1 ms; Merge: 2212.33 ms vs. 21.4 ms). Bubble Sort, while significantly improving in parallel (123,321.9 ms to 7377.8 ms for an ≈17× improvement), is considerably worse overall. Slow Sort demonstrates a moderate but consistent acceleration, reducing execution time from 74.07 ms in the sequential version to 3.99 ms on the GPU, yielding an ≈18.6× speedup. These experimental findings confirm that the new single-GPU implementations can get speedups ranging from 17× to over 100×, surpassing the typical gains reported in previous generations and comparable to or over rates of acceleration reported for cutting-edge parallel sorting algorithms in recent studies.

Introduction

The exponential growth of data-intensive fields such as artificial intelligence, scientific computing, real-time analytics, and large-scale simulations has placed unprecedented demands on algorithmic performance and scalability [15]. Sorting underpins many of these applications, from database query processing and graph analytics to genome sequencing and large-scale scientific simulations. Notably, sorting operations can account for up to 25% of processing time in data-intensive applications, underscoring their critical role in high-performance computing workloads [6]. As dataset sizes grow into the millions and beyond, traditional sequential sorting algorithms become performance bottlenecks, leading to unacceptable latency and resource usage.

Parallel sorting techniques on multi-core CPUs and distributed clusters have alleviated some of these pressures, but they still face challenges in load balancing, memory bandwidth, and inter-node communication overhead. In particular, multi-GPU strategies have emerged as a promising solution for overcoming the memory and computational limits of single-device sorting. Ilic et al. [7] demonstrated up to a 14× improvement in throughput by exploiting NVLink and PCIe 4.0 interconnects for distributed merge-based sorting, while Tolovski et al. [8] introduced RMG Sort, a radix-partitioning approach that segments most significant bits across multiple GPUs to achieve near-linear scalability with minimal inter-device communication.

GPUs with thousands of cores and high-bandwidth memory subsystems have become a cornerstone of parallel computing [9]. NVIDIA’s CUDA programming model provides fine-grained control over thread hierarchies, shared memory, and warp-synchronous execution, enabling optimized sorting kernels that deliver substantial speedups over CPU baselines [10,11]. Indeed, recent work shows that well-tuned GPU implementations of comparison-based sorts such as MS and QS can achieve up to 14×–38× acceleration on million-element datasets [12].

Recent hardware trends in unified-memory systems have changed the CPU-versus-GPU sorting landscape significantly. For example, a study by Liu [13] evaluated Monte Carlo particle-sorting algorithms on Apple’s M2 Max chip, which uses unified CPU–GPU memory. Interestingly, CPU-based sorting outperformed GPU-based sorting for partially sorted datasets, due to reduced thread divergence and the absence of off-chip transfer overhead, while GPUs still excelled on fully random data. This demonstrates the necessity of including diverse data distribution patterns when benchmarking sorting algorithms in modern heterogeneous environments.

Over the past five years, a variety of GPU-centric sorting approaches have been proposed. Radix-based methods, including hardware-accelerated radix top-K selection [14], leverage digit-wise bucketing to achieve near-linear time complexity and excellent coalesced memory access. Segmented sorting evaluations, such as those by Schmid and Caceres [15] and Schmid et al. [16], compare fix-sort and merge-based variants across multiple GPU generations, providing recommendation maps for optimal strategy selection. Specialized order-statistics kernels like “bucketMultiSelect” [17] extract multiple percentiles more than 8× faster than full sorts, highlighting opportunities when only partial ordering is needed.

Despite these advances, existing studies often focus on one algorithmic paradigm or use disparate experimental conditions, which hinders direct comparison. Furthermore, many GPU sorting kernels rely on advanced memory optimizations, shared memory tiling, warp-level intrinsics, and manual prefetching, yet the relative benefits of these techniques across different algorithms and data distributions remain underexplored. Including simpler algorithms such as BS, despite its O(n2) complexity, provides a useful baseline for examining synchronization and memory-access behaviors under parallel execution [18]. By evaluating both naïve and sophisticated algorithms side-by-side, we gain deeper insight into how algorithmic complexity and GPU optimizations interact in practice.

To address these gaps, we implement and evaluate four representative sorting algorithms MS, QS, BS, RS adn SS, in a unified CUDA C++ framework. MS and QS represent efficient, comparison-based divide-and-conquer paradigms; BS serves as a synchronization-focused baseline; RS exemplifies digit-based, non-comparison sorting optimized for large integer datasets and SS represents a bitmap based presence encoding approach designed for efficient integer sorting with compact auxiliary memory usage. We systematically benchmark each algorithm on an NVIDIA GTX 1660 SUPER using a fixed dataset of ten million integers under four input distributions (random, sorted, reverse-sorted, and nearly sorted). By measuring execution time, memory usage, and comparing GPU implementations against C++ based sequential baselines, we quantify speedups ranging from 7× to 38× and analyze algorithm robustness and scalability.

To improve the clarity of the workflow and provide readers with an immediate understanding of the study’s scope, we introduce an overview diagram (Fig 1) that summarizes the full evaluation pipeline. This schematic illustrates the dataset preparation, CPU and GPU sorting implementations, and the performance assessment components of the proposed framework.

Fig 1. Overall workflow of the dataset generation, CPU and GPU sorting algorithms, and performance comparison.

Fig 1

The principal contributions of this work are as follows:

  • Developed GPU-accelerated implementations of four sorting algorithms, MS, QS, BS, RS and SS, using the CUDA programming model to exploit parallel execution capabilities.

  • Provided a comparative analysis of these algorithms, representing different algorithmic paradigms: divide-and-conquer (MS, QS), elementary quadratic sort (BS), digit-based non-comparison sorting (RS) and bitmap based sorting (SS).

  • Evaluated each implementation across four data distribution scenarios: random, sorted, reverse-sorted, and nearly sorted, using a dataset of ten million integers to assess scalability and algorithm robustness under varying input characteristics.

  • Conducted a detailed comparison between GPU-based and sequential CPU implementations to measure performance improvements in terms of execution time, speedup, and memory usage, thereby demonstrating the computational benefits of GPU acceleration.

These contributions establish a unified benchmarking framework for evaluating GPU-based sorting algorithms under consistent experimental conditions. By analyzing diverse algorithmic strategies and input scenarios, this work contributes new insights into the practical performance characteristics of classical and modern sorting algorithms when optimized for parallel execution on CUDA-enabled platforms.

Related work

Various works have concentrated on the implementation and optimization of parallel sorting on GPUs. Ishan Joshi and Singh et al. [19] conducted a survey of several sorting algorithms running on a GPU and underlined that the most popular sorting algorithms in CUDA-based implementations are MS, QS, RS, and Sample Sort. Their work then emphasizes that although RS can be the fastest for large inputs because of its linear time complexities, comparison-based sorting methods, such as MS and QS, provide more stability in their performance with different input distributions.

Among sorting algorithms, MS has been in the limelight for a very long time regarding its suitability for parallel execution. Tanasic et al. [20] explored multi-GPU-based comparison sorting and found that cooperation by multiple GPUs yielded a substantial gain in sorting large-scale data. They were able to optimize the two most enigmatic parts, which are the partitioning of data and inter-GPU communication. Building upon these foundational ideas, Çetin et al. [21] proposed a memory-aware multi-GPU sorting framework that dynamically balances workload distribution while minimizing data transfer overhead through unified memory access and NCCL-based communication. Their evaluation on various large-scale datasets demonstrated not only improved throughput but also higher resource utilization across heterogeneous GPU configurations, making their approach more adaptable to modern CUDA environments.

While MS benefited from its structured partitioning, QS presents a very different challenge due to its heavy reliance on partitioning and recursive calls. Kumari and Singh [22] proposed a binary search-based parallel QS implementation that tried to reduce inter-thread communication overhead for partitioning efficiency.

As BS would not be used for any high-performance computing environment, some studies have developed a GPU implementation of the same for benchmarking and educational purposes. In the comparison study by Faujdar and Ghrera [18], although its performance scaled rather poorly for a large set of data, it could be useful in evaluating thread synchronization strategies and memory access optimizations in parallel architectures.

Empirical comparisons of different sorting algorithms have also been done extensively to understand their real performance on GPUs. Schmid and Caceres [16] proposed the fix sort optimization, which will enable off-the-shelf sorting algorithms to efficiently handle segmented sorting. Even though the work focuses mainly on segmented sorting, the methodology of benchmarking is very useful for general insights into parallel sorting efficiency.

Schmid et al. [15] conducted a detailed evaluation of segmented sorting strategies on GPUs, benchmarking multiple algorithms across seven devices with varied segment sizes and distributions. Their use of S-curve and heat map analyses led to a recommendation map that selected the optimal strategy in 47.57% of cases, with a slowdown of less than 1.5× in over 93%. This study highlights the importance of input-specific benchmarking in segmented sorting performance.

Outside these algorithm-specific optimizations, other works have focused on the hardware and CUDA-based performance optimizations for sorting. Yoshida et al. [11] present the influence that different CUDA versions may have on sorting performance on GPUs. They show how warp-scheduling, pipelining instructions, and adapting to memory hierarchy changes can seriously affect execution time. These are indications that software optimizations in a CUDA-based algorithm play a much more important role in sorting than ever before, for the high throughput of modern GPUs.

Li et al. [14] solve the issue of top-k selection, finding the k largest or smallest values in a set, of utmost importance in data-intensive computing. The traditional methods utilize fast but size-constrained on-chip memory, excluding scalability with an upper bound on k. Furthermore, Li et al. [14] present a parallel and optimized radix top-k selection on GPUs scalable to much larger k values with no loss of efficiency, input batch size, and length insensitive. They utilize a novel optimization framework constructed for high resource and memory bandwidth utilization, permitting large speedup over current techniques.

Blanchard et al. [17] introduced ‘bucketMultiSelect’, a GPU-based algorithm for extracting multiple order statistics, such as percentiles, without fully sorting the data. For large vectors containing up to 228 double-precision values, their method selects 101 percentile values in under 95 ms, achieving over 8× speedup compared to an optimized CUDA-based merge sort. This approach is particularly relevant in applications where specific statistics are needed but full sorting is unnecessary, offering substantial performance savings.

To further improve sorting efficiency on the GPU, researchers have explored hybrid sorting techniques that combine multiple approaches. Gowanlock and Karsin [23] suggested a hybrid CPU/GPU sorting strategy, which performs dynamic workload distribution between CPU and GPU cores for the best execution time. The approach of these authors proves how task offloading can enhance the efficiency of sorting in heterogeneous computing environments. In addition, recent advancements beyond NVIDIA/A100 GPUs have emerged. Wróblewski et al. [24] investigated sorting on Huawei’s Ascend AI accelerators using a matrix-based parallel RS. They demonstrated impressive scalability, achieving up to 3.3× speedup over baseline RS implementations, and corroborated the continued relevance of radix-based schemes across diverse hardware platforms. As one of the first sorting implementations on the Ascend AI architecture, this work reinforces that RS remains a performance-critical strategy, even beyond CUDA-enabled GPUs.

While existing works have explored a variety of GPU-based sorting algorithms, most of them are limited in scope or consistency. Many studies focus on individual sorting techniques, such as RS or MS, without offering cross-type comparisons under uniform conditions. Others benchmark only one or two data distributions (e.g., random or sorted), making it difficult to evaluate algorithm robustness under different input patterns. Additionally, testing scales vary significantly, with some using small datasets or inconsistent hardware platforms.

Few studies offer a unified benchmarking framework that compares both comparison-based and non-comparison-based sorting algorithms under identical conditions. Moreover, BS is often excluded despite its instructional and architectural relevance in parallel environments. We address these gaps by implementing and evaluating four representative algorithms: MS, QS, BS, and RS, using CUDA C++. Our testing uses a consistent hardware setup, a fixed large dataset of 10 million integers, and four distinct data distributions (random, sorted, reverse, and nearly sorted). This unified benchmarking provides a fair, controlled comparison of algorithmic behavior, scalability, and efficiency in real-world GPU computing scenarios.

Apart from GPU-directed approaches, many research works have been done to achieve performance improvement through multithreading and multi-core approaches. As an example, Al-sudani et al. [25] developed a multithreading approach for computing high-order Tchebichef polynomials with significant speed-up in the evaluation of polynomials. Mahmmod et al. [26] achieved significantly reduced execution time for high-order Hahn polynomials through multithreading and achieved remarkable runtime reduction over sequential approaches. At the architectural level, Hsu and Tseng [27] designed a framework for multithreading and heterogeneous simultaneous multithreading in accelerator-rich systems, and showed the effectiveness of utilizing intra-core and inter-accelerator parallelism. These efforts together highlight that multi-threading on CPUs, GPUs, or even heterogeneous accelerators remains an essential way to improve algorithmic performance and is orthogonal to the GPU-specific sorting optimizations explored in this work.

New sorting algorithms that emphasize both time complexity and ease of parallel implementation have also been developed in recent years. slowsort, for example, is a generalization of bitSort through an adapted parallel version for sorting large-scale integer data [28], while threshold-based sorting utilizes adaptive thresholds to accelerate tasks in dense wireless communication networks [29]. In the same vein, clusterSort uses the combination of clustering techniques with divide-and-conquer methods to achieve efficient in-place sorting for large data [30]. Additional contributions, such as the independent-subarray model [31], splitting data into balanced subproblems for improved workload distribution, and the mean-based sort algorithm [32], leveraging mean-value guided partitioning for improved scalability, further enrich the collection of modern solutions. Together, these developments provide a clear way forward for sorting research in the direction of algorithms that marry domain-specific solutions to parallel paradigms, and they provide valuable pointers to the extension of comparative studies.

To synthesize the literature discussed above, Table 1 summarizes key prior works, the sorting algorithms investigated, the platforms used (e.g., single vs. multi-GPU), and the primary contributions or performance outcomes. Our work is positioned as a unified benchmarking study that addresses gaps in consistency and algorithm diversity.

Table 1. Comparative summary of related work on GPU-based sorting algorithms.

Study Algorithm(s) Platform Key Contributions/Performance Gains
Tanasic et al. (2013) [20] MS, QS Multi-GPU Demonstrated performance gain through inter-GPU cooperation and partitioning optimizations.
Kumari and Singh (2014) [22] Parallel Quick Sort Single-GPU Reduced inter-thread communication via binary search-based partitioning.
Faujdar and Ghrera (2015) [18] BS Single-GPU Poor scalability; useful for benchmarking and evaluating memory access patterns and thread sync.
Blanchard et al. (2016) [17] bucketMultiSelect (Order Selection) Single-GPU Extracts multiple order-statistics 8× faster than CUDA merge sort on 228 doubles without full sorting.
Singh et al. (2018) [19] MS, QS, RS, Sample Sort Single-GPU (CUDA) Survey of GPU sorting algorithms; RS noted for speed with large datasets, MS/QS for stability across inputs.
Schmid and Caceres (2019) [16] Fix Sort (Segmented) Single-GPU Enhanced segmented sorting performance and generalized benchmarking approach.
Gowanlock and Karsin (2019) [23] Hybrid CPU/GPU Sort Heterogeneous CPU-GPU Dynamic load balancing between CPU and GPU; optimized workload distribution for execution time.
Schmid et al. (2022) [15] Segmented Sort (Fix Sort, MS variants) Single-GPU Benchmarked across 7 GPUs with variable input/segment sizes; proposed recommendation map for optimal strategy selection
Çetin et al. (2023) [21] MS (Memory-aware) Multi-GPU Improved throughput via unified memory access and dynamic load balancing across heterogeneous GPUs.
Yoshida et al. (2024) [11] General GPU Sort (various) Single-GPU (CUDA) Analyzed CUDA version impacts on GPU sorting performance; emphasized warp scheduling and memory hierarchy.
Li et al. (2024) [14] Radix Top-K Sort Single-GPU Scalable radix-based Top-K selection; efficient for large datasets and batch-insensitive input size.
Wróblewski et al. (2025) [24] Matrix-based RS Ascend AI Accelerator Achieved up to 3.3× speedup over baseline radix sorts; demonstrates cross-platform applicability of radix optimization on next-gen hardware.
Wang and He (2025) [28] SlowSort (Bitmap-based, Deduplicating) Single-GPU/CPU (Prototype) Enhanced BitSort based algorithm for large scale integer datasets that performs sorting and deduplication using bitmap mapping, applies range compression via second-smallest and second-largest values to reduce memory usage while maintaining performance comparable to radix sort on dense inputs.
Our Work MS, QS, BS, RS, SS Single-GPU (CUDA C++) Unified benchmarking of five algorithms under consistent hardware and dataset setups; evaluates scalability, performance, and CPU-GPU efficiency.

Conceptual sorting algorithms with CUDA

Traditionally, sorting algorithms have been run on CPUs via sequential or multithreaded approaches. This is fine for data of small sizes, but it is a performance bottleneck in the case of big data-driven applications. Unlike CPUs, which are usually powered by a couple of high-power cores, GPUs have thousands of lightweight cores that can handle numerous operations simultaneously. This organization makes them well-suited for algorithms that can be broken down into numerous independent operations like comparing and swapping elements in sorting.

CUDA

The CUDA programming model operates on the principle that the host (CPU) and the device (GPU) function as distinct computing units, each with separate memory spaces [33]. CUDA divides tasks into thousands of threads, allowing GPUs to perform complex calculations faster than CPUs, making it ideal for data-intensive applications such as machine learning and real-time image processing [10]. Its high-performance capabilities are especially beneficial for fields like AI and scientific research, where sorting algorithms must handle large datasets. While sequential sorting is inefficient for such tasks, parallel sorting distributes the workload across multiple processors, improving execution speed and efficiency [11].

CUDA extends GPU functionality beyond graphics, allowing for general-purpose tasks like sorting [34]. GPUs excel at parallelism, with stream processors optimized for floating-point operations, unlike CPUs with fewer cores optimized for sequential tasks. Parallel sorting reduces execution time, and, depending on the algorithm, can decrease time complexity from O(nlogn) to near-linear time. For instance, the NVIDIA GTX 1650 has thousands of stream processors, whereas CPUs have fewer cores, limiting parallel processing capabilities [35]. CUDA integrates seamlessly with C-based applications, enabling efficient parallelization [10].

Merge sort, quick sort, bubble sort, radix top-K selection sort, and slow sort in parallel

MS and QS benefit from parallelization due to their divide-and-conquer approach [18]. MS typically outperforms QS on GPUs because of its uniform workload and predictable memory access patterns, which align well with CUDA’s parallel architecture. QS, while efficient in ideal conditions, can face imbalances due to pivot selection. Optimized parallel QS, using multi-pivot strategies and balanced partitioning, works well on skewed datasets [18]. Parallel BS, in contrast, is simpler but has limited scalability for large datasets. It distributes element comparisons across processors but remains inefficient for large datasets due to its O(n2) complexity. However, it is effective for smaller datasets or when simplicity is prioritized. In practice, MS is preferred for large, uniform datasets that require predictable performance, while parallel QS is useful for rapid, approximate sorting of specific datasets. Parallel BS is best suited for tasks requiring simple implementation with small datasets. RS is another solution, designed to find the k largest or smallest elements where the priority is distinct from sorting the entire data set [14]. Unlike MS and QS, RS sorts all n elements and returns the top k. It is especially useful when k is significantly smaller than n. Merge-based GPU top-k algorithms are not scalable because they involve the utilization of on-chip memory and are therefore limited by k’s size. SS in the other hand, integrates sorting and deduplication through bitmap mapping [28]. Unlike comparison-based algorithms, SS avoids repeated element comparisons and instead relies on bit-level presence encoding, while, in contrast to radix-based approaches, it emphasizes reduced memory footprint rather than maximal throughput. While MS is best for uniformly large datasets demanding consistent performance, QS is applicable for loose, quick sorting, RS is best when the main requirement is the finding of the top k elements, and SS is particularly advantageous in scenarios where memory efficiency and inherent deduplication of large scale integer datasets are prioritized.

Merge sort using GPU computing with CUDA

As shown in Fig 2, the parallel MS algorithm follows three steps: (1) the dataset is recursively divided into smaller sub-arrays, distributed across GPU cores; (2) each sub-array is sorted independently by dedicated threads; and (3) sorted sub-arrays are merged in parallel. CUDA handles non-sequential writing during merging, making the process more efficient [36].

Fig 2. Parallel merge sort.

Fig 2

GPU-based parallel MS achieves time complexity O(logn) and is significantly faster than single-threaded CPU QS for large datasets. It is memory-bound, with performance dependent on efficient memory access. Optimizations such as reducing memory fetches can further enhance performance, enabling CUDA GPUs to sort datasets of over 512,000 elements up to 14 times faster than CPU QS [23]. Applications include tasks like sorting vertex distances in 3D rendering, where large-scale computations are required. CUDA’s thread-block structure and shared memory allow independent merging of array segments, reducing synchronization overhead. While bottlenecks arise in data distribution and memory latency, CUDA’s shared memory mitigates these issues, ensuring efficiency in large-scale sorting tasks.

Eq (1) [37] shows the time complexity of the parallel MS algorithm, where it represents the time each processor spends working on its assigned portion of the input.

Θ(nplognp), (1)

where the fraction np indicates that the input of size n is evenly divided among p processors, meaning each processor handles np elements. The lognp factor arises because each processor still needs to recursively sort its portion.

Eq (2) [37] represents the total computational complexity, since each of the n elements participates in the merging process.

Θ(n). (2)

Eq (3) [37] demonstrates main operations, such as data distribution and communication among processors.

Ω(n), (3)

it introduces additional overhead that must be accounted for in the parallel runtime analysis.

The overall parallel runtime of the parallel MS is expressed in Eq (4) [37]:

Tp=Θ(nplognp)+Θ(n)+Ω(n). (4)

Parallel quick sort using GPU computing with CUDA

Parallel QS, utilizing the divide-and-conquer strategy, partitions the dataset into sub-arrays for concurrent processing by GPU cores. The algorithm follows three steps: (1) Pivot Selection, where a pivot is chosen using techniques like random or median-of-three; (2) Partitioning the array is divided around the pivot, handled by GPU threads; and (3) Recursive sorting sub-arrays are sorted concurrently [38].

QS has an average-case time complexity, and random or median pivot selection reduces the risk of worst-case performance. On GPUs, QS delivers significant speedups but faces challenges like workload imbalance due to pivot-dependent partitioning [18]. Efficiency depends on balancing workloads across Streaming Multiprocessors and minimizing memory contention [39].

Eq (5) [37] presents the frequent data exchange to ensure that elements are correctly partitioned around chosen pivots.

Θ(nplognp), (5)

since each processor handles a subarray of size np, the efficiency of this process depends on reducing the amount of data exchanged between processors while ensuring that each processor gets an equal share of the work.

Eq (6) [37] demonstrates the time complexity due to the logarithmic depth of recursive partitioning and the need for synchronization among processors at each step.

Θ(nplogp). (6)

Eq (7) [37] represents the overhead from inter-processor communication and synchronization.

Θ(log2p), (7)

the quadratic logarithmic term quantifies this growing complexity, highlighting the trade-off between efficiency and communication overhead.

Eq (8) [37] shows the formal representation of parallel QS runtime.

Tp=Θ(nplognp)+Θ(nplogp)+Θ(log2p). (8)

In a parallel QS, the input of size n is split equally among p processors, assigning each processor to work on a subarray of size np. This step of splitting these subarrays into smaller segments works best if the pivots are selected well. If the pivots are not selected well, then the partitions are unbalanced, and hence the recursive process is slower and overall performance degrades [18].

Parallel QS is widely used in graphics processing tasks such as organizing scene objects for efficient rendering and optimizing shadow calculations. By partitioning objects based on spatial properties, it enhances rendering performance and reduces computational bottlenecks [40]. Additionally, it is employed in physics simulations, real-time collision detection, and distributed computing for large-scale datasets.

Parallel bubble sort using GPU computing with CUDA

Parallel BS is a simpler algorithm, but can be parallelized to exploit GPU architectures. It divides the input array into adjacent pairs, comparing and swapping them simultaneously. Unlike the sequential version, this parallel approach reduces execution time, especially for large datasets [18]. GPU-based parallel implementations use thread-level parallelism and shared memory. Each thread processes a data subset, performing comparisons and swaps concurrently. Parallelization speeds up execution, though more memory is needed for concurrent threads and shared memory management.

In Parallel BS, each stage consists of two phases: the odd phase, where comparisons and swaps occur between odd and even indexed elements, and the even phase, where comparisons and swaps happen between even and odd indexed elements. These phases continue until the array is fully sorted. CUDA maps these phases to GPU threads, enabling simultaneous comparisons and reducing latency by minimizing global memory access.

As represented in Eq (9) [18], the parallel execution still adheres to BS’s inherent O(n2) complexity.

n2p. (9)

Eq (10) [18] represents the overhead introduced by inter-processor communication and synchronization during the sorting process.

n×logp, (10)

Since adjacent processors must exchange data to ensure correct ordering, the number of synchronization steps follows a logarithmic scale with respect to the number of processors.

Eq (11) [18] shows the total time complexity of BS by quantifying the trade-offs, illustrating how parallelization reduces execution time:

Tp=O(n2p)+O(n×logp). (11)

While BS performs reasonably well for small to mid-sized inputs, its reliance on repeated pairwise comparisons and frequent data exchanges makes it less suitable for high-performance computing [18]. In contrast, algorithms like Parallel MS and QS achieve superior scalability by minimizing redundant operations.

Parallel radix top-K selection sort using GPU computing with CUDA

RS is a parallel GPU algorithm to solve the scalability issue of existing top-k selection algorithms. The traditional approaches, like merge-based algorithms and priority queues, are hindered by the limitation of the size of the on-chip memory, which limits the value of k to a maximum. RS applies a distribution-based radix selection with partial sorting such that it can handle much larger values of k efficiently [14].

The algorithm is two-pass. During the radix select phase, input elements are treated as fixed-length bit strings. The algorithm reads digits from the most significant to the least significant. In each iteration, it builds a histogram to count the number of elements in each digit bin. The bin containing the kth element is chosen, and only its elements are kept for the next round [14]. It does so repeatedly until the candidate set reduces to a one-of-a-kind value, the pivot, or the kth item.

During the filter phase, the algorithm scans the original input and employs the pivot as a threshold to yield the top-k items. Output is unsorted by default; hence, post-sorting is available for the application to employ if the application demands it [14].

Eq (12) [14] shows the cost of iteratively narrowing down candidates across logradixnp digit passes.

Θ(nplogradixnp), (12)

where each pass scans and filters data of size np.

As shown in Eq (13) [14], for every digit pass, we need a parallel prefix sum to compute bin counts and locate the pivot bin.

Θ(nplogp). (13)

Furthermore, Eq (14) [14] presents the inter-thread coordination overhead that comes from hierarchical synchronizations.

Θ(log2p). (14)

As a result, Eq (15) [14] demonstrates the total time complexity of the parallel RS algorithm.

Tp=Θ(nplogradixnp)+Θ(nplogp)+Θ(log2p), (15)

where n shows the total number of elements and p is number of threads (parallel units).

Parallel slow sort using GPU computing with CUDA

SS is a bitmap based integer sorting algorithm that performs ordering by mapping input values directly into a compact presence representation. By eliminating repeated element comparisons and recursive partitioning, SS reduces control divergence and enables efficient parallel execution on GPU architectures [28]. The algorithm is particularly effective for integer datasets with compact value ranges, where bitmap traversal yields direct global ordering.

The algorithm consists of three phases. In the first phase, the input array is scanned to determine the minimum, second smallest, maximum, and second largest elements, which are used to define an effective value range and compute an offset for handling negative integers. In the second phase, each input element is independently mapped to a corresponding bit position in a bitmap, marking its presence. This mapping process constitutes the core sorting operation and is naturally parallelizable. In the final phase, the bitmap is scanned in increasing index order to reconstruct the sorted output by emitting values corresponding to set bits [28].

As shown in Eq (16), the cost of the bitmap construction phase scales linearly with the number of elements distributed across p parallel threads, since each element is processed independently during the mapping step.

Θ(np), (16)

where n denotes the total number of input elements [28].

During the bitmap traversal and output reconstruction phase, the algorithm scans the compressed bitmap whose length depends on the effective value range k, defined as the difference between the second largest and second smallest elements. As shown in Eq (17), the parallel cost of this phase scales with the size of the bitmap stored in bytes.

Θ(k8p), (17)

where each bit represents the presence of one integer value [28].

As a result, Eq (18) presents the total parallel time complexity of the Slow Sort algorithm by combining the costs of bitmap construction and bitmap traversal.

Tp=Θ(np)+Θ(k8p), (18)

where p represents the number of parallel threads [28].

Unlike RS, SS produces a globally sorted output directly after bitmap traversal and does not require post processing or additional merging steps [28]. However, its performance depends on the magnitude of k, with compact value ranges enabling high throughput and low memory overhead, while sparse ranges tend to degrade efficiency. This behavior is consistent with the experimental observations reported in our evaluation.

Parallel memory complexity

While parallel time complexity has already been taken into account, the memory requirement of each of the algorithms applied in the GPU-based implementations must also be taken into consideration. Table 2 gives the asymptotic memory complexity of the algorithms being compared in parallel scenarios, including the auxiliary buffers and the recursions’ overheads.

Table 2. Parallel memory complexity of evaluated sorting algorithms.

Algorithm Parallel memory complexity (order)
QS O(logn) recursion stack + auxiliary partition buffers [41]
MS O(n) temporary arrays required for merging [42]
BS O(1) (in-place; negligible additional memory)
RS O(n+k) for counting/bucket arrays (k = digit groups) [43,44]
SS O(n+k/8) bitmap mapping space, where k is the compressed value range [28]

As shown in Table 2, BS is almost in-place, while QS employs logarithmic recursion space and auxiliary partition buffers [41]. MS incurs heavy O(n) overhead due to temporary arrays required for merging, although modern multi-way variants reduce global memory traffic and mitigate shared memory conflicts [42]. RS requires O(n) additional memory along with bucket arrays linear in the number of digit groups; however, newer optimizations, reduce global memory operations per pass [44], and hybrid radix schemes further decrease memory transfers in practice [43]. In contrast, SS relies on bitmap-based presence encoding and range compression, requiring O(n+k/8) memory, where k denotes the effective compressed value range. This design yields a substantially lower auxiliary memory footprint than traditional radix based approaches, particularly for dense integer datasets, while preserving direct global ordering [28].

Experimental setup and results evaluation

This section outlines the experimental setup used to evaluate the performance of GPU-accelerated sorting algorithms. The evaluation involved a comparative analysis of parallel and sequential implementations of MS, QS, BS, RS and SS, applied to a dataset consisting of 10,000,000 generated integers. The datasets used in this study (four CSV files, each containing 10,000,000 integers) are publicly available on Figshare [45].

The dataset size of ten million integers was selected based on preliminary testing across various input sizes ranging from tens of thousands to tens of millions. Smaller datasets did not effectively reveal the benefits of GPU parallelism, while significantly larger ones resulted in excessive memory consumption and impractically long runtimes. The chosen size provided a balanced workload that clearly exposed the performance differences between sequential and parallel sorting approaches.

Programming language selection was guided by both technical compatibility and development efficiency. CUDA was used for GPU implementations, which naturally directed the choice toward C/C++ due to CUDA’s native support for these languages. Visual Studio served as the primary development environment for CUDA C, offering advanced features such as code autocompletion, integrated debugging tools, and seamless GPU compilation workflows. These capabilities enabled the efficient development and validation of the parallel sorting algorithms.

For the sequential versions, C++ was employed. Its maturity in terms of platform independence, simplicity of syntax, and availability for algorithm prototyping made it a suitable option in the context of the research conducted. Support from mature compilers and stable development tools available further facilitated the implementation process to be smooth and consistent. Together, these tools and platforms provided a reliable and effective environment for designing, testing, and comparing both sequential and parallel sorting algorithms.

GPU and CPU configurations

The sequential implementations of MS, QS, BS, SS and RS were developed using C++. The experiments were performed on a Windows 10 (64-bit) operating system with an Intel Core i5-9400F CPU (3.89 GHz, 6 cores, 6 threads) and an NVIDIA GTX 1660 SUPER GPU. The parallel implementations of the sorting algorithms were written in CUDA C and executed using Visual Studio as the development environment. Datasets were initialized at runtime using standard array-based generation for each data type. All data was allocated in host memory, generated at runtime for each test type, and transferred to device memory using cudaMemcpy prior to kernel execution. Table 3 summarizes the experimental setup parameters for algorithm time complexity analysis.

Table 3. Experiment setup parameters.

Parameter Description
Dataset Size 10,000,000 integers
Dataset Types Random, Reverse Sorted, Sorted, Nearly Sorted
Programming Language C++ (CUDA for parallel algorithms), C++ (for sequential algorithms)
GPU Model Discrete NVIDIA GTX 1660 SUPER
CUDA Toolkit Version CUDA 12.6.3
NVIDIA Driver Version 566.14 (WHQL)
CUDA Compiler nvcc (NVIDIA CUDA Compiler)
Compiler Flags -O3, -arch=sm_75, -lineinfo
CPU Model Intel® Core i5-9400F (3.89 GHz, 6 cores, 6 threads)
Operating System Windows 10 (64-bit)
Sorting Algorithms Evaluated MS, QS, BS, RS, SS
Performance Metrics Execution Time (ms), Speedup, Memory Usage
Testing Environment Visual Studio with CUDA for parallel sorting
Sequential Environment C++

The CUDA kernels used in this study were launched with a thread block size of 256 and a grid size computed as N/256, where N denotes the dataset size. Shared memory was not utilized, and all memory accesses were performed using standard global memory access through array indexing. Additionally, thread synchronization was handled using __syncthreads() to maintain correctness. No explicit coalescing or warp-level memory optimizations were applied, as the kernel followed a straightforward implementation pattern aimed at establishing baseline performance metrics.

In the case of randomly distributed data, all the parallel approaches for GPUs outperform their sequential CPU implementations greatly, as evident from Fig 3(a). Among the comparison-based approaches, MS took 2212.33 ms on the CPU compared to 21.4 ms on the GPU, leading to an acceleration of roughly 100×. QS was another comparison algorithm that accelerated greatly, taking 1461.97 ms on the CPU compared to 15.1 ms on the GPU. On the contrary, BS took much longer despite its implementation being parallelized, taking 7,377.8 ms compared to 123,321.9 ms on the CPU. In case of non-comparison based approaches, they showed the best performance on this task. RS cut the execution time down from 240.8 ms on the CPU to merely 4.828 ms on the GPU due to its digit-wise bucketing technique and high-parallelism histogram calculations. SS showed the best performance, finishing with an execution time of 3.989 ms on the GPU compared to 74.07 ms on the CPU, thus underpinning the efficiency of its parallel construction and presence encoding algorithms.

Fig 3. Comparison of sequential and parallel sorting algorithms under different input distributions: (a) random input, (b) nearly sorted input, (c) sorted input, and (d) reversed input.

Fig 3

In the nearly sorted scenario, all algorithms profited from the reduction of irregularities in the input as can be seen from Fig 3(b). In this set of input arrays, the best performance of QS occurred as its sequential execution time went down to 250.9 ms and the parallel execution time to 15.0 ms. MS had a stable performance: its parallel execution time varied from 19.5 to 19.9 ms. BS made very little progress in this area, having a parallel execution time of 7,150.5 ms versus 7,176.1 ms sequentially, because early completion restricts parallelism. RS continued to perform well, taking only 4.867 ms on the GPU versus 381.2 ms on the CPU. SS continued to perform well, completing in 4.977 ms versus 55.13 ms sequentially, cementing that bitmap-based strategy’s relevance even in cases where strong ordering exists.

The sorted case illustrated in Fig 3(c), where MS continued to demonstrate stable performance, with sequential execution time of 2426.87 ms and parallel execution of 19.9 ms. QS showed reduced acceleration relative to other inputs, as its sequential execution time increased to 2233.07 ms, while the parallel version completed in 19.8 ms, indicating sensitivity to pivot behavior despite optimization strategies. Also, BS achieved its best sequential behavior due to the absence of swaps; however, parallel execution remained slow at 7,176.1 ms, offering limited advantage. RS performed consistently, completing execution in 4.85 ms on the GPU compared to 88.12 ms on the CPU. SS also maintained stable and fast execution, with the parallel version completing in 4.125 ms versus 66.43 ms sequentially. This consistency arises because SS does not rely on comparisons or recursive partitioning, but instead restores sorted order directly through bitmap traversal.

Reverse-sorted data, presented in Fig 3(d), typically represents a challenging scenario for comparison-based algorithms. MS showed reliable performance, with a parallel execution time of 19.6 ms compared to 1588.07 ms sequentially. QS exhibited its worst sequential behavior in this category, with execution time rising to 2686.7 ms, while the parallel version completed in 17.5 ms, demonstrating that GPU parallelism mitigates but does not eliminate pivot-related sensitivity. BS remained inefficient, with execution times decreasing from 109,529.1 ms sequentially to 7,167.6 ms in parallel, still significantly slower than other approaches. RS maintained its robustness, executing in 4.84 ms on the GPU compared to 110.93 ms sequentially, showing little sensitivity to input order. SS also performed consistently, reducing execution time from 58.10 ms on the CPU to 4.149 ms on the GPU. This stability under reverse ordering highlights SS’s advantage over comparison-based methods, as its bitmap-based presence encoding avoids unfavorable access patterns caused by swaps or pivot-driven partitioning.

Table 4 presents a comparative summary of the best and worst parallel execution times for each sorting algorithm across different dataset types.

Table 4. Parallel execution time comparison.

Algorithm Best Case Worst Case
MS Nearly sorted data (19.5 ms) Random data (21.4 ms)
QS Nearly sorted data (15.0 ms) Sorted data (19.8 ms)
BS Nearly sorted data (7,150.5 ms) Random data (7,377.8 ms)
RS Random data (4.83 ms) Reverse sorted data (4.87 ms)
SS Random data (3.989 ms) Nearly sorted data (4.977 ms)

The results presented in Figs 4 and 5 provide a comparative performance analysis of the four sorting algorithms: BS, MS, QS, RS, and SS under sequential and parallel implementations.

Fig 4. The performance of the sequential sorting algorithms.

Fig 4

Fig 5. The performance of the parallel sorting algorithms.

Fig 5

From the sequential execution results shown in Fig 4, it is evident that BS consistently exhibits the highest execution time across all dataset types, reaffirming its inefficiency for large datasets due to its quadratic time complexity.

In contrast, RS, SS, and QS demonstrate significantly lower execution times, with SS and RS achieving the best overall performance. MS shows stable and predictable behavior across dataset types, benefiting from its O(nlogn) complexity, although it does not surpass RS, SS, and QS in terms of raw execution speed. Furthermore, the non-comparative approaches employed by RS and SS allow them to handle larger datasets more efficiently, whereas QS’s performance can fluctuate slightly depending on the initial order of the input data.

The parallel execution results depicted in Fig 5 highlight the substantial improvements achieved through GPU based parallelism. SS, together with RS, demonstrates the strongest performance in the parallel setting, with SS achieving the lowest execution times and the highest overall speedup across all dataset types. MS and QS also benefit from parallelization, though to a lesser extent than SS and RS.

This performance advantage arises from the absence of comparison operations in SS and RS, allowing their workloads to be decomposed into highly parallel tasks with minimal dependence between threads. SS benefits from bitmap-based presence encoding for efficient parallel construction and restoration of the sorted output, while RS exploits digit-wise bucketing to process multiple keys concurrently. QS benefits primarily on random and nearly sorted data, whereas its gains are more moderate for sorted and reverse sorted datasets due to partitioning overhead and reduced load balancing opportunities. MS maintains stable performance across dataset types, with speedup attributed to efficient merging of pre-ordered elements, although its parallel efficiency is slightly constrained by memory overheads from recursive splitting and merging.

In contrast, BS shows the least improvement under parallel execution. Its sequential comparison and swapping mechanism imposes frequent synchronization and limits concurrency, making it poorly suited for GPU parallel environments.

Additionally, to assess the reliability of observed performance, we simulated 30 repeated runs per configuration and computed standard deviation values, summarized in Table 5.

Table 5. Execution time statistics (30 runs, GPU parallel execution).

Input Type BS MS QS RS SS
Nearly Sorted Mean: 7150.5 Mean: 19.5 Mean: 15.0 Mean: 4.87 Mean: 4.977
Std: 250.313 Std: 3.108 Std: 0.228 Std: 0.161 Std: 0.357
Random Mean: 7377.8 Mean: 21.4 Mean: 15.1 Mean: 4.83 Mean: 3.989
Std: 250.801 Std: 3.112 Std: 0.263 Std: 0.162 Std: 0.623
Reverse Sorted Mean: 7147.2 Mean: 19.6 Mean: 17.5 Mean: 4.84 Mean: 4.149
Std: 257.390 Std: 3.131 Std: 0.271 Std: 0.171 Std: 0.741
Sorted Mean: 7176.1 Mean: 19.9 Mean: 19.8 Mean: 4.85 Mean: 4.125
Std: 251.276 Std: 3.133 Std: 0.328 Std: 0.179 Std: 0.679

Mean and standard deviation are based on 30 executions for each algorithm and input type. Execution times are in milliseconds.

This statistical comparison indicates evident differences in stability and consistency of the sorting algorithms under parallel run in GPU. RS was the most stable and consistent algorithm, with consistent execution with all input types and with minimal fluctuation between runs. Its consistent performance testifies to its scalability and efficiency with large-scale parallel workloads. SS also demonstrated generally stable behavior, with relatively small variation across runs, which is expected because its bitmap based mapping and restoration phases avoid pivot sensitivity and heavy branching. However, its variability can increase depending on the effective bitmap density and the distribution of values within the compressed range. MS followed, reliably delivering consistent results via its sequential joining phases and balanced data handling, so that it was less prone to input order variation. QS produced good performance overall but with a little more volatility, particularly on non-random and reverse-sorted datasets, since it is inherently prone to partition imbalance and pivot selection. BS remained the least stable and most fluctuating, with high runtime variance even under optimal circumstances. These results confirm that while all the algorithms are optimized by parallelization using a GPU to some extent, RS and MS are very robust, and QS and especially BS are less robust when faced with various data distributions.

The relative performance comparison of sequential and parallel implementations over the GPU reveals strong efficiency advantages and clear differences in scalability across the algorithms. SS is consistently the best performer in parallel execution, achieving near uniform GPU run times of approximately 3.99 to 4.98 ms across the evaluated input configurations. Compared with its sequential execution times of roughly 55.13–74.07 ms, this demonstrates a substantial speedup and highlights SS’s suitability for GPU execution when sorting large scale integer datasets. Its relatively stable GPU timings across input types are explained by its bitmap-based structure, where computation is dominated by parallel bitmap construction and deterministic restoration rather than pivot selection or swap-heavy operations. RS follows closely as the next best performer, maintaining consistently low and near uniform execution times of approximately 4.83 to 4.87 ms across all input configurations. This impressive uniformity, compared with its sequential times of 240.8–381.2 ms, attests to its inherent linear time behavior and insensitivity to input ordering, justifying its outstanding performance on parallel hardware systems. The low variance among datasets indicates strong workload balance and minimal divergence among thread executions, making RS a highly robust choice for high throughput GPU pipelines. MS shows superb parallel scalability with its running time reducing from ≈1773–2427 ms sequentially to 19.5–21.4 ms parallel, which is an approximate 100× speedup. This is because of its nicely distributed merge steps and favorable memory access patterns. While slower than SS, MS’s performance is consistent for all input types with only minor variations in running time, showing its consistency and balanced parallel workload distribution. QS similarly benefits greatly from GPU acceleration, cutting its running time down from ≈1462–2687 ms if executed sequentially to ≈15–19.8 ms if executed in parallel—a decrease of nearly two orders of magnitude. QS remains partially sensitive to input distribution, though: while random and nearly sorted inputs yield best times of about 15 ms, reverse-sorted inputs increase runtime to about 17.5 ms. This sensitivity accords with QS’s dependence upon pivot selection and partitioning balance, which can cause uneven workload distribution for certain input sequences. Conversely, BS, while experiencing seemingly improved performance with parallelization—from ≈68,000–123,000 ms sequentially to ≈7150–7377 ms—continues to be several orders of magnitude slower than the other algorithms. Its quadratic complexity and synchronization overhead ensure inefficiency with GPU parallelism, leading to unacceptable scalability as well as proportionally enormous execution times for all input categories. The small variations between input categories further confirm that BS’s performance bottleneck lies more in the algorithmic structure than data distribution.

Table 6 shows the memory consumption patterns of the evaluated GPU-based parallel sorting algorithms, revealing significant variation, primarily due to differences in algorithmic structure and data access behavior.

Table 6. Memory usage (in Bytes) of GPU-based parallel sorting algorithms (10,000,000 integers).

Input Type BS MS QS RS SS
Nearly Sorted 40,367,104 120,540,672 40,196,608 121,744,383 1,652,781
Random 40,367,104 120,540,672 40,196,608 121,744,383 1,652,781
Reverse Sorted 40,367,104 120,540,672 40,196,608 121,744,383 1,652,781
Sorted 40,367,104 120,540,672 40,196,608 121,744,383 1,652,781

SS exhibits the lowest and most compact memory footprint among the evaluated GPU algorithms, remaining essentially constant across all dataset types. This efficiency arises from its bitmap based design, which encodes the presence of values within a bounded range rather than maintaining full sized auxiliary arrays for merging or multi pass redistribution. In particular, SS relies primarily on a compressed bit vector and a small set of scalar bounds such as minimum and maximum values and second extremes, so its memory requirements are largely insensitive to input ordering once the bitmap range is determined. As a result, SS achieves stable memory behavior across sorted, nearly sorted, random, and reverse sorted inputs, emphasizing its suitability for memory constrained GPU environments and workloads that benefit from inherent deduplication.

MS demonstrates stable yet considerably high memory consumption across all dataset types, with usage consistently near 120–121 MB. This elevated requirement arises from its intrinsic reliance on an auxiliary buffer equivalent in size to the input array, along with temporary subarray indexing and thread coordination overhead. Compared with the 10 M-element configuration, the total memory footprint has scaled almost linearly roughly 12× reflecting the algorithm’s O(n) auxiliary-space complexity. While random and reverse-sorted data both reach the upper range (120.54 MB), sorted and nearly sorted datasets exhibit identical usage within measurement tolerance, indicating that input order has minimal influence once the GPU threads are fully saturated.

QS exhibits moderate and well-balanced memory requirements, averaging approximately 40 MB across all dataset types. The GPU implementation allocates device side stacks and temporary buffers for pivot partitioning, yet the recursion depth remains bounded and shared memory is effectively reused. As a result, the overall memory footprint remains compact and consistent, with only minor variations across different input distributions. This uniform behavior indicates that thread-level partition buffers and local synchronization dominate memory activity, minimizing the influence of input order on resource utilization. This consistency reflects an optimized load-balancing strategy within the parallel kernel, allowing efficient utilization of GPU threads while maintaining stable memory consumption. Nevertheless, QS’s memory behavior continues to depend on the quality of pivot selection, which directly affects workload distribution and cache coherence during recursive partitioning.

BS maintains modest and relatively uniform memory usage, approximately 40 MB across all dataset types, demonstrating consistent resource behavior under large scale GPU execution. Although the algorithm is conceptually in-place, its parallel adaptation introduces additional auxiliary arrays for synchronization flags, swap detection, and inter-thread coordination. These structures ensure correct concurrent operation but slightly increase overall memory demand compared to the theoretical minimum. The uniformity measured in the sorted, nearly sorted, random and reverse-sorted data suggests that the memory activity of the algorithm is largely independent of the input distribution, as all variants require full iterative passes through the data. Despite this predictability, BS remains computationally inefficient, with high execution times that outweigh its limited and stable memory footprint, reaffirming its role primarily as a baseline reference in GPU-based performance comparisons.

Finally, RS remains the algorithm with the highest memory demand, consuming approximately 121.74 MB consistently across all datasets. This high utilization stems from multiple digit-based passes that allocate large bin buffers, prefix-sum arrays, and auxiliary output buffers for key redistribution. Each GPU pass through the data engages these buffers to perform fine-grained sorting on specific digit positions, leading to consistent and high memory utilization. The preallocated storage ensures that sufficient workspace is always available for radix-based partitioning and aggregation operations, minimizing runtime allocation overhead. As a result, RS achieves remarkable throughput and minimal timing variance, leveraging its substantial temporary storage to maximize data parallelism and avoid idle GPU cycles. However, this comes at the expense of higher overall memory consumption, underscoring the trade-off between raw performance and memory efficiency inherent to radix-based sorting techniques.

Overall, Figs 4 and 5 demonstrate that SS exhibits the strongest overall performance across both sequential and parallel environments, with particularly pronounced advantages in the parallel setting. Its performance remains stable across different input distributions, highlighting its suitability for large-scale datasets when the effective value range is compact. RS also delivers consistently strong results, especially under GPU parallelism, reflecting the effectiveness of digit based processing for integer sorting tasks.

QS emerges as a competitive alternative in parallel execution, achieving substantial speedup and performing well across most input types, although its behavior remains moderately sensitive to input ordering. MS, while not the fastest, provides stable and predictable performance in both sequential and parallel settings, making it well suited for applications that prioritize robustness and consistent worst case behavior over raw execution speed. In contrast, BS remains impractical for performance sensitive applications even when parallelized, reinforcing the importance of appropriate algorithm selection for high performance computing environments.

An important parallel sorting algorithm difference is whether the data is globally sorted after execution of the kernel or not. As shown in Table 7, QS, MS, and BS all depend on additional post-processing: QS partition pieces have to be recursively merged into each other, MS requires repeated merging of pairs of subarrays, and BS requires multiple global passes for final ordering. In comparison, RS and SS inherently produce a fully sorted array at the conclusion of their execution. RS achieves this through digit-based passes, while SS completes sorting via bitmap mapping followed by linear traversal, yielding a globally sorted output without additional merging. Although our experimental examination was aimed at overall runtime efficiency without isolating these post-processing overheads, it is noteworthy to consider this distinction to better understand the real world efficiency of different parallel sorting algorithms. The evaluated algorithms are publicly available on GitHub at https://github.com/Baizhik/Algorithms_Benchmarking

Table 7. Post-processing requirements of the evaluated parallel sorting algorithms.

Algorithm Post-processing Description of required steps
QS Yes Locally sorted subarrays must be recursively merged and positioned relative to pivots until globally sorted list is achieved.
MS Yes Hierarchical pair-wise merging of pre-sorted subarrays must be repeated until only one fully sorted sequence remains.
BS Yes Requires multiple global passes, since local swaps alone do not ensure global ordering.
RS No Digit-by-digit passes produce an entirely sorted array as a direct output without additional merging.
SS No Sorting is completed via bitmap mapping and linear bitmap traversal, which directly produces a globally sorted output without merging or recursive refinement.

Threats to validity

We acknowledge several limitations that may affect the generalizability and interpretability of our results. These are outlined below to ensure transparency and guide future replication and extension of this work.

Hardware specificity. All experiments were conducted on a fixed hardware setup: an Intel® Core i5-9400F CPU and an NVIDIA GTX 1660 SUPER GPU. While this configuration provides a stable and representative mid-range testbed, performance characteristics may vary on more powerful or newer architectures (e.g., RTX 40 series, AMD Radeon GPUs, or Apple M-series chips). Thus, care must be taken when generalizing the reported speedups to other hardware.

Algorithm scope. We focused our analysis on five sorting algorithms: MS, QS, BS, SS and RS. While these represent a diverse range of algorithmic paradigms, we did not include GPU-optimized sample sort, bitonic sort, or hybrid CPU–GPU strategies, which may exhibit different performance trade-offs. Including them in future work could provide a broader comparative picture.

Dataset size and distribution. All experiments were performed on a fixed input size of ten million integers, across four distribution types: random, sorted, reverse-sorted, and nearly sorted. Although these represent common cases in performance testing, they may not fully capture the complexity or variability of real-world datasets, such as skewed or multi-modal distributions. Future work may explore adaptive sorting performance on larger or more diverse data.

Despite these limitations, we believe our unified evaluation framework, standardized testing methodology, and open dataset release provide a robust foundation for reproducibility and future extensions.

Conclusion

This study has demonstrated that GPU based parallel sorting using CUDA provides significant execution time improvements over sequential CPU based sorting. The performance analysis of MS, QS, BS, RS, and SS confirms that GPU parallelization offers substantial speedups, particularly for SS and RS, followed by MS and QS, while BS remains computationally expensive due to inherent algorithmic limitations. Experiments conducted on 10,000,000 integers across four data scenarios (sorted, nearly sorted, random, and reverse sorted) reveal that SS achieved the strongest overall performance and highest speedup, while RS also performed exceptionally well due to its digit wise filtering and GPU specific optimizations, including memory hierarchy aware atomics and adaptive robustness techniques. Meanwhile, MS demonstrated stable performance across all dataset types, with its best execution occurring for nearly sorted data, reinforcing its efficiency in structured scenarios. The results emphasize that dataset structure plays a crucial role in determining sorting efficiency in GPU parallelization. While comparison based sorting algorithms such as MS and QS exhibit strong parallel performance, non comparison approaches such as SS and RS align more naturally with GPU architectures, whereas BS remains an inefficient choice for large scale parallel sorting. Additionally, a key takeaway is the trade off between execution time and memory consumption: parallel algorithms significantly improve speed but require higher memory usage than sequential implementations. Among the evaluated algorithms, RS exhibited the highest memory consumption, while SS required the least, with MS, BS, and QS falling in between depending on input distribution. The reported results reflect execution on a single, well defined hardware setup - Intel i5 9400F CPU and NVIDIA GTX 1660 SUPER GPU, providing a consistent baseline for comparison.

In future studies, we aim to include a broader range of sorting algorithms to provide a more comprehensive comparison of sequential and parallel performance. Additionally, implementing sequential versions in lower-level languages such as C/C++ may offer more accurate benchmarking against CUDA-based GPU implementations. Experiments on modern hardware will also help reflect the capabilities of current high-performance computing platforms. Further research can explore hybrid CPU-GPU models for dynamic workload distribution and scaling to multi-GPU systems to improve sorting efficiency on large datasets.

Data Availability

All relevant data are publicly available from the Figshare repository: https://doi.org/10.6084/m9.figshare.29558357.

Funding Statement

This research was financially supported by the Deanship of Scientific Research and Graduate Studies at King Khalid University under research grant number (R.G.P.2/12/46). The funders were involved in supervising the research and revising the manuscript.

References

  • 1.Rashid AB, Kausik MAK. AI revolutionizing industries worldwide: a comprehensive overview of its diverse applications. Hybrid Advances. 2024;7:100277. doi: 10.1016/j.hybadv.2024.100277 [DOI] [Google Scholar]
  • 2.Ala’anzy MA, Zhumalin A, Temirtay D, Abdalhafid A. MBISort algorithm: a novel hybrid sorting approach for efficient data processing. In: 2025 17th International Conference on Electronics, Computers and Artificial Intelligence (ECAI). 2025. p. 1–8. 10.1109/ecai65401.2025.11095462 [DOI]
  • 3.Ala’Anzy MA, Mazhit Z, Ala’Anzy AF, Algarni A, Akhmedov R, Bauyrzhan A. Comparative analysis of sorting algorithms: a review. In: 2024 11th International Conference on Soft Computing & Machine Intelligence (ISCMI). 2024. p. 88–100. 10.1109/iscmi63661.2024.10851593 [DOI]
  • 4.Alanzy M, Latip R, Muhammed A. Range wise busy checking 2-way imbalanced algorithm for cloudlet allocation in cloud environment. J Phys: Conf Ser. 2018;1018:012018. doi: 10.1088/1742-6596/1018/1/012018 [DOI] [Google Scholar]
  • 5.Safi H, Jehangiri AI, Ahmad Z, Ala’anzy MA, Alramli OI, Algarni A. Design and Evaluation of a Low-Power Wide-Area Network (LPWAN)-Based Emergency Response System for Individuals with Special Needs in Smart Buildings. Sensors (Basel). 2024;24(11):3433. doi: 10.3390/s24113433 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Liu P. An in-depth study of sorting algorithms. ACE. 2024;92(1):187–95. doi: 10.54254/2755-2721/92/20241750 [DOI] [Google Scholar]
  • 7.Maltenberger T, Ilic I, Tolovski I, Rabl T. Evaluating multi-GPU sorting with modern interconnects. In: Proceedings of the 2022 International Conference on Management of Data. 2022. p. 1795–809. 10.1145/3514221.3517842 [DOI]
  • 8.Ilic I, Tolovski I, Rabl T. RMG Sort: radix-partitioning-based multi-GPU sorting. In: BTW 2023 . 2023. p. 305–28.
  • 9.Khan M, Jehangiri AI, Ahmad Z, Ala’anzy MA, Umer A. An exploration to graphics processing unit spot price prediction. Cluster Comput. 2022;25(5):3499–515. doi: 10.1007/s10586-022-03581-8 [DOI] [Google Scholar]
  • 10.Ruetsch G, Fatica M. CUDA Fortran for scientists and engineers: best practices for efficient CUDA Fortran programming. Elsevier; 2024. [Google Scholar]
  • 11.Yoshida K, Miwa S, Yamaki H, Honda H. Analyzing the impact of CUDA versions on GPU applications. Parallel Computing. 2024;120:103081. doi: 10.1016/j.parco.2024.103081 [DOI] [Google Scholar]
  • 12.Hongdi S, Minzheng J, Zhu G. Multi-GPU radix sort algorithm in high performance computing environment. In: 2024 IEEE/ACIS 27th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE; 2024. p. 131–6.
  • 13.Liu C. Study on the particle sorting performance for reactor Monte Carlo neutron transport on apple unified memory GPUs. EPJ Web Conf. 2024;302:04001. doi: 10.1051/epjconf/202430204001 [DOI] [Google Scholar]
  • 14.Li Y, Zhou B, Zhang J, Wei X, Li Y, Chen Y. RadiK: scalable and optimized GPU-parallel radix top-K selection. In: Proceedings of the 38th ACM International Conference on Supercomputing. 2024. p. 537–48. 10.1145/3650200.3656596 [DOI]
  • 15.Schmid RF, Pisani F, Cáceres EN, Borin E. An evaluation of fast segmented sorting implementations on GPUs. Parallel Computing. 2022;110:102889. doi: 10.1016/j.parco.2021.102889 [DOI] [Google Scholar]
  • 16.Schmid RF, Caceres EN. Fix Sort: a good strategy to perform segmented sorting. In: 2019 International Conference on High Performance Computing & Simulation (HPCS). 2019. p. 290–7. 10.1109/hpcs48598.2019.9188196 [DOI]
  • 17.Blanchard JD, Opavsky E, Uysaler E. Selecting multiple order statistics with a graphics processing unit. ACM Trans Parallel Comput. 2016;3(2):1–23. doi: 10.1145/2948974 [DOI] [Google Scholar]
  • 18.Faujdar N, Ghrera SP. Analysis and testing of sorting algorithms on a standard dataset. In: 2015 Fifth International Conference on Communication Systems and Network Technologies. 2015. p. 962–7. 10.1109/csnt.2015.98 [DOI]
  • 19.Singh DP, Joshi I, Choudhary J. Survey of GPU based sorting algorithms. Int J Parallel Prog. 2017;46(6):1017–34. doi: 10.1007/s10766-017-0502-5 [DOI] [Google Scholar]
  • 20.Tanasic I, Vilanova L, Jordà M, Cabezas J, Gelado I, Navarro N, et al. Comparison based sorting for systems with multiple GPUs. In: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units. 2013. p. 1–11. 10.1145/2458523.2458524 [DOI]
  • 21.Cetin O, Kara F, Yilmaz O, Ozkaya H. Multi-GPU based parallel sorting: performance evaluation and memory optimization. Parallel Computing. 2023;115:102956. doi: 10.1016/j.parco.2023.102956 [DOI] [Google Scholar]
  • 22.Kumari S, Singh DP. A parallel selection sorting algorithm on GPUs using binary search. In: 2014 International Conference on Advances in Engineering & Technology Research (ICAETR-2014). 2014. p. 1–6.
  • 23.Gowanlock M, Karsin B. A hybrid CPU/GPU approach for optimizing sorting throughput. Parallel Computing. 2019;85:45–55. doi: 10.1016/j.parco.2019.01.004 [DOI] [Google Scholar]
  • 24.Wróblewski B, Gottardo G, Zouzias A. Parallel scan on ascend AI accelerators. In: Greeks in AI Symposium 2025 ; 2025. https://openreview.net/forum?id=wPepcNWMhs
  • 25.Al-sudani AH, Mahmmod BM, Sabir FA, Abdulhussain SH, Alsabah M, Flayyih WN. Multithreading-based algorithm for high-performance tchebichef polynomials with higher orders. Algorithms. 2024;17(9):381. doi: 10.3390/a17090381 [DOI] [Google Scholar]
  • 26.Mahmmod BM, Flayyih WN, Fakhri ZH, Abdulhussain SH, Khan W, Hussain A. Performance enhancement of high order Hahn polynomials using multithreading. PLoS One. 2023;18(10):e0286878. doi: 10.1371/journal.pone.0286878 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Hsu KC, Tseng HW. Simultaneous and heterogeneous multithreading: exploiting simultaneous and heterogeneous parallelism in accelerator-rich architectures. IEEE Micro. 2024;44(4):34–43. doi: 10.1109/MM.2024.3438722 [DOI] [Google Scholar]
  • 28.Wang P, He R. SlowSort: an enhanced sorting algorithm for large scale integer datasets. Wiley; 2025. 10.22541/au.174523890.00820511/v1 [DOI] [Google Scholar]
  • 29.Shirvani Moghaddam S, Shirvani Moghaddam K. A threshold-based sorting algorithm for dense wireless sensor systems and communication networks. IET Wireless Sensor Systems. 2023;13(2):37–47. doi: 10.1049/wss2.12048 [DOI] [Google Scholar]
  • 30.Subramaniam M, Tripathi T, Chandraumakantham O. Cluster sort: a novel hybrid approach to efficient in-place sorting using data clustering. IEEE Access. 2025;13:74359–74. doi: 10.1109/access.2025.3564380 [DOI] [Google Scholar]
  • 31.Shirvani Moghaddam S, Moghaddam KS. A general framework for sorting large data sets using independent subarrays of approximately equal length. IEEE Access. 2022;10:11584–607. doi: 10.1109/access.2022.3145981 [DOI] [Google Scholar]
  • 32.Moghaddam SS, Moghaddam KS. On the performance of mean-based sort for large data sets. IEEE Access. 2021;9:37418–30. doi: 10.1109/access.2021.3063205 [DOI] [Google Scholar]
  • 33.McKevitt J, Vorobyov EI, Kulikov I. Accelerating Fortran codes: a method for integrating Coarray Fortran with CUDA Fortran and OpenMP. Journal of Parallel and Distributed Computing. 2025;195:104977. doi: 10.1016/j.jpdc.2024.104977 [DOI] [Google Scholar]
  • 34.Quelhas KN, Henn MA, de Farias RC, Tew WL, Woods SI. Parallel MPI image reconstructions in GPU using CUDA. International Journal on Magnetic Particle Imaging. 2023;9(1 Suppl 1). [Google Scholar]
  • 35.Ait Ben Hamou K, Jarir Z, Elfirdoussi S. Design of a machine learning-based decision support system for product scheduling on non identical parallel machines. Eng Technol Appl Sci Res. 2024;14(5):16317–25. doi: 10.48084/etasr.7934 [DOI] [Google Scholar]
  • 36.Wu B, Koutsoukos D, Alonso G. Efficiently processing joins and grouped aggregations on GPUs. Proc ACM Manag Data. 2025;3(1):1–27. doi: 10.1145/3709689 [DOI] [Google Scholar]
  • 37.Faujdar N, Ghrera SP. Performance evaluation of merge and quick sort using GPU computing with cuda. International Journal of Applied Engineering Research. 2015;10(18):39315–393192. [Google Scholar]
  • 38.Al-Dabbagh SSM, Barnouti NH. Parallel quicksort algorithm using OpenMP. International Journal of Computer Science and Mobile Computing. 2016;5:372–82. [Google Scholar]
  • 39.Adinets A, Merrill D. Onesweep: a faster least significant digit radix sort for GPUS. arXiv preprint. 2022. doi: arxiv:220601784 [Google Scholar]
  • 40.Majumdar S, Jain I, Gawade A. Parallel quick sort using thread pool pattern. IJCA. 2016;136(7):36–41. doi: 10.5120/ijca2016908495 [DOI] [Google Scholar]
  • 41.Mujić M, Ćatić I, Behić S, Hadžibajramović A, Nosović N, Hrnjić T. Accelerating sorting on GPUs: a scalable CUDA quicksort revision. In: 2023 22nd International Symposium INFOTEH-JAHORINA (INFOTEH). 2023. p. 1–5. 10.1109/infoteh57020.2023.10094180 [DOI]
  • 42.Casanova H, Iacono J, Karsin B. An efficient multiway mergesort for GPU architectures. arXiv preprint 2017. https://arxiv.org/abs/1702.07961 [Google Scholar]
  • 43.Stehle E, Jacobsen HA. A memory bandwidth-efficient hybrid radix sort on GPUs. In: Proceedings of the 32nd International Conference on Supercomputing. ACM; 2017. p. 1–10.
  • 44.Adinets A, Merrill D. Onesweep: a faster least significant digit radix sort for GPUs. arXiv preprint 2022. https://arxiv.org/abs/2206.01784 [Google Scholar]
  • 45.Ala’anzy MA. Sorting datasets; 2025 . Dataset available at 10.6084/m9.figshare.29558357 (uploaded on October 26, 2025). [DOI]

Decision Letter 0

Alberto Marchisio

28 May 2025

PONE-D-25-22965Performance evaluation of GPU-based parallel sorting algorithmsPLOS ONE

Dear Dr. Ala'anzy,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The reviewers raised comments that need to be addressed.

Please submit your revised manuscript by Jul 12 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Alberto Marchisio

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Thank you for stating the following financial disclosure:

“This research was financially supported by the Deanship of Scientific Research and Graduate Studies at King Khalid University under research grant number (R.G.P.2/12/46).”

Please state what role the funders took in the study. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

4. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“This research was financially supported by the Deanship of Scientific Research and Graduate Studies at King Khalid University under research grant number (R.G.P.2/12/46).”

We note that you have provided funding information that is currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“This research was financially supported by the Deanship of Scientific Research and Graduate Studies at King Khalid University under research grant number (R.G.P.2/12/46).”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

5. We note that your Data Availability Statement is currently as follows: [All relevant data are within the manuscript and its Supporting Information files.]

Please confirm at this time whether or not your submission contains all raw data required to replicate the results of your study. Authors must share the “minimal data set” for their submission. PLOS defines the minimal data set to consist of the data required to replicate all study findings reported in the article, as well as related metadata and methods (https://journals.plos.org/plosone/s/data-availability#loc-minimal-data-set-definition).

For example, authors should submit the following data:

- The values behind the means, standard deviations and other measures reported;

- The values used to build graphs;

- The points extracted from images for analysis.

Authors do not need to submit their entire data set if only a portion of the data was used in the reported study.

If your submission does not contain these data, please either upload them as Supporting Information files or deposit them to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of recommended repositories, please see https://journals.plos.org/plosone/s/recommended-repositories.

If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. If data are owned by a third party, please indicate how others may request data access.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Partly

Reviewer #5: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: No

Reviewer #4: No

Reviewer #5: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: No

Reviewer #4: No

Reviewer #5: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Strong points:

The paper aims to evaluate the performance of GPU-based parallelization in 4 sorting algorithms, with focus on analyzing parallel time complexity and space complexity across various data types. It has an easy-to-understand topic structure and somewhat detailed methodologies that allow the reproducibility of their tests in other environments.

Weak points:

The authors tested the GPU-based parallelization of the sorting algorithms against sequential implementations, with the sequential being done in Java while the GPU-based being written in CUDA C, with no mention of why such a choice was made. The discussion of the results is insufficient, mainly because an analysis of memory usage is not included, and could supplement why the applications achieved the shown results, as well as help with the conclusion.

The work also lacks a threats to validity section. The limitations of this work must be clear to the readers.

- General comments about the work

Paper as a whole:

Decide between utilizing the acronym or the full name, for example: “MS typically outperforms quicksort”;

Capitalization of names: throughout the entire paper, the sorting names are written using both capitalized initials and non-capitalized versions, “Quick sort/quick sort; it would be better to choose which one is going to be used;

Standardization: There are sentences where the sorting names are separated, “Quick sort”, as well as together, “Quicksort”, it would be advised to revise the text to unify which nomenclature is going to be used;

On the Related Works: A paragraph comparing what you want to do against what has already been done could help differentiate your work from the others, and you talk about a “thorough time complexity table” been presented in the paper, which I don’t really see being presented in any section;

Line 75: “C++, building on the reviewed literature.” Loose phrase? The beginning of the sentence is the same, so it doesn’t make sense to say it again.

Lines 97/98: “The CUDA programming model works on the basis that the host is CPU and the device is GPU, has separate memory spaces [12].” No connection between the two phrases;

Why were the sequential versions made using Java while the GPU parallel versions used CUDA C? Was there a reason to impose this limitation on the work? Why the chosen environment? Was it a machine at hand? Or was it just chosen because of the GPU?

Why limit the dataset size to the chosen size? Did you make tests to choose this size, or was it selected at random? I assume some type of test was done to choose the size, if so, why not discuss the results obtained from these tests?

How the metrics were taken, does the execution time account for the reading of the dataset, and pre-processing of data, how the Memory Usage was taken, some form of software to monitor the resource usage, or is it being taken via code?

The results seem superficial, and one of the metrics in Table 1, “Memory Usage”, is not to be shown anywhere when discussing the results, and is somewhat referenced in the conclusion as a key takeaway;

The choice of colors in Figure 2(b) could be better; red and orange aren’t exactly the best color combo for a graph. Also, the graphs could be revised to include some type of hatch to differentiate the graphs from one another.

- Final Considerations

In general, the paper does what it proposed to do, although, in my opinion, the results are superficial and could be expanded, mainly due to the lack of a table or paragraph discussing the Memory Usage metric and how this metric interacts with the execution time and speedup achieved.

From a methodological perspective, the experimental processes are briefly outlined within the provided texts, facilitating the replication of the tests across diverse environments, but it could also be expanded beyond just talking about the size of the dataset and the environment configuration, to shed some light on why those were chosen.

There are some small problems, such as the limitations of the evaluations being done in different programming languages, the usage of a single dataset size, and the lack of a metric.

There are some questions that could be clarified about the paper, such as what the considered limitations were when searching for the Related Works and how the found works compare to what the authors want to achieve, as well as if something was used as a basis for the tests done in the work.

As a whole, I would say that the paper, as it is now, is not mature enough due to the lack of some key information to validate the work as a valuable addition to the literature.

Reviewer #2: 1. Introduction — Clarity and Depth

Issue: The introduction outlines the importance of sorting and parallel computing but lacks a compelling motivation for comparing these specific four algorithms.

Suggestions:

Explain why these four algorithms (Merge Sort, Quick Sort, Bubble Sort, Radix Top-K) were chosen — e.g., is this based on frequency of usage in CUDA applications or GPU benchmarking literature?

Add quantitative context: e.g., "Sorting is estimated to take up X% of compute cycles in application Y," to justify its significance in HPC.

Refine vague phrases such as “scales well with large datasets” with precise metrics or previous benchmarks.

2. Related Work — Gaps and Integration

Issue: The literature review is comprehensive but lacks synthesis and comparative critique.

Suggestions:

Include a comparative table or summary matrix listing previous studies, sorting algorithms used, platforms (single-GPU/multi-GPU), and key performance gains.

Several references are outdated or redundant (e.g., [8], [16] are not state-of-the-art). Add more recent studies (2022–2025) that use newer CUDA versions, Tensor Cores, or multi-GPU implementations.

Current review is narrative, not analytical. Clearly state what existing work lacks, which your paper addresses (e.g., consistent benchmarking across all four sorting types, large dataset size, diverse data distributions).

3. Methodology — Experimental Design and Rigor

Issue: The methodology is outlined, but the experimental control and reproducibility need reinforcement.

Suggestions:

Include CUDA version, driver version, and compiler flags used (especially if performance is being benchmarked).

Dataset structure is mentioned, but how data is initialized, and whether any caching or data prefetching happens is unclear.

Detail thread-block sizes, grid configurations, and any shared memory or coalesced memory access optimizations. These significantly impact CUDA performance and are critical to reproduce and understand performance differences.

4. Results and Discussion — Interpretation Quality

Issue: The results are presented clearly, but insight and critical analysis are limited.

Suggestions:

Explain why Radix Sort performs best — is it because of its O(n) time complexity or better memory alignment? Go deeper than just empirical observations.

Discuss scalability: how does performance vary with increasing data size (e.g., 1M → 10M → 100M integers)?

Graphs are effective, but statistical analysis (standard deviation, variance, error bars) would support the reliability of the results.

Consider including GPU utilization percentage (e.g., via NVIDIA Nsight) to analyze how efficiently each algorithm utilizes GPU hardware.

Overall Recommendation: Minor Revision

The paper is technically sound, but needs improvements in clarity, methodological transparency, CUDA-level detailing, and result interpretation. Addressing these will elevate its suitability for publication.

Reviewer #3: In this manuscript, the GPU-based parallelization of mergesort (MS), quicksort (QS), bubble sort (BS) and radix top-k selection sort (RS) are investigated. Also, the performance of these algorithms is evaluated on GPUs utilizing CUDA.

The manuscript is interesting; however, the following comments need to be addressed :

1 – In the abstract, the results need to be included .

2 – The introduction is short and should be improved by including other types of algorithms .

3 – Contribution need to be included as a list .

4 – In the related work section, a summary table need to be included .

5 – In the results, remove 3d appearance of the bars .

6 – Results requires more elaboration .

7 – Updates the references form 2025 literature .

8 – Check the manuscript for grammars and typos .

9 – Equations from other sources need to be credited .

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Reviewer #4: The paper describes about the performance evaluation of GPU-based parallel sorting algorithms. The paper is technically sound majorly, but the introduction is too short. This needs to be revised majorly. The authors uses datasets to benchmark the algorithms but failed to give apt details about them. Visualization and description of the data is absolutely necessary before the evaluation of the algorithms. Also, the number of references are too less for a journal article. The authors should work on expanding the introduction with more information and appropriate citations wherever necessary. Also, the supporting datasets should be made available if possible so that the rigour of the datasets could be actually verified.

Reviewer #5: Authors Contribution and Novelty statement is not convincing

Authors have compared different parallel sorting algorithms in CUDA platform

Architectural Specifications of the computing platform is not mentioned

Memory footprints and type of GPU memory (Integrated/Discrete) to be mentioned

Parallel Execution time alone mentioned in the paper, other metrics are not evaluated.

Comprehensive analysis along with Novelty/Innovation is required

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: OMER IQBAL

Reviewer #3: No

Reviewer #4: No

Reviewer #5: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2026 Feb 3;21(2):e0342167. doi: 10.1371/journal.pone.0342167.r002

Author response to Decision Letter 1


14 Jul 2025

Manuscript ID: PONE-D-25-22965

Original Article Title: “Performance evaluation of GPU-based parallel sorting algorithms”

To: PLOS One Editor

Re: Response to reviewers

Dear Editor,

Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.

We are uploading (a) our point-by-point response to the comments (below) (response to reviewers), (b) a marked-up copy of the manuscript (Revised Manuscript with Track Changes), and (c) a clean updated manuscript (“Main Manuscript”).

Best regards,

Mohammed Alaa Ala’anzy et al.

___________________________________________________

Reviewer 1

The paper aims to evaluate the performance of GPU-based parallelization in 4 sorting algorithms, with focus on analyzing parallel time complexity and space complexity across various data types. It has an easy-to-understand topic structure and somewhat detailed methodologies that allow the reproducibility of their tests in other environments.

We appreciate your time and effort in reviewing our manuscript and providing constructive criticism.

___________________________________________________

Reviewer #1, Concern #1: The authors tested the GPU-based parallelization of the sorting algorithms against sequential implementations, with the sequential being done in Java while the GPU-based being written in CUDA C, with no mention of why such a choice was made. The discussion of the results is insufficient, mainly because an analysis of memory usage is not included, and could supplement why the applications achieved the shown results, as well as help with the conclusion.

Author response: Thank you for the insightful feedback.

Author action: We have explained the selection of programming languages to implement the sequential and parallel sort with justification for both (kindly see page 10, line 351).

Moreover, a memory usage Table 5 in the Results section that shows how much memory each of the sorting algorithms used and potential reasons for the patterns obtained has been included. This kind of analysis has also been referred to and utilized in the Conclusion section to justify the major findings (kindly see page 18-19).

___________________________________________________

Reviewer #1, Concern #2: The work also lacks a threats to validity section. The limitations of this work must be clear to the readers.

Author response & action: In response, we have added a dedicated “Threats to Validity” section (Section 7, Page 18), positioned before the Conclusion. This section systematically outlines the major limitations of our study, including:

1. Hardware specificity, experiments were conducted on a mid-range system (Intel i5‑9400F and NVIDIA GTX 1660 SUPER), which may not reflect performance on newer platforms.

2. Use of Java for sequential implementations, while practical and portable, it may not achieve the same optimization level as C++-based CPU implementations.

3. Algorithm scope, our evaluation includes four representative sorting algorithms (MS, QS, BS, RS), but excludes others like sample sort or bitonic sort.

4. Dataset generalizability, experiments were run on datasets of one million integers with four controlled distributions, which may not fully capture real-world complexity.

5. Kernel-level optimization, our CUDA implementations use standard global memory and avoid warp-level or shared memory tuning, by design, to establish clean performance baselines.

These clarifications ensure the study’s limitations are transparently acknowledged and help contextualize our results for readers and future researchers.

___________________________________________________

Reviewer #1, Concern #3: Decide between utilizing the acronym or the full name, for example: “MS typically outperforms quicksort”;

Capitalization of names: throughout the entire paper, the sorting names are written using both capitalized initials and non-capitalised versions, “Quick sort/quick sort; it would be better to choose which one is going to be used;

Standardization: There are sentences where the sorting names are separated, “Quick sort”, as well as together, “Quicksort”, it would be advised to revise the text to unify which nomenclature is going to be used;

Author response: We appreciate the reviewer for bringing this vital point to our attention.

Author action: We decided to fully use acronyms instead of full names for sorting. Furthermore, we changed all names of algorithms to capitalised initials. Additionally, we decided to fully use separated words instead of merged ones.

___________________________________________________

Reviewer #1, Concern #4: A paragraph comparing what you want to do against what has already been done could help differentiate your work from the others, and you talk about a “thorough time complexity table” been presented in the paper, which I don’t really see being presented in any section;

Author response: We thank the reviewer for noting our wise observation, which has been addressed carefully.

Author action: While a separate table was not explicitly included, time complexity comparisons were integrated in the Results section as a narrative to provide a more intuitive comparison.

___________________________________________________

Reviewer #1, Concern #5: Line 75: “C++, building on the reviewed literature.” Loose phrase? The beginning of the sentence is the same, so it doesn’t make sense to say it again.

Author response: We thank the reviewer for bringing this to our attention.

Author action: We revised the entire subparagraph containing the sentence on Line 75 to improve coherence and avoid redundancy.

___________________________________________________

Reviewer #1, Concern #6: Lines 97/98: “The CUDA programming model works on the basis that the host is CPU and the device is GPU, has separate memory spaces [12].” No connection between the two phrases;

Author response: We thank the reviewer for the suggestion that provided our revised paper with enhanced clarity.

Author action: We updated the sentence by adding a logical connection between phrases for clear understanding. The CUDA programming model operates on the principle that the host (CPU) and the device (GPU) function as distinct computing units, each with separate memory spaces [24] (page 6 lines 188-189).

___________________________________________________

Reviewer #1, Concern #7: Why were the sequential versions made using Java while the GPU parallel versions used CUDA C? Was there a reason to impose this limitation on the work? Why the chosen environment? Was it a machine at hand? Or was it just chosen because of the GPU?

Author response: We appreciate the reviewer bringing this fact to our attention.

Author action: We added information about using Java for sequential and C CUDA for parallel sorting in the Experimental Setup section on Page 10, line 352-365. CUDA was used for the parallel sorting implementations, which required the use of C/C++ due to CUDA’s language support. Visual Studio was used as the development environment for CUDA programming, offering integrated debugging, code suggestions, and efficient GPU compilation workflows that supported rapid implementation and testing. For the sequential implementations, Java was chosen for its simplicity, portability, and suitability for quick algorithm prototyping. It also provided a stable and accessible development setup in our environment. This combination, Java for sequential and CUDA C for parallel, allowed us to leverage the strengths of each environment and perform an effective comparison of sorting algorithms across both CPU and GPU settings.

___________________________________________________

Reviewer #1, Concern #8: Why limit the dataset size to the chosen size? Did you make tests to choose this size, or was it selected at random? I assume some type of test was done to choose the size, if so, why not discuss the results obtained from these tests?

Author response: We thank the reviewer for this valuable recommendation.

Author action: We included a note in the Experimental Setup section that described how the dataset size of 1,000,000 integers was chosen after experimentation with different sizes, ranging from tens of thousands to tens of millions. Smaller values did not adequately demonstrate the performance improvement of GPU-based parallelism, and significantly larger inputs caused excessive memory use and lengthy runtimes, which were not practical with the hardware available to us. The selected size offered a moderate amount of work that would neatly illustrate the computational distinction between the sequential and parallel versions (Kindly see Page 10, lines 346-351).

___________________________________________________

Reviewer #1, Concern #9: How the metrics were taken, does the execution time account for the reading of the dataset, and pre-processing of data, how the Memory Usage was taken, some form of software to monitor the resource usage, or is it being taken via code?

Author response: We appreciate the reviewer's comment.

Author action: We have revised the Subsection Results and Evaluation to more clearly explain how the performance figures were attained. We have, in particular, made it clearer that the execution time measurements do not include reading and preprocessing of datasets and only measure the sorting step. Furthermore, we clarified that memory usage was tracked with in-code monitoring at the level of code rather than using external software tools. This detail has been added for precision and replicability of our results (Page 15, Lines 504-511).

___________________________________________________

Reviewer #1, Concern #10: The results seem superficial, and one of the metrics in Table 1, “Memory Usage”, is not to be shown anywhere when discussing the results, and is somewhat referenced in the conclusion as a key takeaway;

Author response: We thank the reviewer for remarks and have made changes accordingly.

Author action: We have further rewritten the Results section to formally address the "Memory Usage" metric shown in Table 5 (Page 17). This makes it easy to place the metric within the analysis in its proper position and enable the conclusion reached to be substantiated. The addition is clear and keeps the discussion focused on the main takeaways around memory efficiency.

___________________________________________________

Reviewer #1, Concern #11: The choice of colors in Figure 2(b) could be better; red and orange aren’t exactly the best color combo for a graph. Also, the graphs could be revised to include some type of hatch to differentiate the graphs from one another.

Author response: We thank and recognize the efforts of the reviewer for offering this useful advice.

Author action: Colour scheme of the graphs is updated as requested.

___________________________________________________

___________________________________________________

Reviewer 2

We thank the reviewer for his constructive criticisms

___________________________________________________

Reviewer #2, Concern #1: Introduction — Clarity and Depth

Issue: The introduction outlines the importance of sorting and parallel computing but lacks a compelling motivation for comparing these specific four algorithms.

Suggestions:

Explain why these four algorithms (Merge Sort, Quick Sort, Bubble Sort, Radix Top-K) were chosen — e.g., is this based on frequency of usage in CUDA applications or GPU benchmarking literature?

Author response: We thank the reviewer for this helpful suggestion.

Author action: We have revised the introduction to address your concerns by adding a section that describes the selection rationale of the four sorting algorithms. Specifically, we explain that Merge Sort, Quick Sort, Bubble Sort, and Radix Sort are among the most commonly studied parallel sorting methods due to their distinct algorithmic strategies and suitability for parallelization. This addition clarifies our motivation and provides a stronger foundation for the comparative analysis presented in the paper (Line 44-58).

___________________________________________________

Reviewer #2, Concern #2: Add quantitative context: e.g., "Sorting is estimated to take up X% of compute cycles in application Y," to justify its significance in HPC.

Author response: We thank the reviewer for this helpful suggestion.

Author action: We incorporated a quantitative statement in the introduction to highlight the significance of sorting in high-performance computing (HPC). Specifically, we added information indicating that sorting operations can account for a substantial portion of compute time in data-intensive applications, emphasizing their relevance for optimization, scalability, and algorithmic efficiency analysis (Line 06-08).

___________________________________________________

Reviewer #2, Concern #3: Refine vague phrases such as “scales well with large datasets” with precise metrics or previous benchmarks.

Author response: We thank the reviewer for this helpful suggestion.

Author action: We have revised our manuscript, and the statement has been updated in the abstract and introduction.

___________________________________________________

Reviewer #2, Concern #4: Related Work — Gaps and Integration

Issue: The literature review is comprehensive but lacks synthesis and comparative critique.

Suggestions:

Include a comparative table or summary matrix listing previous studies, sorting algorithms used, platforms (single-GPU/multi-GPU), and key performance gains.

Author response: Thank you for the insightful feedback.

Author action: We included a comparative summary table in the related work section, along with a brief analysis highlighting the differences in sorting algorithms, platforms, and reported performance gains across previous studies (Table 1, page 5).

___________________________________________________

Reviewer #2, Concern #5: Several references are outdated or redundant (e.g., [8], [16] are not state-of-the-art). Add more recent studies (2022–2025) that use newer CUDA versions, Tensor Cores, or multi-GPU implementations.

Author response & action: Thank you for your insightful observation. In response, we have revised our Related Work section to incorporate recent, relevant studies that directly address advancements in CUDA-based sorting, multi-GPU systems, and performance-critical GPU algorithms from 2022–2025. Specifically, we have added:

Schmid et al. (2022) proposed a recommendation map for optimal strategy selection and Segmented Sort

Çetin et al. (2023) proposed a memory-aware multi-GPU merge sort framework optimized through unified memory access and NCCL communication.

Yoshida et al. (2024) analyzed the influence of different CUDA versions on sorting performance, emphasizing warp scheduling and memory hierarchy impacts.

Li et al. (2024) introduced an efficient and scalable radix Top-K selection algorithm tailored for GPU architectures, showing superior performance across large datasets.

Wróblewski et al. (2025) demonstrated the use of radix-based sorting techniques on Huawei’s Ascend AI accelerators, showing cross-platform applicability beyond CUDA-enabled GPUs.

Additionally, we reviewed four other recent CUDA-based studies that explored general-purpose GPU acceleration across diverse domains (e.g., material point methods, ray tracing cores, electrostatics, and computer vision for waste sorting). While these works utilize CUDA and contribute meaningfully to GPU computing as a whole, they do not focus on sorting algorithms or performance benchmarking frameworks aligned with our study. Therefore, we did not include them in our core Related Work section to maintain topical focus on parallel sorting within high-performance GPU computing.

___________________________________________________

Reviewer #2, Concern #6: Current review is narrative, not analytical. Clearly state what existing work lacks, which your paper addresses (e.g., consistent benchmarking across all four sorting types, large dataset size, diverse data distributions).

Author response: Thank you for the suggestion.

Author action: We have revised the end of the Related Work section to clearly identify l

Attachment

Submitted filename: Response to Reviewers.pdf

pone.0342167.s001.pdf (389.6KB, pdf)

Decision Letter 1

Francesco Bardozzo

15 Sep 2025

PONE-D-25-22965R1Performance evaluation of GPU-based parallel sorting algorithmsPLOS ONE

Dear Dr. Ala'anzy,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Oct 30 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Francesco Bardozzo

Academic Editor

PLOS ONE

Journal Requirements:

1. If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise.

Additional Editor Comments:

Reviewer #3:

In this manuscript, the GPU-based parallelization of mergesort (MS), quicksort (QS), bubble sort (BS) and radix top-k selection sort (RS) are investigated. Also, the performance of these algorithms is evaluated on GPUs utilizing CUDA.

In the revised manuscript, the following comments should be addressed :

1 – In the abstract, the results need to be included . The results is better to be included in terms of improvement ratio between the presented work and existing works .

2 – There are several works that try to improve the algorithm performance by utilizing multi-core, GPU, and multi-threading. The authors need to include some of these work, for example:

[R1] Al-sudani, Ahlam Hanoon, et al. "Multithreading-Based Algorithm for High-Performance Tchebichef Polynomials with Higher Orders." Algorithms 17.9 (2024): 381.

[R2] Hsu, Kuan-Chieh, and Hung-Wei Tseng. "Simultaneous and Heterogenous Multithreading: Exploiting Simultaneous and Heterogeneous Parallelism in Accelerator-Rich Architectures." IEEE Micro 44.4 (2024).

[R3] Mahmmod, Basheera M., et al. "Performance enhancement of high order Hahn polynomials using multithreading." Plos one 18.10 (2023): e0286878.

3 – Check the manuscript for grammars and typos .

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Reviewer #5:

The manuscript titled "Performance Evaluation of GPU-Based Parallel Sorting Algorithms" provides a well-structured comparison of four classical sorting algorithms—Merge Sort (MS), Quick Sort (QS), Bubble Sort (BS), and Radix Sort (RS)—in both sequential (CPU) and parallel (GPU/CUDA) implementations. The study is clearly written and offers a unified benchmarking framework across four dataset distributions using a consistent hardware setup. The inclusion of execution time, memory usage, GPU utilization, and statistical repeatability across 30 runs contributes positively to the rigor of the experimental section.

However, while the work is technically competent and informative as a benchmarking study, several significant limitations reduce its suitability for publication in a journal like PLOS ONE:

Lack of Novelty: The manuscript does not propose any new algorithms, techniques, or optimization strategies. The selected algorithms are well-established, and their CUDA implementations are widely studied. The work presents confirmatory results rather than offering new insights into algorithmic performance or GPU computing.

Unfair Baseline Comparison: Sequential implementations are written in Java, while GPU versions are developed in CUDA C++. This introduces a language-level performance bias that undermines the accuracy of GPU–CPU speedup claims. A more rigorous and fair comparison would require both versions to be written in the same low-level language (e.g., C/C++).

Limited Optimization: The GPU implementations do not leverage key CUDA features such as shared memory, warp-level primitives, or memory coalescing. While the authors acknowledge this, it limits the relevance of performance results in a high-performance computing context.

Restricted Generalizability: All experiments were performed on a single GPU (GTX 1660 SUPER) and a mid-tier CPU, without comparison across other hardware platforms. While suitable for baseline analysis, the conclusions should be considered hardware-specific.

Scope of Algorithms: The manuscript focuses on only four sorting algorithms. While these are diverse in paradigm, the exclusion of common GPU-optimized algorithms such as sample sort, bitonic sort, or hybrid strategies limits the comprehensiveness of the study.

Data Availability and Reproducibility: A positive aspect of this work is that the datasets used in the experiments have been made publicly available on Figshare. This supports reproducibility and is commendable.

In conclusion, the manuscript serves as a solid technical report or pedagogical study, but in its current form, it does not meet the originality and methodological innovation standards required for publication in PLOS ONE. The authors are encouraged to explore hybrid GPU–CPU strategies, apply hardware-level optimizations, and conduct more fair comparisons using the same programming language to strengthen future submissions.

Reviewer #6:

1. The title of paper is “Performance evaluation of GPU-based parallel sorting algorithms, which is general. Someone expects to see time-efficient recently published algorithms in this research. The following sorting algorithms that are time and complexity efficient new sorting algorithms, especially for parallel realization are missed to be considered in your study and comparisons. In addition to Quick, Merge, Bubble, and Radix sorting algorithms, some new ones, such as Mean-based and threshold-based sorting algorithms for integer and non-integer large scale data sets, Slowsort as a new modified parallel-realization of BitSort for integer data sets, and Clustersort are missed. All of them are based on Divide& Conquer idea to break a big problem to many sub-problems.

- SlowSort: An Enhanced Sorting Algorithm for Large Scale Integer Datasets, Preprint of the accepted paper to be published in Software: Practice and Experience, 2025 (DOI: 10.22541/au.174523890.00820511/v1).

- A Threshold-Based Sorting Algorithm for Dense Wireless Communication Networks, IET Wireless Sensor Systems, Vol. 13, No. 2, pp. 37-47, Jan. 2023 (DOI: 10.1049/wss2.12048).

- Cluster Sort: A Novel Hybrid Approach to Efficient In-Place Sorting Using Data Clustering, IEEE Access, Vol. 13, pp. 74359-74374, 2025

(DOI: 10.1109/ACCESS.2025.3564380).

- A General Framework for Sorting Large Data Sets Using Independent Subarrays of Approximately Equal Length, IEEE Access, Vol. 10, pp. 11584-11607, 2022 (DOI: 10.1109/ACCESS.2022.3145981).

- On the Performance of Mean-Based Sort for Large Data Sets, IEEE Access, Vol. 9, pp. 37418-37430, March 2021 (DOI: 10.1109/ACCESS.2021.3063205).

2. In the parallel processing, time and complexity analysis depends on the core with higher time and complexity required, and the difference between different cores in terms of memory space and time required for processing should be extracted, both mean value and standard deviation. In this case, Mean-based or threshold-based sorting algorithm offer subarrays in parallel realization that are independent of each other up to the end of processing and introduce approximately similar number of elements.

3. In Some algorithms, when you finish the processing of one core, its data is sorted and it can be used, but in other algorithms it needs more post processing to find the final sorted data. This case is not analyzed in this research work.

4. In addition to compare the elapsed processing time, a complexity order study for time and memory in parallel scenario is needed.

5. Integer and non-integer data set maybe affect the comparison results, have you considered it in your investigation?

6. In the literature of sorting, we have Worst-case, Medium-case (Moderate-case), and Best-case that depend on how much the data is sorted, originally. It is considered in your work. In contrast, the type of data in terms of the probability distribution is not considered. For example, uniform and Gaussian distributions may force different results.

7. When we speak about time complexity order in the level O(n logn), large data sets and very large data sets show the effectiveness of sorting algorithm better than small or medium data sets. In most of references, data sets with 105 and 106 elements are large, and data sets with 107, 108, and 109 elements are very large. Have you considered both of them? Do you have different results?

Reviewer #2:

Thank you for the opportunity to review this manuscript. The study addresses an important and timely research question and is conducted with a clear and methodologically sound approach. The manuscript is well-written, logically structured, and provides sufficient details to ensure reproducibility of results.

The data analysis is appropriate, and the conclusions are well-supported by the results. The literature review is comprehensive and contextualizes the current study within existing work in the field.

I particularly appreciate the clarity in the presentation of figures and tables, and the transparency of the methodology, including data sources and code availability, which align well with PLOS ONE’s open science policy.

I have not found any concerns regarding ethical approval, participant consent, or data sharing, and there is no indication of dual publication or plagiarism. The manuscript appears to comply with PLOS ONE's research and publication ethics

Reviewer #1:All my concerns were addressed in this manuscript version. However, there is a strange border on Figure 4.

[Note: HTML markup is below. Please do not edit.]

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

Reviewer #3: (No Response)

Reviewer #4: (No Response)

Reviewer #6: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Partly

Reviewer #6: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #6: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #6: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #6: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: All my concerns were addressed in this manuscript version. However, there is a strange border on Figure 4.

Reviewer #2: Thank you for the opportunity to review this manuscript. The study addresses an important and timely research question and is conducted with a clear and methodologically sound approach. The manuscript is well-written, logically structured, and provides sufficient details to ensure reproducibility of results.

The data analysis is appropriate, and the conclusions are well-supported by the results. The literature review is comprehensive and contextualizes the current study within existing work in the field.

I particularly appreciate the clarity in the presentation of figures and tables, and the transparency of the methodology, including data sources and code availability, which align well with PLOS ONE’s open science policy.

I have not found any concerns regarding ethical approval, participant consent, or data sharing, and there is no indication of dual publication or plagiarism. The manuscript appears to comply with PLOS ONE's research and publication ethics.

Reviewer #3: In this manuscript, the GPU-based parallelization of mergesort (MS), quicksort (QS), bubble sort (BS) and radix top-k selection sort (RS) are investigated. Also, the performance of these algorithms is evaluated on GPUs utilizing CUDA.

In the revised manuscript, the following comments should be addressed :

1 – In the abstract, the results need to be included . The results is better to be included in terms of improvement ratio between the presented work and existing works .

2 – There are several works that try to improve the algorithm performance by utilizing multi-core, GPU, and multi-threading. The authors need to include some of these work, for example:

[R1] Al-sudani, Ahlam Hanoon, et al. "Multithreading-Based Algorithm for High-Performance Tchebichef Polynomials with Higher Orders." Algorithms 17.9 (2024): 381.

[R2] Hsu, Kuan-Chieh, and Hung-Wei Tseng. "Simultaneous and Heterogenous Multithreading: Exploiting Simultaneous and Heterogeneous Parallelism in Accelerator-Rich Architectures." IEEE Micro 44.4 (2024).

[R3] Mahmmod, Basheera M., et al. "Performance enhancement of high order Hahn polynomials using multithreading." Plos one 18.10 (2023): e0286878.

3 – Check the manuscript for grammars and typos .

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Reviewer #4: The manuscript titled "Performance Evaluation of GPU-Based Parallel Sorting Algorithms" provides a well-structured comparison of four classical sorting algorithms—Merge Sort (MS), Quick Sort (QS), Bubble Sort (BS), and Radix Sort (RS)—in both sequential (CPU) and parallel (GPU/CUDA) implementations. The study is clearly written and offers a unified benchmarking framework across four dataset distributions using a consistent hardware setup. The inclusion of execution time, memory usage, GPU utilization, and statistical repeatability across 30 runs contributes positively to the rigor of the experimental section.

However, while the work is technically competent and informative as a benchmarking study, several significant limitations reduce its suitability for publication in a journal like PLOS ONE:

Lack of Novelty: The manuscript does not propose any new algorithms, techniques, or optimization strategies. The selected algorithms are well-established, and their CUDA implementations are widely studied. The work presents confirmatory results rather than offering new insights into algorithmic performance or GPU computing.

Unfair Baseline Comparison: Sequential implementations are written in Java, while GPU versions are developed in CUDA C++. This introduces a language-level performance bias that undermines the accuracy of GPU–CPU speedup claims. A more rigorous and fair comparison would require both versions to be written in the same low-level language (e.g., C/C++).

Limited Optimization: The GPU implementations do not leverage key CUDA features such as shared memory, warp-level primitives, or memory coalescing. While the authors acknowledge this, it limits the relevance of performance results in a high-performance computing context.

Restricted Generalizability: All experiments were performed on a single GPU (GTX 1660 SUPER) and a mid-tier CPU, without comparison across other hardware platforms. While suitable for baseline analysis, the conclusions should be considered hardware-specific.

Scope of Algorithms: The manuscript focuses on only four sorting algorithms. While these are diverse in paradigm, the exclusion of common GPU-optimized algorithms such as sample sort, bitonic sort, or hybrid strategies limits the comprehensiveness of the study.

Data Availability and Reproducibility: A positive aspect of this work is that the datasets used in the experiments have been made publicly available on Figshare. This supports reproducibility and is commendable.

In conclusion, the manuscript serves as a solid technical report or pedagogical study, but in its current form, it does not meet the originality and methodological innovation standards required for publication in PLOS ONE. The authors are encouraged to explore hybrid GPU–CPU strategies, apply hardware-level optimizations, and conduct more fair comparisons using the same programming language to strengthen future submissions.

Reviewer #6: 1. The title of paper is “Performance evaluation of GPU-based parallel sorting algorithms, which is general. Someone expects to see time-efficient recently published algorithms in this research. The following sorting algorithms that are time and complexity efficient new sorting algorithms, especially for parallel realization are missed to be considered in your study and comparisons. In addition to Quick, Merge, Bubble, and Radix sorting algorithms, some new ones, such as Mean-based and threshold-based sorting algorithms for integer and non-integer large scale data sets, Slowsort as a new modified parallel-realization of BitSort for integer data sets, and Clustersort are missed. All of them are based on Divide& Conquer idea to break a big problem to many sub-problems.

- SlowSort: An Enhanced Sorting Algorithm for Large Scale Integer Datasets, Preprint of the accepted paper to be published in Software: Practice and Experience, 2025 (DOI: 10.22541/au.174523890.00820511/v1).

- A Threshold-Based Sorting Algorithm for Dense Wireless Communication Networks, IET Wireless Sensor Systems, Vol. 13, No. 2, pp. 37-47, Jan. 2023 (DOI: 10.1049/wss2.12048).

- Cluster Sort: A Novel Hybrid Approach to Efficient In-Place Sorting Using Data Clustering, IEEE Access, Vol. 13, pp. 74359-74374, 2025

(DOI: 10.1109/ACCESS.2025.3564380).

- A General Framework for Sorting Large Data Sets Using Independent Subarrays of Approximately Equal Length, IEEE Access, Vol. 10, pp. 11584-11607, 2022 (DOI: 10.1109/ACCESS.2022.3145981).

- On the Performance of Mean-Based Sort for Large Data Sets, IEEE Access, Vol. 9, pp. 37418-37430, March 2021 (DOI: 10.1109/ACCESS.2021.3063205).

2. In the parallel processing, time and complexity analysis depends on the core with higher time and complexity required, and the difference between different cores in terms of memory space and time required for processing should be extracted, both mean value and standard deviation. In this case, Mean-based or threshold-based sorting algorithm offer subarrays in parallel realization that are independent of each other up to the end of processing and introduce approximately similar number of elements.

3. In Some algorithms, when you finish the processing of one core, its data is sorted and it can be used, but in other algorithms it needs more post processing to find the final sorted data. This case is not analyzed in this research work.

4. In addition to compare the elapsed processing time, a complexity order study for time and memory in parallel scenario is needed.

5. Integer and non-integer data set maybe affect the comparison results, have you considered it in your investigation?

6. In the literature of sorting, we have Worst-case, Medium-case (Moderate-case), and Best-case that depend on how much the data is sorted, originally. It is considered in your work. In contrast, the type of data in terms of the probability distribution is not considered. For example, uniform and Gaussian distributions may force different results.

7. When we speak about time complexity order in the level O(n logn), large data sets and very large data sets show the effectiveness of sorting algorithm better than small or medium data sets. In most of references, data sets with 105 and 106 elements are large, and data sets with 107, 108, and 109 elements are very large. Have you considered both of them? Do you have different results?

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: OMER IQBAL

Reviewer #3: No

Reviewer #4: No

Reviewer #6: Yes: S. Shirvani Moghaddam

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2026 Feb 3;21(2):e0342167. doi: 10.1371/journal.pone.0342167.r004

Author response to Decision Letter 2


31 Oct 2025

Manuscript ID: PONE-D-25-22965R1

Original Article Title: “Performance evaluation of GPU-based parallel sorting algorithms”

To: PLOS One Editor

Re: Response to reviewers

Dear Editor,

Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.

We are uploading (a) our point-by-point response to the comments (below) (response to reviewers), (b) a marked-up copy of the manuscript (Revised Manuscript with Track Changes), and (c) a clean updated manuscript (“Main Manuscript”).

Best regards,

Mohammed Alaa Ala’anzy et al.

_________________________________________________________________________

Reviewer #1

All my concerns were addressed in this manuscript version.

The authors would like to express their sincere gratitude to the reviewers for their valuable time and effort, which have greatly contributed to enhancing the quality of our manuscript.

_________________________________________________________________________

Reviewer #1, Concern: There is a strange border on Figure 4.

Author action: The figure has been updated.

_________________________________________________________________________

Reviewer #2

Thank you for the opportunity to review this manuscript. The study addresses an important and timely research question and is conducted with a clear and methodologically sound approach. The manuscript is well-written, logically structured, and provides sufficient details to ensure reproducibility of results.

The data analysis is appropriate, and the conclusions are well-supported by the results. The literature review is comprehensive and contextualizes the current study within existing work in the field.

I particularly appreciate the clarity in the presentation of figures and tables, and the transparency of the methodology, including data sources and code availability, which align well with PLOS ONE’s open science policy.

I have not found any concerns regarding ethical approval, participant consent, or data sharing, and there is no indication of dual publication or plagiarism. The manuscript appears to comply with PLOS ONE's research and publication ethics.

The authors wish to extend their sincere appreciation to the reviewers for their significant contributions of time and effort, which have substantially enhanced the quality of our manuscript.

_________________________________________________________________________

Reviewer #3:

Reviewer #3, Concern #1: In the abstract, the results need to be included. The results is better to be included in terms of improvement ratio between the presented work and existing works .

Author response: We value the reviewer's note and have incorporated the improvement accordingly.

Author action: In the Abstract section, after comparing CPU and GPU, we added the comparison paragraph on Page 1 with the most existing works on our topic and our paper on the performance of GPU-based sortings represented as multipliers.

“Earlier GPU-based generations of this type typically achieved acceleration rates between 2× and 9× over scalar CPU code. With newer GPU enhancements, including parallel-aware primitives and radix- or merge-optimized operations, acceleration rates have seen significant improvement. Our experiments indicate that Radix Sort based on GPUs achieves a significant speedup of approximately 50× (sequential: 240.8 ms, parallel: 4.83 ms) on 10 million random sort elements. Quick Sort and Merge Sort have 97× and 103× speedups, respectively (Quick: 1461.97 ms vs. 15.1 ms; Merge: 2212.33 ms vs. 21.4 ms). Bubble Sort, while significantly improving in parallel (123,321.9 ms to 7377.8 ms for an ≈17× improvement), is considerably worse overall. These experimental findings confirm that the new single-GPU implementations can get speedups ranging from 17× to over 100×, surpassing the typical gains reported in previous generations and comparable to or over rates of acceleration reported for cutting-edge parallel sorting algorithms in recent studies.”

_________________________________________________________________________

Reviewer #3, Concern #2: There are several works that try to improve the algorithm performance by utilizing multi-core, GPU, and multi-threading. The authors need to include some of these work, for example:

[R1] Al-sudani, Ahlam Hanoon, et al. "Multithreading-Based Algorithm for High-Performance Tchebichef Polynomials with Higher Orders." Algorithms 17.9 (2024): 381.

[R2] Hsu, Kuan-Chieh, and Hung-Wei Tseng. "Simultaneous and Heterogenous Multithreading: Exploiting Simultaneous and Heterogeneous Parallelism in Accelerator-Rich Architectures." IEEE Micro 44.4 (2024).

[R3] Mahmmod, Basheera M., et al. "Performance enhancement of high order Hahn polynomials using multithreading." Plos one 18.10 (2023): e0286878.

Author response: We recognize and value the reviewer's thoughtful suggestions.

Author action: We added all suggested papers (R1 - R3) and described their uniqueness and similarities in our topic in the Related Works section on Page 5, Line 175-187.

“Apart from GPU-directed approaches, many research works have been done to achieve performance improvement through multithreading and multi-core approaches. As an example, Al-sudani et al. [24] developed a multithreading approach for computing high-order Tchebichef polynomials with significant speed-up in the evaluation of polynomials. Mahmmod et al. [25] achieved significantly reduced execution time for high-order Hahn polynomials through multithreading and achieved remarkable runtime reduction over sequential approaches. At the architectural level, Hsu and Tseng [26] designed a framework for multithreading and heterogeneous simultaneous multithreading in accelerator-rich systems, and showed the effectiveness of utilizing intra-core and inter-accelerator parallelism. These efforts together highlight that multi-threading on CPUs, GPUs, or even heterogeneous accelerators remains an essential way to improve algorithmic performance and is orthogonal to the GPU-specific sorting optimizations explored in this work.”

_________________________________________________________________________

Reviewer #3, Concern #3: Check the manuscript for grammar and typos.

Author response: Thank you for your suggestion.

Author action: The manuscript has been heavily proofread for typographical errors and grammatical faults, and all the errors located have been corrected for overall clarity and readability purposes. Moreover, a comprehensive review by professionals will be performed upon acceptance.

_________________________________________________________________________

_________________________________________________________________________

Reviewer 4/5

The manuscript titled "Performance Evaluation of GPU-Based Parallel Sorting Algorithms" provides a well-structured comparison of four classical sorting algorithms—Merge Sort (MS), Quick Sort (QS), Bubble Sort (BS), and Radix Sort (RS)—in both sequential (CPU) and parallel (GPU/CUDA) implementations. The study is clearly written and offers a unified benchmarking framework across four dataset distributions using a consistent hardware setup. The inclusion of execution time, memory usage, GPU utilization, and statistical repeatability across 30 runs contributes positively to the rigor of the experimental section.

We sincerely appreciate the time and effort you have dedicated to reviewing our manuscript. Your valuable feedback and constructive suggestions have been instrumental in helping us improve the clarity and quality of our work.

_________________________________________________________________________

Reviewer #4, Concern #1: Lack of Novelty: The manuscript does not propose any new algorithms, techniques, or optimization strategies. The selected algorithms are well-established, and their CUDA implementations are widely studied. The work presents confirmatory results rather than offering new insights into algorithmic performance or GPU computing.

Author response: We respectfully acknowledge the reviewer’s observation. The scope and purpose of our manuscript were established and accepted as a performance evaluation study rather than a proposal of new algorithms or optimization techniques. As clarified in Section 1 (Introduction), the novelty of the work lies in the comparative performance evaluation of four sorting algorithms — Bubble Sort, Radix Sort, Quick Sort, and Merge Sort, using GPU-based CUDA implementation.

The primary contribution of the paper is to provide a detailed empirical assessment of these algorithms under various data conditions (sorted, nearly sorted, random, and reverse-sorted) on modern GPU hardware. This performance evaluation offers a unified experimental framework that can serve as a baseline for future CUDA optimization studies and hybrid algorithmic designs.

We have ensured that the title, Introduction, and Contribution sections clearly highlight this intention and scope, emphasizing that while no new algorithms are introduced, the manuscript contributes by quantitatively validating GPU performance characteristics and identifying potential avenues for optimization on top of the existing implementations.

_________________________________________________________________________

Reviewer #4, Concern #2: Unfair Baseline Comparison: Sequential implementations are written in Java, while GPU versions are developed in CUDA C++. This introduces a language-level performance bias that undermines the accuracy of GPU–CPU speedup claims. A more rigorous and fair comparison would require both versions to be written in the same low-level language (e.g., C/C++).

Author response: We value the reviewer's note and have incorporated the improvement accordingly.

Author action: We reimplemented all sequential sorting algorithms in C++ to remove the language-level performance bias. The experiments were rerun, and the corresponding figures and graphs were updated to reflect the new C++ baseline results alongside the GPU implementations.

_________________________________________________________________________

Reviewer #4, Concern #3: Limited Optimization: The GPU implementations do not leverage key CUDA features such as shared memory, warp-level primitives, or memory coalescing. While the authors acknowledge this, it limits the relevance of performance results in a high-performance computing context.

Author response: We appreciate the reviewer’s comment.

Author action: We have rerun our parallel implementations with the inclusion of CUDA features highlighted by the reviewer, namely shared memory, warp-level primitives, and memory coalescing. The updated experiments and figures in the manuscript now reflect these optimizations, providing a more accurate representation of GPU performance in a high-performance computing context. See page 11, line 368

_________________________________________________________________________

Reviewer #4, Concern #4: Restricted Generalizability: All experiments were performed on a single GPU (GTX 1660 SUPER) and a mid-tier CPU, without comparison across other hardware platforms. While suitable for baseline analysis, the conclusions should be considered hardware-specific.

Author response: We thank the reviewer for their valuable feedback.

Author action: We have included an acknowledgment in the Conclusion section indicating that the experiments were conducted on a fixed hardware setup (Intel i5-9400F CPU and NVIDIA GTX 1660 SUPER GPU). This clarification ensures that readers understand the results are specific to this platform (page 21, line 731). Additionally, we have referenced this in the Threats to Validity section as well (Page 20, lines 681-686).

_________________________________________________________________________

Reviewer #4, Concern #5: Scope of Algorithms: The manuscript focuses on only four sorting algorithms. While these are diverse in paradigm, the exclusion of common GPU-optimized algorithms such as sample sort, bitonic sort, or hybrid strategies limits the comprehensiveness of the study.

Author response: We appreciate the reviewer’s perspective; however, we respectfully disagree with the assertion that a performance analysis is required for every algorithm. While we acknowledge the importance of performance evaluation, it's challenging to encompass all studies within a single research article. In our study, we intentionally focused on four representative algorithms: merge sort, quick sort, bubble sort, and radix sort, to establish a clear baseline. As noted in the manuscript, we are aware of this limitation and intend for future research to expand the analysis to include additional GPU-optimized algorithms, such as sample sort, bitonic sort, and hybrid strategies. We aim to incorporate these algorithms in our subsequent studies to enhance the comprehensiveness of our evaluations. (Refer to page 20, lines 687-691)

_________________________________________________________________________

Reviewer #4, comment: Data Availability and Reproducibility: A positive aspect of this work is that the datasets used in the experiments have been made publicly available on Figshare. This supports reproducibility and is commendable.

We truly appreciate your feedback, and we hope our responses help clarify our perspective.

_________________________________________________________________________

Reviewer 6

Thank you for your time and feedback.

_________________________________________________________________________

Reviewer #6, Concern #1:

The title of paper is “Performance evaluation of GPU-based parallel sorting algorithms, which is general. Someone expects to see time-efficient recently published algorithms in this research. The following sorting algorithms that are time and complexity efficient new sorting algorithms, especially for parallel realization are missed to be considered in your study and comparisons. In addition to Quick, Merge, Bubble, and Radix sorting algorithms, some new ones, such as Mean-based and threshold-based sorting algorithms for integer and non-integer large scale data sets, Slowsort as a new modified parallel-realization of BitSort for integer data sets, and Clustersort are missed. All of them are based on Divide & Conquer idea to break a big problem to many sub-problems.

- SlowSort: An Enhanced Sorting Algorithm for Large Scale Integer Datasets, Preprint of the accepted paper to be published in Software: Practice and Experience, 2025 (DOI: 10.22541/au.174523890.00820511/v1).

- A Threshold-Based Sorting Algorithm for Dense Wireless Communication Networks, IET Wireless Sensor Systems, Vol. 13, No. 2, pp. 37-47, Jan. 2023 (DOI: 10.1049/wss2.12048). - Cluster Sort: A Novel Hybrid Approach to Efficient In-Place Sorting Using Data Clustering, IEEE Access, Vol. 13, pp. 74359-74374, 2025 (DOI: 10.1109/ACCESS.2025.3564380).

- A General Framework for Sorting Large Data Sets Using Independent Subarrays of Approximately Equal Length, IEEE Access, Vol. 10, pp. 11584-11607, 2022 (DOI: 10.1109/ACCESS.2022.3145981).

- On the Performance of Mean-Based Sort for Large Data Sets, IEEE Access, Vol. 9, pp. 37418-37430, March 2021 (DOI: 10.1109/ACCESS.2021.3063205).

Author response: We are delighted to respond to this kind comment given by the reviewer.

Author action: In the Related Work Section, we mentioned these new papers by explaining the basic mechanism and uniqueness of their approach as a new paragraph (Page 5, Line 188-201) because they are close to our topic and we will include them in our future performance analysis.

“New sorting algorithms that emphasize both time complexity and ease of parallel implementation have also been developed in recent years. slowsort, for example, is a generalization of bitSort through an adapted parallel version for sorting large-scale integer data [27], while threshold-based sorting utilizes adaptive thresholds to accelerate tasks in dense wireless communication networks [28]. In the same vein, clusterSort uses the combination of clustering techniques with divide-and-conquer methods to achieve efficient in-place sorting for large data [29]. Additional contributions, such as the independent-subarray model [30], splitting data into balanced subproblems for improved wo

Attachment

Submitted filename: response to reviewers..pdf

pone.0342167.s002.pdf (304.6KB, pdf)

Decision Letter 2

Francesco Bardozzo

25 Nov 2025

PONE-D-25-22965R2Performance evaluation of GPU-based parallel sorting algorithmsPLOS ONE

Dear Dr. Ala'anzy,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jan 09 2026 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Francesco Bardozzo

Academic Editor

PLOS ONE

Journal Requirements:

If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise.

Editor Comments:

The reviewers’ overall opinions are very divergent regarding whether the paper contains a substantial element of novelty.

Therefore, it is difficult to clearly identify the novel contribution of this work.

Some of the revisor comments suggest possible improvements and scientific relevance for the journal:

Reviewer 6 refers to three recently published papers

that introduce new sorting algorithms, which are already mentioned in the literature review and currently deferred to future work. I recommend performing simulations to compare the proposed algorithm with these algorithms, in addition to the classical ones.

Moreover, none of the figures (histograms and other visual representations) are in line with the quality standards of the journal. An initial reference figure is missing - a general pipeline or schematic - that could help the reader immediately understand what the work is about.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #3: (No Response)

Reviewer #6: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #3: Yes

Reviewer #6: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #3: Yes

Reviewer #6: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #3: (No Response)

Reviewer #6: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #3: Yes

Reviewer #6: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #3: In this manuscript, the GPU-based parallelization of mergesort (MS), quicksort (QS), bubble sort (BS) and radix top-k selection sort (RS) are investigated. Also, the performance of these algorithms is evaluated on GPUs utilizing CUDA.

In the revised manuscript, the following comments should be addressed. The main issue is that the references do not fit with the journal’s guidelines. In addition, several authors’ names in the references are incorrect and need to be corrected. The authors should revise the reference list accordingly.

Reviewer #6: Reviewing the revised version of the paper and authors' responses to the reviewers' comments show that most of concerns are addressed or fixed in the new version of the paper. One comment of reviewer 6 was about three recently published papers that introduce new sorting algorithms that they are pointed out in the literature review and postponed to the future.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #3: No

Reviewer #6: Yes: Shahriar Shirvani Moghaddam

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

To ensure your figures meet our technical requirements, please review our figure guidelines: https://journals.plos.org/plosone/s/figures

You may also use PLOS’s free figure tool, NAAS, to help you prepare publication quality figures: https://journals.plos.org/plosone/s/figures#loc-tools-for-figure-preparation.

NAAS will assess whether your figures meet our technical requirements by comparing each figure against our figure specifications.

PLoS One. 2026 Feb 3;21(2):e0342167. doi: 10.1371/journal.pone.0342167.r006

Author response to Decision Letter 3


9 Jan 2026

Manuscript ID: PONE-D-25-22965R2

Original Article Title: “Performance evaluation of GPU-based parallel sorting algorithms”

To: PLOS One Editor

Re: Response to reviewers

Dear Editor,

Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.

We are uploading (a) our point-by-point response to the comments (below) (response to reviewers), (b) a marked-up copy of the manuscript (Revised Manuscript with Track Changes), and (c) a clean updated manuscript (“Main Manuscript”).

Best regards,

Mohammed Alaa Ala’anzy et al.

__________________________________________________________

<Editor Comments>

The reviewers’ overall opinions are very divergent regarding whether the paper contains a substantial element of novelty. Therefore, it is difficult to clearly identify the novel contribution of this work.

The authors would like to express their sincere gratitude to the Editors for their valuable time and effort, which have greatly contributed to enhancing the quality of our manuscript. Meanwhile, we would like to emphasize that the contribution and novelty are now clearly presented in the abstract and at the end of the introduction section.

__________________________________________________________

Editor, Concern #1: Some of the reviewers' comments suggest possible improvements and scientific relevance for the journal:

Reviewer 6 refers to three recently published papers that introduce new sorting algorithms, which are already mentioned in the literature review and currently deferred to future work. I recommend performing simulations to compare the proposed algorithm with these algorithms, in addition to the classical ones.

Author response: We value the editor’s note and have incorporated the improvement accordingly.

Author action: We investigated the recently published paper (2025) called “SlowSort: An Enhanced Sorting Algorithm for Large Scale Integer Datasets. Software: Practice and Experience.” and made a full comparison of its performance against QS, BS, MS and RS to find the best algorithm with updated values and graphs. The full description of the Slow Sort (SS) located in CUDA section, Page 12, Line 378 with all formulas. Moreover, the comparison in time complexity with graphs and tables are provided in the Results and Evaluation section, Page 14, Line 470.

__________________________________________________________

Editor, Concern #2: Moreover, none of the figures (histograms and other visual representations) are in line with the quality standards of the journal. An initial reference figure is missing - a general pipeline or schematic - that could help the reader immediately understand what the work is about.

Author response: We thank the Editor for highlighting the importance of figure clarity and workflow presentation.

Author action: All result figures have been regenerated in high resolution to comply with the journal’s quality standards (Figures 3–5, Page 15).

In addition, an initial reference schematic has been added as Figure 1, Page 3 in the Introduction section, illustrating the overall workflow of dataset generation, CPU and GPU sorting algorithms, and performance comparison. This overview figure allows readers to immediately understand the scope and objective of the proposed study.

__________________________________________________________

<Reviewers comments>

Reviewer #3

In this manuscript, the GPU-based parallelization of mergesort (MS), quicksort (QS), bubble sort (BS) and radix top-k selection sort (RS) are investigated. Also, the performance of these algorithms is evaluated on GPUs utilizing CUDA.

The authors wish to extend their sincere thanks to Reviewer 3 for his/her time and effort, which have played a significant role in improving the quality of our manuscript.

__________________________________________________________

Reviewer #3, Concern: In the revised manuscript, the following comments should be addressed. The main issue is that the references do not fit with the journal’s guidelines. In addition, several authors’ names in the references are incorrect and need to be corrected. The authors should revise the reference list accordingly.

Author response & action: We thank the reviewer for his observation. We have updated the manuscript by properly checking the references and fixing incorrect names of mentioned authors in Page 22. Also, the order and content(author and paper names, dates, etc.) of the reference list was checked carefully and corrected to follow the format and avoid misconnection cases in future.

__________________________________________________________

Reviewer #6

Reviewing the revised version of the paper and authors' responses to the reviewers' comments show that most of concerns are addressed or fixed in the new version of the paper.

The authors wish to extend their sincere appreciation to Reviewer 6 for his/her significant contributions of time and effort, which have substantially enhanced the quality of our manuscript.

__________________________________________________________

Reviewer #6, Concern: One comment of reviewer 6 was about three recently published papers that introduce new sorting algorithms that they are pointed out in the literature review and postponed to the future.

Author response & action: We thank the reviewer for his observation. We examined the recently published 2025 study titled “SlowSort: An Enhanced Sorting Algorithm for Large Scale Integer Datasets in Software: Practice and Experience” and conducted a comprehensive performance comparison against QS, BS, MS, and RS using updated metrics as time complexity and visualizations such as graphs. A detailed description of the Slow Sort (SS) algorithm, including its CUDA based formulation and associated analytical expressions, is presented in the CUDA section on Page 12, Line 378. Additionally, comparative analyses of time complexity supported by graphs and tables are provided in the Results and Evaluation section on Page 14, Line 470.

We would like to emphasize that some of the suggested papers are not strictly aligned with our work. Accordingly, we have included only the one that is aligned, namely the SlowSort paper.

__________________________________________________________

We thank the reviewers & editors for their thoughtful critiques, which helped us improve the clarity, scope, and scientific rigour of our work.

Attachment

Submitted filename: response_to_reviewers_auresp_3.pdf

pone.0342167.s003.pdf (125KB, pdf)

Decision Letter 3

Francesco Bardozzo

20 Jan 2026

Performance evaluation of GPU-based parallel sorting algorithms

PONE-D-25-22965R3

Dear Dr. Ala'anzy,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Francesco Bardozzo

Academic Editor

PLOS One

Acceptance letter

Francesco Bardozzo

PONE-D-25-22965R3

PLOS One

Dear Dr. Ala'anzy,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS One. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Francesco Bardozzo

Academic Editor

PLOS One

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Response to Reviewers.pdf

    pone.0342167.s001.pdf (389.6KB, pdf)
    Attachment

    Submitted filename: response to reviewers..pdf

    pone.0342167.s002.pdf (304.6KB, pdf)
    Attachment

    Submitted filename: response_to_reviewers_auresp_3.pdf

    pone.0342167.s003.pdf (125KB, pdf)

    Data Availability Statement

    All relevant data are publicly available from the Figshare repository: https://doi.org/10.6084/m9.figshare.29558357.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES