Abstract
The strategy of Divide-and-Conquer (D&C) is one of the frequently used programming patterns to design efficient algorithms in computer science, which has been parallelized on shared memory systems and distributed memory systems. Tzeng and Owens specifically developed a generic paradigm for parallelizing D&C algorithms on modern Graphics Processing Units (GPUs). In this paper, by following the generic paradigm proposed by Tzeng and Owens, we provide a new and publicly available GPU implementation of the famous D&C algorithm, QuickHull, to give a sample and guide for parallelizing D&C algorithms on the GPU. The experimental results demonstrate the practicality of our sample GPU implementation. Our research objective in this paper is to present a sample GPU implementation of a classical D&C algorithm to help interested readers to develop their own efficient GPU implementations with fewer efforts.
Keywords: Computer science
1. Introduction
The strategy of Divide-and-Conquer (D&C) is one of the frequently used programming patterns in computer science, which is commonly utilized to design efficient algorithms. The basic idea behind the D&C paradigm is to first recursively divide an original problem into several sub-problems, until those sub-problems become simple enough to be solved directly, and then the desired solution to the original problem is obtained by combining all the solutions to those sub-problems.
In computer science, the D&C pattern is usually used to deal with large-scale problems. When solving large-scale problems using the D&C pattern, another common programming strategy, parallelism, is also exploited to improve the computational efficiency. That is, the D&C algorithms are implemented in parallel under various parallel computing environments to enhance the computational efficiency.
For example, Horowitz and Zorat [1] theoretically analyzed the potential usage of parallel D&C algorithms, and pointed out that the D&C algorithms would be even more efficient if run on an appropriately designed multiprocessor than those designed in sequential. Atallah et al. [2] presented several general techniques for solving problems efficiently using the parallel D&C paradigm. Wu and Kung [3] specifically studied the relationship between parallel computation cost and communication cost for performing D&C computations on a parallel system of P processors.
Achatz and Schulte [4] presented transformation rules to parallelize D&C algorithms over power lists. Those rules converted the parallel control structure of D&C into a sequential control flow, thus making the implicit massive data parallelism in a D&C scheme explicit. Sreenivas et al. [5] first discussed various techniques for parallel D&C in detail and then extended these techniques to handle efficiently disk-resident data, and also provided a generic technique for parallel out-of-core D&C problems. Mateos et al. [6] described EasyFJP, an approach to simplify the parallelization of D&C sequential Java applications.
Recently, Hijma et al. [7] investigated the feasibility of automatically inserting sync statements to relieve programmers of the burden of thinking about synchronization. Chou et al. [8] proposed an energy and performance efficient Dynamic Voltage and Frequency Scaling (DVFS) scheduling scheme to deal with the inherent load-imbalanced issue in irregular parallel D&C algorithms.
Most of the above introduced efforts attempt to parallelize various D&C algorithms on shared-memory or distributed-memory parallel computers by using the existing programming interfaces such as HPF [9] and MPI [10], [11].
However, several of other research efforts were conducted to design and implement compilers or frameworks to parallelize the D&C algorithms such as that introduced in [12], Satin [13], and DAMPVM/DAC [14]. Satin is a system for running D&C programs on distributed memory systems by extending Java with three simple Cilk-like primitives for D&C programming [13]. The DAMPVM/DAC is implemented on top of DAMPVM and capable of providing automatic partitioning of irregular D&C applications at runtime [14].
With the emerging of programmable Graphics Processing Unit (GPU) [15], [16], some research efforts have also been conducted to implement various D&C algorithms in parallel by exploiting the massively computing capability of modern GPUs. For example, Vomel et al. [17] described several techniques for accelerating D&C algorithms on GPU-CPU heterogeneous architectures, and presented an example on how to develop efficient numerical software on heterogeneous architectures.
Among those efforts, probably the most valuable ones are the parallelization of D&C sorting algorithms [18] and scan algorithm [19] on the GPU. Other efforts were also carried out to parallelize the D&C mesh generation [20], [21], [22], convex hull calculation [23], [24], [25], multibody system dynamics [26], [27], etc.
In literature [28], Tzeng and Owens specifically developed a generic paradigm for parallelizing D&C algorithms on the GPU, and applied the proposed paradigm to implement the famous convex hull algorithm, QuickHull [29], in parallel on the GPU. Tzeng and Owens's paradigm is generic, quite effective, and easy to follow to parallelize D&C algorithms on the GPU.
In this paper, by following the generic paradigm proposed by Tzeng and Owens [28], we provide a new and publicly available GPU implementation of the famous QuickHull algorithm to give a sample for demonstrating the parallelization of D&C algorithms on the GPU.
It should be noted that: the basic ideas behind our sample GPU implementation are derived from Tzeng and Owens's paradigm. However, our sample GPU implementation of the famous QuickHull algorithm is different from the one developed by Tzeng and Owens.
The major difference between our sample GPU implementation and the one developed by Tzeng and Owens is that: Tzeng and Owens strongly utilized the efficient parallel primitives such as parallel sort, scan, and reduction provided by the library CUDPP [30], while in contrast we heavily take the advantages of those efficient parallel primitives offered by the library Thrust [31].
The reason why we use the library Thrust rather than the CUDPP is that: the Thrust is much easier to be utilized than the CUDPP since Thrust has been integrated into the GPU programming model CUDA (Compute Unified Device Architecture). In other words, when parallelizing D&C algorithms on the GPU with CUDA, it is quite convenient and easy for users to use those parallel primitives in Thrust; and thus, fewer efforts are needed.
Our contribution in this paper can be summarized as follows. We develop a Sample GPU implementation of a classical D&C algorithm (i.e., Quickhull) by following Tzeng and Owens's paradigm. We hope this will be helpful for interested readers to develop their own efficient GPU implementations of D&C algorithms with fewer efforts and in less time, for example, to develop efficient D&C algorithms in mesh generation [20], [21], [22], convex hull calculation [23], [24], [25], multibody system dynamics [26], [27], etc.
The rest of this paper is organized as follows. Section 2 briefly reviews Tzeng and Owens's paradigm for parallelizing D&C algorithms on the GPU. Section 3 introduces the basic ideas and details of our sample GPU implementation of the classical QuickHull algorithm. The experimental results and analysis are presented in Section 4. The present sample GPU implementation is briefly discussed in Section 5. Finally, Section 6 concludes our work.
2. Background
In a D&C algorithm, an original problem needs to be recursively divided into several sub-problems; and then each sub-problem is individually solved to obtain the desired solution to the original problem. However, current GPU architecture does not support the feature of recursion. And it is not clear how to map the D&C algorithms onto the GPU architecture efficiently [32].
To address the above problem, Tzeng and Owens [28] have specifically proposed a generic paradigm for parallelizing D&C algorithms on modern GPUs, and applied the paradigm to implement the Quickhull algorithm to find convex hulls. This section will briefly introduce the basic ideas behind Tzeng and Owens's paradigm.
2.1. Basic ideas behind Tzeng and Owens's paradigm
In Tzeng and Owens's paradigm, they consider each sub-problem as contiguous segments within the input data, and implement the divide stage of the D&C paradigm by permuting the input data so that data for each sub-problem are still stored contiguously in non-overlapping segments. Therefore, the recursive function calls can be replaced by a single kernel call which does the permutation on the whole input array; see Figure 1.
Figure 1.
The essential difference between the traditional D&C algorithms and the Tzeng and Owens's paradigm. (This figure is derived from [32].)
Note that each group of similar data in the array is defined as a segment. See more descriptions in the subsequent Subsection 2.2.
More specifically, Tzeng and Owens attempt to keep the data in their original array rather than scattering the data into separate arrays, but permute the modified data so that similar data end up consecutively. The whole array is then marked with segment flags so that subsequent operations can be conducted on each individual segment of data. The major operations are carried out by utilizing the parallel primitive of scan and segmented scan. See more descriptions on the segmented scan in the subsequent Subsection 2.2.
There are two major operations in Tzeng and Owens's paradigm, i.e., the flagPermute and Compact.
-
•
The objective of flagPermute is to permute data around to gather the data with the same flags into segments (i.e., laying out continuously in the array) according to an array of flags; see Figure 2(a).
-
•
The objective of Compact is to discard the data with the flags marked as false from the input array according to an array of Booleans; see Figure 2(b).
Figure 2.

The operations: flagPermute and Compact. (This figure is derived from [32].)
2.2. Segment and segmented scan
Segments are those continuously stored partitions of data, which can be represented using segment flags [19], [33]. In general, there exist two common forms for representing segments. The first type of representation is to utilize a set of flags; and the second is to use a set of keys; see simple illustrations in Figure 3. A flag of a segment labels the beginning position of the segment (i.e., the segment head). A key indicates that which segment that each element storing in an array falls in.
Figure 3.
Segments and segment representations.
The scan is one of the common primitives. The segmented scan is a specific version of the generic scan primitive by performing scans on arbitrary length of segments of the input arrays [33]. To improve the computational efficiency of the segmented scan when applied to large arrays, Sengupta, Harris, Zhang and Owens [19] specifically parallelized the segmented scan on the GPU. In addition, the GPU-accelerated segmented scan has also been integrated into several existing libraries such as CUDPP [30] and Thrust [31].
3. Methods
3.1. Overview of our sample GPU implementation
In this paper, we present a sample GPU implementation of a classic D&C algorithm, QuickHull [34]. The most important idea behind our implementation is to directly operate the data in the input arrays that are originally allocated to store the coordinates of input points, rather than in the additionally allocated arrays or splitting the input data into separate arrays. This idea is derived from Tzeng and Owens's paradigm [28].
The QuickHull algorithm is a D&C method, which tends to divide the input data set into subsets and then handles these subsets recursively. A simple manner to perform calculations for those subsets is to store them in different separated arrays, and then operate on those arrays. The above method is easy to implement in sequential programming pattern on the CPU. However, on the GPU, the parallelization is in general carried out based on aligned data accesses, which means data storing in different arrays with different sizes is not suitable to be used for parallelization.
An alternative and practical strategy is still to divide the input data set into subsets, but do not store them in separated arrays with different sizes. Instead, all the data of the subsets are still stored in the input data array, but the data of each subset is stored into a Segment (i.e., a consecutive piece / partition of data) [33]. Operations carried out for each subset is exactly the operations for each segment. This strategy has been introduced by Tzeng and Owens [28].
In our implementation, we also adopt the above strategy. But our implementation is different to the one presented in [28]. In Section 5 we will discuss the differences, while in this section we only briefly introduce our implementation; see the procedure of our implementation in Algorithm 1.
Algorithm 1.
Procedure: 2D Quickhull on the GPU (without preprocessing).
In our implementation, after splitting the input set of points into the lower and the upper subsets, we sort the two subsets separately according to the x-coordinates. After this sorting, either the lower or the upper subset of sorted points can be considered as a Monotone Chain [35]; in addition, both the above chains can be further considered as two halves of a general polygon. If we detect and remove those vertices of the general polygon that have the interior angles greater than 180 degrees, i.e., those concave vertices, then we will receive a convex polygon that is in fact the desired convex hull.
The above idea of first sorting and then removing non-extreme points / concave vertices was first introduced in Andrew's Monotone Chain convex hull algorithm [35]. The method of removing concave vertices of a general polygon to obtain expected convex hull was also described in Melkman convex hull algorithm [36]. In our implementation, we also adopt the above ideas.
Therefore, the basic idea behind our implementation is to “Find-and-Remove”. In the step of first split, we divide the input points and then sort them to virtually form a general polygon. In the subsequent recursive step, we recursively first find those non-extreme points, and then remove them to guarantee that all the remaining points are completely extreme points of the expected convex hull.
3.2. Implementation details
This section describes some details of our implementation. The first feature of our implementation is to utilize several efficient parallel primitives provided by the library Thrust [31]. The second feature is that: we directly operate on the input data arrays rather than additionally allocated arrays.
The source code of our GPU implementation is provided as a supplementary for any public usage; see the section Supplementary.
3.2.1. Data storage and data layout
We allocate several arrays on the device side to store the coordinates of planar points, information about segments, and other required values such as distances; see Table 1.
Table 1.
Allocated arrays for storing data on the device.
| Array | Usage |
|---|---|
| float x[n] | x coordinates |
| float y[n] | y coordinates |
| float dist[n] | Distances |
| int head[n] | Indicator of the first point of each segment |
| (1: Head point; 0: Not a head point) | |
| int keys[n] | Index of the segment that each point belongs to |
| int first_pts[n] | Index of the first point of each segment |
| int flag[n] | Indicate whether a point is an extreme point or an interior point |
| (1: potential extreme point; 0: determined interior point) |
3.2.2. The preprocessing procedure
Before performing the QuickHull algorithm on the GPU, we first carry out a preprocessing procedure to filter the input points. The objective of this preprocessing procedure is to reduce the number of points by discarding those points that are not needed for consideration in the subsequent stage of calculating the desired convex hull. More specifically, we first find four extreme points, i.e., the leftmost, the rightmost, the topmost, and the bottommost points, and then form a convex quadrilateral. We check each of the rest points to determine whether it locates inside the formed quadrilateral. Those points that fall into the quadrilateral are definitely non-extreme points, and thus can be directly discarded at this stage, while the remaining points are used to calculate the desired convex hull in the subsequent stage.
We use the parallel reduction to find the extreme points with min or max x or y coordinate. In more details, we adopt thrust::minmax_element(x.begin(), x.end()) to find the leftmost and the rightmost points, and similarly use the function thrust::minmax_element() to obtain the topmost and the bottommost points. These four extreme points are then used to form a convex quadrilateral.
We also design a simple CUDA kernel to check each point to determine whether it locates inside the quadrilateral. In the kernel, each thread is responsible for determining the position of only one point , i.e., whether or not a point falls into the formed convex quadrilateral. If does, the corresponding indicator value flag[i] will be set to 0, otherwise, the value flag[i] is still kept as 1.
3.2.3. The first split
The first split of the QuickHull algorithm is to divide the set of input points into two subsets, i.e., the lower and the upper subsets, using the line segment L formed by the leftmost and the rightmost points. Those points that locate below the L are grouped into the lower subset, while the ones distributed above the L are contained in the upper subset.
We develop another quite simple kernel to perform the above split procedure. In this kernel, each thread takes the responsibility to determine the position of only one point with respect to the line segment L. In this step, we temporarily use the values int flag[n] to indicate the positions: if the point locates below the L, in other words, if belongs to the lower subset, then the corresponding indicator value flag[i] will be set to 1, otherwise 0.
After determining the positions of all points, it is needed to gather the points belonging to the same subset such as the lower one together according to the indicator values int flag[n]. We realized this procedure by simply using the function thrust::partition(). Those points with the indicator value 1, i.e., the lower points, will be placed into the first consecutive half of the input array (a segment of points), while the upper points will be grouped into another consecutive half (another segment of points). In subsequent steps, operations will be performed in the segments of points.
3.2.4. The recursive procedure
Finding the farthest point
The first step in the recursive procedure is to find the farthest points in each segment, which includes two remarkable issues. The first is to calculate the distance for all points in parallel; and the other is to find those points with the farthest distances for all segments in parallel.
Calculating the distance
The calculation of the distance from a point to a line is quite straightforward. However, in our implementation, it is needed to calculate the distances from different points to different lines simultaneously in parallel. This calculation is not so easy to implement in practice. This is because that: (1) for each segment, it is needed to compute the distance from each of those points belonging to this segment to the line formed by the first points and the last point; (2) for any two segments, their first point and last points are different.
Therefor, when calculating the distances for the points belongs to different segments, the first information needed to know is that: (1) the segment which each point belongs to, and (2) the first and the last point of the segment it belongs to.
In our implementation, for each point , we use the value first_pt[i] to record the index of the first point of that segment it belongs to. Since segments are stored consecutively, the first point of the ()th segment is exactly the last point of the jth segment except for the last segment. Note that the last point of the last segment is the point P0. Therefore, it is easy to obtain the index of the last point for each segment, and the distance from the point to the line formed by the first point and the last point; see the following piece of code.
Find the farthest point
After calculating the distances for those points in different segments, it is needed to find the farthest point in each segment. The finding of the farthest point for only one segment is easy to realize. It is only needed to perform a global parallel reduction to find the greatest distance and the corresponding point (i.e., the farthest point).
However, the finding of the farthest points for all segments in parallel is not so easy. This is because there are more than two segments that are needed to find their farthest points. Their farthest points are private, segment-specific. In this case, it is unable to perform a global parallel reduction, but is able to employ a segmented parallel reduction to find the greatest distance for each segment of points.
The segmented parallel reduction is designed in Thrust to make a parallel reduction for more than one segment of points. It can be used to find the min or max values in several segments in parallel. We employ the parallel primitive thrust::reduce_by_key() in our implementation to find the farthest points / maximum distances for all segments in parallel.
The first round of updating segments
After finding the farthest points, each segment is then typically divided into two smaller sub segments using the farthest point. This means that the old segments are replaced with new segments. To create the new segments, the following information about segments is needed to be updated (see Listing 1):
-
•
(1) Head flags
The head flags of the farthest points are needed to be modified from 0 to 1, which means each of the farthest points becomes the first point of a new segment. The head flags of other points are kept unchanged.
-
•
(2) Keys
The updating of keys can be very easily realized by performing a global inclusive scan for the head flags. For that in each segment only the head flag of the first point is 1, the sum of the head flags can be considered as the index of the segment (i.e., the keys). Noticeably, the sum of the head flags is one-based rather than zero-based (i.e., starting from 1 rather 0). To make the indices become much easier to be used, we further modify the keys from one-based to zero-based by performing a parallel subtracting.
-
•
(3) Indices of first points
After updating the head flags and keys of each segment, the corresponding indices of the first points are no longer valid, and thus needed to be updated. Before updating the indices of the first points, we first assign a global index for each of the remaining points, then check each point whether it is the first point according to the head flags. If the head flag of a point is 1, then this point must be a first point of a segment and its index is exactly i.
Listing 1.
The first round of updating segments.
Discarding interior points
The discarding of interior points is to first check whether or not a point locates inside the triangle formed by the first point of the segment (denoted as A), the last point of the segment (denoted as B), and the farthest point in this segment (denoted as C).
Let △ACB denote the triangle, the determining of the points' positions with respect to the triangle △ACB is to check whether those points in this segment locate on the right side of the line AC and the line CB. Since that the segment of points AB has been divided into two smaller segments AC and CB using the farthest point C, the first point of the old segment AB has become the first point of the segment AC, while C is both the last point of the segment AC and the first point of the segment CB.
Therefore, for the points in the segment AC, it is only needed to check whether each point is located on the right side of the directed line AC. If does, then it is not an interior point, and its corresponding indicator value flag[i] is set to 1; otherwise, it is an interior point and the value flag[i] must modified to 0. Similarly, for each point in the segment CB, it is only needed to check whether it falls on the right side the directed line CB.
After determining all interior points in this recursion, we employ a parallel partitioning procedure to gather all interior points together according to the indicator values int flag[n]; see a simple illustration in Figure 4. Noticeably, to maintain the relative order of input points, we use the function thrust::stable_partition() rather than the function thrust::partition(). After the partitioning, we make an operation resize() to remove all the interior points found in this recursion.
Figure 4.
Partitioning according to flags.
The second round of updating segments
This round of updating the segments is nearly the same to the first round of updating. The reason why this round of updating is needed to be performed is that: after removing some interior points, the points belonging to some segments are removed. In this case, the global indices of all the remaining point are not consecutive and thus need to be rearranged; see the first round of updating for more details about the above updating. Noticeably, the head flags and the keys for the remaining points are still correct, and do not need to be updated.
4. Results
4.1. Test data and experimental settings
To evaluate the performance of our sample GPU implementation, in this section, we perform several groups of experimental tests on three GPUs using two groups of test data in two modes.
The first group of input test data is composed of eight sets of planar points. Each set of the planar points is randomly distributed in a unit square, and created with the use of the component rbox of the library Qhull [37]. The second group of the employed test data consists of ten sets of planar points. Each set of the planar points is generated by mapping the nodes of a 3D mesh model into the XY plane; see those 3D mesh models listed in Table 8 and illustrated in Figure 5. Those 3D mesh models can be publicly available from the GIT Large Geometry Models Archive (http://www.cc.gatech.edu/projects/large_models/) and the Stanford 3D Scanning Repository (http://www.graphics.stanford.edu/data/3Dscanrep/).
Table 8.
Comparison of running time (/ms) for points derived from 3D models on K20c.
| 3D Model | Size | Qhull | Our implementation |
Speedup |
||
|---|---|---|---|---|---|---|
| Mode 1 | Mode 2 | Mode 1 | Mode 2 | |||
| Armadillo | 172K | 16 | 47.5 | 61.7 | 0.34 | 0.26 |
| Angel | 237K | 47 | 50.5 | 62.6 | 0.93 | 0.75 |
| Skeleton Hand | 327K | 32 | 49.5 | 61.8 | 0.65 | 0.52 |
| Dragon | 437K | 62 | 55.9 | 67.4 | 1.11 | 0.92 |
| Happy Buddha | 543K | 63 | 56.8 | 76.7 | 1.11 | 0.82 |
| Turbine Blade | 882K | 125 | 64.3 | 72.3 | 1.94 | 1.73 |
| Vellum Manuscript | 2M | 219 | 63.2 | 91.5 | 3.47 | 2.39 |
| Asian Dragon | 3M | 359 | 78.4 | 129.8 | 4.58 | 2.77 |
| Thai Statue | 5M | 515 | 84.4 | 142.6 | 6.10 | 3.61 |
| Lucy | 14M | 1404 | 141.3 | 223.3 | 9.94 | 6.29 |
Figure 5.
Three-dimensional mesh models.
We perform our experimental tests using the following three GPUs in the environments listed in Table 2.
Table 2.
The platforms used for testing.
| Configuration | Platform NO.1 | Platform NO.2 | Platform NO.3 |
|---|---|---|---|
| CPU | Intel i7-3610QM | Intel i5-3470 | Intel E5-2650 |
| CPU Frequency | 2.30GHz | 3.20GHz | 2.60GHz |
| Memory | 6GB | 8GB | 96GB |
| GPU | GeForce GTX 660M | GeForce GT640 | Tesla K20c |
| GPU RAM | 2GB | 1GB | 4GB |
| GPU Cores | 384 | 384 | 2496 |
| GPU CC | 3.0 | 3.5 | 3.5 |
| CUDA | Version 5.5 | Version 5.5 | Version 5.5 |
| OS | Window 7 | Window 7 | Window 7 |
In addition, all the experimental tests are conducted in two modes differing from the use of the preprocessing procedure. Mode 1: in this mode, the preprocessing procedure is employed. Mode 2: in this mode, the preprocessing procedure is not used.
In summary, to evaluate the performance of our implementation, we use: (1) three GPUs (GTX 660M, GT640, and K20c); (2) two groups of test data; (3) two modes (i.e., with and without preprocessing).
It also should be noted that: in all experimental tests, the achieved running time of the GPU implementations and the Qhull library does not include the time cost on inputting the test data or outputting experimental results.
4.2. Test results
4.2.1. Tests for points randomly distributed in unit square
We test the efficiency of our implementation using the points that are randomly distributed in the unit square on three GPUs, i.e., GTX660M, GT640, and K20c, in the Mode 1 and Mode 2, and then compare the efficiency with that of the Qhull library [37]; see Table 3, Table 4, Table 5. The experimental results show that: our implementation can achieve the speedups of up to 10.98x and 8.30x in the Mode 1 and Mode 2, respectively. In addition, it costs about 0.2 seconds to compute the convex hull of 20M points in the best case.
Table 3.
Comparison of running time (/ms) for points distributed in a square on GTX660M.
| Size | Qhull | Our implementation |
Speedup |
||
|---|---|---|---|---|---|
| Mode 1 | Mode 2 | Mode 1 | Mode 2 | ||
| 100K | 27 | 38.5 | 54.6 | 0.70 | 0.49 |
| 200K | 53 | 38.6 | 57.1 | 1.37 | 0.93 |
| 500K | 124 | 57.8 | 77.5 | 2.15 | 1.60 |
| 1M | 235 | 73.4 | 101.5 | 3.20 | 2.32 |
| 2M | 426 | 99.7 | 140.8 | 4.27 | 3.03 |
| 5M | 601 | 146.0 | 217.5 | 4.12 | 2.76 |
| 10M | 1158 | 281.8 | 372.3 | 4.11 | 3.11 |
| 20M | 2331 | 442.2 | 653.6 | 5.27 | 3.57 |
Table 4.
Comparison of running time (/ms) for points distributed in a square on GT640.
| Size | Qhull | Our implementation |
Speedup |
||
|---|---|---|---|---|---|
| Mode 1 | Mode 2 | Mode 1 | Mode 2 | ||
| 100K | 11 | 43.9 | 60.4 | 0.25 | 0.18 |
| 200K | 21 | 46.7 | 66.5 | 0.45 | 0.32 |
| 500K | 54 | 53.0 | 78.1 | 1.02 | 0.69 |
| 1M | 106 | 63.0 | 92.5 | 1.68 | 1.15 |
| 2M | 209 | 87.4 | 126.9 | 2.39 | 1.65 |
| 5M | 523 | 159.5 | 219.4 | 3.28 | 2.38 |
| 10M | 1045 | 274.1 | 362.6 | 3.81 | 2.88 |
| 20M | 2226 | 447.4 | N/A | 4.98 | N/A |
Table 5.
Comparison of running time (/ms) for points distributed in a square on K20c.
| Size | Qhull | Our implementation |
Speedup |
||
|---|---|---|---|---|---|
| Mode 1 | Mode 2 | Mode 1 | Mode 2 | ||
| 100K | 16 | 37.2 | 61.7 | 0.43 | 0.26 |
| 200K | 32 | 36.4 | 65.7 | 0.88 | 0.49 |
| 500K | 62 | 41.7 | 69.2 | 1.49 | 0.90 |
| 1M | 109 | 44.8 | 73.9 | 2.43 | 1.47 |
| 2M | 234 | 55.9 | 86.2 | 4.19 | 2.71 |
| 5M | 561 | 77.9 | 118.5 | 7.20 | 4.73 |
| 10M | 1029 | 124.7 | 161.7 | 8.25 | 6.36 |
| 20M | 2262 | 206.0 | 272.5 | 10.98 | 8.30 |
It can be obviously observed that: (1) on the GPU K20c, the best efficiency can be received, while the running time is nearly the same on the GTX660M and GT640; (2) on all the three GPUs, the Mode 1 can achieve better efficiency than the Mode 2.
It should be noted that: on GT640, the test for the set of 20M points in the Mode 2 cannot be conducted due to insufficient memory on the device; see Table 4. This is because that without the use of preprocessing, too much memory is needed to be allocated on the device, and exceeds the limit of the size of global memory on GT640.
4.2.2. Tests for points derived from 3D mesh models
We also evaluate the efficiency of our implementation using the points that are derived from 3D mesh models on three GPUs; see the experimental results listed in Table 6, Table 7, Table 8. The experimental results indicate that: our implementation can achieve the speedups of up to 9.94x and 6.29x in the Mode 1 and Mode 2, respectively. Additionally, in the best case our implementation can find the convex hull of 14M points in less than 0.2 seconds.
Table 6.
Comparison of running time (/ms) for points derived from 3D models on GTX660M.
| 3D Model | Size | Qhull | Our implementation |
Speedup |
||
|---|---|---|---|---|---|---|
| Mode 1 | Mode 2 | Mode 1 | Mode 2 | |||
| Armadillo | 172K | 47 | 49.1 | 85.9 | 0.96 | 0.55 |
| Angel | 237K | 52 | 48.2 | 77.2 | 1.08 | 0.67 |
| Skeleton Hand | 327K | 78 | 49.2 | 78.9 | 1.59 | 0.99 |
| Dragon | 437K | 98 | 56.7 | 85.3 | 1.73 | 1.15 |
| Happy Buddha | 543K | 122 | 68.2 | 112.1 | 1.79 | 1.09 |
| Turbine Blade | 882K | 203 | 75.6 | 130.3 | 2.69 | 1.56 |
| Vellum Manuscript | 2M | 392 | 114.9 | 162.9 | 3.41 | 2.41 |
| Asian Dragon | 3M | 490 | 131.9 | 224.9 | 3.71 | 2.18 |
| Thai Statue | 5M | 530 | 139.5 | 258.7 | 3.80 | 2.05 |
| Lucy | 14M | 1420 | 282.5 | 553.7 | 5.03 | 2.56 |
Table 7.
Comparison of running time (/ms) for points derived from 3D models on GT640.
| 3D Model | Size | Qhull | Our implementation |
Speedup |
||
|---|---|---|---|---|---|---|
| Mode 1 | Mode 2 | Mode 1 | Mode 2 | |||
| Armadillo | 172K | 18 | 51.6 | 62.4 | 0.35 | 0.29 |
| Angel | 237K | 24 | 55.9 | 88.3 | 0.43 | 0.27 |
| Skeleton Hand | 327K | 31 | 54.8 | 87.1 | 0.57 | 0.36 |
| Dragon | 437K | 42 | 66.0 | 97.4 | 0.64 | 0.43 |
| Happy Buddha | 543K | 52 | 74.8 | 110.8 | 0.70 | 0.47 |
| Turbine Blade | 882K | 82 | 84.1 | 115.3 | 0.98 | 0.71 |
| Vellum Manuscript | 2M | 205 | 95.6 | 131.6 | 2.14 | 1.56 |
| Asian Dragon | 3M | 345 | 131.9 | 215.2 | 2.62 | 1.60 |
| Thai Statue | 5M | 480 | 129.9 | 254.0 | 3.70 | 1.89 |
| Lucy | 14M | 1323 | 296.1 | 532.8 | 4.47 | 2.48 |
Similarly, we also find that: (1) on K20c, the best efficiency can be received, while the running time is nearly the same on the GTX660M and GT640; (2) on all the three GPUs, the Mode 1 can achieve better efficiency than the Mode 2.
4.3. Results analysis
4.3.1. Computational efficiency on different GPUs
We have evaluated our implementation by carrying out several groups of experimental tests on three GPUs. Some of the main specifications of these GPUs have been listed in Table 2.
According to the experimental results listed in Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, it can be obviously observed that: on K20c, the best efficiency can be received, while the running time is nearly the same on the GTX660M and GT640. These results are also illustrated in Figure 6.
Figure 6.
Comparison of running time on three GPUs in two modes. (a) Test for points distributed in square in Mode 1; (b) Test for points distributed in square in Mode 2; (c) Test for points derived from 3D models in Mode 1; (d) Test for points derived from 3D models in Mode 2.
The above behavior is due to the fact that the efficiency of GPU-accelerated implementations strongly depends on the adopted GPUs. The GPU Tesla K20c is much more expensive than the rest of two, and certainly much more powerful than the GTX660M and GT640, while GTX 660M and GT 640 have the similar specifications such as the number of CUDA cores. Therefore, it is reasonable that: (1) the speedups received on the K20c is greater than those on GTX660M and GT640. (2) The running time obtained on GTX660M and GT640 are nearly the same.
It should be also noticeable that: the speedups achieved on GTX660M and GT640 are not nearly the same. This is due to the fact that: the adopted GPUs in the platform 1 and platform 2 have the significantly different computational capability; see Table 2.
4.3.2. Performance impact of preprocessing
We have employed a preprocessing procedure to filter the input set of points by discarding those interior points locating in a convex quadrilateral. We have evaluated the performance of our implementation in two modes: in the first mode (i.e., the Mode 1), the above preprocessing is not used, while in the second mode (i.e., the Mode 2) the preprocessing is not applied.
The experimental results received on three GPUs indicate that: our implementation executing in the Mode 1 is much faster than that in Mode 2; see Figure 7, Figure 8, Figure 9. This behavior is probably due to the facts that: (1) more than 50% input points can be discarded in this preprocessing; (2) the performance benefit from the discarding of interior points is more than the performance penalty lead by the discarding.
Figure 7.
Comparison of speedups in two modes on GTX660M. (a) Points distributed in unit square; (b) Points derived from 3D mesh models.
Figure 8.
Comparison of speedups in two modes on GT640. (a) Points distributed in unit square; (b) Points derived from 3D mesh models.
Figure 9.
Comparison of speedups in two modes on K20c. (a) Points distributed in unit square; (b) Points derived from 3D mesh models.
Besides the above mentioned performance improvement, another benefit of adopting the preprocessing procedure is that: the memory usage on the device side is much less. This is because that: after the preprocessing only less than 50% input points remain. Therefore, in the subsequent procedure of computing the convex hull, it is only needed to allocate much less memory for the arrays such float x[n], y[n], dist[n] and int keys[n], first_pts[n], flag[n]. This benefit has been demonstrated in the experimental test on the GPU GT640: the experimental test for the set of 20M points cannot be performed due to insufficient memory when the preprocessing procedure is not used.
4.3.3. Performance of each sub-procedure
There are three main sub-procedures in our implementation when adopting the preprocessing, i.e., the preprocessing procedure (pre-step), the splitting of points into two subsets (1st step), and the recursive procedure of finding the expected convex hull (2nd step). To find the potential performance bottleneck, we have investigated the computational efficiency of these three sub-procedures; see the computational efficiency on K20c in Figure 10. We have found that in most cases: (1) the most computationally expensive step is the 2nd step; (2) the most computationally inexpensive one is the pre-step; (3) for large set of input points such as 10M or 20M points, the computational cost of the 1st step and the 2nd step no longer dramatically differs. Therefore, the potential performance bottleneck of our implementation is probably the 2nd step, and needs to be optimized in further work.
Figure 10.
Computational efficiency of sub-procedures on K20c. (a) Test for points distributed in square in Mode 1; (b) Test for points distributed in square in Mode 2; (c) Test for points derived from mesh models in Mode 1; (d) Test for points derived from mesh models in Mode 2.
5. Discussion
The strategy of D&C is one of the frequently used programming patterns to design efficient algorithms in computer science, which has been parallelized on various parallel computing architectures. However, current GPU architecture does not support the feature of recursion. It is unclear how to map the D&C algorithms onto the GPU architecture efficiently [32].
To address the above problem, Tzeng and Owens [28] have specifically proposed a generic paradigm for parallelizing D&C algorithms on modern GPUs, and applied the paradigm to implement the Quickhull algorithm to find convex hulls.
In this paper, by following the generic paradigm proposed by Tzeng and Owens, we provide a publicly available GPU implementation of the famous D&C algorithm, QuickHull, to give a sample and guide for parallelizing D&C algorithms on the GPU.
We have evaluated the performance of our sample GPU implementation by comparing it with the state-of-art serial implementation Qhull library [37]. Our experimental results indicated that: the sample GPU implementation can achieve the speedups of up to 10.98x and 9.94x over the library Qhull for the two groups of test data. Moreover, it approximately cost 0.2 seconds to find the convex hull of 20M planar points when using our sample GPU implementation; see Table 5.
Compared to other GPU-accelerated implementations [38], [39], [28], our sample GPU implementation is also competitive in terms of the computational efficiency. For example, Srikanth et al. [38] introduced that their GPU-accelerated implementation is 10 ∼ 15 times faster than the Qhull for 1M to 10M input points. It also has been reported in [39] that: their GPU-accelerated implementation can achieve the speedup of approximately 14x over a standard sequential CPU implementation. In addition, Tzeng and Owens's GPU implementation [28] can achieve an order of magnitude speedup over the library Qhull [37].
In Section 3.1 we have mentioned that: several basic ideas behind our sample GPU implementation is the same as those behind Tzeng and Owens's GPU implementation [28]. The first of the same ideas is that: we also carry out the D&C operations within the continuous input arrays, rather than separated, additionally allocated arrays. The second idea is that: the D&C operations are realized by forming, updating, and removing Segments. The third is that: the efficient data-parallel primitives such as parallel segmented scan, which are provided by existing libraries, are strongly exploited.
The above three ideas are the same. However, we have our own ideas in the development of our implementation; and there exist several differences between our sample GPU implementation and Tzeng and Owens's [28].
The first difference is that: before discarding the interior points in the preprocessing procedure, we first divide the set of input points into the lower and the upper subsets, and then sort the points of the two subsets according to the x-coordinates. Most importantly, the relative orders of those sorted points are kept unchanged in the subsequent filtering procedure of recursively removing undesired interior points. In contrast, Tzeng and Owens [28] do not sort the subsets of input points according to the coordinates or maintain the relative orders of the sorted points.
The second difference is the creating and removing of segments: Tzeng and Owens create new segments using the farthest point by permuting the points located in the old segment, while we also create new segments using the farthest point but we do not permute points. When removing segments, we at least retain the first point (i.e., the head point) of each segment for that this point is definitely an extreme point of the desired convex hull, while in the implementation of Tzeng and Owens a segment is probably completely removed. This difference is due to the different schemes of updating segments.
Another significant difference is that: we adopt the library Thrust [31] for the use of the efficient data-parallel primitives such as parallel scan, segmented scan, reduction, and sorting, while Tzeng and Owens [28] develop their implementation by strongly exploiting the library CUDPP [30]. The reason why we choose to use the library Thrust rather than CUDPP is that: Thrust has been integrated in CUDA toolkit, and can be much easier to be used in practice.
To summarize our work in this paper, we present a GPU implementation of a classical D&C algorithm, QuickHull. Our research objective in this paper is to present a sample GPU implementation of a classical D&C algorithm to help interested readers to develop their own efficient GPU implementations with fewer efforts and in less time.
6. Conclusion
In this paper we have presented a sample GPU implementation of a classical D&C algorithm, Quickhull, by strongly utilizing those efficient parallel primitives provided by the library Thrust. Our sample GPU implementation is based on Tzeng and Owens's generic paradigm for parallelizing D&C algorithms on the GPU. The experimental results have demonstrated the practicality of our sample GPU implementation. We hope that our sample GPU implementation could be helpful for interested readers to develop their own efficient GPU implementations of D&C algorithms with fewer efforts and in less time.
Declarations
Author contribution statement
Gang Mei: Conceived and designed the experiments; Analyzed and interpreted the data; Wrote the paper
Jiayin Zhang: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data
Nengxiong Xu: Wrote the paper
Kunyang Zhao: Performed the experiments; Contributed reagents, materials, analysis tools or data
Funding statement
This work was supported by the Natural Science Foundation of China (Grant Numbers 11602235 and 40872183), the China Postdoctoral Science Foundation (2015M571081), the Fundamental Research Funds for the Central Universities (2652015065), and the Fundamental Research Funds for the Central Universities (2652015319).
Competing interest statement
The authors declare no conflict of interest.
Additional information
Supplementary content related to this article has been published online at https://doi.org/10.1016/j.heliyon.2018.e00512.
No additional information is available for this paper.
A preprint version of this paper is posted at http://arxiv.org/abs/1501.04706.
Acknowledgements
We would like first to appreciate Stanley Tzeng and John D. Owens for their generic paradigm for parallelizing D&C algorithms on the GPU. Our contribution in this paper is based on Tzeng and Owens's excellent work. The authors would like to thank the editor and reviewers for their contributions on the paper.
Supplementary material
The following Supplementary material is associated with this article:
Sourcecode of a sample GPU implementation for parallelizing Divide-and-Conquer algorithms.
References
- 1.Horowitz E., Zorat A. Divide-and-conquer for parallel processing. IEEE Trans. Comput. 1983;C-32(6):582–585. [Google Scholar]
- 2.Atallah M.J., Cole R., Goodrich M.T. Cascading divide-and-conquer: a technique for designing parallel algorithms. SIAM J. Comput. 1989;18(3):499–532. [Google Scholar]
- 3.Wu I.C., Kung H.T. Proceedings – 32nd Annual Symposium on Foundations of Computer Science. IEEE Computer Society Press; Los Alamitos: 1991. Communication complexity for parallel Divide-and-Conquer. [Google Scholar]
- 4.Achatz K., Schulte W. Massive parallelization of divide-and-conquer algorithms over powerlists. Sci. Comput. Program. 1996;26(1–3):59–78. [Google Scholar]
- 5.Sreenivas M.K., AlSabti K., Ranka S. International Parallel Processing Symposium. Proceedings. IEEE Computer Society; Los Alamitos: 1999. Parallel out-of-core divide-and-conquer techniques with application to classification trees; pp. 555–562. [Google Scholar]
- 6.Mateos C., Zunino A., Hirsch M. EasyFJP: providing hybrid parallelism as a concern for divide and conquer java applications. Comput. Sci. Inf. Syst. 2013;10(3):1129–1163. [Google Scholar]
- 7.Hijma P., van Nieuwpoort R.V., Jacobs C.J.H., Bal H.E. Generating synchronization statements in divide-and-conquer programs. Parallel Comput. 2012;38(1–2):75–89. [Google Scholar]
- 8.Chou Y.L., Liu S.S., Chung E.Y., Gaudiot J.L. An energy and performance efficient DVFS scheme for irregular parallel divide-and-conquer algorithms on the intel SCC. IEEE Comput. Archit. Lett. 2014;13(1):13–16. [Google Scholar]
- 9.Wang H., Guo M.Y., Wei D.M. A divide-and-conquer algorithm for irregular redistribution in parallelizing compilers. J. Supercomput. 2004;29(2):157–170. [Google Scholar]
- 10.Dorta I., Leon C., Rodriguez C., Rojas A. Eleventh Euromicro Conference on Parallel, Distributed and Network-Based Processing, Proceedings. IEEE Computer Society; Los Alamitos: 2003. Parallel skeletons for Divide-and-Conquer and Branch-and-Bound techniques. [Google Scholar]
- 11.Rezaei S., Monwar M.M. 2006 Canadian Conference on Electrical and Computer Engineering, vols. 1–5. IEEE Computer Society; New York: 2006. Divide-and-conquer algorithm for ClustalW-MPI. [Google Scholar]
- 12.Rugina R., Rinard M. Automatic parallelization of divide and conquer algorithms. ACM SIGPLAN Not. 1999;34(8):72–83. [Google Scholar]
- 13.van Nieuwpoort R.V., Kielmann T., Bal H.E. vol. 1900. Springer-Verlag Berlin; Berlin: 2000. Satin: efficient parallel divide-and-conquer in Java; pp. 690–699. (Lecture Notes in Computer Science). [Google Scholar]
- 14.Czarnul P. Programming, tuning and automatic parallelization of irregular divide-and-conquer applications in DAMPVM/DAC. Int. J. High Perform. Comput. Appl. 2003;17(1):77–93. [Google Scholar]
- 15.Owens J.D., Luebke D., Govindaraju N., Harris M., Krueger J., Lefohn A.E., Purcell T.J. A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum. 2007;26(1):80–113. [Google Scholar]
- 16.Owens J.D., Houston M., Luebke D., Green S., Stone J.E., Phillips J.C. GPU computing. Proc. IEEE. 2008;96(5):879–899. [Google Scholar]
- 17.Vomel C., Tomov S., Dongarra J. Divide and conquer on hybrid GPU-accelerated multicore systems. SIAM J. Sci. Comput. 2012;34(2):C70–C82. [Google Scholar]
- 18.Satish N., Harris M., Garland M. Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. 2009. Designing efficient sorting algorithms for manycore GPUs; pp. 1–10. [Google Scholar]
- 19.Sengupta S., Harris M., Zhang Y., Owens J.D. Graphics Hardware. 2007. Scan primitives for GPU computing; pp. 97–106. [Google Scholar]
- 20.Chen M.B. International Symposium on Parallel Architectures Algorithms and Programming. IEEE Computer Society; New York: 2012. A divide-and-conquer algorithm of Delaunay triangulation with GPGPU; pp. 175–177. [Google Scholar]
- 21.Zhang M.Y., Huang J., Liu X.G., Bao H.J. A divide-and-conquer approach to quad remeshing. IEEE Trans. Vis. Comput. Graph. 2013;19(6):941–952. doi: 10.1109/TVCG.2012.301. [DOI] [PubMed] [Google Scholar]
- 22.Wu W.Z., Rui Y.K., Su F.Z., Cheng L., Wang J.C. Novel parallel algorithm for constructing Delaunay triangulation based on a twofold-divide-and-conquer scheme. GISci. Remote Sens. 2014;51(5):537–554. [Google Scholar]
- 23.Stein A., Geva E., El-Sana J. CudaHull: fast parallel 3D convex hull on the GPU. Comput. Graph. 2012;36(4):265–271. [Google Scholar]
- 24.Tang M., Zhao J.Y., Tong R.F., Manocha D. GPU accelerated convex hull computation. Comput. Graph. 2012;36(5):498–506. [Google Scholar]
- 25.M.C. Gao, T.T. Cao, A. Nanjappa, T.S. Tan, Z.Y. Huang, gHull: a GPU algorithm for 3D convex hull.
- 26.Malczyk P., Fraczek J. A divide and conquer algorithm for constrained multibody system dynamics based on augmented Lagrangian method with projections-based error correction. Nonlinear Dyn. 2012;70(1):871–889. [Google Scholar]
- 27.Laflin J.J., Anderson K.S., Khan I.M., Poursina M. Advances in the application of the divide-and-conquer algorithm to multibody system dynamics. J. Comput. Nonlinear Dyn. 2014;9(4):8. [Google Scholar]
- 28.Tzeng S., Owens J.D. Finding convex hulls using quickhull on the GPU. arXiv:1201.2936 arXiv preprint.
- 29.Barber C.B., Dobkin D.P., Huhdanpaa H. The quickhull algorithm for convex hulls. ACM Trans. Math. Softw. 1996;22(4):469–483. [Google Scholar]
- 30.CUDPP Cuda data parallel primitives library. 2014. http://cudpp.github.io/ URL.
- 31.Bell N., Hoberock J. Chapter 26 – Thrust: a productivity-oriented library for CUDA. In: Hwu W.-m.W., editor. GPU Computing Gems Jade Edition. Morgan Kaufmann; Boston: 2012. pp. 359–371. (Applications of GPU Computing Series). [Google Scholar]
- 32.Tzeng S., Owens J.D. A paradigm for divide and conquer algorithms on the GPU and its application to the quickhull algorithm. http://www.nvidia.com/content/gtc/posters/61_tzeng_paradigm_for_divide_and_conquer.pdf URL.
- 33.Blelloch G.E. 1st edition. MIT Press Cambridge; Cambridge, Massachusetts: 1990. Vector Models for Data-Parallel Computing. [Google Scholar]
- 34.Barber C.B., Dobkin D.P., Huhdanpaa H. The quickhull algorithm for convex hulls. ACM Trans. Math. Softw. 1996;22(4):469–483. [Google Scholar]
- 35.Andrew A.M. Another efficient algorithm for convex hulls in two dimensions. Inf. Process. Lett. 1979;9(5):216–219. [Google Scholar]
- 36.Melkman A.A. On-line construction of the convex hull of a simple polyline. Inf. Process. Lett. 1987;25(1):11–12. [Google Scholar]
- 37.CUDPP Code for computing the convex hull, Delaunay triangulation, Voronoi diagram, and halfspace intersection. 2014. http://www.qhull.org/ URL.
- 38.Srikanth D., Kothapalli K., Govindarajulu R., Narayanan P. International conference on high performance computing (HiPC) 2009. Parallelizing two dimensional convex hull on NVIDIA GPU and Cell BE; pp. 1–5. [Google Scholar]
- 39.Srungarapu S., Reddy D.P., Kothapalli K., Narayanan P.J. Advanced Information Networking and Applications (WAINA), 2011 IEEE Workshops of International Conference on. 2011. Fast two dimensional convex hull on the GPU; pp. 7–12. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Sourcecode of a sample GPU implementation for parallelizing Divide-and-Conquer algorithms.











