GPU computing with Kaczmarz’s and other iterative algorithms for linear systems

Joseph M Elble; Nikolaos V Sahinidis; Panagiotis Vouzis

doi:10.1016/j.parco.2009.12.003

. Author manuscript; available in PMC: 2011 Jun 1.

Published in final edited form as: Parallel Comput. 2010 Jun 1;36(5-6):215–231. doi: 10.1016/j.parco.2009.12.003

GPU computing with Kaczmarz’s and other iterative algorithms for linear systems

Joseph M Elble ^a, Nikolaos V Sahinidis ^b, Panagiotis Vouzis ^b

PMCID: PMC2879082 NIHMSID: NIHMS164768 PMID: 20526446

Abstract

The graphics processing unit (GPU) is used to solve large linear systems derived from partial differential equations. The differential equations studied are strongly convection-dominated, of various sizes, and common to many fields, including computational fluid dynamics, heat transfer, and structural mechanics. The paper presents comparisons between GPU and CPU implementations of several well-known iterative methods, including Kaczmarz’s, Cimmino’s, component averaging, conjugate gradient normal residual (CGNR), symmetric successive overrelaxation-preconditioned conjugate gradient, and conjugate-gradient-accelerated component-averaged row projections (CARP-CG). Computations are preformed with dense as well as general banded systems. The results demonstrate that our GPU implementation outperforms CPU implementations of these algorithms, as well as previously studied parallel implementations on Linux clusters and shared memory systems. While the CGNR method had begun to fall out of favor for solving such problems, for the problems studied in this paper, the CGNR method implemented on the GPU performed better than the other methods, including a cluster implementation of the CARP-CG method.

Keywords: Scientific computing on GPU, Linear systems, Iterative methods, Convex feasibility problem

1. Introduction

Natural and engineered systems are often modeled by sets of partial differential equations that have no closed-form solution. Discretization of these equations leads to large sparse systems of linear equations. A considerable amount of algorithmic and computational work has been performed to develop iterative algorithms for solving these systems as efficiently as possible [1].

The current paper investigates the performance of several iterative solvers on a graphics processing unit (GPU). The algorithms we are interested in include Kaczmarz’s sequential algorithm, as well as several block-parallel algorithms: Cimmino’s method, component averaging (CAV), conjugate gradient normal residual (CGNR), symmetric successive overrelaxation (SSOR)-preconditioned conjugate gradient (CGMN), component-averaged row projections (CARP), and conjugate-gradient-accelerated CARP (CARP-CG).

Given the challenging nature of large-scale systems, parallel computing has been utilized in the development of efficient solution algorithms. In the context of GPUs, recent hardware developments, such as the NVIDIA Tesla C870, offer a massively multithreaded processor architecture that is ideal for high performance computing applications. These innovative designs have led many researchers to implement various linear algebra routines on the GPU. A Jacobi iterative solver was implemented on the GPU by Göddeke et al. [2]. Conjugate gradient and multigrid sparse matrix solvers were implemented on the GPU by Bolz et al. [3]. Krüger and Westermann [4] implemented direct solvers for sparse matrices, and studied their performance using multi-dimensional finite difference equations arising from the 2-D wave equation and the incompressible Navier-Stokes equations. An algorithm to solve dense linear systems using GPUs was implemented on an NVIDIA GeForce 7800 GPU by Galoppo et al. [5]. These are some of several studies recently reported using GPUs to accelerate the solution of various linear algebra problems.

There are many studies reporting the parallel solution of linear systems on hardware other than GPUs. Here, we mention those most closely related to the algorithms studied in this paper. Bramley and Sameh [6] implemented a block-sequential Kaczmarz algorithm with CG acceleration under five different partitioning schemes on structured grids. The disadvantage of the block-sequential approach is that it requires the identification of independent sets of equations which is difficult for unstructured grids. In [7], the same authors extended the work in [6] to include a CG acceleration of a block-Cimmino algorithm and a new projection method called V-RP. Their block-parallel CG-accelerated methods were shown to be robust in practice, but there was no clear “best” partitioning scheme. Arioli et al. [8] studied parallel CG acceleration of block-Cimmino for different block partitionings. The authors restricted their study to block-tridiagonal systems. Gordon and Gordon [9] introduced the CARP algorithm that divides the linear equations into blocks and operates in a block-parallel manner. Kaczmarz row projections are performed within each block in parallel and the results are then merged using component-averaging operations. CARP was shown to be very robust and suitable for unstructured grids.

The main purpose of the paper is to evaluate the performance on the GPU of several parallel algorithms with respect to solving large linear systems arising from the discretization of elliptic convection-diffusion partial differential equations. In particular, we are interested in identifying the best possible algorithm for this architecture, as well as in comparing the GPU against the CPU and earlier cluster implementations of these algorithms. For this purpose, we will investigate the performance of these algorithms on a set of problems addressed in many earlier works. This set of problems includes six partial differential equations proposed in [7], along with three additional partial differential equations that were investigated in [10, 11]. All nine of these partial differential equations are strongly convection-dominated.

The remainder of the paper is organized as follows. Section 2 provides mathematical preliminaries and an introduction to GPU computing and architectures, including the NVIDIA Tesla C870, which is utilized for computations in this paper. The algorithms considered here are described in Section 3, while the philosophy and drivers behind our GPU implementation of these algorithms are detailed in Section 4. In Section 5, we present extensive computational results, followed by conclusions from this work in Section 6.

2. Preliminaries

2.1. Mathematical background and notation

The problem of finding a point in the intersection of two or more convex sets is referred to as the convex feasibility problem. Let C₁, C₂, …, C_n be closed convex subsets of a Hilbert space X with a nonempty intersection

C = C_{1} \cap C_{2} \cap \dots \cap C_{n} \neq \emptyset .

The convex feasibility problem involves finding some x in C. In image reconstruction, the convex sets C_i are hyperplanes. The iterative algorithms presented in Section 3 use orthogonal projections on the hyperplanes C_i to solve these problems. A matrix P ∈ ℝⁿ^×ⁿ is an orthogonal projection onto S ⊆ ℝⁿ if R(P) = S, P² − P = 0, and P^T = P, where R(P) is the range of P.

Define a linear system as Ax = b, where A ∈ ℝ^m^×ⁿ, x ∈ ℝⁿ, and b ∈ ℝ^m. A consistent linear system is one where b ∈ R(A). An inconsistent linear system is such that b ∉ R(A). Simply put, the equations of a linear system are consistent if they possess a common (but not necessarily unique) solution, and inconsistent otherwise.

A linear system with a low condition number is said to be well-conditioned, while a linear system with a relatively high condition number is said to be ill-conditioned. For a linear system with an invertible matrix, the condition number is given by

κ (A) = | | A^{- 1} | | | | A | |,

where ||A|| denotes the norm of A. Any norm, ||·||, without a subscript denotes the Euclidean norm. For rectangular matrices with full column rank, i.e., A ∈ ℝ^m^×ⁿ and rank(A) = n, the condition number is given by

κ (A) = \frac{σ_{max} (A)}{σ_{min} (A)},

where σ_max(A) denotes the maximum singular value of A and σ_min(A) denotes the minimum singular value of A.

Two nonzero vectors u and v are conjugate with respect to A if

u^{T} A v = {〈 u, v 〉}_{A} = 0.

Likewise, it is necessary to define the notation

{| | x | |}_{S}^{2} = x^{T} S x .

When discussing an iterative algorithm, the kth iterate of any vector will always be denoted by a superscript in parenthesis, e.g., x⁽^k⁾. On the other hand, the position in the vector will be denoted with a subscript. For example, $x_{j}^{(k)}$ would denote the jth element of the kth iteration of x. For an m × n matrix A, a_ij denotes the element of A in the ith row and jth column. Furthermore, a_i is used to denote the ith row vector of A, unless otherwise noted. The standard basis vectors are written e_i.

2.2. The graphics processing unit

This subsection briefly describes the architecture of a GPU from the perspective of a programmer using the compute unified device architecture (CUDA) [12]. CUDA is a hardware and software architecture that issues and manages data-parallel computations on a GPU. The GPU is a single-instruction multiple-data (SIMD) parallel device. The GPU should be viewed as a compute device or coprocessor, while the CPU should be viewed as the host. Since the GPU is a SIMD architecture with a host device, the GPU is utilized for data-parallel and computationally intensive portions of the algorithm or application.

The computationally intensive portions of the algorithm are executed in parallel using thousands of threads. Each thread executes the same set of instructions independently on different data. The GPU instructions are referred to as a kernel, which is downloaded to the device. A block is a group of threads that share data through shared memory and synchronize their execution to coordinate their memory accesses. These synchronization points are specified within the kernel and act as a barrier where all threads in the block are suspended until they reach this barrier. A grid is an organization of thread blocks. Within each grid, there exist a programmer-defined number of blocks.

The memory model for the GPU is provided at the grid level. Read-write access to global memory and read-only access to constant and texture memory is afforded to the entire grid. The host has read-write access with global, constant, and texture memory. Each thread has read-write access to its own set of registers and local memory, while each block has read-write access to designated shared memory.

The multiprocessors on an NVIDIA Tesla C870 are organized in a streaming processor array. Each element of this array is referred to as a texture processor cluster. In the case of the NVIDIA Tesla C870, there are eight texture processor clusters. A texture processor cluster consists of texture memory and two streaming multiprocessors. Each streaming multiprocessor contains instruction and data cache, instruction fetch and dispatch unit, shared memory, eight streaming processors, and two special function units. The special function units allow for fast single-precision mathematical computations, such as sine, cosine, logarithm, and exponential.

The interested reader can find additional discussions on the Tesla architecture in [13]. More detailed information on GPU computing can be found in two recent special issues of IEEE Proceedings [14] and the Journal of Parallel and Distributed Computing [15].

3. Iterative methods

This section describes the iterative methods utilized in this paper. Interest is placed on solving linear systems of the form Ax = b, where A is an m × n matrix, x is an n-vector, and b is an m-vector.

3.1. Kaczmarz’s algorithm

Kaczmarz introduced this algorithm in [16]. Kaczmarz’s approach is a projection method that is also referred to as ART (algebraic reconstruction technique) and is used for solving linear systems from image reconstruction problems.

Kaczmarz’s algorithm is sequential. The algorithm sweeps through rows of A in a cyclic manner. At each iteration, the previous iterate is projected orthogonally onto the solution hyperplane 〈a_i, x〉 = b_i. This orthogonal projection leads to the normalized step at iteration k

s^{(k)} = λ_{i} \frac{b_{i} - 〈 a_{i}, x^{(k)} 〉}{{| | a_{i} | |}^{2}} a_{i},

(1)

where λ_i is a cyclic relaxation parameter that extends the projections either in front of the hyperplane (λ_i < 1), exactly on the hyperplane (λ_i = 1), or beyond the hyperplane (λ_i > 1), and we assume henceforth that 0 < λ < 2. In this case, i is equal to k modulus m + 1, with k ≥ 0. A randomized version of the algorithm has also been proposed in [17], where the row i is chosen at random rather than sequentially. In either case, each set of m iterations is referred to as a sweep. The iterate progresses as follows

x^{(k + 1)} = x^{(k)} + s^{(k)}

where the step s⁽^k⁾ is defined by (1).

Algorithm 1 is a presentation of Kaczmarz’s method.

Algorithm 1.

Kaczmarz’s algorithm

	GPU time (sec)		CPU time (sec)
Problem size (n)	Memory initialization	Per iteration	Per iteration
1000	0.003	0.034	0.014
2500	0.016	0.036	0.020
5000	0.061	0.039	0.034
7500	0.137	0.044	0.050
10000	0.245	0.043	0.064
15000	0.547	0.055	0.102

	GPU time (sec)		CPU time (sec)
Band size	Memory initialization	Per iteration	Per iteration
1	0.002	0.151	1.955
2	0.006	0.199	1.950
4	0.006	0.296	1.815
8	0.008	0.492	2.718
16	0.009	1.249	4.417
32	0.011	1.844	7.729
64	0.014	5.168	14.630

Algorithm	GFLOPS
GPU-CGNR	11.82
GPU-CGMN	0.32
GPU-CARPCG	0.66
GPU-CAV	15.26
CPU-CGMN	0.2
CPU-CGNR	0.31

	GPU				CPU		Cluster [10]
Problem	CGNR	CGMN	CARP-CG	CAV	CGNR	CGMN	CARP-CG
1	389	85	157	97,559	318	74	97
2	1023	177	1169	–	991	169	176
3	514	54	184	11,699	513	54	282
4	1398	518	1077	–	1382	487	578
5	506	103	165	81,024	499	99	105
6	268	60	144	13,897	262	57	62
7	247	53	106	59,728	247	38	77
8	7404	1117	2341	–	6418	636	1121
9	848	118	628	131,051	827	142	142

Problem	CGNR	CGMN	CARP-CG
1	1.22	1.14	1.61
2	1.03	1.04	6.53
3	1.00	1.00	0.65
4	1.01	1.06	1.86
5	1.01	1.04	1.57
6	1.02	1.05	2.32
7	1.00	1.39	1.37
8	1.15	1.75	2.08
9	1.02	0.83	4.42

PERMALINK

GPU computing with Kaczmarz’s and other iterative algorithms for linear systems

Joseph M Elble

Nikolaos V Sahinidis

Panagiotis Vouzis

Abstract

1. Introduction

2. Preliminaries

2.1. Mathematical background and notation

2.2. The graphics processing unit

3. Iterative methods

3.1. Kaczmarz’s algorithm

Algorithm 1.

3.2. Cimmino’s algorithm

Algorithm 2.

3.3. Component-averaging

Algorithm 3.

3.4. Conjugate gradient normal residual (CGNR)

Algorithm 4.

3.5. Symmetric successive overrelaxation (SSOR)-preconditioned conjugate gradient (CGMN)

Algorithm 5.

3.6. Conjugate-gradient-accelerated component-averaged row projections

Algorithm 6.

4. GPU implementation

4.1. Preliminary tests

Table 1.

Table 2.

Table 3.

Table 4.

4.2. GPU implementation

Algorithm 7.

5. Computational results

Figure 1.

Figure 9.

Table 5.

Table 6.

Table 7.

Table 8.

Table 9.

Figure 10.

6. Conclusions

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Acknowledgments

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases