Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2020 May 22;12151:230–248. doi: 10.1007/978-3-030-50743-5_12

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions

Daichi Mukunoki 12,, Katsuhisa Ozaki 13, Takeshi Ogita 14, Toshiyuki Imamura 12
Editors: Ponnuswamy Sadayappan8, Bradford L Chamberlain9, Guido Juckeland10, Hatem Ltaief11
PMCID: PMC7295351

Abstract

This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform Inline graphic matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads.

Keywords: Tensor cores, FP16, Half-precision, Low-precision, Matrix multiplication, GEMM, Linear algebra, Accuracy, Reproducibility

Introduction

The increasing number of deep learning applications has triggered the development of special processing units such as Tensor Cores on NVIDIA’s graphics processing units (GPUs) and Google’s Tensor Processing Units (TPUs) in recent years. The kernel of such tasks is matrix multiplication, which does not require high-precision such as IEEE 754-2008 binary32 (known as single-precision or FP32, with an 8-bit exponent and a 23-bit fraction) and binary64 (known as double-precision or FP64, with an 11-bit exponent and a 52-bit fraction). The hardware instead supports fast, low-precision operations such as binary16 (known as half-precision or FP16, with a 5-bit exponent and a 10-bit fraction) and 8/16-bit integer operations.

One of the most widely used examples is Tensor Cores introduced in the Volta architecture, which computes a Inline graphic matrix multiplication per clock with fused multiply-add operations. Although Tensor Cores support several data formats and computational precisions, the present paper focuses on FP16 computations with FP32 precision mode, which compute Inline graphic with FP32 precision (Fig. 1). Here, a and b are FP16 values, and c and d are FP32. The Tensor Cores operate up to eight times faster than standard FP32 floating-point units (FPUs) on CUDA Cores. Many studies have exploited this tremendous performance of Tensor Cores in general tasks.

Fig. 1.

Fig. 1.

Tensor Cores (FP16 computations with FP32 precision mode)

This paper presents a method for computing a general matrix multiply routine (GEMM) in level-3 basic linear algebra subprograms (BLAS) [4] on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores. GEMM is one of the kernel operations of many scientific workloads, as well as high-performance Linpack. The proposed method is based on an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication, proposed by Ozaki et al. [13], also known as the Ozaki scheme. The advantages of this method are listed below.

  • Productive: Being built upon the cublasGemmEx routine in cuBLAS1 provided by NVIDIA, the method incurs a low development cost.

  • Accurate: The method achieves higher accuracy than standard SGEMM and DGEMM even with correct-rounding.

  • Reproducible: The method obtains the same (bitwise identical) result for the same input, even when the number of cores and threads differs in each execution.

  • Adaptable: The concept is adaptable to other precisions.

Whereas some studies simply accelerate the computation not requiring high-precision by utilizing low-precision hardware, the present study attempts more accurate computations by utilizing low-precision hardware. Our DGEMM implementations outperform cuBLAS DGEMM only on processors with limited FP64 support. However, the performance gain over FP64 FPUs is not necessarily important; rather, the intent is to increase the potential of low-precision hardware such as artificial intelligence (AI) oriented processors. Moreover, our method provides a new perspective on the efficient hardware design for both AI and traditional high-performance computing (HPC) workloads. For example, it may reduce the number of FP64 resources in exchange for massive low-precision support.

The remainder of this paper is organized as follows. Section 2 introduces related work, and Sect. 3 describes our methodology based on the Ozaki scheme. Section 4 implements the method, and Sect. 5 presents the accuracy and performance evaluations on Titan RTX and Tesla V100 GPUs. Section 6 discusses the perspective of future hardware design using our proposal. This paper concludes with Sect. 7.

Related Work

Several studies have attempted to utilize low-precision hardware designed for AI workloads for other purposes. For example, Haidar et al. [7] utilized standard FP16 and the Tensor Cores operation with FP32 precision in dense and sparse linear systems with iterative refinement. Energy improvement has also been studied [6]. Its error analysis was given by Carson and Higham [1]. Yang et al. [16] presented a Monte Carlo simulation of an Ising model using bfloat16 (BF16, with an 8-bit exponent and a 7-bit fraction) on Google’s TPUs. These studies apply low-precision operations to the portions of code not requiring high accuracy, which can be computed at that precision level. Accordingly, their applicability is algorithm- or problem-dependent.

Similarly to the present study, several studies have attempted more accurate operations than those achieved by low-precision hardware. For example, Markidis et al. [11] proposed a method that improves the accuracy of matrix multiplication computed with Tensor Cores. Although their method is conceptually similar to ours, its capability is limited to the computation of matrices with dynamic ranges supported on FP16 with SGEMM-equivalent accuracy. Henry et al. [8] discussed the performance of high-precision operations with double-double arithmetic [2], a classic 2-fold precision arithmetic technique, on BF16 FPUs. Sorna et al. [15] proposed a method to improve the accuracy of 2D fast Fourier transform performed on Tensor Cores. We note that, in those studies, the performance gain over FP32 or FP64 FPUs was not necessarily important; rather, the intent was to increase the potential of low-precision hardware. Therefore, the hardware may need to be redesigned to balance the precisions supported on the FPUs. Our present discussion follows a similar direction.

The Ozaki scheme, which is the kernel of our proposed method, was originally proposed for accurate matrix multiplication by standard floating-point operations. OzBLAS [12]2 implements accurate and reproducible BLAS routines on CPUs and GPUs based on the Ozaki scheme. Whereas OzBLAS was built on DGEMM performed on FP64 FPUs, the Ozaki scheme in the present study performs DGEMM/SGEMM operations using GEMM performed on Tensor Cores. Ichimura et al. [10] also reported a high-performance implementation of the Ozaki scheme based on FP64 operations on many-core CPUs.

Methodology

This section first describes the minimal scheme for computing DGEMM by the modified Ozaki scheme on Tensor Cores. Next, it presents additional techniques that accelerate the computations. In this paper, Inline graphic and Inline graphic denote the computations performed in FP64 and FP32 arithmetic, respectively, Inline graphic and Inline graphic denote the unit round-offs of FP64 (Inline graphic) and FP32 (Inline graphic), respectively, and Inline graphic and Inline graphic denote the sets of FP64 and FP16 floating-point numbers, respectively. Inline graphic denotes the set of natural numbers including zero.

Ozaki Scheme for Tensor Cores

The Ozaki scheme performs an error-free transformation of matrix multiplication; specifically, the matrix multiplication is transformed into a summation of several matrix multiplications that can be performed on floating-point operations without rounding-errors. Figure 2 is a schematic of the whole Ozaki scheme. The method performs three major steps:

  • Step 1: Splitting – element-wise splitting of the input matrices into several split matrices.

  • Step 2: Computation – computation of all-to-all matrix products of the split matrices.

  • Step 3: Summation – element-wise summation of the all-to-all matrix products.

Fig. 2.

Fig. 2.

Schematic of matrix multiplication (Inline graphic) by Ozaki scheme (in this figure, scaling is omitted).

We now describe each step in detail. For simplicity, we consider an inner product of two vectors Inline graphic, but the approach is naturally extendible to matrix multiplication as it consists of inner products. Also, although we describe the case for DGEMM only, the same concept applies to SGEMM.graphic file with name 492208_1_En_12_Figa_HTML.jpg

Step 1: Splitting. Algorithm 1 splits the input vectors on FP64 into several vectors on FP16 as follows.

graphic file with name M15.gif

A split vector is first obtained on FP64 and then converted (downscaled) to FP16. The conversion moves only the exponent and causes no significand-bit loss. Here, Inline graphic and Inline graphic are the downscaling factors (from FP64 to FP16) of the exponents of Inline graphic and Inline graphic, respectively. At line 7 in Algorithm 1, Inline graphic reaches 1024 when Inline graphicDBL_MAX, meaning that Inline graphic and Inline graphic can be stored as 2-byte short integers. The splitting algorithm must satisfy the following properties:

  1. If Inline graphic and Inline graphic are non-zero elements,
    graphic file with name M26.gif
  2. Inline graphic must be error-free in the FP32 computation:
    graphic file with name M28.gif

Splitting can be understood as a translation from a floating-point representation to a fixed-point representation. The former of the above two properties means that the accuracy of the final result can be controlled (to lower accuracy) by omitting some split vectors from the lowest term. The accuracy of the final result obtainable with a certain number of split vectors depends on the length of the inner product and the range of the absolute values in each element of the input vectors. Note that to replace Tensor Cores by other FPUs with different precisions, we need to modify parameter Inline graphic in Algorithm 1, and the number of bits held in the split vectors (Inline graphic and Inline graphic) depends on the precision of the FPUs.

Step 2: Computation. Next, the inner product Inline graphic is computed as

graphic file with name M33.gif

Here, the computation of all-to-all inner products of the split vectors is performed: a total of Inline graphic inner products are computed. Inline graphic is the upscaling factor that compensates the downscaling performed in the splitting process. By the second property of Algorithm 1, the inner products of the split vectors can be computed with Tensor Core operations because the inputs are stored in the FP16 format. When extending this example to matrix multiplication, the split matrices must be multiplied by the algorithm based on the standard inner product: divide-and-conquer approaches such as Strassen’s algorithm are not permitted.

Step 3: Summation. Finally, the inner products of the split vectors are summed. The summation can be computed by FP64 arithmetic if the required accuracy is that of standard DGEMM. However, as Inline graphic in Step 2 has no rounding errors (being error-free), the correctly-rounded result of Inline graphic can be obtained by summation with a correctly-rounded method such as NearSum [14]. The result is reproducible if the summation is performed by some reproducible method, even in FP64 arithmetic. As the summation is computed element-wise, the order of the computation is easily fixed.graphic file with name 492208_1_En_12_Figb_HTML.jpg

Whole Procedure on Matrix Multiplication. Algorithm 2 computes the whole Ozaki scheme for matrix multiplication on Tensor Cores. Here, SplitA and SplitB perform the splitting in the inner product direction (along k-dimension) of matrices Inline graphic and Inline graphic respectively, using Algorithm 1. Note that as Inline graphic and Inline graphic can be stored on FP16, Inline graphic can be performed by FP16 computations with FP32 precision on Tensor Cores through the cublasGemmEx routine in cuBLAS.

Fast Computation Techniques

To further improve the performance, the following methods modify Algorithm 1 or . Implementation-based speedup techniques that do not change the algorithm will be discussed in Sect. 4.

Fast Mode. As implemented in OzBLAS, we define a parameter Inline graphic that determines the number of split matrices in the computation. With d specified, we can omit the computations Inline graphic in Inline graphic in exchange for a small loss of accuracy. If the required accuracy is FP64 (equivalent to the standard DGEMM, as performed by the method that determines the number of split matrices, described next), the accuracy loss is negligible. This technique reduces the number of matrix multiplications to Inline graphic from Inline graphic at most.graphic file with name 492208_1_En_12_Figc_HTML.jpg

Estimating the Number of Split Matrices that Achieves FP64-equivalent Accuracy. Splitting by Algorithm 1 automatically stops when Inline graphic; that is, when the accuracy of the final result is maximized. However, if the required accuracy is that of standard DGEMM performed on FP64 arithmetic (FP64-equivalent accuracy), we can estimate the minimum required number of splits by Algorithm 3 based on the probabilistic error bound [9] as

graphic file with name M49.gif 1

where Inline graphic is introduced to avoid matrix multiplication in the estimation (note that at line 7 in Algorithm 3, Inline graphic is d-th non-downscaled split matrix stored on FP64. Hence, SplitA at line 2 does not necessarily need to perform until Inline graphic, and SplitA and this algorithm can be integrated). This algorithm is designed to operate in fast mode. If the split number is determined such that Algorithm 1 executes until Inline graphic, the accuracy may be lower than that of standard DGEMM. In this case, we must disable the fast mode. Note that, a certain degree of difference between the desired (achieved by standard DGEMM) and obtained is expected in this method, because the number of split matrices is just estimated based on the probabilistic error bound, and will also be influenced by the vector Inline graphic.

Blocking Against Inner Product. This step is not implemented in the present study. As Inline graphic in Algorithm 1 includes n, the dimension of the inner product, the number of splits required to achieve a certain accuracy depends on the inner-product-wise dimension of the matrix. Its increase can be avoided by employing a blocking strategy against the inner product-wise operations. The blocking size can be set to the minimum size that achieves the best performance. However, this strategy increases the summation cost; moreover, changing the block size may disturb the reproducibility except when the correctly-rounded computation is performed.

Implementation

Basic Design

Our DGEMM implementations, computing Inline graphic, using Tensor Cores are referred to as DGEMM-TC, and two versions are implemented as described below.

  • DP-mode: This mode achieves FP64-equivalent accuracy. The number of split matrices is determined automatically by Algorithm 3. Fast mode is automatically applied if possible. The summation is performed in FP64 arithmetic.

  • CR-mode: This mode achieves the correctly-rounded result when Inline graphic and Inline graphic. The splitting iterates until all elements of the split matrices are zero. Fast mode is disabled. The summation is performed with NearSum when Inline graphic and Inline graphic or in FP64 arithmetic in other cases.

We also implemented SGEMM-TC in SP-mode, which corresponds to the FP32 version of DGEMM-TC in DP-mode.

Our implementations are interface-compatible with the standard DGEMM and SGEMM routines, except for an argument for the pointer to a handler that holds some parameters including the address pointing to the working memory of the Ozaki scheme. The working memory is wholly allocated outside the BLAS routine to avoid the allocation time. The allocation is performed through a BLAS initialization function, which must be called in advance, similar to cublasInit in cuBLAS. In our implementation of Algorithm 1, max (at line 12) is obtained on the register through the shared memory, whereas Inline graphic is accessed at line 10. For downscaling (and upscaling in the summation), Inline graphic is computed by scalbn (double x, int n), a function that computes Inline graphic. In DP-mode, Algorithms 1 and 3 are performed simultaneously as the latter includes the former. The computation part is performed by cublasGemmEx, a matrix multiplication routine that uses Tensor Cores. This routine has several internal implementations3, and can be selected as an argument. In this study, we used the default: CUBLAS_GEMM_DFALT_TENSOR_OP. Inline graphic and Inline graphic are computed in the summation process.

Optimization

Blocking to Reduce Memory Consumption. Memory consumption is reduced by a blocking technique applied to the outer-product-wise direction (note that this blocking differs from the inner-product-wise blocking discussed in Subsect. 3.2). All procedures are blocked by dividing a matrix into a rectangle with block size Inline graphic. In our implementation, the block size is determined as Inline graphic. This blocking technique may reduce the performance, as the memory consumption shrinks towards Inline graphic, because each matrix multiplication more closely approaches the inner product.

Further Performance Improvement. Although not attempted in this study, the performance can be improved in several ways from an implementation technique perspective.

First, as implemented in OzBLAS, the computations of split matrices can be performed with batched BLAS (i.e., cublasGemmEx can be replaced with cublasGemmBatchedEx) because each matrix multiplication can be performed independently. We observed that the performance was improved when the matrix size was very small, or when the number of split matrices was relatively large, but was degraded in other cases.

Second, as discussed in the paper [13], a sufficiently sparse split matrix can be represented in sparse matrix form. Split matrices holding higher or lower bits of the input matrices may contain many zero elements. If a high-performance sparse matrix-matrix multiplication routine using Tensor Cores is provided, we might enhance the performance by switching the dense operation to a sparse operation.

Expected Performance and Memory Consumption

The most computationally complex part of matrix multiplication by this scheme is multiplying the split matrices using cublasGemmEx, which has Inline graphic complexity. Ideally, the overall performance is thus determined by the number of GEMMs called in the computation and the GEMM throughput. For d split matrices, the number of GEMM is Inline graphic in the standard method and Inline graphic in fast mode. These values show how the performance overheads (in time) compare with that of a one-time execution of cublasGemmEx. However, our implementations contain several operations executed using FP64 FPUs. Whereas those portions have a computational complexity of Inline graphic at most, they may affect the performance, if the hardware has limited FP64 support.

The memory consumption when Inline graphic with Inline graphic split matrices and Inline graphic with Inline graphic split matrices in the naive implementation (i.e., without the blocking technique) is Inline graphic on FP16 for storing the split matrices. As shown in Algorithm 2, if the summation is performed immediately after each GEMM execution, Inline graphic requires mn storage on FP32; however, in our implementation, owing to the convenience of implementing NearSum in CR-mode, all computation results are retained, requiring Inline graphic of storage. After applying the blocking technique with block size Inline graphic in both the m and n dimensions, the memory consumption reduces to Inline graphic (Inline graphic. On the other hand, as the Inline graphic and Inline graphic are unknown before execution and the working memory is allocated before execution to avoid the memory allocation time, a certain amount of memory must be allocated at initialization. We then determine the maximum possible block size under the memory constraint. In addition to the above, several working memory spaces are needed. The memory consumption of our implementation is not yet optimized and should be improved in future work.

Evaluation

Experimental Settings

The performance was mainly evaluated on NVIDIA Titan RTX, a Turing architecture GPU with a compute capability of 7.5. The theoretical peak performance (with a boost clock of 1.77 GHz4) is 509.76 GFlops on FP64, 16312.32 GFlops on FP32, and 130498.56 GFlops5 on Tensor Cores with FP32 precision. This GPU has more limited FP64 support than the Tesla series targeting HPC workloads (1/32 of FP32 and 1/256 of Tensor Cores). The memory is 24 GB GDDR6 at 672.0 GB/s. The host machine was equipped with an Intel Core i7-5930K CPU running CentOS Linux release 8.1.1911 (4.18.0-147.3.1.el8_1.x86_64), CUDA 10.2, and CUDA driver version 440.44. The GPU codes were compiled by nvcc release 10.2, V10.2.89 with compiler options “-O3 -gencode arch=compute_60, code=sm_75".

Further evaluations were conducted on NVIDIA Tesla V100 (PCIe 32GB), which offers rich FP64 support (1/2 of FP32 and 1/16 of Tensor Cores). The Tesla V100 is a Volta architecture GPU with compute capability 7.0, and its theoretical peak performance (with a boost clock of 1.38 GHz) is 7065.6 GFlops on FP64, 14131.2 GFlops on FP32, and 113049.6 GFlops on Tensor Cores with FP32 precision. The memory is 32 GB HBM2 at 898.0 GB/s. The host machine was equipped with an Intel Xeon Gold 6126 CPU running Red Hat Enterprise Linux Server release 7.7 (3.10.0-1062.18.1.el7.x86_64), CUDA 10.2, and CUDA driver version 440.33.01. The codes for this GPU were compiled by nvcc release 10.2, V10.2.89 with “-O3 -gencode arch=compute_60, code=sm_70".

For accurate evaluation, we averaged the results of 10 executions after three warm-up executions. In our proposed method, the number of split matrices required to achieve a certain accuracy depends on the range of the absolute values in each element of the input matrices. To observe the performance degradation arising from this range, we initialized the input matrices with Inline graphic, where Inline graphic is a uniform random number [0, 1) and Inline graphic is a random number selected from the standard normal distribution. The range of the absolute value of the input can be controlled by Inline graphic, and is widened by increasing Inline graphic. For example, fixing Inline graphic and varying Inline graphic, 1, and 2, the ranges were obtained as 9.8E−10 – 8.9E−01, 1.4E−09 – 1.6E+02, and 4.4E−10 – 4.8E+04, respectively. In all experiments, we allocated 20 GB to the working memory, and set the maximum block size to Inline graphic. The scalar parameters were set as Inline graphic and Inline graphic.

DGEMM-TC

Figure 3 shows the accuracies of cublasDgemm and DGEMM-TC in DP-mode (DGEMM-TC-DP) for various input ranges (collected with different Inline graphic values) on Titan RTX. The maximum relative error is compared with the result of 2048-bit MPFR6 [5] on FP64 (the results of MPFR are rounded to FP64). As the CR-mode with NearSum always obtained “zero," meaning that all the results were correctly-rounded, its results are omitted from Fig. 3. The accuracy of our implementation (solid lines) was equivalent to that of cublasDgemm (dotted lines), but some differences were observed, because our method (Algorithm 3) simply estimates the minimum number of split matrices that ensure similar accuracy to the classic DGEMM based on a probabilistic error bound of GEMM. The estimation further roughened by the Inline graphic term that avoids matrix multiplications in the estimation.

Fig. 3.

Fig. 3.

Accuracy of cublasDgemm and DGEMM-TC in DP mode on Titan RTX. The maximum relative error is plotted against the results of MPFR 2048-bit. Inline graphic varies the range of the input values.

Figure 4 shows the performance of DGEMM-TC in DP-mode (with FP64-equivalent accuracy) for the Inline graphic values. “Flops (on DP)" is the number of floating-point operations on FP64 per second when viewed as the standard DGEMM (i.e., it is computed as 2mnk/t, where t denotes the execution time in seconds). Although the theoretical peak performance of the GPU on FP64 was only 510 GFlops (539 GFlops was observed on cublasDgemm with GPU boost), our implementation achieved up to approximately 980 GFlops (when Inline graphic), outperforming cublasDgemm.

Fig. 4.

Fig. 4.

Performance of DGEMM-TC in DP-mode (with FP64-equivalent accuracy) on Titan RTX. “Flops (on DP)" is the number of FP64 floating-point operations corresponding to the standard DGEMM. Inline graphic varies the range of the input values.

Additional performance analyses are shown in Fig. 5. Panel (a) shows the execution time breakdown for Inline graphic 0.1–2, observed in the tenth (final) execution. The execution time was dominated by the Tensor Cores computations. The splitting execution SplitA was slightly slower than SplitB because it included the cost of determining the number of splits, but SplitB was more costly than SplitA overall, because it was performed several times on the same portions of matrix Inline graphic. Such multiple executions were required by the blocking strategy. The left part of Fig. 5 (b) shows the number of split matrices (d) of Inline graphic and Inline graphic (plotted by the same line). The central part of Fig. 5 (b) plots the number of GEMMs called in the computation (Inline graphic or Inline graphic in fast mode against the number of split matrices d). Finally, the right part of Fig. 5 shows the computational throughput on Tensor Cores (i.e., cublasGemmEx). Unlike the case in Fig. 4, the Flops value directly represents the number of floating-point operations performed on the Tensor Cores, and excludes all other computations. The actual performance can be understood through the following example: when Inline graphic and Inline graphic, we observed 980 GFlops. In this case, the number of GEMM calls was 66. The throughput of cublasGemmEx was approximately 92 TFlops on TC, and consumed approximately 70% of the total execution time. Hence, there were Inline graphic TFlops on DP.

Fig. 5.

Fig. 5.

Details of DGEMM-TC in DP-mode (with FP64-equivalent accuracy) on Titan RTX. Inline graphic varies the range of the input values. †1: The same line plots the values of matrices Inline graphic and Inline graphic. †1†2: Average over all blocks.

Figure 6 shows the performance of DGEMM-TC in CR-mode (DGEMM-TC-CR) (correctly-rounded) on Titan RTX. Figure 7 analyzes the performance in detail. The number of split matrices and GEMMs called in the computation can be decimals because the results were averaged over several blocks processed by the blocking technique. This mode degraded the performance because it increased the number of split matrices (and GEMMs), disabled the fast mode (i.e., affected the number of GEMMs), and increased the summation cost (NearSum is much costly than the standard FP64 summation).

Fig. 6.

Fig. 6.

Performance of DGEMM-TC in CR-mode (correctly-rounded) on Titan RTX. “Flops (on DP)" is the number of FP64 floating-point operations corresponding to the standard DGEMM. Inline graphic varies the range of the input values.

Fig. 7.

Fig. 7.

Details of DGEMM-TC in CR-mode (correctly-rounded) on Titan RTX. Inline graphic varies the range of the input values. †1: The same line plots the values of matrices Inline graphic and Inline graphic. †1†2: Average over all blocks.

Finally, Table 1 summarizes the performances of DGEMM-TC with Inline graphic on Titan RTX and on Tesla V100, which has rich FP64 support. For comparison, we also show the performance of a correctly-rounded DGEMM implementation (DGEMM-DP-CR), which is based on the Ozaki scheme but uses cublasDgemm instead of cublasGemmEx (hence, the computation was performed using FP64 instead of Tensor Cores). On Tesla V100, DGEMM-TC in DP-mode could not accelerate cublasDgemm.

Table 1.

Performance comparison (Inline graphic) on Titan RTX and Tesla V100 (GFlops on DP). Inline graphic varies the range of the input values.

Titan RTX Tesla V100
Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
cublasDgemm 534.1 6761
DGEMM-TC-DP 972.4 823.0 713.1 1064 914.3 790.8
DGEMM-TC-CR 220.4 193.5 173.1 255.0 222.5 198.5
DGEMM-DP-CR 24.75 21.17 21.17 293.1 250.7 250.7

SGEMM-TC

Figure 8 shows the accuracy of cublasSgemm and SGEMM-TC in SP-mode for various input ranges (controlled by varying Inline graphic) on Titan RTX. The results are compared on FP32 (the results of MPFR are rounded to FP32).

Fig. 8.

Fig. 8.

Accuracy of cublasSgemm and SGEMM-TC in SP mode on Titan RTX. The maximum relative error is plotted against the results of MPFR 2048-bit. Inline graphic varies the range of the input values.

Figure 9 shows the performance of SGEMM-TC in SP-mode. Similarly to DGEMM-TC on Tesla V100, our proposed method was useless for accelerating SGEMM on this GPU with fast FP32 support, but outperformed DGEMM-TC. The reason for the superior performance is not discernible from Fig. 9; however, the number of split matrices decreased, and the execution time of the splitting and summation parts was reduced in SGEMM-TC.

Fig. 9.

Fig. 9.

Performance of SGEMM-TC in SP-mode (with FP32-equivalent accuracy) on Titan RTX. “Flops (on SP)" is the number of FP32 floating-point operations corresponding to the standard SGEMM. Inline graphic varies the range of the input values.

Discussion

This section discusses perspectives for introducing our proposed approach into hardware design. Although our method is limited to inner product based computations, it extends the application range of hardware with limited (or no) FP32/FP64 resources and fast low-precision processing units for general purpose workloads. Consequently, we can consider reducing the number of FP64 (or even FP32) FPUs, as discussed by Domke et al. [3], by exchanging them with low-precision FPUs such as Tensor Cores. Our rationale is supported by the following situations.

  • The demand for AI workloads not requiring FP64 is increasing, and such work is becoming a significant part of the total workloads of HPC systems.

  • The performance of large-scale computations is becoming communication-bound as the degree of parallelism of HPC systems increases.

  • The need for FP64 performance has reduced under the advance of mixed-precision techniques such as iterative refinement and precision-tuning.

  • Low-precision hardware is easier to implement than high-precision hardware. In general, the complexity of p-bit precision hardware is Inline graphic. Currently, most processors only exploit the O(p) benefit of instruction-level parallelism in single-instruction-multiple-data (SIMD) vectorization.

  • Field-programmable gate arrays (FPGAs) are becoming a promising platform for HPC. Computations that do not fit into general processors can be accommodated by FPGAs. For instance, FPGAs can cover any “niche" demands for FP64 in future.

Accurate and high-precision computational methods, such as the proposed method and the other methods introduced in Sect. 2, may satisfy the “averaged" demand for the workloads requiring FP64 operations on an HPC system. Particularly in memory-bound operations, sufficient performance may be delivered by limited FP64 performance on hardware, or by software emulation of FP64 through multi-precision techniques; for example, “double-float" arithmetic as a float version of double-double arithmetic.

We now propose some hardware designs based on the Ozaki scheme. As described in Subsect. 3.1, the core concept of the Ozaki scheme is error-free transformation, which is similar to conversion from a matrix represented by floating-point numbers to matrices represented by fixed-point numbers. Accordingly, the length of the significand bit is important, and the exponent can be computed separately. Fast integer matrix multiplication (i.e., fast Arithmetic Logic Units) is desired for such a scheme because it requires fewer split matrices than the floating-point format for the same bit length. Moreover, this study effectively utilizes the Tensor Core design that computes FP16 data with FP32 precision and returns the result on FP32. Although the same idea can be implemented on standard FP16 FPUs, which adopt FP16 for both data format and computation, this implementation would increase the number of split matrices that achieve a given accuracy. This situation is worsened on BF16 FPUs, which have fewer significand bits. From this perspective, FPUs like the FP64 version of Tensor Cores are desired; as it computes Inline graphic with FP64 accuracy, where a and b are FP32 and c and d are FP64. Such FPUs can adequately substitute full FP64 FPUs with the Ozaki scheme on DGEMM.

Conclusion

This paper presented an implementation technique for DGEMM and SGEMM using Tensor Cores that compute FP16 inputs with FP32 precision. Our method is based on the Ozaki scheme and is built upon cublasGemmEx, a GEMM implementation in cuBLAS performed on Tensor Cores. Besides providing a DGEMM and SGEMM compatible interface with equivalent accuracy, our technique can support accurate (correctly-rounded) and reproducible computations. The performance of our method depends on the range of the absolute values in each element of the input matrices. For instance, when matrices were initialized with random numbers over a dynamic range of 1E+9, and our DGEMM implementation with FP64-equivalent accuracy was run on Titan RTX with 130 TFlops on Tensor Cores, the highest achievement was approximately 980 GFlops of FP64 operation, although cublasDgemm can achieve only 539 GFlops on FP64 FPUs. The proposed method enhances the possibility of utilizing hardware with limited (or no) FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads. Furthermore, because the proposed method reduces the demand for FP64 FPUs in exchange for lower-precision FPUs, it will contribute to new perspectives of future hardware designs. Our code is available on our webpage7.

Acknowledgment

This research was partially supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 19K20286 and MEXT as “Exploratory Issue on Post-K computer" (Development of verified numerical computations and super high-performance computing environment for extreme researches). We thank Takeshi Terao (Shibaura Institute of Technology) for his helpful suggestion for the idea of the blocking toward inner product. This research used computational resources of Cygnus (for Tesla V100) provided by Multidisciplinary Cooperative Research Program in Center for Computational Sciences, University of Tsukuba.

Footnotes

3

The details are not presented.

4

The actual clock can exceed the boost clock, depending on the individual product and the execution environment.

5

576 (Tensor Cores) Inline graphic 1.77 (GHz) Inline graphic (Flops) = 130498.56 (GFlops).

Contributor Information

Ponnuswamy Sadayappan, Email: saday@cs.utah.edu.

Bradford L. Chamberlain, Email: bradford.chamberlain@hpe.com

Guido Juckeland, Email: g.juckeland@hzdr.de.

Hatem Ltaief, Email: hatem.ltaief@kaust.edu.sa.

Daichi Mukunoki, Email: daichi.mukunoki@riken.jp.

Katsuhisa Ozaki, Email: ozaki@sic.shibaura-it.ac.jp.

Takeshi Ogita, Email: ogita@lab.twcu.ac.jp.

Toshiyuki Imamura, Email: imamura.toshiyuki@riken.jp.

References

  • 1.Carson E, Higham N. Accelerating the solution of linear systems by iterative refinement in three precisions. SIAM J. Sci. Comput. 2018;40(2):A817–A847. doi: 10.1137/17M1140819. [DOI] [Google Scholar]
  • 2.Dekker TJ. A floating-point technique for extending the available precision. Numerische Mathematik. 1971;18:224–242. doi: 10.1007/BF01397083. [DOI] [Google Scholar]
  • 3.Domke, J., et al.: Double-precision FPUs in high-performance computing: an embarrassment of riches? In: Proceedings 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019), pp. 78–88 (2019)
  • 4.Dongarra JJ, Du Croz J, Hammarling S, Duff IS. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 1990;16(1):1–17. doi: 10.1145/77626.79170. [DOI] [Google Scholar]
  • 5.Fousse L, Hanrot G, Lefèvre V, Pélissier P, Zimmermann P. MPFR: a multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw. 2007;33(2):13:1–13:15. doi: 10.1145/1236463.1236468. [DOI] [Google Scholar]
  • 6.Haider A, et al. et al. The design of fast and energy-efficient linear solvers: on the potential of half-precision arithmetic and iterative refinement techniques. In: Shi Y, et al.et al., editors. Computational Science – ICCS 2018; Cham: Springer; 2018. pp. 586–600. [Google Scholar]
  • 7.Haidar, A., Tomov, S., Dongarra, J., Higham, N.J.: Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In: Proceedings International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 2018), pp. 47:1–47:11 (2018)
  • 8.Henry, G., Tang, P.T.P., Heinecke, A.: Leveraging the bfloat16 artificial intelligence datatype for higher-precision computations. In: Proceedings 26th IEEE Symposium on Computer Arithmetic (ARITH-26), pp. 69–76 (2019)
  • 9.Higham NJ, Mary T. A new approach to probabilistic rounding error analysis. SIAM J. Sci. Comput. 2019;41(5):A2815–A2835. doi: 10.1137/18M1226312. [DOI] [Google Scholar]
  • 10.Ichimura, S., Katagiri, T., Ozaki, K., Ogita, T., Nagai, T.: Threaded accurate matrix-matrix multiplications with sparse matrix-vector multiplications. In: Proceedings 32nd IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). pp. 1093–1102 (2018)
  • 11.Markidis, S., Chien, S.W.D., Laure, E., Peng, I.B., Vetter, J.S.: NVIDIA tensor core programmability, performance precision. In: Proceedings 32nd IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 522–531 (2018)
  • 12.Mukunoki, D., Ogita, T., Ozaki, K.: Reproducible BLAS routines with tunable accuracy using ozaki scheme for many-core architectures. In: Proceedings 13th International Conference on Parallel Processing and Applied Mathematics (PPAM2019), Lecture Notes in Computer Science, vol. 12043, pp. 516–527 (2020)
  • 13.Ozaki K, Ogita T, Oishi S, Rump SM. Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numer. Algorithms. 2012;59(1):95–118. doi: 10.1007/s11075-011-9478-1. [DOI] [Google Scholar]
  • 14.Rump S, Ogita T, Oishi S. Accurate floating-point summation part ii: Sign, k-fold faithful and rounding to nearest. SIAM J. Sci. Comput. 2009;31(2):1269–1302. doi: 10.1137/07068816X. [DOI] [Google Scholar]
  • 15.Sorna, A., Cheng, X., D’Azevedo, E., Won, K., Tomov, S.: Optimizing the fast fourier transform using mixed precision on tensor core hardware. In: Proceedings 25th IEEE International Conference on High Performance Computing Workshops (HiPCW), pp. 3–7 (2018)
  • 16.Yang, K., Chen, Y.F., Roumpos, G., Colby, C., Anderson, J.: High performance monte carlo simulation of ising model on TPU clusters. In: Proceedings International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2019), pp. 83:1–83:15 (2019)

Articles from High Performance Computing are provided here courtesy of Nature Publishing Group

RESOURCES