Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures

Jana Hozzová; Jiří Filipovič; Amin Nezarat; Jaroslav Ol’ha; Filip Petrovič

doi:10.1016/j.dib.2021.107631

. 2021 Nov 24;39:107631. doi: 10.1016/j.dib.2021.107631

Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures

Jana Hozzová ¹, Jiří Filipovič ^1,^⁎, Amin Nezarat ¹, Jaroslav Ol’ha ¹, Filip Petrovič ¹

PMCID: PMC8633859 PMID: 34877392

Abstract

We have developed several autotuning benchmarks in CUDA that take into account performance-relevant source-code parameters and reach near peak-performance on various GPU architectures. We have used them during the development and evaluation of a search method for tuning space proposed in [1]. With our framework Kernel Tuning Toolkit, freely available at Github, we measured computation times and hardware performance counters on several GPUs for the complete tuning spaces of five benchmarks. These data, which we provide here, might benefit research of search algorithms for the tuning spaces of GPU codes or research of relation between applied code optimization, hardware performance counters, and GPU kernels’ performance.

Moreover, we describe the scripts we used for robust evaluation of our searcher and comparison to others in detail. In particular, the script that simulates the tuning, i.e., replaces time-demanding compiling and executing the tuned kernels with a quick reading of the computation time from our measured data, makes it possible to inspect the convergence of tuning search over a large number of experiments. These scripts, freely available with our other codes, make it easier to experiment with search algorithms and compare them in a robust and reproducible way.

During our research, we generated models for predicting values of performance counters from values of tuning parameters of our benchmarks. Here, we provide the models themselves and describe the scripts we implemented for their training. These data might benefit researchers who want to reproduce or build on our research.

Keywords: Auto-tuning, Tuning spaces, Performance counters, Cuda

Specifications Table

Subject	Computer Science
Specific subject area	Auto-tuning GPU kernels using hardware performance counters
Type of data	Tables Python and R scripts
How data were acquired	Raw autotuning data: Using our autotuning framework Kernel Tuning Toolkit (KTT), we measured computation time and collected hardware performance counters for whole tuning spaces of five benchmark CUDA codes on four GPUs. KTT is freely available [2], five benchmarks are also there in ’examples’ folder. These benchmarks cover a wide range of computational problems: computing convolution, Coulomb summation in three dimensions, matrix multiplication, matrix transposition and N-body problem. They also differ in sizes of their tuning spaces. Prediction models: Using our scripts, also available in KTT repository at Github, we trained models with the raw tuning data.
Data format	Raw Analyzed Scripts
Parameters for data collection	Raw autotuning data: Computation time and performance counters were measured for five benchmarks (GEMM, Convolution, Matrix transposition, 3D Coulomb summation and n-body) bundled with KTT [2]. We ran them on four GPUs: GeForce GTX 680, GeForce GTX 750, GeForce GTX 1070 and GeForce RTX 2080. KTT was configured to perform exhaustive exploration of tuning spaces on each GPU under our test with profiling switched on. Size of input for each benchmark was chosen so that the kernel execution took 1-10 milliseconds. For the GEMM benchmark, data for several input sizes were collected on GTX GeForce 1070. Prediction models: Prediction models were trained for all hardware performance counters, local and global size.
Description of data collection	Raw autotuning data: KTT performed exhaustive exploration of complete tuning spaces (sets of all executable tuning configurations) of tested benchmarks for each GPU. Each tuning configuration contains information about tuning parameters (affecting how GPU kernel is created and executed), the runtime of the kernel and hardware performance counters provided by NVIDIA CUPTI library. Tuning configurations which cannot be executed on a particular GPU are not stored. Prediction models: Models were created from the raw autotuning data with scripts create_nonlinear_models.R and generate-knowledge-base.py available with profile-based searcher in KTT.
Data source location	Institute of Computer Science, Masaryk University Brno Czech Republic 49.211N, 16.598E
Data accessibility	Data repository name: Mendeley Data Data identification number: https://doi.org/10.17632/nn53dskr7z.1 Direct URL to data: https://doi.org/10.17632/nn53dskr7z.1 KTT repository name: Gitbub Data identification number: https://doi.org/10.5281/zenodo.5675994 Direct URL to data: https://doi.org/10.5281/zenodo.5675994
Related research article	Filipovič, J., Hozzová, J., Nezarat, A., Ol’ha, J., Petrovič, F., Using hardware performance counters to speed up autotuning convergence on GPUs. In Journal of Parallel and Distributed Computing, Volume 160, 2022.

Open in a new tab

Value of the Data

•
Raw autotuning data contain complete tuning spaces of several CUDA kernels prepared for autotuning, alongside their computation times and hardware performance counters measurements on several GPUs. Scripts make it easier to experiment with searching tuning space in a controlled environment, so the results of searchers are comparable.
•
These data will help those researching how to search the tuning spaces of GPU codes or those interested in mining the data related to hardware performance counters and GPU kernels’ performance.
•
With raw autotuning data, new search algorithms for navigating the tuning spaces can be easily evaluated for multiple GPUs (even those unavailable to the researchers), skipping high time demands of actually compiling, running and measuring. Moreover, the global optimum of the tuning space is known from data.
•
With scripts for simulated and real-time tuning, the data of others (with a new search method or a new prediction model for performance counters) can be consistently compared to the performance of our searcher.
•
Availability of KTT autotuner and scripts for model preparation allows users to expand our dataset by measurement on their own GPUs, or their own benchmarks.

1. Data Description

1.1. Raw autotuning data

Raw autotuning data were produced by Kernel Tuning Toolkit 1.3 [2] running on GPUs listed in Table 1. For each benchmark listed in Table 2 (available in KTT repository in folder examples as cltune-conv, coulomb_sum_3d, cltune-gemm, mtran and nbody), the exhaustive search of the whole tuning space was executed, measuring computation time and hardware performance counters. For details on benchmarks and their tuning spaces, see [3].

Table 1.

GPU devices used to obtain our data. Modified from Filipovič et al. [1].

Device	Architecture	Released	Abbreviation
GeForce GTX 680	Kepler	2012	680
GeForce GTX 750	Maxwell	2014	750
GeForce GTX 1070	Pascal	2016	1070
GeForce RTX 2080	Turing	2018	2080

Open in a new tab

Table 2.

A list of the benchmarks used to obtain our data.

Benchmark	Description	Abbreviation
Convolution	2D convolution kernel using $7 \times 7$ filter adopted from Nugteren and Codreanu [4].	conv
Coulomb 3D	Direct coulomb summation on 3D lattice, introduced in [5].	coulomb
GEMM	Matrix-matrix multipication adopted from Nugteren and Codreanu [4], tuning space	gemm-reduced
	reduced as in [6].
Transpose	Out-of-place matrix transposition, adopted from NVIDIA	mtran
	CUDA SDK 10.0.
N-body	N-body kernel, adopted from NVIDIA CUDA SDK 10.0.	nbody

Open in a new tab

The raw data are available at https://doi.org/10.17632/nn53dskr7z.1 in directory ‘raw-autotuning-data’. They are stored as CSV files with naming convention containing the abbreviation of GPU, the abbreviation of benchmark, and suffix _output.csv. For example, data obtained on GeForce GTX 1070 and N-body benchmark are stored in 1070-nbody_output.csv. There are special cases for GEMM benchmark, where we obtained data on small and highly-rectangular matrices. Those benchmarks are abbreviated as 1070-gemm-128-128-128 (multiplication of $128 \times 128$ matrices), 1070-gemm-16-4096-4096 (multiplication of matrix $16 \times 4096$ with matrix $4096 \times 16$ ), 1070-gemm-4096-16-4096 (multiplication of matrix $4096 \times 16$ with matrix $4096 \times 4096$ ) and 1070-gemm-4096-4096-16 (multiplication of matrix $4096 \times 4096$ with matrix $4096 \times 16$ ). However, those benchmarks are measured for GeForce GTX 1070 only.

The CSV files produced by Kernel Tuning Toolkit are formatted as follows:

•
the first line is the header containing names of columns;
•
each other line contains the profile of one tuning configuration (a combination of tuning parameters, which produces unique CUDA kernel source code and execution setting);
•
if some configuration cannot be executed on a given GPU (e.g., because of insufficient hardware resources), it is not included in CSV (therefore, the same benchmarks can produce CSV files with a different number of lines when executed on different GPUs).

Each line of the CSV file contains the following types of columns:

•
Kernel name: the name of the benchmarked kernel (the same for one type of benchmark);
•
Computation duration (µs): the duration of the benchmarked kernel and unit the time is measure in;
•
Global size and Local size: The global and local size of the executed kernel (number of threads and block size in CUDA terminology). The size is counted as a scalar number; it reflects an overall number of threads with no respect to the grid or block dimensionality;
•
Tuning parameters: the benchmark-specific tuning parameters, named in capitals by our convention (e.g., VECTOR_TYPE or CR);
•
Hardware performance counters: performance counters measured on particular GPU (e.g., dram_utilization or inst_fp_32).

Please note that not all available hardware performance counters were measured due to time demand to measure the complete tuning space. The performance counters set differs from GPU to GPU because different architectures implement different performance counters. The biggest change is with GeForce RTX 2080, where the performance counters are completely re-designed and re-named. The performance counters measured in our experiments are listed in Table 4.

Table 4.

List of predicted performance couters. For counters implemented for Volta generation and newer, the conversion ratio (if any) is written next to the counter. Modified from Filipovič et al. [1].

Counter (prior Volta)	Counter (Volta and newer)
dram_read_transactions	dram__sectors_read.sum
dram_write_transactions	dram__sectors_write.sum
l2_read_transactions	lts__t_sectors_op_read.sum
l2_write_transactions	lts__t_sectors_op_write.sum
tex_cache_transactions	l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum
local_memory_overhead	l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum
shared_load_transactions	l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum
shared_store_transactions	l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum
inst_fp_32	smsp__sass_thread_inst_executed_op_fp32_pred_on.sum
inst_fp_64	smsp__sass_thread_inst_executed_op_fp64_pred_on.sum
inst_integer	smsp__sass_thread_inst_executed_op_integer_pred_on.sum
inst_misc	smsp__sass_thread_inst_executed_op_misc_pred_on.sum
inst_compute_ld_st	smsp__sass_thread_inst_executed_op_memory_pred_on.sum
inst_control	smsp__sass_thread_inst_executed_op_control_pred_on.sum
inst_bit_convert	smsp__sass_thread_inst_executed_op_conversion_pred_on.sum
inst_executed	smsp__inst_executed.sum
issue_slot_utilization	smsp__issue_active.avg.pct_of_peak_sustained_active
dram_utilization	dram__throughput.avg.pct_of_peak_sustained_elapsed $: 10$
l2_utilization	lts__t_sectors.avg.pct_of_peak_sustained_elapsed
tex_utilization	l1tex__t_requests_pipe_lsu_mem_global_op_ld.avg.pct_of_peak_sustained_active $: 10$
shared_utilization	l1tex__data_pipe_lsu_wavefronts_mem_shared.avg.pct_of_peak_sustained_elapsed $: 10$
sm_efficiency	smsp__cycles_active.avg.pct_of_peak_sustained_elapsed
warp_execution_efficiency	smsp__thread_inst_executed_per_inst_executed.ratio $\cdot 100 : 32$
warp_nonpred_execution_efficiency	smsp__thread_inst_executed_per_inst_executed.pct

Open in a new tab

Input size for kernels was selected so that the kernel execution took approximately 1–10 milliseconds. These sizes obviously differ for each benchmark and GPU, Table 3 summarizes them.

Table 3.

Input sizes used for gathering raw tuning data. The shown number determines size of input matrix/matrices in both dimensions with conv, gemm-reduced and mtran benchmarks. With nbody benchmark, the shown number determines number of simulated bodies. Finally, with coulomb benchmark, the first number is size of a 3D grid (the same in all dimensions), whereas the second number determines the number of atoms.

	680	750	1070	2080
conv	2048	4096	4096	4096
coulomb	256, 64	256, 64	256, 64	256, 64
gemm-reduced	1024	1024	1024	2048
mtran	8192	8192	8192	8192
nbody	16384	16384	16384	16384

Open in a new tab

1.2. Prediction models for performance counters

We provide pre-computed prediction models for performance counters, which are predicted from values of tuning parameters. All predict the global size, the local size, and performance counters relevant for the GPU used to obtain the training raw autotuning data. The performance counters predicted by Least-square nonlinear models and Decision trees are listed in Table 4.

1.3. Least-squares nonlinear models

Nonlinear prediction models were produced with the script create_least_squares_models.R bundled with KTT in ’profile-searcher/scripts-prep/’. For each raw tuning data file (i.e., for each benchmark and each GPU, and for all input sizes of gemm benchmark on GeForce GTX 1070), we ran the script, producing multiple models for each performance counter, each for a different combination of values of binary tuning parameters. Please see the section Experimental Design, Materials and Methods for details on how the models are generated. It might make it easier to understand the format of the model files. The least-squares nonlinear models are stored as CSV file, following a similar naming convention as raw autotuning data: the abbreviation of GPU, the abbreviation of benchmark and suffix -model_[number].csv. The special cases for GEMM benchmark are named analogously as their raw tuning data files.

The CSV files produced by the script contain three sections. The first section includes a line for each tuning parameter describing an expression for coding this parameter, as coded values of tuning parameters are used to predict the values of performance counters). The second section includes one line called Condition describing a logical condition of values of binary parameters this model was trained for. Furthermore, the third section includes a line for each performance counter, describing an expression for predicting the given performance counter’s value with the coded values of tuning parameters.

1.4. Decision tree

Decision-tree prediction models were produced with the script generate_decision_tree_model.py bundled with KTT in ‘profile-searcher/scripts-prep/’. The script takes raw tuning data as an input and creates a predictive model of performance counters. Please see the section Experimental Design, Materials and Methods for details on how the models are generated. The resulting decision tree is stored as a pickle file together with a CSV file containing a list of all performance counters predicted by the model.

1.5. Simulated tuning

The results of the simulated autotuning are stored as CSV files of the following format. The first line contains a header with names of the columns. Each column presents an iteration of the searcher (i.e., the exploration of the next tuning configuration, requiring its profiling). The first column contains the iteration number. The second column contains an average runtime with the standard deviation of the best kernel known in this iteration when the random searcher is utilized. The third column contains an average runtime with the standard deviation of the best kernel known in this iteration when the profile-based searcher is utilized.

1.6. Real-time Tuning

Results of real-time tuning experiments are stored as CSV files of the following format. The first line is a header containing names of the columns. Each row contains measurement in each second of the autotuning process. The missing row indicates that no new data was available in particular second of the tuning process (this may happen in the initial part of the tuning if the first profiled kernel runs a long time). The rows contain the following columns:

•
the time (in seconds) from the beginning of the autotuning;
•
runtime of the best kernel known in the corresponding time (averaged over all experimental runs);
•
standard deviation of the runtime;
•
minimal runtime;
•
maximal runtime.

2. Experimental design, materials and methods

2.1. Obtaining raw tuning data via kernel tuning toolkit

The raw tuning data are obtained during an autotuning process performed by Kernel Tuning Toolkit with GPU of our interest. Note that we recommend using version tagged v1.3-profile-searcher, which couples KTT 1.3 with profile-based searcher and benchmarks listed in Table 2 prepared for collecting raw data or executing tuning with the profile-based searcher. We can explore the tuning space of any benchmark bundled with KTT (in ’examples’ folder) or user-provided code. To obtain tuning data with hardware performance counters, KTT has to be built with enabled profiling (e.g., by calling premake5 –profiling=cupti gmake, see KTT documentation for further details). Moreover, the profiling has to be switched on in the autotuned application by calling ktt::Tuner::setKernelProfiling() method (the profiling in benchmarks bundled with KTT can be switched on by setting USE_PROFILING macro to 1).

KTT can explore either the entire tuning space (default behaviour) or only its subset. In the latter case, it is recommended to use random searcher to randomize the observed subset (using method ktt::Tuner::setSearcher()). After the search of the space is complete, the tuning data are stored in CSV files by method ktt::Tuner::PrintResult()). All benchmarks bundled with KTT stores resulting CSV and can execute exhaustive exploration of the tuning space by setting preprocessor macro EXHAUSTIVE_SEARCH to 1. For more information about KTT methods and implementation of new autotuned codes, we refer to its documentation [2].

2.2. Generating prediction models from raw tuning data

We provide scripts to generate prediction models from raw tuning data. These scripts take tuning space of the problem with collected performance counters and train a model that predicts performance counters’ values when given a tuning configuration.

2.3. Generating least-square regression non-linear models

We provide two scripts in ‘profile-searcher/scripts-prep’ folder in the Github KTT repository.

The main script create_least_squares_models.R trains nonlinear models: it takes the tuning data, and for each performance counter, it generates a model that predicts its value based on values of tuning parameters. To increase the accuracy of such prediction, we divide the tuning space into subspaces based on values of binary tuning parameters, as we suspect these have a profound influence on the performance counters. Thus, we generate several models for each performance counter, each model applicable only for a given combination of values of binary tuning parameters. An example of its usage is shown in Listing 1. We recommend R v3.4.4, no special Rcran libraries are necessary.

Fig. 1 — Generating nonlinear models for GEMM benchmark on GTX GeForce 1070.

It takes four arguments:

•
[input file name] e.g. 1070-gemm-reduced_output.csv, must follow the formatting of raw tuning data, as described above in section Data Description;
•
[prefix for output files names] e.g. 1070-gemm-reduced, this will be used to name output files with models, -model_[number].csv will be added;
•
[numbers of columns with tuning parameters in input file] in format allowing to set individual columns and interval of columns (in format ’from:to’) e.g. 2,5:12 meaning columns 2 and 5 through 11, counting starts at 0;
•
[numbers of columns with performance counters in input file] in the same format.

After parsing the script arguments and reading the input file, we apply min-max scaling to the tuning parameters’ values, scaling them to the range of $< - 1, 1 >$ . This step is recommended in any regression model design, as models generally do not work well with the absolute values of the factors. Next, we select the values of tuning parameters that will determine training data. In other words, we do not choose data points (rows from the input file) for training randomly. We select a few values of non-binary tuning parameters and then include all available combinations in the training dataset. We need to moderate the number of values to prevent an exponential increase in training data size or a poor sampling of some part of the tuning space due to constraints.

The function takes two arguments: the formula and the training data. The formula includes the factors, i.e. coded tuning parameters, and arithmetic operations with them. To make the models nonlinear, we include multiplications of factors (to capture their interactions) and quadratic terms. The training data include rows from the input file with selected values of tuning parameters and corresponding values of the given profiling counter.

The output of the script are multiple files named [output_name]-model_[number].csv. The number of models corresponds to the number of combinations of values of binary parameters. If a model cannot be created for a specific combination of binary parameters (e.g. there are no data due to constraints), the closest model (i.e., the minimal number of values of binary tuning parameters differs) fills in and is printed in the output file. The format and contents of model files are described in the above section Data Description.

The script generate_least_squares_models.py makes it quick and easy to generate models for all our raw tuning data. It requires python3, with docopt library installed. It takes one argument, –benchmark. The option –benchmark GPU generates models for benchmarks conv, coulomb, gemm-reduced, mtran and nbody for all GPUs. Option –benchmark GEMM generates models for different input sizes of benchmark gemm-reduced on GeForce GTX 1070. Users may need to modify the script to accommodate the names of folders with data.

2.4. Generating decision trees

The script generate_decision_tree_model.py for decision-tree model preparation is stored in ‘profile-searcher/scripts-prep’ folder in a KTT distribution. While generating the model, users have to supply the CSV file containing explored tuning space, columns containing tuning parameters and profiling counters with parallelism configuration (“Global size” and “Local size” columns) in the same format as with least-square regression nonlinear models, see Listing 2.

Fig. 2 — How to call script for generating nonlinear model.

The script builds models predicting performance counters using optimized decision trees. The performance counters prediction with decision trees is computationally more efficient than with least-squares model; therefore, we recommend them as a default choice. The decision trees have high precision in densely sampled tuning spaces, but they are poor in extrapolation. Therefore, if only a smaller part of the tuning space is sampled, we recommend testing whether the least-squares method would bring better precision and faster tuning convergence.

The resulting decision tree is stored in files with suffix DT.sav and a list of hardware performance counters predicted by the script has suffix DT.sav.pc. For example, model for the file 1070-gemm-reduced_output.csv is stored in 1070-gemm-reduced_output_DT.sav.

2.5. Execution of simulated tuning

Simulated tuning scripts simulated-profiling-searcher.py perform a search of the autotuning space on a pre-computed tuning space. It requires auxiliary files base.py and mlKTTPredictor.py distributed with KTT. It also requires python3, with installed libraries docopt, numpy, pandas, pickle and sklearn.

Instead of real execution and profiling of autotuned kernels (obtaining their runtime and hardware performance counters), it reads stored raw autotuning data (i.e., just simulates their execution and profiling) and performs a search on them. The advantage of this approach is that it is performed much faster than real tuning (as no compilation, execution and profiling are performed); therefore, the simulated tuning experiments can be performed many times to get statistically relevant data. The simulated run also does not require installation of KTT and GPU we are simulating autotuning for. The convergence of search method is measured and can be compared in means of the number of search steps (equal to the number of kernel executions performed by KTT in a real environment).

Listing 3 shows three examples of running the script. We want to tune gemm-reduced benchmark. The models for predicting performance counters were trained on GeForce GTX 750, and we want to use them to guide the search on GeForce GTX 1070. These three commands differ in the model used. The first one does not predict anything, only reads the performance counters’ values from the provided raw tuning data file. The second one uses a decision tree, and the third one uses least-squares nonlinear models.

Fig. 3 — Example of simulated-profiling-searcher.py.

The script takes multiple arguments:

•
-o [raw tuning data file] the raw tuning data file following the format described in the section Data Description
•
–oc [compute capability of GPU used to produce raw tuning data] e.g. 6.1, if raw tuning data came from GeForce GTX 10701
•
–mp [number of multiprocessors on that GPU] e.g. 15, if raw tuning data came from GeForce GTX 1070
•
–co [number of CUDA cores on that GPU] e.g. 1920, if raw tuning data came from GeForce GTX 1070
•
one of the following
- •
  –cm [raw tuning data file] with this option, no prediction of values of performance counters is computed, their actual values are read from the file with given raw tuning data
- •
  –dt [decision tree model file] with this option, decision tree model is employed to predict values of performance counters
- •
  –ls [prefix for least squares model files] with this option, least squares nonlinear models are employed to predict values of performance counters
•
–ic [compute capability of the GPU of training data for model] e.g. 5.0, if model was trained with data from GeForce GTX 750
•
-p [column with computation time] always 1 in provided raw tuning data
•
-t [columns with TP] 4:19 for gemm-reduced bechmark
•
-c [columns with PC] 2,3,19:62 for gemm-reduced benchmark on GeForce GTX 1070
•
-e [number of experiments] sets how many times the experiment is repeated to get more stable results in case of randomized searchers
•
-i [number of iterations] sets how many tuning iterations (i.e., search steps) are performed per experiment
•
–compute-bound or –memory-bound e.g., –compute-bound as that is the character of gemm-reduced problem

For details on the algorithm of profile-based search, please see [1].

The script autobench.py makes it easy to run simulated tuning for all our raw tuning data. It takes two arguments, –benchmark and –method. The option –benchmark GPU runs simulated tuning for benchmarks conv, coulomb, gemm-reduced, mtran and nbody for all GPUs. Option –benchmark GEMM runs simulated tuning for different input sizes of benchmark gemm-reduced on GeForce GTX 1070. The option –method has three possible arguments Exact, DecisionTree or LeastSquares to denote the used model for predicting the values of performance counters. Users may need to modify the script to accommodate the names of folders with data. Moreover, the script can be used as a source of information on possible values of several command-line arguments of simulated-profiling-searcher.py, such as compute capabilities of different GPUs, number of their multiprocessors and CUDA cores, and indexes of columns for computation time, tuning parameters and performance counters in raw tuning data we provide.

We used the simulated tuning to analyze the convergence speed of the profile-based searcher proposed in [1] and of the random search. During the analysis, the autotuning is performed in the defined number of iterations. In each iteration of the searching process, the runtime of the best kernel found is logged. The autotuning is performed multiple times, so we obtain an average speed of the best kernel for each iteration over multiple autotuning executions.

Other search methods or modifications based on our profile-based searcher might be easily added to scripts and compared consistently.

2.6. Execution real-time tuning

Real-time tuning performs a search of the autotuning space without a pre-computed tuning space. The autotuned kernels are actually compiled, executed and profiled during the search. This is far more demanding than the simulated tuning described above, but it makes it possible to measure the actual time per tuning search iteration. The convergence of search method is measured and can be compared in means of tuning time.

Real-time tuning can be executed by running a compiled benchmark. For benchmarks bundled with KTT, the preparation includes the following steps:

•
KTT needs to be compiled with profiling. i.e. premake5 –profiling=cupti gmake needs to be run before building it. In case of older GPU architectures,

use –profiling=cupti-legacy instead.
•
In the code of the benchmark (in cpp file), a preprocessor macro EXHAUSTIVE_SEARCH has to be set to 0.
•
The random search is used by default. To test profile-based searcher [1], the macro USE_PROFILE_SEARCHER has to be set to 1 in the code of the benchmark (in cpp file).
•
The time for autotuning is restricted to a certain value set by macro TUNE_SEC. This time can be altered, for example, with TUNE_SEC 60 performs autotuning for 60 seconds.
•
The prediction models needed for profile-based searcher have to be in ‘KTT/profile-searcher/models’ folder.

Listing 4 shows examples of running a single real-time tuning for each benchmark and saving the log file.

Fig. 4 — Running a single execution of real-time tuning on every benchmark.

The benchmark executable has two arguments; however, both of them have default values:

•
[platform index] default value 0, cannot be changed when using CUDA
•
[device index] default value 0, users may need to change this if multiple GPUs are available

The input size can be modified in the source code of the given benchmark, in the main function. Again, reasonable default values are set. In our evaluation in [1], we set the input size of benchmarks according to Table 3.

Proper evaluation of searcher’s convergence in means of tuning time requires multiple runs of real-time tuning. This can be easily done by executing the benchmark multiple times and generating a log file for each run. For easier processing of the data in multiple log files, we provide script histogram.py. See an example of its usage in Listings 5.

Fig. 5 — Processing multiple runs of real-time tuning.

It takes two arguments:

•
-s [folder with log files from real-time tuning runs];
•
-t [time] in seconds that denotes the maximum running time that is analyzed.

The data are collected from all available executions of the autotuning (all log files in folder passed by argument -s).

CRediT authorship contribution statement

Jana Hozzová: Methodology, Software, Validation, Investigation, Data curation, Writing – original draft, Writing – review & editing. Jiří Filipovič: Conceptualization, Methodology, Software, Validation, Investigation, Data curation, Writing – review & editing, Visualization, Supervision, Funding acquisition. Amin Nezarat: Methodology, Software, Investigation. Jaroslav Ol’ha: Software. Filip Petrovič: Methodology, Software.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.

Acknowledgments

The work was supported from European Regional Development Fund-Project “CERIT Scientific Cloud” (No. CZ.02.1.01/0.0/0.0/16_013/0001802). Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.

Footnotes

Note that values of several arguments, such as compute capability of GPUs, number of multiprocessor or CUDA cores, and column indexes for computation time, tuning parameters and profiling counters are available in script autobench.py for our raw tuning data. This script is described later in this section.

Contributor Information

Jana Hozzová, Email: hozzova@mail.muni.cz.

Jiří Filipovič, Email: fila@mail.muni.cz.

Amin Nezarat, Email: aminnezarat@mail.muni.cz.

Jaroslav Ol’ha, Email: 348646@mail.muni.cz.

Filip Petrovič, Email: fillo@mail.muni.cz.

References

1.Filipovič J., Hozzová J., Nezarat A., Ol’ha J., Petrovič F. Using hardware performance counters to speed up autotuning convergence on GPUs. J. Parallel Distrib. Comput. 2022;160:16–35. doi: 10.1016/j.jpdc.2021.10.003. [DOI] [Google Scholar]
2.F. Petrovič, J. Filipovič, D. Střelák, J. Hozzová, R. Trembecký, Kernel tuning toolkit, 2021. https://github.com/HiPerCoRe/KTT/releases/tag/v1.3-profile-searcher. doi: 10.5281/zenodo.5675994 [DOI]
3.Petrovič F., Střelák D., Hozzová J., Ol’ha J., Trembecký R., Benkner S., Filipovič J. A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with kernel tuning toolkit. Future Gener. Comput. Syst. 2020;108:161–177. doi: 10.1016/j.future.2020.02.069. [DOI] [Google Scholar]
4.Nugteren C., Codreanu V. Proceedings of the IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) 2015. CLTune: a generic auto-tuner for OpenCL kernels. [Google Scholar]
5.Filipovič J., Petrovič F., Benkner S. Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy Efficient HPC Systems (ANDARE ’17) 2017. Autotuning of OpenCL kernels with global optimizations. [Google Scholar]
6.Nugteren C. Proceedings of the International Workshop on OpenCL, IWOCL ’18. ACM; 2018. CLBlast: a tuned OpenCL BLAS library; pp. 5:1–5:10. [DOI] [Google Scholar]

[bib0001] 1.Filipovič J., Hozzová J., Nezarat A., Ol’ha J., Petrovič F. Using hardware performance counters to speed up autotuning convergence on GPUs. J. Parallel Distrib. Comput. 2022;160:16–35. doi: 10.1016/j.jpdc.2021.10.003. [DOI] [Google Scholar]

[bib0002] 2.F. Petrovič, J. Filipovič, D. Střelák, J. Hozzová, R. Trembecký, Kernel tuning toolkit, 2021. https://github.com/HiPerCoRe/KTT/releases/tag/v1.3-profile-searcher. doi: 10.5281/zenodo.5675994 [DOI]

[bib0003] 3.Petrovič F., Střelák D., Hozzová J., Ol’ha J., Trembecký R., Benkner S., Filipovič J. A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with kernel tuning toolkit. Future Gener. Comput. Syst. 2020;108:161–177. doi: 10.1016/j.future.2020.02.069. [DOI] [Google Scholar]

[bib0004] 4.Nugteren C., Codreanu V. Proceedings of the IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) 2015. CLTune: a generic auto-tuner for OpenCL kernels. [Google Scholar]

[bib0005] 5.Filipovič J., Petrovič F., Benkner S. Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy Efficient HPC Systems (ANDARE ’17) 2017. Autotuning of OpenCL kernels with global optimizations. [Google Scholar]

[bib0006] 6.Nugteren C. Proceedings of the International Workshop on OpenCL, IWOCL ’18. ACM; 2018. CLBlast: a tuned OpenCL BLAS library; pp. 5:1–5:10. [DOI] [Google Scholar]

PERMALINK

Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures

Jana Hozzová

Jiří Filipovič

Amin Nezarat

Jaroslav Ol’ha

Filip Petrovič

Abstract

Specifications Table

Value of the Data

1. Data Description

1.1. Raw autotuning data

Table 1.

Table 2.

Table 4.

Table 3.

1.2. Prediction models for performance counters

1.3. Least-squares nonlinear models

1.4. Decision tree

1.5. Simulated tuning

1.6. Real-time Tuning

2. Experimental design, materials and methods

2.1. Obtaining raw tuning data via kernel tuning toolkit

2.2. Generating prediction models from raw tuning data

2.3. Generating least-square regression non-linear models

Listing 1.

2.4. Generating decision trees

Listing 2.

2.5. Execution of simulated tuning

Listing 3.

2.6. Execution real-time tuning

Listing 4.

Listing 5.

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases