TABLE OF CONTENTS
- Welcome to hpcscan
- Overview
- Main features
- Environment set-up
- Compilation
- Validation
- Execution
- Performance benchmarks
- Customization
- Have fun!
Version 1.2
Contact: Vincent Etienne / Email: vetienne@rocketmail.com
Contributors (chronological order)
- Vincent Etienne (NEC)
- Suha Kayum (Saudi Aramco)
- Marcin Rogowski (King Adbullah University of Science and Technology)
- Laurent Gatineau (NEC)
- Philippe Thierry (Intel)
- Fabrice Dupros (ARM)
- Hugo Barreiro (University of Reims Champagne-Ardenne)
hpcscan is a tool for benchmarking algorithms/kernels that are found in many scientific applications on various architectures/systems.
It features several categories of test cases aiming to measure memory, computation and communication bandwidths along with electric energy consumption.
- Written in C++
- Simple code structure based on individual test cases
- Easy to add new test cases
- Hybrid OpenMP/MPI parallelism
- Support scalar and vector CPUs, GPUs and other accelerators (depending on compiler/architecture)
- All configuration parameters on command line
- Support single and double precision computation
- Compilation with standard Makefile
- No external librairies
- Follows C++ Google style code
- All test cases are validated with embedded reference solutions
There exist several benchmarks commonly used in the HPC community. Just to cite a few, the Stream benchmark, the HPL benchmark and the OSU Micro benchmarks, allow to measure the memory bandwidth, the computation bandwidth and the interconnect bandwidth respectively. In general, these benchmarks target specific characteristics of HPC systems. However, it is not straightforward to transpose these characteristics in the context of a given scientific application.
This is why, HPC vendors used to present throughputs obtained with open source scientific codes such as for instance OpenFOAM (Computational Fluid Dynamics) or SPECFEM3D (Seismology). While these results are important to assess the performance of a given architecture to solve concrete problems, it is again not straightforward to transpose conclusions to other applications. Moreover, every application has been built on technical choices that may hinder performance on a system compared to another. How to overcome these technical biases?
hpcscan has been designed to address these issues 😃
☑️ Lightweight and portable tool that can be easily deployed on a wide range of architectures including CPUs, GPUs and accelerators (see Validated hardware, operating systems and compilers).
☑️ Bridge between HPC architectures and numerical analysis/computational sciences. Beyond getting accurate performance measurements, hpcscan allows to explore the behavior of numerical kernels and to seek for the optimal configuration on a given architecture. An example is shown below where several key parameters of an algorithm (a wave propagation kernel) are explored to find the optimum (in terms of computation speed vs accuracy) on the supercomputer Shaheen II at KAUST. See Performance benchmarks for details on this test case as well as scripts to perform the analysis.
Top left: L1 Error between the computed (wavefield) and analytical solutions versus N, the number of grid points along on direction (grid size is NxNxN). Blue: Finite-Difference with 4th order stencil, Pink: 8th order and Red: 12th order. Squares are obtained with the standard propagator implementation while crosses are obtained when the Laplacian operator is computed separately.
Top right: L1 Error between the computed and analytical solutions versus the computation time. The black star points to the configuration with an error below 1% and shortest computation time (i.e. the optimal configuration relative to the target error).
Bottom left: Propagator bandwidth in GPoint/s versus N.
Bottom right: Propagator bandwidth in GBtye/s versus N.
☑️ Set of representative kernels used in many scientific applications (see List of test cases). Without being too specific, the embedded kernels provide a way to capture the main traits of HPC architectures and identify their bottle-necks and strenghts. With this knowledge, one can re-design or update accordingly specific parts of an application to take full benefit of the target hardware.
☑️ Set of robust protocols to compare architectures. As suggested in the example above, the optimal configuration to solve a given problem might change from an architecture to another. hpcscan provides a solid framework to compare performances between different systems, where one can analyse results from different perspectives and achieve 'apples to apples' comparisons.
☑️ Customizable to fit a specific hardware (see Customization).
☑️ Multi-purpose initiative with benefits at several levels: from computer science students eager to learn to seasoned numerical analysts willing to share their findings or to software engineers reusing kernels of interest to upgrade their applications.
☑️ On-going effort aiming to collect contributions to cover the current offer of HPC systems. More options and kernels will be added with time.
⛔ One-number benchmark to rank HPC systems. However, hpcscan provides a way to perform a complete 'scanning' of architectures and possibly focus on one characteristic.
⛔ Confidential project. Everyone is invited to share results, feedbacks and more important contributions for the benefit of the entire HPC community.
hpcscan is a self-content package that can be easily installed and executed on your system. Just follow the steps:
- Step 1: create the environment script for your system
- Step 2: build the executable
- Step 3: validate the executable
- Step 4: run the performance benchmarks
Version | Description | Release date |
---|---|---|
v1.0 | Initial version with CPU and Vector Engine support |
Nov 28, 2020 |
v1.1 | GPU support |
May 22, 2021 |
v1.2 | Energy consumption |
Coming soon |
bin
this directory is created during compilation and contains hpcscan executablebuild
hpcscan can be compiled from hereenv
scripts to initialize hpcscan environmentmics
output samples and studiesscript
scripts for validation and performance benchmarkssrc
all hpcscan source files
Test case name | Description | Remark |
---|---|---|
Comm | MPI communications bandwidth
|
This case requires at least 2 MPI processes |
FD_D2 | Finite-difference (second derivatives in space) computations bandwidth | Available FD stencil orders: 2, 4, 6, 8, 10, 12, 14 and 16 |
Grid | Grid operations bandwidth
|
Operations on grids include manipulation of multi-dimensional indexes and specific portions of the grids (for instance, excluding halos) |
Memory | Memory operations bandwidth
|
Conversely to Test Case Grid, operations are done on continuous memory arrays |
Modeling | Acoustic wave modeling bandwidth Same features as for test case Propa except
|
There is no accuray checking for this test case |
Propa | Acoustic wave propagator bandwidth | Accuracy is checked against the multi-dimensional analytical solution (Eigen modes) of the wave equation |
Template | Test case template | Used to create new test cases |
Util | Utility tests to check internal functions | Reserved for developpers |
All available test modes are listed below. Activation of each test mode depends on the compilers defined in the hpscan environment script, see Environment script (mandatory).
Test mode name | Target hardware | Description | Remark |
---|---|---|---|
Baseline | Generic CPU | Standard implementation without optimization | ➡️ This mode is the reference implementation Default test mode. Always enabled |
CacheBlk | Generic CPU | Optimized with cache blocking techniques | Always enabled |
CUDA | NVIDIA GPU | Regular CUDA implementation without optimization | Enabled when compiled with nvcc (NVIDIA CUDA compiler) |
CUDA_Opt | NVIDIA GPU | Optimized CUDA implementation | Enabled when compiled with nvcc (NVIDIA CUDA compiler) |
CUDA_Ref | NVIDIA GPU | Reference CUDA implementation (for developpers) | Enabled when compiled with nvcc (NVIDIA CUDA compiler) |
DPC++ | Intel CPU/GPU/FPGA | Regular DPC++ implementation without optimization | Enabled when compiled with dpcpp (Intel OneAPI DPC++ compiler) |
HIP | AMD GPU | Regular HIP implementation without optimization | Enabled when compiled with hipcc (AMD HIP compiler) |
HIP_Opt | AMD GPU | Optimized HIP implementation | Enabled when compiled with hipcc (AMD HIP compiler) |
NEC | NEC SX-Aurora | With NEC compiler directives | Enabled when compiled with nc++ (NEC C++ compiler) |
NEC_SCA | NEC SX-Aurora | With NEC Library Stencil Code Accelerator | Enabled when compiled with nc++ (NEC C++ compiler) |
OpenAcc | NVIDIA GPU | Regular OpenACC implementation without optimization | Enabled when compiled with a C++ compiler that supports OpenACC (not yet operational) |
- Linux operating system
- C++ compiler with OpenMP support
- MPI library
- python and Matlab to plot figures
- NVIDIA CUDA compiler
- Intel DPC++ compiler
- AMD HIP compiler
- NEC C++ compiler
- C++ compiler with OpenACC support
In order to compile and run hpcscan, you need to source one of the files in the directory ./env
cd ./env
Example to set up the environment for hpcscan with GCC and CUDA compilers:
source ./setEnvNeptuneGccCuda.sh
🔔 For a new system, you would need to create a file for your system (take example from one of the existing files)
Go to ./build
, and use the command make
Executable can be found in ./bin/hpcscan
🔔 If hpcscan environment has not been set (see Environment script (mandatory)), compilation will abort.
By default, hpcscan is compiled in single in precision
To compile in double precision: make precision=double
To check the test modes that are enabled in your hpcscan binary, use the command
./bin/hpcscan -v
To check that hpcscan has correctly been built and works fine, go to ./script
and launch
sh runValidationTests.sh
This script runs a set a light test cases and should complete within few minutes (even on a laptop).
You should get in the ouptput report (displayed on the terminal)
- All tests marked as PASSED (661 tests passed for each test mode enabled)
- No test marked as FAILED
Check the summary at the end of report to have a quick look on this.
🔔 These tests are intended for validation purpose only, they do not allow for performance measurements.
hpcscan has been successfully tested on the hardware, operating systems and compilers listed below.
Operating system | Compiler | MPI | Host | Device | Test modes |
---|---|---|---|---|---|
Ubuntu 22.04.1 LTS | g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 | mpirun (Open MPI) 4.1.2 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (Intel Kaby Lake) | - | Baseline, CacheBlk |
Ubuntu 22.04.1 LTS | Intel icpc (ICC) 2021.7.0 20220726 | Intel MPI Version 2021.7 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (Intel Kaby Lake) | - | Baseline, CacheBlk |
Red Hat 4.8.5-39 | Intel oneAPI DPC++/C++ Compiler 2022.1.0 | Intel MPI Version 2021.6 | Intel(R) Xeon(R) Gold 6240L CPU @ 2.60GHz (Intel Cascade Lake) | - | Baseline, CacheBlk |
Red Hat 4.8.5-39 | Intel MPI Version 2021.6 | Intel(R) Xeon(R) Gold 6240L CPU @ 2.60GHz (Intel Cascade Lake) | Tesla V100S-PCI (NVIDIA GPU) | Baseline, CacheBlk, Cuda, Cuda_Opt, Cuda_Ref | |
Red Hat 8.5.0-10 | NEC nc++ (NCC) 4.0.0 | NEC MPI 3.1.0 | Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz (Intel Skylake) | NEC SX-Aurora TSUBASA 20B-P (NEC Vector Engine) | Baseline, CacheBlk, NEC, NEC_SCA |
Red Hat 8.5.0-10 | Intel oneAPI DPC++/C++ Compiler 2022.1.0 | Intel MPI Version 2021.6 | Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz (Intel Skylake) | - | Baseline, CacheBlk |
SUSE Linux Enterprise Server 15 | Intel icpc (ICC) 19.0.5.281 20190815 | - | Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz (Intel Haswell) | - | - |
Red Hat 4.8.5-39 | Intel icpc version 19.1.2.254 | - | Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (Intel Cascade Lake) | - | - |
Ubuntu 20.04.1 LTS | - | Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (Intel Ice Lake) | GP108M [GeForce MX330] (NVIDIA GPU) | - | |
CentOS Linux release 7.7.1908 | - | Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz (Intel Skylake) | GV100GL [Tesla V100 SXM2 32GB] (NVIDIA GPU) | - | |
Ubuntu 20.04.1 LTS | Intel(R) oneAPI DPC++ Compiler 2021.2.0 | - | Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (Intel Ice Lake) | - | - |
Ubuntu 20.04.1 LTS | - | AMD EPYC 7742 64-Core Processor @ 2.25GHz (AMD Rome) | [AMD Instinct MI100] (AMD GPU) | - |
hpcscan can be launched from a terminal with all configuration parameters within a single line.
To get help on the parameters
./bin/hpcscan -h
Execution with a unique MPI process
mpirun -n 1 ./bin/hpcscan -testCase <TESTCASE> -testMode <TESTMODE>
where
TESTCASE
is the name of the test case (see List of test cases)TESTMODE
is the name of the test mode (see List of test modes)
Example
mpirun -n 1 ./bin/hpcscan -testCase Propa -testMode CacheBlk
🔔 If you omit to specify -testMode <TESTMODE>
, the Baseline mode is assumed.
Example
mpirun -n 1 ./bin/hpcscan -testCase Propa
Execution with multiple MPI processes
mpirun -n <N> ./bin/hpcscan -testCase <TESTCASE> -testMode <TESTMODE> -nsub1 <NSUB1> -nsub2 <NSUB2> -nsub3 <NSUB3>
🔔 When several MPI processes are used, subdomain decomposition is activated. The product NSUB1 x NSUB2 x NSUB3 must be equal to N (no. of MPI proc.). You may omit to specify the number of subdomains along an axis if that number is 1.
Example
mpirun -n 2 ./bin/hpcscan -testCase Comm -nsub1 2
Configuration of the grid size and dimension
Simply add on the command line
-n1 <N1> -n2 <N2> -n3 <N3> -dim <DIM>
Where N1, N2, N3
are the number of grid points along axis 1, 2 and 3.
And DIM
= 1,2 or 3 (1D, 2D or 3D grids).
Example
mpirun -n 1 ../bin/hpcscan -testCase Grid -dim 2 -n1 200 -n2 300
🔔 If you omit to specify -dim <DIM>
, 3D grid is assumed.
Input
hpcscan does not require any input file. All data are built internally.
Output on the terminal
During execution, information regarding results validation and performances are sent to the terminal output.
Output performance log file
For every test case, an ASCII file containing all measures in a compact way is created. It can used to plot results with dedicated tools. The name of the log file is as follows
hpcscan.perf.<TESTCASE>.log
If hpcscan is launched several times, results are added to the log file. It is convenient for instance, when you want to analyse the effect of a parameter and plot the serie of results in a graph.
Output grids
Be default, the grids manipulated by hpcscan are not written on disk.
To output the grids, use the option -writeGrid
.
When activated, each grid used in a test will generate 2 files:
- An ASCII file with the grid dimensions (name of the file
<GRIDNAME>.proc<ID>.grid.info
) - A binary file with the grid data (name of the file
<GRIDNAME>.proc<ID>.grid.bin
) where ID is the MPI rank.
Example (this is the command that was used to produce the hpcscan logo on top of this page)
mpirun -n 1 ../../bin/hpcscan -testCase Propa -writeGrid \
-tmax 0.2 -snapDt 0.1 \
-dim 2 -n1 200 -n2 600 \
-param1 4 -param2 8
Outputs the following files: PropaEigenModeRef.proc0.grid.info
, PropaEigenModeRef.proc0.grid.bin
, PropaEigenModePrn.proc0.grid.info
and PropaEigenModePrn.proc0.grid.bin
Output debug traces
The code is equipped with debug traces that can be activated with the option -debug <LEVEL>
where LEVEL can be set to light
, mid
or full
(minimum, middle and maximum level of verbosity).
It can be useful to activate them when developping/debugging to understand the behavior of the code.
When activated, debug traces are written by each MPI proc in an ASCII file with name hpcscan.debug.proc<ID>.log
where ID is the MPI rank.
🔔 Maximum memory required per node (device) is 20 GB
🔔 At maximum, 8 computing nodes (devices) are used
The benchmarks are independent and can be used as is or configured according to your system if needed.
Test cases description
Test case | Objectives | Remarks |
---|---|---|
Memory | Assess memory bandwidth | Scalability analysis on a single node |
Grid | Assess bandwidth of grid operations | Analyse effect of the grid size |
Comm | Assess inter-node communication bandwidth | Analyse effect of subdomain decomposition |
FD_D2 | Assess FD spatial derivative computation bandwidth | Analyse effect of FD stencil order |
Propa | Find optimal configuration for the wave propagator | Explore range of parameters |
Propa | Scalability analysis of wave propagator on multiple nodes | Analyse effect of the FD stencil order |
➡️ Performance measurements and scripts to reproduce results obtained on various architectures are available in ./misc/hpcscanPerfSlides/hpcscanPerfSlides.pdf
hpcscan is built on a simple yet very flexible design heavily relying on inheritance feature of C++.
The main class is Grid
(see ./src/grid.cpp).
This class handles all grid data in hpcscan and all operations performed on grids.
It implements the so-called Baseline mode and it is the reference implementation.
💡 All test cases, at some point, call methods of this class. Indeed, test cases (testCase_xxx.cpp) do not implement kernels.
Now, let us say, you would like to specialize the implementation for a given architecture.
To do this, you would need to create a new class that derives from Grid
.
For instance, you will create Grid_ArchXYZ.h
and Grid_ArchXYZ.cpp
for your new class (you need to add the new source file in the Makefile as well).
In this class, you may implement only few functions that are declared as virtual
in Grid
.
💡 To allow hpcscan to use this new class, you need only to add it the 'grid factory' (see ./src/grid_Factory.cpp). This is the only location of the code where all grids are referenced.
By doing this, you may switch at execution time, to your new grid with the -testMode <TESTMODE>
option where TESTMODE
= ArchXYZ.
💡 You can proceed little by little, implementing one function at a time, with the possibility to check the behavior of your implementation against the Baseline reference solution.
Check the grids that are already implemented in hpcscan to get some examples.
- Issues encountered
- Suggestions of new test cases
- Performance measurements
➡️ If you want to contribute to hpcscan, please contact the project coordinator (vetienne@rocketmail.com).