Welcome to hpcscan

TABLE OF CONTENTS

Welcome to hpcscan
Overview
Main features
Environment set-up
Compilation
- Makefile
- Enabled test modes
Validation
- Validation tests
- Validated hardware, operating systems and compilers
Execution
- Usage
- Input and output
Performance benchmarks
Customization
Have fun!
- Share feedback
- Contributing to hpcscan

Welcome to hpcscan

Version 1.2

Contact: Vincent Etienne / Email: vetienne@rocketmail.com

Contributors (chronological order)

Vincent Etienne (NEC)
Suha Kayum (Saudi Aramco)
Marcin Rogowski (King Adbullah University of Science and Technology)
Laurent Gatineau (NEC)
Philippe Thierry (Intel)
Fabrice Dupros (ARM)
Hugo Barreiro (University of Reims Champagne-Ardenne)

Overview

Description

hpcscan is a tool for benchmarking algorithms/kernels that are found in many scientific applications on various architectures/systems.

It features several categories of test cases aiming to measure memory, computation and communication bandwidths along with electric energy consumption.

Written in C++
Simple code structure based on individual test cases
Easy to add new test cases
Hybrid OpenMP/MPI parallelism
Support scalar and vector CPUs, GPUs and other accelerators (depending on compiler/architecture)
All configuration parameters on command line
Support single and double precision computation
Compilation with standard Makefile
No external librairies
Follows C++ Google style code
All test cases are validated with embedded reference solutions

Why another benchmark?

There exist several benchmarks commonly used in the HPC community. Just to cite a few, the Stream benchmark, the HPL benchmark and the OSU Micro benchmarks, allow to measure the memory bandwidth, the computation bandwidth and the interconnect bandwidth respectively. In general, these benchmarks target specific characteristics of HPC systems. However, it is not straightforward to transpose these characteristics in the context of a given scientific application.

This is why, HPC vendors used to present throughputs obtained with open source scientific codes such as for instance OpenFOAM (Computational Fluid Dynamics) or SPECFEM3D (Seismology). While these results are important to assess the performance of a given architecture to solve concrete problems, it is again not straightforward to transpose conclusions to other applications. Moreover, every application has been built on technical choices that may hinder performance on a system compared to another. How to overcome these technical biases?

hpcscan has been designed to address these issues 😃

What hpcscan is

☑️ Lightweight and portable tool that can be easily deployed on a wide range of architectures including CPUs, GPUs and accelerators (see Validated hardware, operating systems and compilers).

☑️ Bridge between HPC architectures and numerical analysis/computational sciences. Beyond getting accurate performance measurements, hpcscan allows to explore the behavior of numerical kernels and to seek for the optimal configuration on a given architecture. An example is shown below where several key parameters of an algorithm (a wave propagation kernel) are explored to find the optimum (in terms of computation speed vs accuracy) on the supercomputer Shaheen II at KAUST. See Performance benchmarks for details on this test case as well as scripts to perform the analysis.

Top left: L1 Error between the computed (wavefield) and analytical solutions versus N, the number of grid points along on direction (grid size is NxNxN). Blue: Finite-Difference with 4th order stencil, Pink: 8th order and Red: 12th order. Squares are obtained with the standard propagator implementation while crosses are obtained when the Laplacian operator is computed separately.

Top right: L1 Error between the computed and analytical solutions versus the computation time. The black star points to the configuration with an error below 1% and shortest computation time (i.e. the optimal configuration relative to the target error).

Bottom left: Propagator bandwidth in GPoint/s versus N.

Bottom right: Propagator bandwidth in GBtye/s versus N.

☑️ Set of representative kernels used in many scientific applications (see List of test cases). Without being too specific, the embedded kernels provide a way to capture the main traits of HPC architectures and identify their bottle-necks and strenghts. With this knowledge, one can re-design or update accordingly specific parts of an application to take full benefit of the target hardware.

☑️ Set of robust protocols to compare architectures. As suggested in the example above, the optimal configuration to solve a given problem might change from an architecture to another. hpcscan provides a solid framework to compare performances between different systems, where one can analyse results from different perspectives and achieve 'apples to apples' comparisons.

☑️ Customizable to fit a specific hardware (see Customization).

☑️ Multi-purpose initiative with benefits at several levels: from computer science students eager to learn to seasoned numerical analysts willing to share their findings or to software engineers reusing kernels of interest to upgrade their applications.

☑️ On-going effort aiming to collect contributions to cover the current offer of HPC systems. More options and kernels will be added with time.

What hpcscan is not

⛔ One-number benchmark to rank HPC systems. However, hpcscan provides a way to perform a complete 'scanning' of architectures and possibly focus on one characteristic.

⛔ Confidential project. Everyone is invited to share results, feedbacks and more important contributions for the benefit of the entire HPC community.

Quick start

hpcscan is a self-content package that can be easily installed and executed on your system. Just follow the steps:

Step 1: create the environment script for your system
Step 2: build the executable
Step 3: validate the executable
Step 4: run the performance benchmarks

Versions

Version	Description	Release date
v1.0	Initial version with CPU and Vector Engine support Test cases: Comm, FD_D2, Grid, Memory and Propa FD orders: 2, 4, 8, 12 & 16 Test modes: Baseline, CacheBlk and NEC_SCA	Nov 28, 2020
v1.1	GPU support Added test modes CUDA and HIP Added test mode NEC	May 22, 2021
v1.2	Energy consumption Access hardware counters to report energy (Watt) consumption Added FD orders: 6, 10 & 14 Added test mode DPC++ Added test modes CUDA_Opt, CUDA_Ref and HIP_Opt	Coming soon

Main features

Project directories

bin this directory is created during compilation and contains hpcscan executable
build hpcscan can be compiled from here
env scripts to initialize hpcscan environment
mics output samples and studies
script scripts for validation and performance benchmarks
src all hpcscan source files

List of test cases

Test case name	Description	Remark
Comm	MPI communications bandwidth Uni-directional (Half-duplex with MPI_Send) proc1 -> proc2 Bi-directional (Full-duplex with MPI_Sendrecv) proc1 <-> proc2 Grid halos exchange (MPI_Sendrecv) all procs <-> all procs	This case requires at least 2 MPI processes Depending on the placement of MPI processes, intra-node or inter-node bandwidth can be measured Width of halos depends on the selected FD stencil order ➡️ Validation against reference grids filled with predefined values ➡️ Measures GPoints/s and GBytes/s
FD_D2	Finite-difference (second derivatives in space) computations bandwidth $U={\partial^2}/{\partial x_1^2} \: (V)$ (for grid dim. 1, 2 or 3) $U={\partial^2}/{\partial x_2^2} \: (V)$ (for grid dim. 2 or 3) $U={\partial^2}/{\partial x_3^2} \: (V)$ (for grid dim. 3) $U= \Delta (V)$ (for grid dim 2 or 3)	Available FD stencil orders: 2, 4, 6, 8, 10, 12, 14 and 16 Accuracy is checked against multi-dimensional sine function Accuracy depends on the selected FD stencil order, the spatial grid sampling and the number of periods in the sine function ➡️ Computes L1 Error against analytical solution ➡️ Measures GPoints/s, GBytes/s and GFlop/s
Grid	Grid operations bandwidth Fill grid U with constant value Max. diff. between grids U and V L1 norm between U and V Sum of abs(U) Sum of abs(U-V) Max. of U Min. of U Complex grid manipulation (wavefield update in propagator) U = 2 x V - U + C x L Boundary condition (free surface) at all edges of U	Operations on grids include manipulation of multi-dimensional indexes and specific portions of the grids (for instance, excluding halos) ➡️ Validation against reference grids filled with predefined values ➡️ Measures GPoints/s and GBytes/s
Memory	Memory operations bandwidth Fill array A with constant value Copy array A = B Add 2 arrays A = B + C Multiply 2 arrays A = B * C Add 2 arrays and update array A = A + B	Conversely to Test Case Grid, operations are done on continuous memory arrays This test case is similar to the Stream benchmark ➡️ Validation against reference grids filled with predefined values ➡️ Measures GPoints/s and GBytes/s
Modeling	Acoustic wave modeling bandwidth Same features as for test case Propa except Velocity model is read from file Source is a Ricker wavelet Ouput seismic traces and snapshots	There is no accuray checking for this test case
Propa	Acoustic wave propagator bandwidth 2nd order wave equation ${\partial^2}/{\partial t^2} (P)=c^2 \: \Delta (P)$ Domain size is 1 m in every dimension c is constant and equals to 1 m/s Free surface boundary condition is applied to all edges of the domain Wavefield is initialized at t=-dt and t=-2dt with a particular solution	Accuracy is checked against the multi-dimensional analytical solution (Eigen modes) of the wave equation Number of modes can be parametrized differently in every dimension Time step can be set arbitrarily or set to the stability condition Dimension, grid size, and number of time steps can be set arbitrarily Accuracy depends on the selected FD stencil order, the spatial grid sampling and the number of Eigen modes ➡️ Computes L1 Error against analytical solution ➡️ Measures GPoints/s, GBytes/s and GFlop/s
Template	Test case template	Used to create new test cases
Util	Utility tests to check internal functions	Reserved for developpers

List of test modes

All available test modes are listed below. Activation of each test mode depends on the compilers defined in the hpscan environment script, see Environment script (mandatory).

Test mode name	Target hardware	Description	Remark
Baseline	Generic CPU	Standard implementation without optimization	➡️ This mode is the reference implementation Default test mode. Always enabled
CacheBlk	Generic CPU	Optimized with cache blocking techniques	Always enabled
CUDA	NVIDIA GPU	Regular CUDA implementation without optimization	Enabled when compiled with nvcc (NVIDIA CUDA compiler)
CUDA_Opt	NVIDIA GPU	Optimized CUDA implementation	Enabled when compiled with nvcc (NVIDIA CUDA compiler)
CUDA_Ref	NVIDIA GPU	Reference CUDA implementation (for developpers)	Enabled when compiled with nvcc (NVIDIA CUDA compiler)
DPC++	Intel CPU/GPU/FPGA	Regular DPC++ implementation without optimization	Enabled when compiled with dpcpp (Intel OneAPI DPC++ compiler)
HIP	AMD GPU	Regular HIP implementation without optimization	Enabled when compiled with hipcc (AMD HIP compiler)
HIP_Opt	AMD GPU	Optimized HIP implementation	Enabled when compiled with hipcc (AMD HIP compiler)
NEC	NEC SX-Aurora	With NEC compiler directives	Enabled when compiled with nc++ (NEC C++ compiler)
NEC_SCA	NEC SX-Aurora	With NEC Library Stencil Code Accelerator	Enabled when compiled with nc++ (NEC C++ compiler)
OpenAcc	NVIDIA GPU	Regular OpenACC implementation without optimization	Enabled when compiled with a C++ compiler that supports OpenACC (not yet operational)

Environment set-up

Basic requirements

Linux operating system
C++ compiler with OpenMP support
MPI library

Optional requirements

python and Matlab to plot figures
NVIDIA CUDA compiler
Intel DPC++ compiler
AMD HIP compiler
NEC C++ compiler
C++ compiler with OpenACC support

Environment script (mandatory)

In order to compile and run hpcscan, you need to source one of the files in the directory ./env

cd ./env

Example to set up the environment for hpcscan with GCC and CUDA compilers:

source ./setEnvNeptuneGccCuda.sh

Display command output

🔔 For a new system, you would need to create a file for your system (take example from one of the existing files)

Compilation

Makefile

Go to ./build, and use the command make

Display command output

Executable can be found in ./bin/hpcscan

🔔 If hpcscan environment has not been set (see Environment script (mandatory)), compilation will abort.

By default, hpcscan is compiled in single in precision

To compile in double precision: make precision=double

Enabled test modes

To check the test modes that are enabled in your hpcscan binary, use the command

./bin/hpcscan -v

Display command output

Validation

Validation tests

To check that hpcscan has correctly been built and works fine, go to ./script and launch

sh runValidationTests.sh

Display command output

This script runs a set a light test cases and should complete within few minutes (even on a laptop).

You should get in the ouptput report (displayed on the terminal)

All tests marked as PASSED (661 tests passed for each test mode enabled)
No test marked as FAILED

Check the summary at the end of report to have a quick look on this.

🔔 These tests are intended for validation purpose only, they do not allow for performance measurements.

Validated hardware, operating systems and compilers

hpcscan has been successfully tested on the hardware, operating systems and compilers listed below.

Operating system	Compiler	MPI	Host	Device	Test modes
Ubuntu 22.04.1 LTS	g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0	mpirun (Open MPI) 4.1.2	Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (Intel Kaby Lake)	-	Baseline, CacheBlk
Ubuntu 22.04.1 LTS	Intel icpc (ICC) 2021.7.0 20220726	Intel MPI Version 2021.7	Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (Intel Kaby Lake)	-	Baseline, CacheBlk
Red Hat 4.8.5-39	Intel oneAPI DPC++/C++ Compiler 2022.1.0	Intel MPI Version 2021.6	Intel(R) Xeon(R) Gold 6240L CPU @ 2.60GHz (Intel Cascade Lake)	-	Baseline, CacheBlk
Red Hat 4.8.5-39	Intel oneAPI DPC++/C++ Compiler 2022.1.0 NVIDIA nvcc release 11.7	Intel MPI Version 2021.6	Intel(R) Xeon(R) Gold 6240L CPU @ 2.60GHz (Intel Cascade Lake)	Tesla V100S-PCI (NVIDIA GPU)	Baseline, CacheBlk, Cuda, Cuda_Opt, Cuda_Ref
Red Hat 8.5.0-10	NEC nc++ (NCC) 4.0.0	NEC MPI 3.1.0	Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz (Intel Skylake)	NEC SX-Aurora TSUBASA 20B-P (NEC Vector Engine)	Baseline, CacheBlk, NEC, NEC_SCA
Red Hat 8.5.0-10	Intel oneAPI DPC++/C++ Compiler 2022.1.0	Intel MPI Version 2021.6	Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz (Intel Skylake)	-	Baseline, CacheBlk
SUSE Linux Enterprise Server 15	Intel icpc (ICC) 19.0.5.281 20190815	-	Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz (Intel Haswell)	-	-
Red Hat 4.8.5-39	Intel icpc version 19.1.2.254	-	Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (Intel Cascade Lake)	-	-
Ubuntu 20.04.1 LTS	gcc version 9.3.0 NVIDIA nvcc release 11.3, V11.3.109	-	Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (Intel Ice Lake)	GP108M [GeForce MX330] (NVIDIA GPU)	-
CentOS Linux release 7.7.1908	Intel icpc (ICC) 19.1.0.166 20191121 NVIDIA nvcc release 11.0, V11.0.167	-	Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz (Intel Skylake)	GV100GL [Tesla V100 SXM2 32GB] (NVIDIA GPU)	-
Ubuntu 20.04.1 LTS	Intel(R) oneAPI DPC++ Compiler 2021.2.0	-	Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (Intel Ice Lake)	-	-
Ubuntu 20.04.1 LTS	g++ 9.3.0 AMD hipcc 4.2.21155-37cb3a34	-	AMD EPYC 7742 64-Core Processor @ 2.25GHz (AMD Rome)	[AMD Instinct MI100] (AMD GPU)	-

Execution

Usage

hpcscan can be launched from a terminal with all configuration parameters within a single line.

To get help on the parameters

./bin/hpcscan -h

Display command output

Execution with a unique MPI process

mpirun -n 1 ./bin/hpcscan -testCase <TESTCASE> -testMode <TESTMODE>

where

TESTCASE is the name of the test case (see List of test cases)
TESTMODE is the name of the test mode (see List of test modes)

Example

mpirun -n 1 ./bin/hpcscan -testCase Propa -testMode CacheBlk

Display command output

🔔 If you omit to specify -testMode <TESTMODE>, the Baseline mode is assumed.

Example

mpirun -n 1 ./bin/hpcscan -testCase Propa

Execution with multiple MPI processes

mpirun -n <N> ./bin/hpcscan -testCase <TESTCASE> -testMode <TESTMODE> -nsub1 <NSUB1> -nsub2 <NSUB2> -nsub3 <NSUB3>

🔔 When several MPI processes are used, subdomain decomposition is activated. The product NSUB1 x NSUB2 x NSUB3 must be equal to N (no. of MPI proc.). You may omit to specify the number of subdomains along an axis if that number is 1.

Example

mpirun -n 2 ./bin/hpcscan -testCase Comm -nsub1 2

Configuration of the grid size and dimension

Simply add on the command line

-n1 <N1> -n2 <N2> -n3 <N3> -dim <DIM>

Where N1, N2, N3 are the number of grid points along axis 1, 2 and 3.

And DIM = 1,2 or 3 (1D, 2D or 3D grids).

Example

mpirun -n 1 ../bin/hpcscan -testCase Grid -dim 2 -n1 200 -n2 300

🔔 If you omit to specify -dim <DIM>, 3D grid is assumed.

Input and output

Input

hpcscan does not require any input file. All data are built internally.

Output on the terminal

During execution, information regarding results validation and performances are sent to the terminal output.

Output performance log file

For every test case, an ASCII file containing all measures in a compact way is created. It can used to plot results with dedicated tools. The name of the log file is as follows

hpcscan.perf.<TESTCASE>.log

If hpcscan is launched several times, results are added to the log file. It is convenient for instance, when you want to analyse the effect of a parameter and plot the serie of results in a graph.

Output grids

Be default, the grids manipulated by hpcscan are not written on disk. To output the grids, use the option -writeGrid. When activated, each grid used in a test will generate 2 files:

An ASCII file with the grid dimensions (name of the file <GRIDNAME>.proc<ID>.grid.info)
A binary file with the grid data (name of the file <GRIDNAME>.proc<ID>.grid.bin) where ID is the MPI rank.

Example (this is the command that was used to produce the hpcscan logo on top of this page)

mpirun -n 1 ../../bin/hpcscan -testCase Propa -writeGrid \
       -tmax 0.2 -snapDt 0.1 \
       -dim 2 -n1 200 -n2 600 \
       -param1 4 -param2 8

Outputs the following files: PropaEigenModeRef.proc0.grid.info, PropaEigenModeRef.proc0.grid.bin, PropaEigenModePrn.proc0.grid.info and PropaEigenModePrn.proc0.grid.bin

⚠️ Writing grids on disks slows down the execution and shouldn't be combined with performance measurements

⚠️ Grids can be of large size and can quickly reach your available disk space

Output debug traces

The code is equipped with debug traces that can be activated with the option -debug <LEVEL> where LEVEL can be set to light, mid or full (minimum, middle and maximum level of verbosity). It can be useful to activate them when developping/debugging to understand the behavior of the code. When activated, debug traces are written by each MPI proc in an ASCII file with name hpcscan.debug.proc<ID>.log where ID is the MPI rank.

⚠️ Debug traces slow down the execution and shouldn't be combined with performance measurements

Performance benchmarks

⚠️ These benchmarks are intensive tests that require to run on HPC platforms

🔔 Maximum memory required per node (device) is 20 GB

🔔 At maximum, 8 computing nodes (devices) are used

The benchmarks are independent and can be used as is or configured according to your system if needed.

Test cases description

Test case	Objectives	Remarks
Memory	Assess memory bandwidth	Scalability analysis on a single node
Grid	Assess bandwidth of grid operations	Analyse effect of the grid size
Comm	Assess inter-node communication bandwidth	Analyse effect of subdomain decomposition
FD_D2	Assess FD spatial derivative computation bandwidth	Analyse effect of FD stencil order
Propa	Find optimal configuration for the wave propagator	Explore range of parameters
Propa	Scalability analysis of wave propagator on multiple nodes	Analyse effect of the FD stencil order

➡️ Performance measurements and scripts to reproduce results obtained on various architectures are available in ./misc/hpcscanPerfSlides/hpcscanPerfSlides.pdf

Customization

hpcscan is built on a simple yet very flexible design heavily relying on inheritance feature of C++.

The main class is Grid (see ./src/grid.cpp). This class handles all grid data in hpcscan and all operations performed on grids. It implements the so-called Baseline mode and it is the reference implementation.

💡 All test cases, at some point, call methods of this class. Indeed, test cases (testCase_xxx.cpp) do not implement kernels.

Now, let us say, you would like to specialize the implementation for a given architecture.

To do this, you would need to create a new class that derives from Grid. For instance, you will create Grid_ArchXYZ.h and Grid_ArchXYZ.cpp for your new class (you need to add the new source file in the Makefile as well). In this class, you may implement only few functions that are declared as virtual in Grid.

💡 To allow hpcscan to use this new class, you need only to add it the 'grid factory' (see ./src/grid_Factory.cpp). This is the only location of the code where all grids are referenced.

By doing this, you may switch at execution time, to your new grid with the -testMode <TESTMODE> option where TESTMODE = ArchXYZ.

💡 You can proceed little by little, implementing one function at a time, with the possibility to check the behavior of your implementation against the Baseline reference solution.

Check the grids that are already implemented in hpcscan to get some examples.

Have fun!

Share feedback

Issues encountered
Suggestions of new test cases
Performance measurements

Contributing to hpcscan

➡️ If you want to contribute to hpcscan, please contact the project coordinator (vetienne@rocketmail.com).

Name		Name	Last commit message	Last commit date
Latest commit History 366 Commits
build		build
env		env
misc		misc
script		script
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to hpcscan

Overview

Description

Why another benchmark?

What hpcscan is

What hpcscan is not

Quick start

Versions

Main features

Project directories

List of test cases

List of test modes

Environment set-up

Basic requirements

Optional requirements

Environment script (mandatory)

Compilation

Makefile

Enabled test modes

Validation

Validation tests

Validated hardware, operating systems and compilers

Execution

Usage

Input and output

Performance benchmarks

Customization

Have fun!

Share feedback

Contributing to hpcscan

About

Releases 2

Packages

Contributors 3

Languages

License

vetienne74/hpcscan

Folders and files

Latest commit

History

Repository files navigation

Welcome to hpcscan

Overview

Description

Why another benchmark?

What hpcscan is

What hpcscan is not

Quick start

Versions

Main features

Project directories

List of test cases

List of test modes

Environment set-up

Basic requirements

Optional requirements

Environment script (mandatory)

Compilation

Makefile

Enabled test modes

Validation

Validation tests

Validated hardware, operating systems and compilers

Execution

Usage

Input and output

Performance benchmarks

Customization

Have fun!

Share feedback

Contributing to hpcscan

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 3

Languages

Packages