Skip to content

power8_configure

Mehmet Deveci edited this page Dec 19, 2017 · 7 revisions

1. Set up Kokkos

You can use the latest version of Kokkos. The experiments of the paper "Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures" uses kokkos-version: 2.04.04

mkdir $HOME/kokkoskernels_spgemm_benchmark
cd $HOME/kokkoskernels_spgemm_benchmark
git clone git@github.com:kokkos/kokkos.git

2. Get KokkosKernels

cd $HOME/kokkoskernels_spgemm_benchmark
git clone git@github.com:mndevec/kokkos-kernels.git 

Currently KokkosKernels-spgemm updates are not on the master branch yet (12/20/2017). Checkout the develop branch.

cd $HOME/kokkoskernels_spgemm_benchmark/kokkos-kernels
git checkout spgemm_hash_promotion

3. Update the compileKokkosKernels.sh located at example/buildlib.

cd $HOME/kokkoskernels_spgemm_benchmark/kokkos-kernels/example/buildlib
vi compileKokkosKernels.sh

Below is the example of compileKokkosKernels.sh for Power8 with OpenMP execution space.

KOKKOS_PATH=${HOME}/kokkoskernels_spgemm_benchmark/kokkos #path to kokkos source
KOKKOSKERNELS_SCALARS='double' #we only need double
KOKKOSKERNELS_LAYOUTS=LayoutLeft #the layout types to instantiate.
KOKKOSKERNELS_ORDINALS=int #ordinal types to instantiate
KOKKOSKERNELS_OFFSETS=int #offset types to instantiate
KOKKOSKERNELS_PATH=${HOME}/kokkoskernels_spgemm_benchmark/kokkos-kernels #path to kokkos-kernels top directory.
KOKKOSKERNELS_OPTIONS=eti-only #options for kokkoskernels  
CXXFLAGS="-Wall -pedantic -Werror -O3 -g -Wshadow -Wsign-compare -Wignored-qualifiers -Wempty-body -Wclobbered -Wuninitialized"
CXX=g++
KOKKOS_DEVICES=OpenMP #we only need openmp execution space.
KOKKOS_ARCHS=Power8        #!!!!!!!!!!!!!!!!!!specify the architecture for compilation!!!!!!!!!!!!!!!!!!!
KOKKOSKERNELS_TPLS="" 

../../scripts/generate_makefile.bash --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars=${KOKKOSKERNELS_SCALARS} --with-ordinals=${KOKKOSKERNELS_ORDINALS} --with-offsets=${KOKKOSKERNELS_OFFSETS} --kokkos-path=${KOKKOS_PATH} --with-devices=${KOKKOS_DEVICES} --arch=${KOKKOS_ARCHS} --compiler=${CXX} --with-options=${KOKKOSKERNELS_OPTIONS}  --cxxflags="${CXXFLAGS}" --with-tpls=${KOKKOSKERNELS_TPLS}

Set the compiler.

module load gcc/5.4.0

4. Compile KokkosKernels.

cd $HOME/kokkoskernels_spgemm_benchmark/kokkos-kernels/example/buildlib
./compileKokkosKernels.sh
make build-test -j

5- Running Benchmarks.

  • Allocate the node using appropriate scheduling command.
  • Download a UFL sparse matrix. We are showing it on audikw_1 in this example.
  • Each is run 6 times, which can be changed using "repeat" keyword ("repeat 15" to repeat 15 times.)
  • First run is always discarded as warm-up. For each algorithm below, we run for [32, 64, 128] threads.
  • I am using ".bin" files below for faster I/O handles. ".mtx" files can also be used, based on the suffix correct reader will be called. But for faster experimenting, you can use KokkosKernels_MatrixConverter.exe as below for converint mtx files to bin files.
./KokkosKernels_MatrixConverter.exe in_mtx audikw_1.mtx out_mtx audikw_1.bin
  • Set the environment variables, go to the executables folder.
export OMP_PROC_BIND=spread 
export OMP_PLACES=threads
cd $HOME/kokkoskernels_spgemm_benchmark/kokkos-kernels/example/buildlib/perf_test
Running default algorithm: KKSPGEMM. Best Runtime: ~1.38 seconds
bash-4.2$ OMP_NUM_THREADS=32 ./KokkosSparse_spgemm.exe openmp 32 amtx  audikw_1.bin 
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 32 x 1 ]
Using A matrix for B as well
mm_time:1.60257 symbolic_time:0.189802 numeric_time:1.41277
mm_time:1.59863 symbolic_time:0.192077 numeric_time:1.40656
mm_time:1.59652 symbolic_time:0.190182 numeric_time:1.40634
mm_time:1.59754 symbolic_time:0.190363 numeric_time:1.40718
mm_time:1.59503 symbolic_time:0.191396 numeric_time:1.40363
mm_time:1.59607 symbolic_time:0.190068 numeric_time:1.406

bash-4.2$ OMP_NUM_THREADS=64 ./KokkosSparse_spgemm.exe openmp 64 amtx  audikw_1.bin 
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 64 x 1 ]
Using A matrix for B as well
mm_time:1.39399 symbolic_time:0.16283 numeric_time:1.23116
mm_time:1.38258 symbolic_time:0.163591 numeric_time:1.21899
mm_time:1.38344 symbolic_time:0.163245 numeric_time:1.2202
mm_time:1.3904 symbolic_time:0.162324 numeric_time:1.22808
mm_time:1.38424 symbolic_time:0.162419 numeric_time:1.22182
mm_time:1.38652 symbolic_time:0.163269 numeric_time:1.22325

bash-4.2$ OMP_NUM_THREADS=128 ./KokkosSparse_spgemm.exe openmp 128 amtx  audikw_1.bin 
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 128 x 1 ]
Using A matrix for B as well
mm_time:1.492 symbolic_time:0.165791 numeric_time:1.32621
mm_time:1.49169 symbolic_time:0.165278 numeric_time:1.32641
mm_time:1.48871 symbolic_time:0.164961 numeric_time:1.32375
mm_time:1.50133 symbolic_time:0.165355 numeric_time:1.33597
mm_time:1.48948 symbolic_time:0.164395 numeric_time:1.32508
mm_time:1.48719 symbolic_time:0.164453 numeric_time:1.32274
Running KKMEM. Best Runtime: ~1.43 seconds
bash-4.2$ OMP_NUM_THREADS=32 ./KokkosSparse_spgemm.exe openmp 32 amtx  audikw_1.bin algorithm kkmem
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 32 x 1 ]
Using A matrix for B as well
mm_time:1.65053 symbolic_time:0.239056 numeric_time:1.41148
mm_time:1.64862 symbolic_time:0.240431 numeric_time:1.40819
mm_time:1.64656 symbolic_time:0.239542 numeric_time:1.40701
mm_time:1.64874 symbolic_time:0.239644 numeric_time:1.4091
mm_time:1.64745 symbolic_time:0.239365 numeric_time:1.40808
mm_time:1.65199 symbolic_time:0.239364 numeric_time:1.41262



bash-4.2$ OMP_NUM_THREADS=64 ./KokkosSparse_spgemm.exe openmp 64 amtx  audikw_1.bin algorithm kkmem
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 64 x 1 ]
Using A matrix for B as well
mm_time:1.43281 symbolic_time:0.208509 numeric_time:1.2243
mm_time:1.42735 symbolic_time:0.208932 numeric_time:1.21842
mm_time:1.44173 symbolic_time:0.207623 numeric_time:1.2341
mm_time:1.43231 symbolic_time:0.208167 numeric_time:1.22414
mm_time:1.43156 symbolic_time:0.20821 numeric_time:1.22335
mm_time:1.42862 symbolic_time:0.208559 numeric_time:1.22006

bash-4.2$ OMP_NUM_THREADS=128 ./KokkosSparse_spgemm.exe openmp 128 amtx  audikw_1.bin algorithm kkmem
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 128 x 1 ]
Using A matrix for B as well
mm_time:1.54737 symbolic_time:0.209912 numeric_time:1.33746
mm_time:1.53272 symbolic_time:0.211581 numeric_time:1.32114
mm_time:1.53738 symbolic_time:0.211051 numeric_time:1.32633
mm_time:1.52853 symbolic_time:0.210795 numeric_time:1.31773
mm_time:1.52895 symbolic_time:0.209716 numeric_time:1.31924
mm_time:1.52877 symbolic_time:0.210408 numeric_time:1.31837


Running KKDENSE. Best Runtime: ~1.02 seconds
bash-4.2$  OMP_NUM_THREADS=32 ./KokkosSparse_spgemm.exe openmp 32 amtx  audikw_1.bin algorithm kkdense
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 32 x 1 ]
Using A matrix for B as well
mm_time:1.10339 symbolic_time:0.189668 numeric_time:0.913727
mm_time:1.10376 symbolic_time:0.191396 numeric_time:0.912362
mm_time:1.10059 symbolic_time:0.189966 numeric_time:0.910622
mm_time:1.10133 symbolic_time:0.189731 numeric_time:0.911597
mm_time:1.10026 symbolic_time:0.189717 numeric_time:0.910541
mm_time:1.10042 symbolic_time:0.190201 numeric_time:0.910215


bash-4.2$ OMP_NUM_THREADS=64 ./KokkosSparse_spgemm.exe openmp 64 amtx  audikw_1.bin algorithm kkdense
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 64 x 1 ]
Using A matrix for B as well
mm_time:1.01657 symbolic_time:0.162959 numeric_time:0.853615
mm_time:1.02281 symbolic_time:0.164399 numeric_time:0.85841
mm_time:1.01601 symbolic_time:0.161771 numeric_time:0.854238
mm_time:1.03082 symbolic_time:0.163026 numeric_time:0.867798
mm_time:1.02147 symbolic_time:0.162556 numeric_time:0.858909
mm_time:1.02057 symbolic_time:0.162479 numeric_time:0.858087



bash-4.2$ OMP_NUM_THREADS=128 ./KokkosSparse_spgemm.exe openmp 128 amtx  audikw_1.bin algorithm kkdense
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 128 x 1 ]
Using A matrix for B as well
mm_time:1.02431 symbolic_time:0.16467 numeric_time:0.859638
mm_time:1.03067 symbolic_time:0.164871 numeric_time:0.865799
mm_time:1.02781 symbolic_time:0.16416 numeric_time:0.863653
mm_time:1.02785 symbolic_time:0.164758 numeric_time:0.863094
mm_time:1.02448 symbolic_time:0.164188 numeric_time:0.860296
mm_time:1.02984 symbolic_time:0.164721 numeric_time:0.865115

Changing KKSPGEMM cut-off to 1M to use dense accumulators:

All benchmarks use the cut-off for dense accumulators as 250k. Below shows an example of how to change it.

bash-4.2$ OMP_NUM_THREADS=64 ./KokkosSparse_spgemm.exe openmp 64 amtx  audikw_1.bin DENSEACCMAX 1000000 
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 64 x 1 ]
Using A matrix for B as well
mm_time:1.01432 symbolic_time:0.162636 numeric_time:0.851682
mm_time:1.02273 symbolic_time:0.162844 numeric_time:0.859891
mm_time:1.01914 symbolic_time:0.16215 numeric_time:0.85699
mm_time:1.01352 symbolic_time:0.162267 numeric_time:0.851249
mm_time:1.01632 symbolic_time:0.162973 numeric_time:0.85335
mm_time:1.01611 symbolic_time:0.163853 numeric_time:0.852255


Clone this wiki locally