-
Notifications
You must be signed in to change notification settings - Fork 99
power8_configure
You can use the latest version of Kokkos. The experiments of the paper "Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures" uses kokkos-version: 2.04.04
mkdir $HOME/kokkoskernels_spgemm_benchmark
cd $HOME/kokkoskernels_spgemm_benchmark
git clone git@github.com:kokkos/kokkos.git
cd $HOME/kokkoskernels_spgemm_benchmark
git clone git@github.com:mndevec/kokkos-kernels.git
Currently KokkosKernels-spgemm updates are not on the master branch yet (12/20/2017). Checkout the develop branch.
cd $HOME/kokkoskernels_spgemm_benchmark/kokkos-kernels
git checkout spgemm_hash_promotion
cd $HOME/kokkoskernels_spgemm_benchmark/kokkos-kernels/example/buildlib
vi compileKokkosKernels.sh
Below is the example of compileKokkosKernels.sh for Power8 with OpenMP execution space.
KOKKOS_PATH=${HOME}/kokkoskernels_spgemm_benchmark/kokkos #path to kokkos source
KOKKOSKERNELS_SCALARS='double' #we only need double
KOKKOSKERNELS_LAYOUTS=LayoutLeft #the layout types to instantiate.
KOKKOSKERNELS_ORDINALS=int #ordinal types to instantiate
KOKKOSKERNELS_OFFSETS=int #offset types to instantiate
KOKKOSKERNELS_PATH=${HOME}/kokkoskernels_spgemm_benchmark/kokkos-kernels #path to kokkos-kernels top directory.
KOKKOSKERNELS_OPTIONS=eti-only #options for kokkoskernels
CXXFLAGS="-Wall -pedantic -Werror -O3 -g -Wshadow -Wsign-compare -Wignored-qualifiers -Wempty-body -Wclobbered -Wuninitialized"
CXX=g++
KOKKOS_DEVICES=OpenMP #we only need openmp execution space.
KOKKOS_ARCHS=Power8 #!!!!!!!!!!!!!!!!!!specify the architecture for compilation!!!!!!!!!!!!!!!!!!!
KOKKOSKERNELS_TPLS=""
../../scripts/generate_makefile.bash --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars=${KOKKOSKERNELS_SCALARS} --with-ordinals=${KOKKOSKERNELS_ORDINALS} --with-offsets=${KOKKOSKERNELS_OFFSETS} --kokkos-path=${KOKKOS_PATH} --with-devices=${KOKKOS_DEVICES} --arch=${KOKKOS_ARCHS} --compiler=${CXX} --with-options=${KOKKOSKERNELS_OPTIONS} --cxxflags="${CXXFLAGS}" --with-tpls=${KOKKOSKERNELS_TPLS}
Set the compiler. If you use mkl, export the mkl path.
module load intel/compilers/18.0.128
export MKL_PATH=/home/projects/x86-64-knl/intel/beta/pstudio/2018-B1/compilers_and_libraries_2018.0.061/linux/mkl/
cd $HOME/kokkoskernels_spgemm_benchmark/kokkos-kernels/example/buildlib
./compileKokkosKernels.sh
make build-test -j
Below we show how to run benchmarks using KNL-CACHE mode.
- Allocate the node using appropriate scheduling command.
- Download a UFL sparse matrix. We are showing it on audikw_1 in this example.
- Each is run 6 times, which can be changed using "repeat" keyword ("repeat 15" to repeat 15 times.)
- First run is always discarded as warm-up. For each algorithm below, we run from 64 to 256 threads.
- I am using ".bin" files below for faster I/O handles. ".mtx" files can also be used, based on the suffix correct reader will be called. But for faster experimenting, you can use KokkosKernels_MatrixConverter.exe as below for converint mtx files to bin files.
./KokkosKernels_MatrixConverter.exe in_mtx audikw_1.mtx out_mtx audikw_1.bin
- Set the environment variables, go to the executables folder.
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
cd $HOME/kokkoskernels_spgemm_benchmark/kokkos-kernels/example/buildlib/perf_test
bash-4.2$ OMP_NUM_THREADS=64 ./KokkosSparse_spgemm.exe openmp 64 amtx audikw_1.bin
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 64 x 1 ]
Using A matrix for B as well
mm_time:2.48323 symbolic_time:0.32976 numeric_time:2.15347
mm_time:2.47841 symbolic_time:0.323874 numeric_time:2.15454
mm_time:2.47767 symbolic_time:0.322463 numeric_time:2.15521
mm_time:2.47479 symbolic_time:0.322538 numeric_time:2.15225
mm_time:2.47393 symbolic_time:0.322712 numeric_time:2.15122
mm_time:2.47576 symbolic_time:0.322544 numeric_time:2.15321
bash-4.2$ OMP_NUM_THREADS=128 ./KokkosSparse_spgemm.exe openmp 128 amtx audikw_1.bin
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 128 x 1 ]
Using A matrix for B as well
mm_time:1.59176 symbolic_time:0.21535 numeric_time:1.37641
mm_time:1.58634 symbolic_time:0.211642 numeric_time:1.37469
mm_time:1.58279 symbolic_time:0.209543 numeric_time:1.37325
mm_time:1.59248 symbolic_time:0.210353 numeric_time:1.38213
mm_time:1.59114 symbolic_time:0.210148 numeric_time:1.38099
mm_time:1.59152 symbolic_time:0.209913 numeric_time:1.38161
bash-4.2$ OMP_NUM_THREADS=256 ./KokkosSparse_spgemm.exe openmp 256 amtx audikw_1.bin
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 256 x 1 ]
Using A matrix for B as well
mm_time:1.38192 symbolic_time:0.204276 numeric_time:1.17764
mm_time:1.38627 symbolic_time:0.212812 numeric_time:1.17346
mm_time:1.36824 symbolic_time:0.196392 numeric_time:1.17185
mm_time:1.36981 symbolic_time:0.198274 numeric_time:1.17153
mm_time:1.37078 symbolic_time:0.196811 numeric_time:1.17397
mm_time:1.36552 symbolic_time:0.196207 numeric_time:1.16932
bash-4.2$ OMP_NUM_THREADS=64 ./KokkosSparse_spgemm.exe openmp 64 amtx audikw_1.bin algorithm kkmem
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 64 x 1 ]
Using A matrix for B as well
mm_time:2.61748 symbolic_time:0.469393 numeric_time:2.14808
mm_time:2.61755 symbolic_time:0.469112 numeric_time:2.14844
mm_time:2.61899 symbolic_time:0.469484 numeric_time:2.14951
mm_time:2.61537 symbolic_time:0.466099 numeric_time:2.14927
mm_time:2.61856 symbolic_time:0.470032 numeric_time:2.14853
mm_time:2.61432 symbolic_time:0.466054 numeric_time:2.14827
bash-4.2$ OMP_NUM_THREADS=128 ./KokkosSparse_spgemm.exe openmp 128 amtx audikw_1.bin algorithm kkmem
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 128 x 1 ]
Using A matrix for B as well
mm_time:1.69112 symbolic_time:0.310673 numeric_time:1.38045
mm_time:1.69136 symbolic_time:0.307821 numeric_time:1.38353
mm_time:1.68288 symbolic_time:0.3067 numeric_time:1.37618
mm_time:1.69087 symbolic_time:0.306877 numeric_time:1.38399
mm_time:1.68863 symbolic_time:0.306124 numeric_time:1.3825
mm_time:1.68767 symbolic_time:0.305851 numeric_time:1.38182
bash-4.2$ OMP_NUM_THREADS=256 ./KokkosSparse_spgemm.exe openmp 256 amtx audikw_1.bin algorithm kkmem
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 256 x 1 ]
Using A matrix for B as well
mm_time:1.43602 symbolic_time:0.265443 numeric_time:1.17058
mm_time:1.42847 symbolic_time:0.257696 numeric_time:1.17078
mm_time:1.42843 symbolic_time:0.2532 numeric_time:1.17524
mm_time:1.42418 symbolic_time:0.253221 numeric_time:1.17096
mm_time:1.42461 symbolic_time:0.253872 numeric_time:1.17074
mm_time:1.42362 symbolic_time:0.252966 numeric_time:1.17066
bash-4.2$ OMP_NUM_THREADS=64 ./KokkosSparse_spgemm.exe openmp 64 amtx audikw_1.bin algorithm kkdense
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 64 x 1 ]
Using A matrix for B as well
mm_time:1.70767 symbolic_time:0.321831 numeric_time:1.38584
mm_time:1.70334 symbolic_time:0.318716 numeric_time:1.38462
mm_time:1.70252 symbolic_time:0.318125 numeric_time:1.3844
mm_time:1.70231 symbolic_time:0.317844 numeric_time:1.38446
mm_time:1.70372 symbolic_time:0.318226 numeric_time:1.3855
mm_time:1.70266 symbolic_time:0.318109 numeric_time:1.38455
bash-4.2$ OMP_NUM_THREADS=128 ./KokkosSparse_spgemm.exe openmp 128 amtx audikw_1.bin algorithm kkdense
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 128 x 1 ]
Using A matrix for B as well
mm_time:1.16144 symbolic_time:0.215078 numeric_time:0.946365
mm_time:1.15385 symbolic_time:0.211481 numeric_time:0.942369
mm_time:1.15591 symbolic_time:0.21047 numeric_time:0.945443
mm_time:1.15433 symbolic_time:0.209749 numeric_time:0.944576
mm_time:1.15591 symbolic_time:0.209757 numeric_time:0.946151
mm_time:1.15422 symbolic_time:0.209543 numeric_time:0.944674
bash-4.2$ OMP_NUM_THREADS=256 ./KokkosSparse_spgemm.exe openmp 256 amtx audikw_1.bin algorithm kkdense
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 256 x 1 ]
Using A matrix for B as well
mm_time:1.09092 symbolic_time:0.206483 numeric_time:0.884432
mm_time:1.09683 symbolic_time:0.213442 numeric_time:0.883391
mm_time:1.09282 symbolic_time:0.210769 numeric_time:0.882047
mm_time:1.0984 symbolic_time:0.212993 numeric_time:0.885408
mm_time:1.09628 symbolic_time:0.212053 numeric_time:0.884225
mm_time:1.09678 symbolic_time:0.208386 numeric_time:0.888396
To fit to the use of SpGEMM in Trilinos, mkl-inspector is called twice for both in symbolic and numeric phases and there are some post-processing. To benchmark its runtime, we exclude all these post-processing by providing the "verbose mklkeepout 0" to executable. The timings that we take into account is "Actual DOUBLE MKL SPMM Time Without Free", rather then previous mm_time, symbolic_time and numeric time. Note that first call is much more expensive than the rest, which we exclude.
bash-4.2$ OMP_NUM_THREADS=64 ./KokkosSparse_spgemm.exe openmp 64 amtx audikw_1.bin algorithm mkl mklkeepout 0 verbose
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 64 x 1 ]
Using A matrix for B as well
m:943695 n:943695 k:943695
Actual DOUBLE MKL SPMM Time Without Free:3.74993
Actual DOUBLE MKL SPMM Time:3.75037
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:2.72329
Actual DOUBLE MKL SPMM Time:2.72369
mm_time:6.75529 symbolic_time:3.98355 numeric_time:2.77174
Actual DOUBLE MKL SPMM Time Without Free:2.72345
Actual DOUBLE MKL SPMM Time:2.7239
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:2.72306
Actual DOUBLE MKL SPMM Time:2.7235
mm_time:5.71684 symbolic_time:2.94666 numeric_time:2.77018
Actual DOUBLE MKL SPMM Time Without Free:2.72385
Actual DOUBLE MKL SPMM Time:2.72433
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:2.72334
Actual DOUBLE MKL SPMM Time:2.72379
mm_time:5.71811 symbolic_time:2.94657 numeric_time:2.77154
Actual DOUBLE MKL SPMM Time Without Free:2.72288
Actual DOUBLE MKL SPMM Time:2.72333
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:2.72343
Actual DOUBLE MKL SPMM Time:2.72389
mm_time:5.71555 symbolic_time:2.94453 numeric_time:2.77102
Actual DOUBLE MKL SPMM Time Without Free:2.72238
Actual DOUBLE MKL SPMM Time:2.72283
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:2.72248
Actual DOUBLE MKL SPMM Time:2.72295
mm_time:5.71422 symbolic_time:2.94366 numeric_time:2.77056
Actual DOUBLE MKL SPMM Time Without Free:2.72335
Actual DOUBLE MKL SPMM Time:2.72379
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:2.72312
Actual DOUBLE MKL SPMM Time:2.72357
mm_time:5.71585 symbolic_time:2.94496 numeric_time:2.77089
bash-4.2$ OMP_NUM_THREADS=128 ./KokkosSparse_spgemm.exe openmp 128 amtx audikw_1.bin algorithm mkl mklkeepout 0 verbose
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 128 x 1 ]
Using A matrix for B as well
m:943695 n:943695 k:943695
Actual DOUBLE MKL SPMM Time Without Free:1.95746
Actual DOUBLE MKL SPMM Time:1.95805
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:1.69292
Actual DOUBLE MKL SPMM Time:1.69333
mm_time:3.98053 symbolic_time:2.22713 numeric_time:1.7534
Actual DOUBLE MKL SPMM Time Without Free:1.69162
Actual DOUBLE MKL SPMM Time:1.69203
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:1.69158
Actual DOUBLE MKL SPMM Time:1.69204
mm_time:3.71019 symbolic_time:1.95521 numeric_time:1.75497
Actual DOUBLE MKL SPMM Time Without Free:1.68733
Actual DOUBLE MKL SPMM Time:1.68773
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:1.69338
Actual DOUBLE MKL SPMM Time:1.69379
mm_time:3.70548 symbolic_time:1.95155 numeric_time:1.75392
Actual DOUBLE MKL SPMM Time Without Free:1.68911
Actual DOUBLE MKL SPMM Time:1.68953
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:1.6801
Actual DOUBLE MKL SPMM Time:1.68051
mm_time:3.69032 symbolic_time:1.95066 numeric_time:1.73966
Actual DOUBLE MKL SPMM Time Without Free:1.6984
Actual DOUBLE MKL SPMM Time:1.69883
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:1.68243
Actual DOUBLE MKL SPMM Time:1.68283
mm_time:3.70284 symbolic_time:1.95989 numeric_time:1.74294
Actual DOUBLE MKL SPMM Time Without Free:1.68301
Actual DOUBLE MKL SPMM Time:1.68343
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:1.67717
Actual DOUBLE MKL SPMM Time:1.67759
mm_time:3.68133 symbolic_time:1.94262 numeric_time:1.73872
bash-4.2$ OMP_NUM_THREADS=256 ./KokkosSparse_spgemm.exe openmp 256 amtx audikw_1.bin algorithm mkl mklkeepout 0 verbose
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 256 x 1 ]
Using A matrix for B as well
m:943695 n:943695 k:943695
Actual DOUBLE MKL SPMM Time Without Free:2.23312
Actual DOUBLE MKL SPMM Time:2.23403
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:1.48377
Actual DOUBLE MKL SPMM Time:1.48443
mm_time:4.11499 symbolic_time:2.56608 numeric_time:1.54891
Actual DOUBLE MKL SPMM Time Without Free:1.49798
Actual DOUBLE MKL SPMM Time:1.49866
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:1.47959
Actual DOUBLE MKL SPMM Time:1.48028
mm_time:3.3549 symbolic_time:1.81261 numeric_time:1.54229
Actual DOUBLE MKL SPMM Time Without Free:1.51514
Actual DOUBLE MKL SPMM Time:1.51578
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:1.51419
Actual DOUBLE MKL SPMM Time:1.51482
mm_time:3.41489 symbolic_time:1.83635 numeric_time:1.57855
Actual DOUBLE MKL SPMM Time Without Free:1.4894
Actual DOUBLE MKL SPMM Time:1.49008
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:1.51534
Actual DOUBLE MKL SPMM Time:1.51602
mm_time:3.38422 symbolic_time:1.80693 numeric_time:1.57729
Actual DOUBLE MKL SPMM Time Without Free:1.49757
Actual DOUBLE MKL SPMM Time:1.49819
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:1.4856
Actual DOUBLE MKL SPMM Time:1.48628
mm_time:3.36753 symbolic_time:1.81743 numeric_time:1.5501
Actual DOUBLE MKL SPMM Time Without Free:1.49857
Actual DOUBLE MKL SPMM Time:1.49925
C SIZE:0
Actual DOUBLE MKL SPMM Time Without Free:1.51736
Actual DOUBLE MKL SPMM Time:1.51806
mm_time:3.40202 symbolic_time:1.81891 numeric_time:1.58311
There are some pre-and post processing because of the methods working on 1-base. The symbolic time is printed at "Actual MKL2 Symbolic Time" and numeric time is printed "Actual MKL2 Numeric Time", which are the times surrounding mkl spgemm calls.
bash-4.2$ OMP_NUM_THREADS=64 ./KokkosSparse_spgemm.exe openmp 64 amtx audikw_1.bin algorithm mkl2 verbose mklsort 7
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 64 x 1 ]
Using A matrix for B as well
m:943695 n:943695 k:943695
Sort:7 Actual MKL2 Symbolic Time:1.49217
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.61316
mm_time:4.20866 symbolic_time:1.52024 numeric_time:2.68843
Sort:7 Actual MKL2 Symbolic Time:1.46548
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.62388
mm_time:4.19539 symbolic_time:1.49614 numeric_time:2.69926
Sort:7 Actual MKL2 Symbolic Time:1.45934
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.6158
mm_time:4.17825 symbolic_time:1.48679 numeric_time:2.69145
Sort:7 Actual MKL2 Symbolic Time:1.45455
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.61726
mm_time:4.17625 symbolic_time:1.48371 numeric_time:2.69254
Sort:7 Actual MKL2 Symbolic Time:1.45365
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.61739
mm_time:4.17523 symbolic_time:1.48239 numeric_time:2.69284
Sort:7 Actual MKL2 Symbolic Time:1.45837
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.61674
mm_time:4.179 symbolic_time:1.48681 numeric_time:2.69219
bash-4.2$ OMP_NUM_THREADS=128 ./KokkosSparse_spgemm.exe openmp 128 amtx audikw_1.bin algorithm mkl2 verbose mklsort 7
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 128 x 1 ]
Using A matrix for B as well
m:943695 n:943695 k:943695
Sort:7 Actual MKL2 Symbolic Time:1.49747
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.50591
mm_time:4.08896 symbolic_time:1.52182 numeric_time:2.56713
Sort:7 Actual MKL2 Symbolic Time:1.39741
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.50679
mm_time:3.99857 symbolic_time:1.41981 numeric_time:2.57876
Sort:7 Actual MKL2 Symbolic Time:1.39001
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.50741
mm_time:3.97166 symbolic_time:1.41301 numeric_time:2.55865
Sort:7 Actual MKL2 Symbolic Time:1.3902
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.50657
mm_time:3.97159 symbolic_time:1.41346 numeric_time:2.55813
Sort:7 Actual MKL2 Symbolic Time:1.39268
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.5075
mm_time:3.97499 symbolic_time:1.41702 numeric_time:2.55797
Sort:7 Actual MKL2 Symbolic Time:1.38985
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.5079
mm_time:3.97231 symbolic_time:1.41345 numeric_time:2.55886
bash-4.2$ OMP_NUM_THREADS=256 ./KokkosSparse_spgemm.exe openmp 256 amtx audikw_1.bin algorithm mkl2 verbose mklsort 7
B is not provided. Multiplying AxA.
Kokkos::OpenMP thread_pool_topology[ 1 x 256 x 1 ]
Using A matrix for B as well
m:943695 n:943695 k:943695
Sort:7 Actual MKL2 Symbolic Time:1.46614
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.5107
mm_time:4.05512 symbolic_time:1.49686 numeric_time:2.55826
Sort:7 Actual MKL2 Symbolic Time:1.39502
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.51181
mm_time:4.0016 symbolic_time:1.42182 numeric_time:2.57978
Sort:7 Actual MKL2 Symbolic Time:1.38827
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.51616
mm_time:3.97749 symbolic_time:1.41435 numeric_time:2.56314
Sort:7 Actual MKL2 Symbolic Time:1.38636
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.51356
mm_time:3.99414 symbolic_time:1.41294 numeric_time:2.5812
Sort:7 Actual MKL2 Symbolic Time:1.38773
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.51418
mm_time:3.97602 symbolic_time:1.41369 numeric_time:2.56233
Sort:7 Actual MKL2 Symbolic Time:1.38804
C SIZE:662878935
Sort:7 Actual MKL2 Numeric Time:2.51376
mm_time:3.99734 symbolic_time:1.41556 numeric_time:2.58178