This repo contains several implementations of SGEMM on GPU. I tried different optimization techniques, such as iling, register blocking, prefetching, etc. The best implementation is sgemm_header_file/kernel9.cuh
, and its result is shown below.
System: Ubuntu 20.04 under WSL2
Compiler: NVCC 10.1
GPU: GTX 1660 supper
Required compile flag: -lcublas