This repository contains a number of serial and parallel benchmarks for matrix multiplication in C++. Matrix multiplication is a wonderful first operation to try your hand at optimizing for the following reasons:
- It is a very common operation in popular fields (e.g., ML)
- The optimizations are fairly easy to understand (they primarily deal with simple access patterns)
- The optimizations are composable (they work better together!)
- It is fairly easy to parallelize
And many more!
The benchmarks in this repository were written using Google Benchmark. For simplicity, all benchmarks assume square matrix of dimension N x N, where N is 384, 768, and 1152.
The following section breaks down the benchmarks contained in each subdirectory.
serial_mmul_bench
- Baseline serial mmul implementation (using the classical triply-nested for loop)
parallel_mmul_bench
- Baseline parallel mmul implementation (splits rows of output matrix across threads)
blocked_mmul_bench
- A serial mmul implementation which processes a block of elements at a time to exploit locality in the B matrix
blocked_aligned_mmul_bench
- Same as
blocked_mmul_bench
but using 64-byte aligned allocations to prevent blocks from spanning cache lines
- Same as
parallel_blocked_mmul_bench
- A parallel blocked mmul implementation (splits rows of output matrix across threads)
blocked_column_aligned_mmul_bench
- A serial mmul implementation which processes a block of elements at a time, but traverses output blocks of elements in column-major order to exploit locality in the columns of B between blocks of output elements
parallel_blocked_column_mmul_bench
- A parallel blocked column implementation (splits columns between threads) where work is statically mapped
blocked_column_multi_output_aligned_mmul_bench
- A serial mmul implementation which processes a tile of output elements at a time, exploiting locality from each tile of B across output tile elements
parallel_blocked_column_multi_output_mmul_bench
- A parallel blocked column multi output implementation (splits columns between threads) where work is statically mapped
baseline_cuda_mmul
- A naive mmul implementation for NVIDIA GPUs written in CUDA
shmem_cuda_mmul
- A cache-tiled mmul implementation for NVIDIA GPUs using shared memory
- Email: CoffeeBeforeArch@gmail.com
- YouTube: Link
- Twitter: Link