FastKron is a fast library for computing Generalized Matrix Kronecker-Matrix Multiplication (GeMKM) and Generalized Kronecker-Matrix Matrix Multiplication (GeKMM) on NVIDIA GPUs and X86 CPUs. FastKron contains specialized algorithms and implementations of GeMKM and GeKMM rather than using existing linear algebra operations. FastKron avoids extra transposes and adds more optimizations including fusion of multiple kernels. Therefore, FastKron performs orders of magnitude better than baseline GPyTorch, NVIDIA cuTensor, and HPTT. Fastkron provides a C++ library and a Python library for Numpy and PyTorch autograd functions. FastKron provides fast implementations for float and double data type, while Numpy/PyTorch functions uses Shuffle algorithm for other types.
For more details look Fast Kronecker Matrix-Matrix Multiplication on GPUs.
We compare FastKron's GeMKM and GeKMM with the existing shuffle algorithm in GPyTorch based on PyTorch 2.5.1. Below table shows the range of speedup on different hardware and data types.
Hardware | Float | Double |
---|---|---|
AMD 64-Core CPU with AVX | 9.3-45x | 5.8-21x |
AMD 64-Core CPU with AVX512 | 9.7-38x | 6.3-21x |
NVIDIA A100 80 GB | 1.5-9.5x | 1.1-9.5x |
NVIDIA V100 16 GB | 2.5-10x | 1.9-11x |
Hardware | Float | Double |
---|---|---|
AMD 64-Core CPU with AVX | 2.7-13.7x | 1.5-7x |
AMD 64-Core CPU with AVX512 | 2.2-14x | 2-7x |
NVIDIA A100 80 GB | 1.3-4.6x | 0.9-4.5x |
NVIDIA V100 16 GB | 1.4-6.4x | 2-7.8x |
For more information see documents/performance.md
Linux | WSL2 | Windows | Mac | |
---|---|---|---|---|
x86 | ✅ | ✅ | 🐍 | 🐍 |
ARM | 🐍 | 🐍 | 🐍 | 🐍 |
AVX256 | ✅ | ✅ | 🐍 | 🐍 |
AVX512 | ✅ | ✅ | 🐍 | 🐍 |
SM50+ CUDA cores | ✅ | ✅ | 🐍 | 🐍 |
SM80+ Tensor cores | ❌ | ❌ | 🐍 | 🐍 |
AMD RoCM | 🐍 | 🐍 | 🐍 | 🐍 |
✅ FastKron supports optimized implementations for AVX256 and AVX512 CPUs and NVIDIA GPUs.
❌ Tensor cores for double are not supported.
🐍 Supported in Python module. x86 CPUs older than GLIBC x86-64-v2, ARM CPUs, AMD GPUs, Windows, and Mac OS are not supported in C++ API but PyFastKron fallbacks to the shuffle algorithm in Numpy or PyTorch.
The future roadmap is as follows in terms of priority: Windows, SM80+ Double Tensor cores, AMD GPUs, ARM CPUs.
The directory example/
pinclude examples of using FastKron's CUDA and x86 backend using both C++ and Python.
Before using an example, follow below instructions to build FastKron.
PyFastKron can be installed using pip.
pip install pyfastkron
PyFastKron's CUDA backend is built with CUDA 12.3 but is compatible with CUDA 11.8 and above.
Build the C++ library, libFastKron.so, to use with C++ programs or the Python library, PyFastKron, to use with PyTorch or Numpy programs.
On Ubuntu :
sudo apt update && sudo apt install gcc linux-headers-$(uname -r) make g++ git python3-dev wget unzip python3-pip build-essential devscripts debhelper fakeroot intel-mkl cmake
Install CUDA 11+ from https://developer.nvidia.com/cuda/ .
Clone repository with submodules using
git clone --recurse-submodules https://github.com/abhijangda/fastkron.git
If already cloned and want to only clone submodules, use
git submodule update --init --recursive
Build FastKron as C++ library using below commands:
cd build/
cmake ..
make -j
To install run
make install
By default both x86 and CUDA backends are built. use CMAKE option -DENABLE_CUDA=OFF
to disable CUDA backend or -DENABLE_X86=OFF
to disable x86 backend.
Run X86 CPU tests using
make run-x86-tests
Run CUDA tests using
make run-cuda-tests
Install PyFastKron using pip
pip install .
Run tests using
pytest
C++ API: documents/cpp-api.md
Python API: documents/python-api.md
Kernel Tuning: documents/autotuning.md
Performance: documents/performance.md
Multi-GPU: documents/multigpu.md
Contributing: documents/contributing.md
@inproceedings{10.1145/3627535.3638489,
author = {Jangda, Abhinav and Yadav, Mohit},
title = {Fast Kronecker Matrix-Matrix Multiplication on GPUs},
year = {2024},
isbn = {9798400704352},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3627535.3638489},
doi = {10.1145/3627535.3638489},
booktitle = {Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming},
pages = {390–403},
numpages = {14},
keywords = {graphics processing units, CUDA, kronecker product, linear algebra},
location = {Edinburgh, United Kingdom},
series = {PPoPP '24}
}