Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CuBLAS] Add CuBLAS benchmarks #447

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

[CuBLAS] Add CuBLAS benchmarks #447

wants to merge 1 commit into from

Conversation

yudi0201
Copy link
Collaborator

@yudi0201 yudi0201 commented Apr 5, 2024

Please refer to the commit message for benchmark results.

Some CuBLAS benchmarking results on RTX2080 TI (all measurements are median latencies):

SECTION 1
FP32 Matrix Multiply: C (bs x m x n) = A (bs x m x k) @ B(bs x k x n)

Group 1 results with m = 512, n = 512, k = 512
bs = 1:
cublas_batched_gemm            69.0us
cublas_strided_gemm            41.0us
hidet.ops.matmul optimized     37.0us
PyTorch                        44.6us

bs = 2:
cublas_batched_gemm            111.7us
cublas_strided_gemm            75.8us
hidet.ops.matmul optimized     69.2us
PyTorch                        71.7us

bs = 4:
cublas_batched_gemm            124.9us
cublas_strided_gemm            97.2us
hidet.ops.matmul optimized     100.8us
PyTorch                        96.3us

bs = 8:
cublas_batched_gemm            190.5us
cublas_strided_gemm            191.1us
hidet.ops.matmul optimized     204.7us
PyTorch                        187.6us

Group 2 results with m = 1024, n = 1024, k = 2048
bs = 1:
cublas_batched_gemm            405.1us
cublas_strided_gemm            419.2us
hidet.ops.matmul optimized     370.7us
PyTorch                        405.1us

bs = 2:
cublas_batched_gemm            725.3us
cublas_strided_gemm            859.9us
hidet.ops.matmul optimized     800.8us
PyTorch                        719.2us

bs = 4:
cublas_batched_gemm            1442us
cublas_strided_gemm            1592us
hidet.ops.matmul optimized     1606us
PyTorch                        1466us

bs = 8:
cublas_batched_gemm            2658us
cublas_strided_gemm            2830us
hidet.ops.matmul optimized     3475us
PyTorch                        2753us

SECTION 2
FP16 Matrix Multiply: C (bs x m x n) = A (bs x m x k) @ B(bs x k x n)

Group 1 results with m = 512, n = 512, k = 512
bs = 1:
cublas_batched_gemm            63.5us
cublas_strided_gemm            34.0us
hidet.ops.matmul optimized     34.9us
PyTorch                        41.0us

bs = 2:
cublas_batched_gemm            66.0us
cublas_strided_gemm            30.2us
hidet.ops.matmul optimized     64.8us
PyTorch                        45.1us

bs = 4:
cublas_batched_gemm            72.7us
cublas_strided_gemm            32.4us
hidet.ops.matmul optimized     24.4us
PyTorch                        46.3us

bs = 8:
cublas_batched_gemm            81.2us
cublas_strided_gemm            36.2us
hidet.ops.matmul optimized     38.5us
PyTorch                        47.8us

Group 2 results with m = 1024, n = 1024, k = 2048
bs = 1:
cublas_batched_gemm            71.0us
cublas_strided_gemm            60.1us
hidet.ops.matmul optimized     65.5us
PyTorch                        90.6us

bs = 2:
cublas_batched_gemm            114.8us
cublas_strided_gemm            112.3us
hidet.ops.matmul optimized     123.1us
PyTorch                        160.5us

bs = 4:
cublas_batched_gemm            225.1us
cublas_strided_gemm            223.4us
hidet.ops.matmul optimized     245.6us
PyTorch                        319.8us

bs = 8:
cublas_batched_gemm            442.8us
cublas_strided_gemm            439.1us
hidet.ops.matmul optimized     733.2us
PyTorch                        634.8us
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant