[CuBLAS] Add CuBLAS benchmarks #447

yudi0201 · 2024-04-05T18:17:32Z

Please refer to the commit message for benchmark results.

Some CuBLAS benchmarking results on RTX2080 TI (all measurements are median latencies): SECTION 1 FP32 Matrix Multiply: C (bs x m x n) = A (bs x m x k) @ B(bs x k x n) Group 1 results with m = 512, n = 512, k = 512 bs = 1: cublas_batched_gemm 69.0us cublas_strided_gemm 41.0us hidet.ops.matmul optimized 37.0us PyTorch 44.6us bs = 2: cublas_batched_gemm 111.7us cublas_strided_gemm 75.8us hidet.ops.matmul optimized 69.2us PyTorch 71.7us bs = 4: cublas_batched_gemm 124.9us cublas_strided_gemm 97.2us hidet.ops.matmul optimized 100.8us PyTorch 96.3us bs = 8: cublas_batched_gemm 190.5us cublas_strided_gemm 191.1us hidet.ops.matmul optimized 204.7us PyTorch 187.6us Group 2 results with m = 1024, n = 1024, k = 2048 bs = 1: cublas_batched_gemm 405.1us cublas_strided_gemm 419.2us hidet.ops.matmul optimized 370.7us PyTorch 405.1us bs = 2: cublas_batched_gemm 725.3us cublas_strided_gemm 859.9us hidet.ops.matmul optimized 800.8us PyTorch 719.2us bs = 4: cublas_batched_gemm 1442us cublas_strided_gemm 1592us hidet.ops.matmul optimized 1606us PyTorch 1466us bs = 8: cublas_batched_gemm 2658us cublas_strided_gemm 2830us hidet.ops.matmul optimized 3475us PyTorch 2753us SECTION 2 FP16 Matrix Multiply: C (bs x m x n) = A (bs x m x k) @ B(bs x k x n) Group 1 results with m = 512, n = 512, k = 512 bs = 1: cublas_batched_gemm 63.5us cublas_strided_gemm 34.0us hidet.ops.matmul optimized 34.9us PyTorch 41.0us bs = 2: cublas_batched_gemm 66.0us cublas_strided_gemm 30.2us hidet.ops.matmul optimized 64.8us PyTorch 45.1us bs = 4: cublas_batched_gemm 72.7us cublas_strided_gemm 32.4us hidet.ops.matmul optimized 24.4us PyTorch 46.3us bs = 8: cublas_batched_gemm 81.2us cublas_strided_gemm 36.2us hidet.ops.matmul optimized 38.5us PyTorch 47.8us Group 2 results with m = 1024, n = 1024, k = 2048 bs = 1: cublas_batched_gemm 71.0us cublas_strided_gemm 60.1us hidet.ops.matmul optimized 65.5us PyTorch 90.6us bs = 2: cublas_batched_gemm 114.8us cublas_strided_gemm 112.3us hidet.ops.matmul optimized 123.1us PyTorch 160.5us bs = 4: cublas_batched_gemm 225.1us cublas_strided_gemm 223.4us hidet.ops.matmul optimized 245.6us PyTorch 319.8us bs = 8: cublas_batched_gemm 442.8us cublas_strided_gemm 439.1us hidet.ops.matmul optimized 733.2us PyTorch 634.8us

vadiklyutiy · 2024-12-22T11:16:41Z

@yaoyaoding
Do you know do we need this PR?

yaoyaoding · 2025-01-06T20:41:59Z

We can keep this PR and integrate the benchmark to our regression test so that we know the performance comparison of cublas, hidet matmul, pytorch matmul. We can create an issue to track this and close this PR until we finish the integration. Do we have any similar benchmark in regression test? @vadiklyutiy

vadiklyutiy · 2025-01-07T03:39:14Z

For compare with cublas and pytorch inductor I used the following approach. I pass here
https://github.com/hidet-org/hidet/blob/main/tests/benchmarks/run_tests.py#L56-L57
different backend and mode.
backend=inductor, mode=eager correspond cutlass (not cublas)
backend=inductor, mode=eager correspond inductor with triton tuning

yudi0201 force-pushed the cublas_benchmarks branch from b4f1323 to 1035fcb Compare April 5, 2024 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CuBLAS] Add CuBLAS benchmarks #447

[CuBLAS] Add CuBLAS benchmarks #447

yudi0201 commented Apr 5, 2024

vadiklyutiy commented Dec 22, 2024

yaoyaoding commented Jan 6, 2025

vadiklyutiy commented Jan 7, 2025

[CuBLAS] Add CuBLAS benchmarks #447

Are you sure you want to change the base?

[CuBLAS] Add CuBLAS benchmarks #447

Conversation

yudi0201 commented Apr 5, 2024

vadiklyutiy commented Dec 22, 2024

yaoyaoding commented Jan 6, 2025

vadiklyutiy commented Jan 7, 2025