Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CuBLAS] Add CuBLAS benchmarks #447

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

[CuBLAS] Add CuBLAS benchmarks #447

wants to merge 1 commit into from

Conversation

yudi0201
Copy link
Collaborator

@yudi0201 yudi0201 commented Apr 5, 2024

Please refer to the commit message for benchmark results.

Some CuBLAS benchmarking results on RTX2080 TI (all measurements are median latencies):

SECTION 1
FP32 Matrix Multiply: C (bs x m x n) = A (bs x m x k) @ B(bs x k x n)

Group 1 results with m = 512, n = 512, k = 512
bs = 1:
cublas_batched_gemm            69.0us
cublas_strided_gemm            41.0us
hidet.ops.matmul optimized     37.0us
PyTorch                        44.6us

bs = 2:
cublas_batched_gemm            111.7us
cublas_strided_gemm            75.8us
hidet.ops.matmul optimized     69.2us
PyTorch                        71.7us

bs = 4:
cublas_batched_gemm            124.9us
cublas_strided_gemm            97.2us
hidet.ops.matmul optimized     100.8us
PyTorch                        96.3us

bs = 8:
cublas_batched_gemm            190.5us
cublas_strided_gemm            191.1us
hidet.ops.matmul optimized     204.7us
PyTorch                        187.6us

Group 2 results with m = 1024, n = 1024, k = 2048
bs = 1:
cublas_batched_gemm            405.1us
cublas_strided_gemm            419.2us
hidet.ops.matmul optimized     370.7us
PyTorch                        405.1us

bs = 2:
cublas_batched_gemm            725.3us
cublas_strided_gemm            859.9us
hidet.ops.matmul optimized     800.8us
PyTorch                        719.2us

bs = 4:
cublas_batched_gemm            1442us
cublas_strided_gemm            1592us
hidet.ops.matmul optimized     1606us
PyTorch                        1466us

bs = 8:
cublas_batched_gemm            2658us
cublas_strided_gemm            2830us
hidet.ops.matmul optimized     3475us
PyTorch                        2753us

SECTION 2
FP16 Matrix Multiply: C (bs x m x n) = A (bs x m x k) @ B(bs x k x n)

Group 1 results with m = 512, n = 512, k = 512
bs = 1:
cublas_batched_gemm            63.5us
cublas_strided_gemm            34.0us
hidet.ops.matmul optimized     34.9us
PyTorch                        41.0us

bs = 2:
cublas_batched_gemm            66.0us
cublas_strided_gemm            30.2us
hidet.ops.matmul optimized     64.8us
PyTorch                        45.1us

bs = 4:
cublas_batched_gemm            72.7us
cublas_strided_gemm            32.4us
hidet.ops.matmul optimized     24.4us
PyTorch                        46.3us

bs = 8:
cublas_batched_gemm            81.2us
cublas_strided_gemm            36.2us
hidet.ops.matmul optimized     38.5us
PyTorch                        47.8us

Group 2 results with m = 1024, n = 1024, k = 2048
bs = 1:
cublas_batched_gemm            71.0us
cublas_strided_gemm            60.1us
hidet.ops.matmul optimized     65.5us
PyTorch                        90.6us

bs = 2:
cublas_batched_gemm            114.8us
cublas_strided_gemm            112.3us
hidet.ops.matmul optimized     123.1us
PyTorch                        160.5us

bs = 4:
cublas_batched_gemm            225.1us
cublas_strided_gemm            223.4us
hidet.ops.matmul optimized     245.6us
PyTorch                        319.8us

bs = 8:
cublas_batched_gemm            442.8us
cublas_strided_gemm            439.1us
hidet.ops.matmul optimized     733.2us
PyTorch                        634.8us
@yudi0201 yudi0201 force-pushed the cublas_benchmarks branch from b4f1323 to 1035fcb Compare April 5, 2024 18:22
@vadiklyutiy
Copy link
Collaborator

@yaoyaoding
Do you know do we need this PR?

@yaoyaoding
Copy link
Member

We can keep this PR and integrate the benchmark to our regression test so that we know the performance comparison of cublas, hidet matmul, pytorch matmul. We can create an issue to track this and close this PR until we finish the integration. Do we have any similar benchmark in regression test? @vadiklyutiy

@vadiklyutiy
Copy link
Collaborator

For compare with cublas and pytorch inductor I used the following approach. I pass here
https://github.com/hidet-org/hidet/blob/main/tests/benchmarks/run_tests.py#L56-L57
different backend and mode.
backend=inductor, mode=eager correspond cutlass (not cublas)
backend=inductor, mode=eager correspond inductor with triton tuning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants