-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CuBLAS] Add CuBLAS benchmarks #447
base: main
Are you sure you want to change the base?
Conversation
Some CuBLAS benchmarking results on RTX2080 TI (all measurements are median latencies): SECTION 1 FP32 Matrix Multiply: C (bs x m x n) = A (bs x m x k) @ B(bs x k x n) Group 1 results with m = 512, n = 512, k = 512 bs = 1: cublas_batched_gemm 69.0us cublas_strided_gemm 41.0us hidet.ops.matmul optimized 37.0us PyTorch 44.6us bs = 2: cublas_batched_gemm 111.7us cublas_strided_gemm 75.8us hidet.ops.matmul optimized 69.2us PyTorch 71.7us bs = 4: cublas_batched_gemm 124.9us cublas_strided_gemm 97.2us hidet.ops.matmul optimized 100.8us PyTorch 96.3us bs = 8: cublas_batched_gemm 190.5us cublas_strided_gemm 191.1us hidet.ops.matmul optimized 204.7us PyTorch 187.6us Group 2 results with m = 1024, n = 1024, k = 2048 bs = 1: cublas_batched_gemm 405.1us cublas_strided_gemm 419.2us hidet.ops.matmul optimized 370.7us PyTorch 405.1us bs = 2: cublas_batched_gemm 725.3us cublas_strided_gemm 859.9us hidet.ops.matmul optimized 800.8us PyTorch 719.2us bs = 4: cublas_batched_gemm 1442us cublas_strided_gemm 1592us hidet.ops.matmul optimized 1606us PyTorch 1466us bs = 8: cublas_batched_gemm 2658us cublas_strided_gemm 2830us hidet.ops.matmul optimized 3475us PyTorch 2753us SECTION 2 FP16 Matrix Multiply: C (bs x m x n) = A (bs x m x k) @ B(bs x k x n) Group 1 results with m = 512, n = 512, k = 512 bs = 1: cublas_batched_gemm 63.5us cublas_strided_gemm 34.0us hidet.ops.matmul optimized 34.9us PyTorch 41.0us bs = 2: cublas_batched_gemm 66.0us cublas_strided_gemm 30.2us hidet.ops.matmul optimized 64.8us PyTorch 45.1us bs = 4: cublas_batched_gemm 72.7us cublas_strided_gemm 32.4us hidet.ops.matmul optimized 24.4us PyTorch 46.3us bs = 8: cublas_batched_gemm 81.2us cublas_strided_gemm 36.2us hidet.ops.matmul optimized 38.5us PyTorch 47.8us Group 2 results with m = 1024, n = 1024, k = 2048 bs = 1: cublas_batched_gemm 71.0us cublas_strided_gemm 60.1us hidet.ops.matmul optimized 65.5us PyTorch 90.6us bs = 2: cublas_batched_gemm 114.8us cublas_strided_gemm 112.3us hidet.ops.matmul optimized 123.1us PyTorch 160.5us bs = 4: cublas_batched_gemm 225.1us cublas_strided_gemm 223.4us hidet.ops.matmul optimized 245.6us PyTorch 319.8us bs = 8: cublas_batched_gemm 442.8us cublas_strided_gemm 439.1us hidet.ops.matmul optimized 733.2us PyTorch 634.8us
b4f1323
to
1035fcb
Compare
@yaoyaoding |
We can keep this PR and integrate the benchmark to our regression test so that we know the performance comparison of cublas, hidet matmul, pytorch matmul. We can create an issue to track this and close this PR until we finish the integration. Do we have any similar benchmark in regression test? @vadiklyutiy |
For compare with cublas and pytorch inductor I used the following approach. I pass here |
Please refer to the commit message for benchmark results.