-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[lapack][blas][cuda] Update host task impl to use enqueue_native_command #572
Conversation
See SYCL_EXT_ONEAPI_ENQUEUE_NATIVE_COMMAND for details. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
See SYCL_EXT_ONEAPI_ENQUEUE_NATIVE_COMMAND extension document for details. Generalize helpers funcs and use them for blas l1, l2, l3, batch Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
cublas_native_named_func Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the PR
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@hdelan please review this when you are back. |
I have a small patch ready to update cublas backend just a little to implement missing |
I would prefer to have a separate PR to make it clearer which commit implements what. |
Yeah OK, I'll be patient, thanks. |
Hi @oneapi-src/onemkl-blas-write @oneapi-src/onemkl-lapack-write Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BLAS part looks good, thank you. I'm less familiar with the LAPACK part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please provide more details about why the LAPACK tests required updating? I see your comment in the description of the PR, but I still have a few questions. Thanks!
Also, if the LAPACK tests do need updating, how will that impact testing the other (non-cuSOLVER) backends?
this dep check is overzealous because it enforces that a dependent event cannot be submitted to run on the native device queue but not completed before a later event it is dependent upon has also been marked running on the device. This is not part of the sycl spec and unnecessarily slows down execution. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
These funcs are async in the cusolver backend. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work here to improve performance, @JackAKirk !
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
This reverts commit 61c9a53.
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Description
Update host task impl to use enqueue_native_command for blas/lapack using the cuda backend (cublas/cusolver). I did both backends in a single PR because the cusolver backend uses the cublas backend of oneMKL.
The sycl_ext_codeplay_enqueue_native_command extension reduces latency wrt the host_task for native library submissions, and allows integration with sycl task_graph / events. See https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_codeplay_enqueue_native_command.asciidoc
for details.
This extension has already been shown to lead to considerable performance improvements for applications that call oneMKL, such as Gromacs for the oneMKL fft backend. We expect similar improvements for the lapack and blas backends implemented here.
I had to update the lapack tests because they previously relied on the synchronous behaviour of the native calls due to the fact we had to sync the native streams, since previously with standard host_task we are not able to integrate the native event into the sycl task_graph/ sycl::event.
I did not need to update the blas tests since they already take into account asynchronous behaviour.
Checklist
#216 is for the most part fixed, but technically this PR maximally enables ooo queue interoperability so we can say that this
fixes #216
All Submissions
I've added a test for each backend for each of the possible codepaths:
test_main_blas_ct_host_task.txt
test_main_blas_rt_host_task.txt
test_main_lapack_rt_native_command.txt
test_main_lapack_ct_native_command.txt
test_main_lapack_ct_host_task.txt
test_main_lapack_rt_host_task.txt
test_main_blas_ct_res_native_command.txt
test_main_blas_rt_res_native_command.txt