[lapack][blas][cuda] Update host task impl to use enqueue_native_command #572
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Update host task impl to use enqueue_native_command for blas/lapack using the cuda backend (cublas/cusolver). I did both backends in a single PR because the cusolver backend uses the cublas backend of oneMKL.
The sycl_ext_codeplay_enqueue_native_command extension reduces latency wrt the host_task for native library submissions, and allows integration with sycl task_graph / events. See https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_codeplay_enqueue_native_command.asciidoc
for details.
This extension has already been shown to lead to considerable performance improvements for applications that call oneMKL, such as Gromacs for the oneMKL fft backend. We expect similar improvements for the lapack and blas backends implemented here.
I had to update the lapack tests because they previously relied on the synchronous behaviour of the native calls due to the fact we had to sync the native streams, since previously with standard host_task we are not able to integrate the native event into the sycl task_graph/ sycl::event.
I did not need to update the blas tests since they already take into account asynchronous behaviour.
Checklist
All Submissions
I've added a test for each backend for each of the possible codepaths:
test_main_blas_ct_host_task.txt
test_main_blas_rt_host_task.txt
test_main_lapack_rt_native_command.txt
test_main_lapack_ct_native_command.txt
test_main_lapack_ct_host_task.txt
test_main_lapack_rt_host_task.txt
test_main_blas_ct_res_native_command.txt
test_main_blas_rt_res_native_command.txt