-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UR][L0][CUDA][HIP] Add enqueue timestamp recording extension #1400
Conversation
This commit adds a new extension feature for recording timestamps into events, the information from which can be queried using the existing profiling queries. Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
include/ur_api.h
Outdated
@@ -1647,7 +1649,7 @@ typedef enum ur_device_info_t { | |||
/// - ::UR_RESULT_ERROR_INVALID_NULL_HANDLE | |||
/// + `NULL == hDevice` | |||
/// - ::UR_RESULT_ERROR_INVALID_ENUMERATION | |||
/// + `::UR_DEVICE_INFO_INTEROP_SEMAPHORE_EXPORT_SUPPORT_EXP < propName` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking you meant to delete this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it needs to be the last enumerator, as it's saying any value after that will result in UR_RESULT_ERROR_INVALID_ENUMERATION
. I was skeptical at first too, but the generated script diff check insisted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is the generator making the change. It's expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could probably be a bit more clever to do proper validation of the enum ranges.
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1400 +/- ##
==========================================
- Coverage 14.82% 12.46% -2.36%
==========================================
Files 250 240 -10
Lines 36220 36129 -91
Branches 4094 4094
==========================================
- Hits 5369 4504 -865
- Misses 30800 31621 +821
+ Partials 51 4 -47 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hip/cuda impl and tests LGTM.
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
@@ -58,7 +59,7 @@ ur_result_t ur_event_handle_t_::start() { | |||
ur_result_t Result = UR_RESULT_SUCCESS; | |||
|
|||
try { | |||
if (Queue->URFlags & UR_QUEUE_FLAG_PROFILING_ENABLE) { | |||
if (Queue->URFlags & UR_QUEUE_FLAG_PROFILING_ENABLE || isTimestampEvent()) { | |||
// NOTE: This relies on the default stream to be unused. | |||
UR_CHECK_ERROR(hipEventRecord(EvQueued, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JackAKirk - In intel/llvm#12838 it seems like the submission time on HIP is giving weird values. I did a bit of digging and it seems to me like HIP is a little different from CUDA when checking timing-differences between events. Of particular interest here is the following line for hipEventElapsedTime():
Events which are recorded in a NULL stream will block until all commands on all other streams complete execution, and then record the timestamp.
While what we expect here is to get an event with the current time, hence using an otherwise unused stream. Fixing it might be outside the scope of this PR, but a possible solution could be to lazily have a stream specifically for recording submission time of events, tied to the context. Similar could be used in the CUDA backend to avoid the assumption noted above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@konradkusiak97 this sounds like the same thing as your hip event timings issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have opened an issue about it here: intel/llvm#12904.
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
@oneapi-src/unified-runtime-native-cpu-write & @oneapi-src/unified-runtime-level-zero-write - Friendly ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Native CPU lgtm, thank you
@oneapi-src/unified-runtime-level-zero-write Ping. |
@oneapi-src/unified-runtime-level-zero-write ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just a few nits.
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
@steffenlarsen if you could pull in the latest changes from main and resolve the conflict I can get this merged. |
Thank you, @kbenzie ! Merge commit has been pushed and it should hopefully pass CI again. |
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
oneapi-src#1400 incorrectly removed some device info enumerator cases due to a fault merge conflict resolution. This commit adds them back. Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
std::unique_ptr<ur_event_handle_t_>(ur_event_handle_t_::makeNative( | ||
UR_COMMAND_TIMESTAMP_RECORDING_EXP, hQueue, CuStream)); | ||
UR_CHECK_ERROR(RetImplEvent->start()); | ||
UR_CHECK_ERROR(RetImplEvent->record()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late review, but start
and record
record 3 separate native events. Perhaps it is useful to have a separate EvQueued
and EvStart
using this extension, but EvEnd
should be unnecessary. Do you think it'd be worth having a event::checkpoint
member func that minimizes the number of native events recorded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand your suggestion. Using this extension, the timestamp event should behave like a normal event, where the native events have the following behavior:
EvQueued
: Should finish immediately as this is when the timestamp event was enqueued.EvStart
: Should be when the queue finishes.EvEnd
: Same as above as there is no work related to the actual enqueued "command".
Since EvStart
and EvEnd
are mostly the same, we could use just one, but I am not convinced there's so much overhead from recording an additional event that it's worth the additional complexity of having to ignore EvStart
or EvEnd
if it is a timestamp event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for response. Perhaps we can continue discussion here #1613
This commit adds a new extension feature for recording timestamps into events, the information from which can be queried using the existing profiling queries.
This new functionality is currently implemented for L0, CUDA and HIP.
Corresponding SYCL extension implementation: intel/llvm#12838