Skip to content

Commit

Permalink
[SYCL][Graph] Update doc for UR PR moving reset commands to a dedica…
Browse files Browse the repository at this point in the history
…ted cmd-list (#12770)

Update the design doc.
Update the UR tag.

---------

Co-authored-by: Ewan Crawford <ewan@codeplay.com>
Co-authored-by: Kenneth Benzie (Benie) <k.benzie83@gmail.com>
  • Loading branch information
3 people authored Mar 14, 2024
1 parent b188783 commit 8850a97
Show file tree
Hide file tree
Showing 4 changed files with 95 additions and 49 deletions.
130 changes: 89 additions & 41 deletions sycl/doc/design/CommandGraph.md
Original file line number Diff line number Diff line change
Expand Up @@ -250,59 +250,107 @@ there are no parameters to take a wait-list, and the only sync primitive
returned is blocking on host.

In order to achieve the expected UR command-buffer enqueue semantics with Level
Zero, the adapter implementation adds extra commands to the Level Zero
command-list representing a UR command-buffer.

* Prefix - Commands added to the start of the L0 command-list by L0 adapter.
* Suffix - Commands added to the end of the L0 command-list by L0 adapter.

These extra commands operate on L0 event synchronisation primitives, used by the
command-list to interact with the external UR wait-list and UR return event
required for the enqueue interface.

The `ur_exp_command_buffer_handle_t` class for this adapter contains a
*SignalEvent* which signals the completion of the command-list in the suffix,
and is reset in the prefix. This signal is detected by a new UR return event
created on UR command-buffer enqueue.

There is also a *WaitEvent* used by the `ur_exp_command_buffer_handle_t` class
in the prefix to wait on any dependencies passed in the enqueue wait-list.
This WaitEvent is reset in the suffix.

A command-buffer is expected to be submitted multiple times. Consequently,
Zero, the adapter implementation needs extra commands.

* Prefix - Commands added **before** the graph workload.
* Suffix - Commands added **after** the graph workload.

These extra commands operate on L0 event synchronisation primitives,
used by the command-list to interact with the external UR wait-list
and UR return event required for the enqueue interface.
Unlike the graph workload (i.e. commands needed to perform the graph workload)
the external UR wait-list and UR return event are submission dependent,
which mean they can change from one submission to the next.

For performance concerns, the command-list that will execute the graph
workload is made only once (during the command-buffer finalization stage).
This allows the adapter to save time when submitting the command-buffer,
by executing only this command-list (i.e. without enqueuing any commands
of the graph workload).

#### Prefix

The prefix's commands aim to:
1. Handle the the list on events to wait on, which is passed by the runtime
when the UR command-buffer enqueue function is called.
As mentioned above, this list of events changes from one submission
to the next.
Consequently, managing this mutable dependency in the graph-workload
command-list implies rebuilding the command-list for each submission
(note that this can change with mutable command-list).
To avoid the signifiant time penalty of rebuilding this potentially large
command-list each time, we prefer to add an extra command handling the
wait list into another command-list (*wait command-list*).
This command-list consists of a single L0 command: a barrier that waits for
dependencies passed by the wait-list and signals a signal
called *WaitEvent* when the barrier is complete.
This *WaitEvent* is defined in the `ur_exp_command_buffer_handle_t` class.
In the front of the graph workload command list, an extra barrier command
waiting for this event is added (when the command-buffer is created).
This ensures that the graph workload does not start running before
the dependencies to be completed.
The *WaitEvent* event is reset in the suffix.


2. Reset events associated with the command-buffer except the
*WaitEvent* event.
Indeed, L0 events needs to be explicitly reset by an API call
(L0 command in our case).
Since a command-buffer is expected to be submitted multiple times,
we need to ensure that L0 events associated with graph commands have not
been signaled by a previous execution. These events are therefore reset to the
non-signaled state before running the actual graph associated commands. Note
non-signaled state before running the graph-workload command-list. Note
that this reset is performed in the prefix and not in the suffix to avoid
additional synchronization w.r.t profiling data extraction.

If a command-buffer is about to be submitted to a queue with the profiling
property enabled, an extra command that copies timestamps of L0 events
associated with graph commands into a dedicated memory which is attached to the
returned UR event. This memory stores the profiling information that
corresponds to the current submission of the command-buffer.

![L0 command-buffer diagram](images/L0_UR_command-buffer-v3.jpg)
We use a new command list (*reset command-list*) for performance concerns.
Indeed:
* This allows the *WaitEvent* to be signaled directly on the host if
the waiting list is empty, thus avoiding the need to submit a command list.
* Enqueuing a reset L0 command for all events in the command-buffer is time
consumming, especially for large graphs.
However, this task is not needed for every submission, but only once, when the
command-buffer is fixed, i.e. when the command-buffer is finalized. The
decorellation between the reset command-list and the wait command-list allow us to
create and enqueue the reset commands when finalizing the command-buffer,
and only create the wait command-list at submission.

This command list is consist of a reset command for each of the graph commands
and another reset command for resetting the signal we use to signal the completion
of the graph workload. This signal is called *SignalEvent* and is defined in
in the `ur_exp_command_buffer_handle_t` class.

#### Suffix

The suffix's commands aim to:
1) Handle the completion of the graph workload and signal
an UR return event.
Thus, at the end of the graph workload command-list a command, which
signals the *SignalEvent*, is added (when the command-buffer is finalized).
In an additional command-list (*signal command-list*), a barrier waiting for
this event is also added.
This barrier signals, in turn, the UR return event that has be defined by
the runtime layer when calling the `urCommandBufferEnqueueExp` function.

2) Manage the profiling. If a command-buffer is about to be submitted to
a queue with the profiling property enabled, an extra command that copies
timestamps of L0 events associated with graph commands into a dedicated
memory which is attached to the returned UR event.
This memory stores the profiling information that corresponds to
the current submission of the command-buffer.

![L0 command-buffer diagram](images/L0_UR_command-buffer-v5.jpg)

For a call to `urCommandBufferEnqueueExp` with an `event_list` *EL*,
command-buffer *CB*, and return event *RE* our implementation has to submit two
new command-lists for the above approach to work. One before
command-buffer *CB*, and return event *RE* our implementation has to submit
three new command-lists for the above approach to work. Two before
the command-list with extra commands associated with *CB*, and the other
after *CB*. These two new command-lists are retrieved from the UR queue, which
after *CB*. These new command-lists are retrieved from the UR queue, which
will likely reuse existing command-lists and only create a new one in the worst
case.

The L0 command-list created on `urCommandBufferEnqueueExp` to execute **before**
*CB* contains a single command. This command is a barrier on *EL* that signals
*CB*'s *WaitEvent* when completed.

The L0 command-list created on `urCommandBufferEnqueueExp` to execute **after**
*CB* also contains a single command. This command is a barrier on *CB*'s
*SignalEvent* that signals *RE* when completed.

#### Drawbacks

There are two drawbacks of this approach to implementing UR command-buffers for
There are three drawbacks of this approach to implementing UR command-buffers for
Level Zero:

1. 3x the command-list resources are used, if there are many UR command-buffers in
Expand Down
Binary file removed sycl/doc/design/images/L0_UR_command-buffer-v3.jpg
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 6 additions & 8 deletions sycl/plugins/unified_runtime/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -57,15 +57,13 @@ if(SYCL_PI_UR_USE_FETCH_CONTENT)
include(FetchContent)

set(UNIFIED_RUNTIME_REPO "https://github.com/oneapi-src/unified-runtime.git")
# commit d99d5f742cea18d7204c59c4320b8ea0329b49eb (HEAD -> main)
# Merge: f17c0e91 c3809c61
# commit 418ad5354ca24a6dfbd01df803949855b7a6c3dd
# Merge: d99d5f74 26682290
# Author: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
# Date: Wed Mar 13 19:47:39 2024 +0000
#
# Merge pull request #1431 from zhaomaosu/fix-ocl-adapter-tear-down
#
# [CL] Gracefully tear down adapter in case that some globals have been released
set(UNIFIED_RUNTIME_TAG d99d5f742cea18d7204c59c4320b8ea0329b49eb)
# Date: Thu Mar 14 10:19:56 2024 +0000
# Merge pull request #1365 from Bensuo/maxime/improve-L0-cmd-buffer-enqueing
# [EXP][CMDBUF] Move event reset commands to dedicated cmd-list
set(UNIFIED_RUNTIME_TAG 418ad5354ca24a6dfbd01df803949855b7a6c3dd)

if(SYCL_PI_UR_OVERRIDE_FETCH_CONTENT_REPO)
set(UNIFIED_RUNTIME_REPO "${SYCL_PI_UR_OVERRIDE_FETCH_CONTENT_REPO}")
Expand Down

0 comments on commit 8850a97

Please sign in to comment.