diff --git a/sycl/doc/design/CommandGraph.md b/sycl/doc/design/CommandGraph.md index 8f5d715a32925..51dd698901ccf 100644 --- a/sycl/doc/design/CommandGraph.md +++ b/sycl/doc/design/CommandGraph.md @@ -3,8 +3,12 @@ This document describes the implementation design of the [SYCL Graph Extension](../extensions/experimental/sycl_ext_oneapi_graph.asciidoc). -A related presentation can be found -[here](https://www.youtube.com/watch?v=aOTAmyr04rM). +## Resources + +* A recording of a presentation on the extension can be found + [on Youtube](https://www.youtube.com/watch?v=aOTAmyr04rM). +* A blog post introducing the extension can be found + [here](https://codeplay.com/portal/blogs/2024/01/22/sycl-graphs). ## Requirements @@ -15,13 +19,11 @@ represents a series of operations to be enqueued to the backend device and their dependencies. A single command-graph can be partitioned into more than one command-buffer by the runtime. The SYCL Graph extension distinguishes between backends that support the command-buffer extension -and those that do not. Currently command-buffer extensions are only supported -by Level Zero. All other backends would fall back to an emulation mode, or not -be reported as supported. +and those that do not, and only reports support for the extension using an +aspect on backends that do support command-buffers. -The emulation mode targets support of functionality only, without potentially -resulting performance improvements, i.e. execution of a closed Level Zero -command-list multiple times. +See the [Backend Implementation](#backend-implementation) section of this +document for details of support of different SYCL backends. ### UR Command-Buffer Experimental Feature @@ -163,12 +165,13 @@ potential backward dependencies that could be created using the `make_edge` function. ### Example + The partitioning process is achieved is two main stages: 1 - Nodes are assigned to a temporary group/partition. -2 - Once all the nodes have been annotated with a group number, -actual partitions are created based on these annotations. +2 - Once all the nodes have been annotated with a group number, actual +partitions are created based on these annotations. The following diagrams show the annotation process: @@ -179,10 +182,9 @@ The following diagrams show the annotation process: ![Graph partition illustration step 5.](images/SYCL-Graph-partitions_step5.jpg) ![Graph partition illustration step 6.](images/SYCL-Graph-partitions_step6.jpg) -Now consider a slightly different graph. -We used the `make_edge` function to create a dependency between Node E and -Node HT1. The first 5 steps are identical. -However, from the step 6 the process changes and a group merge is needed as +Now consider a slightly different graph. We used the `make_edge` function to +create a dependency between Node E and Node HT1. The first 5 steps are identical. +However, from the step 6 the process changes and a group merge is needed as illustrated in the following diagrams: ![Graph partition illustration step 6b.](images/SYCL-Graph-partitions_step7.jpg) @@ -193,31 +195,28 @@ illustrated in the following diagrams: ![Graph partition illustration step 11b.](images/SYCL-Graph-partitions_step12.jpg) ### Multiple Roots Execution Flow -The following diagram shows the partitions of a graph with two roots -and a host-task in each branch. + +The following diagram shows the partitions of a graph with two roots and a +host-task in each branch. ![Multiple roots graph partition illustration.](images/SYCL-Graph-multiple_roots_partitions.jpg) -When executing this graph, the partitions were enqueued one after the other, -with each partition waiting for the previous one to complete -(see top of the following diagram). -However, for multi-root graph, this behavior adds unnecessary dependency -between partitions, slowing down the execution of the whole graph. -Now, we keep track of the actual predecessors of each partition and -only enforce dependencies between partitions when necessary. -In our example, the extra dependency is therefore removed and -both branches can be executed concurrently. -But as we can see on this diagram, this new approach can involve -multiple execution tails, which leads to difficulties when -we want to know when the graph execution has finished. -To cope with this issue, the events associated to the completion of -each partition are linked to the event returned to users. -Hence, when the returned event is complete, we can guarantee that -all work associated with the graph has been completed. +When executing this graph, the partitions were enqueued one after the other, +with each partition waiting for the previous one to complete (see top of the +following diagram). However, for multi-root graph, this behavior adds +unnecessary dependency between partitions, slowing down the execution of the +whole graph. Now, we keep track of the actual predecessors of each partition +and only enforce dependencies between partitions when necessary. +In our example, the extra dependency is therefore removed and both branches can +be executed concurrently. But as we can see on this diagram, this new approach +can involve multiple execution tails, which leads to difficulties when we want +to know when the graph execution has finished. To cope with this issue, the +events associated to the completion of each partition are linked to the event +returned to users. Hence, when the returned event is complete, we can guarantee +that all work associated with the graph has been completed. ![Multiple roots graph partition execution flow.](images/SYCL-Graph-partition_execution_flow.jpg) - ## Memory handling: Buffer and Accessor There is no extra support for graph-specific USM allocations in the current @@ -232,8 +231,8 @@ yet been implemented. ## Backend Implementation -Implementation of UR command-buffers -for each of the supported SYCL 2020 backends. +Implementation of UR command-buffers for each of the supported SYCL 2020 +backends. Backends which are implemented currently are: [Level Zero](#level-zero), [CUDA](#cuda), [HIP](#hip) and partial support for [OpenCL](#opencl). @@ -255,98 +254,88 @@ Zero, the adapter implementation needs extra commands. * Prefix - Commands added **before** the graph workload. * Suffix - Commands added **after** the graph workload. -These extra commands operate on L0 event synchronisation primitives, -used by the command-list to interact with the external UR wait-list -and UR return event required for the enqueue interface. -Unlike the graph workload (i.e. commands needed to perform the graph workload) -the external UR wait-list and UR return event are submission dependent, -which mean they can change from one submission to the next. +These extra commands operate on L0 event synchronisation primitives, used by +the command-list to interact with the external UR wait-list and UR return event +required for the enqueue interface. Unlike the graph workload (i.e. commands +needed to perform the graph workload) the external UR wait-list and UR return +event are submission dependent, which mean they can change from one submission +to the next. -For performance concerns, the command-list that will execute the graph -workload is made only once (during the command-buffer finalization stage). -This allows the adapter to save time when submitting the command-buffer, -by executing only this command-list (i.e. without enqueuing any commands -of the graph workload). +For performance concerns, the command-list that will execute the graph workload +is made only once (during the command-buffer finalization stage). This allows +the adapter to save time when submitting the command-buffer, by executing only +this command-list (i.e. without enqueuing any commands of the graph workload). #### Prefix -The prefix's commands aim to: -1. Handle the the list on events to wait on, which is passed by the runtime -when the UR command-buffer enqueue function is called. -As mentioned above, this list of events changes from one submission -to the next. -Consequently, managing this mutable dependency in the graph-workload -command-list implies rebuilding the command-list for each submission -(note that this can change with mutable command-list). -To avoid the signifiant time penalty of rebuilding this potentially large -command-list each time, we prefer to add an extra command handling the -wait list into another command-list (*wait command-list*). -This command-list consists of a single L0 command: a barrier that waits for -dependencies passed by the wait-list and signals a signal -called *WaitEvent* when the barrier is complete. -This *WaitEvent* is defined in the `ur_exp_command_buffer_handle_t` class. -In the front of the graph workload command list, an extra barrier command -waiting for this event is added (when the command-buffer is created). -This ensures that the graph workload does not start running before -the dependencies to be completed. -The *WaitEvent* event is reset in the suffix. - - -2. Reset events associated with the command-buffer except the -*WaitEvent* event. -Indeed, L0 events needs to be explicitly reset by an API call -(L0 command in our case). -Since a command-buffer is expected to be submitted multiple times, -we need to ensure that L0 events associated with graph commands have not -been signaled by a previous execution. These events are therefore reset to the -non-signaled state before running the graph-workload command-list. Note -that this reset is performed in the prefix and not in the suffix to avoid -additional synchronization w.r.t profiling data extraction. -We use a new command list (*reset command-list*) for performance concerns. +The prefix's commands aim to: +1. Handle the list of events to wait on, which is passed by the runtime +when the UR command-buffer enqueue function is called. As mentioned above, this +list of events changes from one submission to the next. Consequently, managing +this mutable dependency in the graph-workload command-list implies rebuilding +the command-list for each submission (note that this can change with mutable +command-list). To avoid the signifiant time penalty of rebuilding this +potentially large command-list each time, we prefer to add an extra command +handling the wait list into another command-list (*wait command-list*). This +command-list consists of a single L0 command: a barrier that waits for +dependencies passed by the wait-list and signals a signal called *WaitEvent* +when the barrier is complete. This *WaitEvent* is defined in the +`ur_exp_command_buffer_handle_t` class. In the front of the graph workload +command list, an extra barrier command waiting for this event is added (when +the command-buffer is created). This ensures that the graph workload does not +start running before the dependencies to be completed. The *WaitEvent* event is +reset in the suffix. + +2. Reset events associated with the command-buffer except the *WaitEvent* event. +Indeed, L0 events needs to be explicitly reset by an API call (L0 command in +our case). Since a command-buffer is expected to be submitted multiple times, +we need to ensure that L0 events associated with graph commands have not been +signalled by a previous execution. These events are therefore reset to the +non-signalled state before running the graph-workload command-list. Note that +this reset is performed in the prefix and not in the suffix to avoid additional +synchronization w.r.t profiling data extraction. We use a new command list +(*reset command-list*) for performance concerns. Indeed: - * This allows the *WaitEvent* to be signaled directly on the host if - the waiting list is empty, thus avoiding the need to submit a command list. + * This allows the *WaitEvent* to be signalled directly on the host if the + waiting list is empty, thus avoiding the need to submit a command list. * Enqueuing a reset L0 command for all events in the command-buffer is time - consumming, especially for large graphs. - However, this task is not needed for every submission, but only once, when the - command-buffer is fixed, i.e. when the command-buffer is finalized. The - decorellation between the reset command-list and the wait command-list allow us to - create and enqueue the reset commands when finalizing the command-buffer, - and only create the wait command-list at submission. - -This command list is consist of a reset command for each of the graph commands -and another reset command for resetting the signal we use to signal the completion -of the graph workload. This signal is called *SignalEvent* and is defined in -in the `ur_exp_command_buffer_handle_t` class. + consuming, especially for large graphs. However, this task is not needed for + every submission, but only once, when the command-buffer is fixed, i.e. when + the command-buffer is finalized. The decorrelation between the reset + command-list and the wait command-list allow us to create and enqueue the + reset commands when finalizing the command-buffer, and only create the wait + command-list at submission. + +This command list is consist of a reset command for each of the graph commands +and another reset command for resetting the signal we use to signal the +completion of the graph workload. This signal is called *SignalEvent* and is +defined in the `ur_exp_command_buffer_handle_t` class. #### Suffix The suffix's commands aim to: -1) Handle the completion of the graph workload and signal -an UR return event. -Thus, at the end of the graph workload command-list a command, which -signals the *SignalEvent*, is added (when the command-buffer is finalized). -In an additional command-list (*signal command-list*), a barrier waiting for -this event is also added. -This barrier signals, in turn, the UR return event that has be defined by -the runtime layer when calling the `urCommandBufferEnqueueExp` function. - -2) Manage the profiling. If a command-buffer is about to be submitted to -a queue with the profiling property enabled, an extra command that copies -timestamps of L0 events associated with graph commands into a dedicated -memory which is attached to the returned UR event. -This memory stores the profiling information that corresponds to -the current submission of the command-buffer. +1) Handle the completion of the graph workload and signal a UR return event. +Thus, at the end of the graph workload command-list a command, which signals +the *SignalEvent*, is added (when the command-buffer is finalized). In an +additional command-list (*signal command-list*), a barrier waiting for this +event is also added. This barrier signals, in turn, the UR return event that +has be defined by the runtime layer when calling the +`urCommandBufferEnqueueExp` function. + +2) Manage the profiling. If a command-buffer is about to be submitted to a +queue with the profiling property enabled, an extra command that copies +timestamps of L0 events associated with graph commands into a dedicated memory +which is attached to the returned UR event. This memory stores the profiling +information that corresponds to the current submission of the command-buffer. ![L0 command-buffer diagram](images/L0_UR_command-buffer-v5.jpg) For a call to `urCommandBufferEnqueueExp` with an `event_list` *EL*, -command-buffer *CB*, and return event *RE* our implementation has to submit -three new command-lists for the above approach to work. Two before -the command-list with extra commands associated with *CB*, and the other -after *CB*. These new command-lists are retrieved from the UR queue, which -will likely reuse existing command-lists and only create a new one in the worst -case. +command-buffer *CB*, and return event *RE* our implementation has to submit three +new command-lists for the above approach to work. Two before the command-list +with extra commands associated with *CB*, and the other after *CB*. These new +command-lists are retrieved from the UR queue, which will likely reuse existing +command-lists and only create a new one in the worst case. #### Drawbacks @@ -365,11 +354,11 @@ Level Zero: executing a UR command-buffer. 3. Dependencies between multiple submissions must be handled by the runtime. - Indeed, when a second submission is performed the signal conditions - of *WaitEvent* are redefined by this second submission. - Therefore, this can lead to an undefined behavior and potential - hangs especially if the conditions of the first submissions were not yet - satisfied and the event has not yet been signaled. + Indeed, when a second submission is performed the signal conditions of + *WaitEvent* are redefined by this second submission. Therefore, this can + lead to an undefined behavior and potential hangs especially if the + conditions of the first submissions were not yet satisfied and the event has + not yet been signalled. Future work will include exploring L0 API extensions to improve the mapping of UR command-buffer to L0 command-list. @@ -378,43 +367,41 @@ UR command-buffer to L0 command-list. The SYCL Graph CUDA backend relies on the [CUDA Graphs feature](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs), -which is the CUDA public API for batching series of operations, -such as kernel launches, connected by dependencies. +which is the CUDA public API for batching series of operations, such as kernel +launches, connected by dependencies. UR commands (e.g. kernels) are mapped as graph nodes using the [CUDA Driver API](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__GRAPH.html#group__CUDA__GRAPH). -The CUDA Driver API is preferred over the CUDA Runtime API to implement -the SYCL Graph backend to remain consistent with other UR functions. -Synchronization between commands (UR sync-points) is implemented -using graph dependencies. - -Executable CUDA Graphs can be submitted to a CUDA stream -in the same way as regular kernels. -The CUDA backend enables enqueuing events to wait for into a stream. -It also allows signaling the completion of a submission with an event. -Therefore, submitting a UR command-buffer consists only of submitting to a stream -the executable CUDA Graph that represent this series of operations. +The CUDA Driver API is preferred over the CUDA Runtime API to implement the +SYCL Graph backend to remain consistent with other UR functions. +Synchronization between commands (UR sync-points) is implemented using graph +dependencies. + +Executable CUDA Graphs can be submitted to a CUDA stream in the same way as +regular kernels. The CUDA backend enables enqueuing events to wait for into a +stream. It also allows signalling the completion of a submission with an event. +Therefore, submitting a UR command-buffer consists only of submitting to a +stream the executable CUDA Graph that represent this series of operations. An executable CUDA Graph, which contains all commands and synchronization -information, is saved in the UR command-buffer to allow for efficient -graph resubmission. +information, is saved in the UR command-buffer to allow for efficient graph +resubmission. ### HIP -The HIP backend offers a Graph managemenet API very similar to CUDA Graph -feature for batching series of operations. -The SYCL Graph HIP backend implementation is therefore very similar to that of CUDA. +The HIP backend offers a graph management API very similar to CUDA Graph +feature for batching series of operations. The SYCL Graph HIP backend +implementation is therefore very similar to that of CUDA. UR commands (e.g. kernels) are mapped as graph nodes using the [HIP Management API](https://docs.amd.com/projects/HIP/en/docs-5.5.0/doxygen/html/group___graph.html). -Synchronization between commands (UR sync-points) is implemented -using graph dependencies. -Executable HIP Graphs can be submitted to a HIP stream -in the same way as regular kernels. -The HIP backend enables enqueuing events to wait for into a stream. -It also allows signaling the completion of a submission with an event. -Therefore, submitting a UR command-buffer consists only of submitting to a stream -the executable HIP Graph that represent this series of operations. +Synchronization between commands (UR sync-points) is implemented using graph +dependencies. Executable HIP Graphs can be submitted to a HIP stream in the +same way as regular kernels. The HIP backend enables enqueuing events to wait +for into a stream. It also allows signalling the completion of a submission +with an event. Therefore, submitting a UR command-buffer consists only of +submitting to a stream the executable HIP Graph that represent this series of +operations. An executable HIP Graph, which contains all commands and synchronization information, is saved in the UR command-buffer to allow for efficient @@ -439,7 +426,7 @@ support. Due to the API mapping gaps documented in the following section, OpenCL as a SYCL backend cannot fully support the graph API. Instead, there are limitations in the types of nodes which a user can add to a graph, using -an unsupported node type will cause a sycl exception to be thrown in graph +an unsupported node type will cause a SYCL exception to be thrown in graph finalization with error code `sycl::errc::feature_not_supported` and a message mentioning the unsupported command. For example, @@ -522,7 +509,8 @@ custom defined symbols with the ones from OpenCL-Headers. Publicly available implementations of `cl_khr_command_buffer` that can be used to enable the graph extension in OpenCL: -- [OneAPI Construction Kit](https://github.com/codeplaysoftware/oneapi-construction-kit) (must enable `OCL_EXTENSION_cl_khr_command_buffer` when building) +- [OneAPI Construction Kit](https://github.com/codeplaysoftware/oneapi-construction-kit) + (must enable `OCL_EXTENSION_cl_khr_command_buffer` when building) - [PoCL](http://portablecl.org/) - [Command-Buffer Emulation Layer](https://github.com/bashbaug/SimpleOpenCLSamples/tree/efeae73139ddf064fafce565cc39640af10d900f/layers/10_cmdbufemu)