Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Graph] Enable in-order cmd-list #13088

Merged
merged 14 commits into from
May 2, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions sycl/doc/design/CommandGraph.md
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,28 @@ requirements for these new accessors to correctly trigger allocations before
updating. This is similar to how individual graph commands are enqueued when
accessors are used in a graph node.

## Optimizations
### Interactions with Profiling

Enabling profiling on a graph may disable optimizations from being performed on
the graph if they are incompatible with profiling. For example, enabling
profiling prevents the in-order optimization since the removal of events would
prevent collecting profiling information.

### In-Order Graph Partitions

On finalization graph partitions are checked to see if they are in-order, i.e.
the graph follows a single path where each node depends on the previous node. If
so a hint is provided to the backend that it may create the command-buffers in
an in-order fashion. Support for this is backend specific but it may provide
benefits through the removal of the need for synchronization primitives between
kernels.

This optimization is only performed in this very limited case where it can be
safely assumed to be more performant. It is not likely we'll try to allow
in-order execution in more scenarios through a complicated (and imperfect)
heuristic but rather expose this as a hint the user can provide.

## Backend Implementation

Implementation of UR command-buffers for each of the supported SYCL 2020
Expand Down
45 changes: 38 additions & 7 deletions sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -591,6 +591,31 @@ The API for explicitly adding nodes to a `command_graph` includes a
property to be passed here. `depends_on_all_leaves` provides a shortcut for
adding all the current leaves of a graph as dependencies.

==== Enable-Profiling Property [[enable-profiling]]
EwanC marked this conversation as resolved.
Show resolved Hide resolved
[source,c++]
----
namespace sycl::ext::oneapi::experimental::property::node {
class enable_profiling {
public:
enable_profiling() = default;
};
}
----

The `property::node::enable_profiling` property can be passed to a
`command_graph::add()` function and enables profiling support
for the node in the `command_graph<graph_state::executable>`.
Passing this property implies disabling certain optimizations.
EwanC marked this conversation as resolved.
Show resolved Hide resolved
This is why profiling is by default disabled on graphs, unless users
explicitly require it using either the `property::node::enable_profiling`
property when using the explicit graph creation API or the
`property::queue::enable_profiling` on the recorded queue in the Record & Replay
API. As a result, the execution time of a graph finalized with profiling enabled
is longer than that of a graph without profiling capability. An error will be
thrown when attempting to profile an event from a graph submission that was
created without this property.


=== Graph

[source, c++]
Expand Down Expand Up @@ -911,6 +936,8 @@ Parameters:
* `propList` - Zero or more properties can be provided to the constructed node
via an instance of `property_list`. The `property::node::depends_on` property
can be passed here with a list of nodes to create dependency edges on.
The `enable_profiling` property enables the profiling of this node.
See <<enable-profiling, Enable-Profiling>> for more details.


Returns: The empty node which has been added to the graph.
Expand Down Expand Up @@ -948,6 +975,9 @@ Parameters:
* `propList` - Zero or more properties can be provided to the constructed node
via an instance of `property_list`. The `property::node::depends_on` property
can be passed here with a list of nodes to create dependency edges on.
The `enable_profiling` property enables the profiling of this node.
See <<enable-profiling, Enable-Profiling>> for more details.


Returns: The command-group function object node which has been added to the graph.

Expand Down Expand Up @@ -1350,13 +1380,14 @@ ways:
an implicit dependency before and after the graph execution, as if the graph
execution is one command-group submitted to the in-order queue.

2. `property::queue::enable_profiling` - This property has no effect on graph
recording. When set on the queue a graph is submitted to however, it allows
profiling information to be obtained from the event returned by a graph
submission. As it is not defined how a submitted graph will be split up for
scheduling at runtime, the `uint64_t` timestamp reported from a profiling
query on a graph execution event has the following semantics, which may be
pessimistic about execution time on device.
2. `property::queue::enable_profiling` - This property must be set on the queue
in recording mode if users want to profile the commands recorded to
the graph. This property must also be set on the queue an executable graph is
submitted to. It allows profiling information to be obtained from the event
returned by a graph submission. As it is not defined how a submitted graph
will be split up for scheduling at runtime, the `uint64_t` timestamp reported
from a profiling query on a graph execution event has the following
semantics, which may be pessimistic about execution time on device.

* `info::event_profiling::command_submit` - Timestamp when the graph is
submitted to the queue.
Expand Down
2 changes: 2 additions & 0 deletions sycl/include/sycl/detail/pi.h
Original file line number Diff line number Diff line change
Expand Up @@ -2358,6 +2358,8 @@ typedef enum {
struct pi_ext_command_buffer_desc final {
pi_ext_structure_type stype;
const void *pNext;
pi_bool is_in_order;
pi_bool enable_profiling;
pi_bool is_updatable;
};

Expand Down
3 changes: 2 additions & 1 deletion sycl/include/sycl/detail/property_helper.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,9 @@ enum DataLessPropKind {
GraphAssumeBufferOutlivesGraph = 23,
GraphDependOnAllLeaves = 24,
GraphUpdatable = 25,
GraphEnableProfiling = 26,
// Indicates the last known dataless property.
LastKnownDataLessPropKind = 25,
LastKnownDataLessPropKind = 26,
// Exceeding 32 may cause ABI breaking change on some of OSes.
DataLessPropKindSize = 32
};
Expand Down
29 changes: 22 additions & 7 deletions sycl/include/sycl/ext/oneapi/experimental/graph.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,15 @@ class depends_on_all_leaves : public ::sycl::detail::DataLessProperty<
depends_on_all_leaves() = default;
};

/// Property used to enable node profiling.
/// Passing this property to the `command_graph::add()` function
/// ensures that profiling can be queried on this node.
class enable_profiling : public ::sycl::detail::DataLessProperty<
::sycl::detail::GraphEnableProfiling> {
public:
enable_profiling() = default;
};

} // namespace node
} // namespace property

Expand All @@ -221,15 +230,17 @@ class __SYCL_EXPORT modifiable_command_graph {
/// @param PropList Property list used to pass [0..n] predecessor nodes.
/// @return Constructed empty node which has been added to the graph.
node add(const property_list &PropList = {}) {
bool EnableProfiling =
PropList.has_property<property::node::enable_profiling>();
if (PropList.has_property<property::node::depends_on>()) {
auto Deps = PropList.get_property<property::node::depends_on>();
node Node = addImpl(Deps.get_dependencies());
node Node = addImpl(Deps.get_dependencies(), EnableProfiling);
if (PropList.has_property<property::node::depends_on_all_leaves>()) {
addGraphLeafDependencies(Node);
}
return Node;
}
node Node = addImpl({});
node Node = addImpl({}, EnableProfiling);
if (PropList.has_property<property::node::depends_on_all_leaves>()) {
addGraphLeafDependencies(Node);
}
Expand All @@ -241,15 +252,17 @@ class __SYCL_EXPORT modifiable_command_graph {
/// @param PropList Property list used to pass [0..n] predecessor nodes.
/// @return Constructed node which has been added to the graph.
template <typename T> node add(T CGF, const property_list &PropList = {}) {
bool EnableProfiling =
PropList.has_property<property::node::enable_profiling>();
if (PropList.has_property<property::node::depends_on>()) {
auto Deps = PropList.get_property<property::node::depends_on>();
node Node = addImpl(CGF, Deps.get_dependencies());
node Node = addImpl(CGF, Deps.get_dependencies(), EnableProfiling);
if (PropList.has_property<property::node::depends_on_all_leaves>()) {
addGraphLeafDependencies(Node);
}
return Node;
}
node Node = addImpl(CGF, {});
node Node = addImpl(CGF, {}, EnableProfiling);
if (PropList.has_property<property::node::depends_on_all_leaves>()) {
addGraphLeafDependencies(Node);
}
Expand Down Expand Up @@ -324,14 +337,16 @@ class __SYCL_EXPORT modifiable_command_graph {
/// Template-less implementation of add() for CGF nodes.
/// @param CGF Command-group function to add.
/// @param Dep List of predecessor nodes.
/// @param EnableProfiling Enable node profiling.
/// @return Node added to the graph.
node addImpl(std::function<void(handler &)> CGF,
const std::vector<node> &Dep);
node addImpl(std::function<void(handler &)> CGF, const std::vector<node> &Dep,
const bool EnableProfiling);

/// Template-less implementation of add() for empty nodes.
/// @param Dep List of predecessor nodes.
/// @param EnableProfiling Enable node profiling.
/// @return Node added to the graph.
node addImpl(const std::vector<node> &Dep);
node addImpl(const std::vector<node> &Dep, const bool EnableProfiling);

/// Adds all graph leaves as dependencies
/// @param Node Destination node to which the leaves of the graph will be
Expand Down
16 changes: 8 additions & 8 deletions sycl/plugins/unified_runtime/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -94,14 +94,14 @@ if(SYCL_PI_UR_USE_FETCH_CONTENT)
CACHE PATH "Path to external '${name}' adapter source dir" FORCE)
endfunction()

set(UNIFIED_RUNTIME_REPO "https://github.com/oneapi-src/unified-runtime.git")
# commit b37fa2c4b09a49839a83228f687c811595fce3fd
# Merge: c7fade0d f61e81e9
# Author: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
# Date: Tue Apr 23 16:17:41 2024 +0100
# Merge pull request #1544 from kbenzie/benie/l0-fix-rhel-error
# [L0] Add missing <iomanip> include
set(UNIFIED_RUNTIME_TAG b37fa2c4b09a49839a83228f687c811595fce3fd)
set(UNIFIED_RUNTIME_REPO "https://github.com/bensuo/unified-runtime.git")
# commit fe9a05e528992cd1db7b05e2857fb17879442e86
# Merge: ee2feb22 9222315f
# Author: aarongreig <aaron.greig@codeplay.com>
# Date: Tue Apr 16 10:10:10 2024 +0100
# Merge pull request #1507 from nrspruit/fix_p2p_properties_init
# [L0] Fix to p2p properties init for pNext and stype
set(UNIFIED_RUNTIME_TAG 3f85f3058eab63fa9bd28b0b98194fa2c21ce7ce)

fetch_adapter_source(level_zero
${UNIFIED_RUNTIME_REPO}
Expand Down
2 changes: 2 additions & 0 deletions sycl/plugins/unified_runtime/pi2ur.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -4496,6 +4496,8 @@ piextCommandBufferCreate(pi_context Context, pi_device Device,
ur_device_handle_t UrDevice = reinterpret_cast<ur_device_handle_t>(Device);
ur_exp_command_buffer_desc_t UrDesc;
UrDesc.stype = UR_STRUCTURE_TYPE_EXP_COMMAND_BUFFER_DESC;
UrDesc.isInOrder = Desc->is_in_order;
UrDesc.enableProfiling = Desc->enable_profiling;
EwanC marked this conversation as resolved.
Show resolved Hide resolved
UrDesc.isUpdatable = Desc->is_updatable;
ur_exp_command_buffer_handle_t *UrCommandBuffer =
reinterpret_cast<ur_exp_command_buffer_handle_t *>(RetCommandBuffer);
Expand Down
2 changes: 2 additions & 0 deletions sycl/source/detail/event_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,8 @@ class event_impl {
return MEventFromSubmittedExecCommandBuffer;
}

void setProfilingEnabled(bool Value) { MIsProfilingEnabled = Value; }

// Sets a command-buffer command when this event represents an enqueue to a
// Command Buffer.
void
Expand Down
21 changes: 18 additions & 3 deletions sycl/source/detail/graph_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,13 @@ void exec_graph_impl::makePartitions() {
if (Node->MCGType == sycl::detail::CG::CodeplayHostTask) {
HostTaskList.push_back(Node);
}
// Next line is supposed to be temporary.
// Nodes are not profiled individually, but the profiling of the whole graph
// is enabled if at least one node has profiling enabled. This should be
// changed once the PR https://github.com/intel/llvm/pull/12592 on node
// profiling is merged. This also will involve updating all the UR enqueue
// cmd functions to add a new parameter containing the profiling status.
MEnableProfiling |= Node->MProfilingEnabled;
}

// Annotate nodes
Expand Down Expand Up @@ -261,6 +268,7 @@ void exec_graph_impl::makePartitions() {
}
if (Partition->MRoots.size() > 0) {
Partition->schedule();
Partition->MIsInOrderGraph = Partition->checkIfGraphIsSinglePath();
MPartitions.push_back(Partition);
PartitionFinalNum++;
}
Expand Down Expand Up @@ -698,7 +706,9 @@ void exec_graph_impl::createCommandBuffers(
sycl::detail::pi::PiExtCommandBuffer OutCommandBuffer;
sycl::detail::pi::PiExtCommandBufferDesc Desc{
pi_ext_structure_type::PI_EXT_STRUCTURE_TYPE_COMMAND_BUFFER_DESC, nullptr,
MIsUpdatable};
pi_bool(Partition->MIsInOrderGraph && !MEnableProfiling),
pi_bool(MEnableProfiling), pi_bool(MIsUpdatable)};

auto ContextImpl = sycl::detail::getSyclObjImpl(MContext);
const sycl::detail::PluginPtr &Plugin = ContextImpl->getPlugin();
auto DeviceImpl = sycl::detail::getSyclObjImpl(Device);
Expand Down Expand Up @@ -999,6 +1009,7 @@ exec_graph_impl::enqueue(const std::shared_ptr<sycl::detail::queue_impl> &Queue,
NewEvent->attachEventToComplete(Elem.second);
}
}
NewEvent->setProfilingEnabled(MEnableProfiling);
sycl::event QueueEvent =
sycl::detail::createSyclObjFromImpl<sycl::event>(NewEvent);
return QueueEvent;
Expand Down Expand Up @@ -1458,7 +1469,8 @@ modifiable_command_graph::modifiable_command_graph(
: impl(std::make_shared<detail::graph_impl>(
SyclQueue.get_context(), SyclQueue.get_device(), PropList)) {}

node modifiable_command_graph::addImpl(const std::vector<node> &Deps) {
node modifiable_command_graph::addImpl(const std::vector<node> &Deps,
const bool EnableProfiling) {
impl->throwIfGraphRecordingQueue("Explicit API \"Add()\" function");
std::vector<std::shared_ptr<detail::node_impl>> DepImpls;
for (auto &D : Deps) {
Expand All @@ -1467,11 +1479,13 @@ node modifiable_command_graph::addImpl(const std::vector<node> &Deps) {

graph_impl::WriteLock Lock(impl->MMutex);
std::shared_ptr<detail::node_impl> NodeImpl = impl->add(impl, DepImpls);
NodeImpl->MProfilingEnabled = EnableProfiling;
return sycl::detail::createSyclObjFromImpl<node>(NodeImpl);
}

node modifiable_command_graph::addImpl(std::function<void(handler &)> CGF,
const std::vector<node> &Deps) {
const std::vector<node> &Deps,
const bool EnableProfiling) {
impl->throwIfGraphRecordingQueue("Explicit API \"Add()\" function");
std::vector<std::shared_ptr<detail::node_impl>> DepImpls;
for (auto &D : Deps) {
Expand All @@ -1481,6 +1495,7 @@ node modifiable_command_graph::addImpl(std::function<void(handler &)> CGF,
graph_impl::WriteLock Lock(impl->MMutex);
std::shared_ptr<detail::node_impl> NodeImpl =
impl->add(impl, CGF, {}, DepImpls);
NodeImpl->MProfilingEnabled = EnableProfiling;
return sycl::detail::createSyclObjFromImpl<node>(NodeImpl);
}

Expand Down
Loading
Loading