[SYCL][Graph] Add API for updating single node parameters #340

Bensuo · 2023-11-01T18:21:14Z

New API and wording for updating single parameters

This bears some relation to #77 but takes a different approach. The other PR requires that users keep track of both the old parameter and the new parameter when updating, where this one creates an intermediate object (essentially a handle) which can be used when updating the parameters with new values. I believe this is preferable to having to store buffers, pointers and potentially a mix of both in complex code where graph building and updating may not be located in the same scope necessarily.

Edit: Queries have been removed from this PR and are now implemented in #348

EwanC

This is a great start, I think we think about how the various update API designs fit together, to work out how much of the helper getDependencies()/getNodes() queries we should be providing with this PR(if any)

Whole graph update - record a library call, and don't have access
to sycl inputs themselves.
dynamic parameter update - user doesn't want the runtime to traverse the graph,
know exactly what nodes to change.
[SYCL] Add explicit update API to graph spec #77 update - user has access to sycl inputs but doesn't want
to manually loop over every node in a graph doing dynamic parameter
update, so runtime can do that for them. Maybe created the graph using record & replay rather than explicitly.

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

guoyejun · 2023-11-30T12:00:24Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

+guarantee allows the backend to provide a more optimized implementation, if
+possible.
+
+The `property::graph::updatable_graph` property also allows an executable graph


For in order queue, the contiguous submission is always allowed.

For out of order queue, it is not allowed without updatable_graph property, it is possible to allow with updatable_graph property, but, I doubt if the graph update is always enough to remove all the data race conditions, there are lots of intermediate memory within the graph.

imo, we may remove this paragraph which describes the rare case that the update is enough to remove all the data race condition.

For in order queue, the contiguous submission is always allowed.

That's actually not the current spec wording, the spec says "Throws synchronously with error code invalid if a previous submission of graph has yet to complete execution." So even if you use an in-order queue, or set execution dependencies in an out-of-order queue, it will still be an exception. This is very restrictive, which is why we have #345 open. I imagine that #345 will merge first, and the wording here will updated to reflect that (right now it's a bit of a placeholder)

yes, you are correct for the current status. My option should be more precise: For in order queue, the contiguous submission is expected to always be allowed once the limitation of “host sync for multiple command-buffer submissions” is removed. And hope #345 to be merged soon, this limitation is really so restrictive.

my main idea of this comment is that: even with updatable_graph property, there's still high possibility to have data race condition due to the many intermediate tensors within the graph.

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

gmlueck · 2023-12-15T21:30:47Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

+
+// Register nodeB dynamic parameters
+dynParamInput.register(ptrX, nodeB);
+dynParamInput.register(ptrY, nodeB);


I like the concept of the dynamic_parameter type, which acts like a handle to the parameter you want to update. However, I hate the way the registration works -- matching by value seems error prone. Could you structure dynamic_parameter like this instead?

template<typename T> class dynamic_parameter { public: dynamic_parameter(const T& param, command_graph<graph_state::modifiable> graph, std::string label = ""); register(const handler &cgh); update(const T& newval, command_graph<graph_state::executable> execGraph); };

Usage would be like this:

size_t len = /* ... */; int *ptrx = malloc_shared<int>(len, myQueue); int *ptry = malloc_device<int>(len, myQueue); // Note that we use CTAD here to deduce the template parameter of "dyanmic_parameter". // Also note that the dynamic parameter is associated with the captured kernel argument in the // constructor, which seems nice. dynamic_parameter dparam_len{len}; dynamic_parameter dparam_ptrx{ptrX}; dynamic_parameter_dparam_ptry(ptry}; node nodeB = myGraph.add[&](handler& cgh) { // Register each dynamic parameter with the "handler". The implementation use this to figure out the // node. By the time "myGraph.add" returns, each dynamic parameter becomes associated with "nodeB". dparam_len.register(cgh); dparam_ptrx.register(cgh); dparam_ptry.register(cgh); cgh.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) { const size_t i = it[0]; ptrY[i] += ptrX[i]; }); }); auto execGraph = myGraph.finalize({sycl_ext::property::graph::updatable}); myQueue.ext_oneapi_graph(execGraph); len = /* some new length */; int *ptrz = malloc_shared<int>(len, myQueue); int *ptrq = malloc_device<int>(len, myQueue); // Update parameter to new values. dparam_len.update(len, execGraph); dparam_ptrx.update(ptrz, execGraph); dparam_ptry.update(ptrq, execGraph);

In this proposal, registration is somewhat different. The dynamic_parameter object is constructed with the address of the parameter. The variable passed to the dynamic_parameter constructor must be the same variable that is captured by the kernel's lambda. The call to register then associates the dynamic parameter object with the node and the kernel that the node contains.

I'm assuming here that your graph compiler knows the variables that are captured by the kernel. The graph compiler can search this list of captured variables and compare their address with the address that was passed to the dynamic_parameter constructor. This address is unique, so there's no ambiguity with the registration. You can even create a dynamic_parameter for a scalar as my example shows.

Thanks for the suggestions. In principal I think having the dynamic_parameter be templated on the arg type is fine, though we had initially avoided this to allow generic storage of dynamic_parameters of a single type. This might present a problem for usability but that's my only concern really.

Registration with the handler would present significant problems with record and replay functionality I believe though, since the code may either be not accessible to make these additions (library case) or not desirable for the user to change (general record and replay case). Record and replay is being used in the primary use-case we have for developing this feature against, so compatibility in this area is very important.

I also believe that scalars still present a problem even with this suggestion. It is my understanding that since these parameters are captured by value inside the lambda, when inspected they will have a different address than the original scalar value. Also in cases where set_arg is used for scalar values then as far as I can see the handler will create it's own copy of the data internally which would have the same issue. Am I misunderstanding something here?

I also believe that scalars still present a problem even with this suggestion. It is my understanding that since these parameters are captured by value inside the lambda, when inspected they will have a different address than the original scalar value. Also in cases where set_arg is used for scalar values then as far as I can see the handler will create it's own copy of the data internally which would have the same issue. Am I misunderstanding something here?

My proposal is assuming that you have have access to the address of the original variables that are captured by the kernel lambda. I'm not sure if it's easy for you to get these addresses, but I think it would be worth some extra effort. The main weakness I see with the original proposal is that it identifies kernel arguments by value, but this is error prone because two arguments might have the same value. If you could somehow get the address of the captured variables, you can use this address to uniquely associate each dynamic_parameter with its kernel argument.

I'm not concerned about the set_arg case. In this case, you can uniquely associate each dynamic_parameter with its argument by using the argument index. The main problem I'm trying to solve in this comment is the case when kernel arguments are captured by the lambda expression.

Do you think you can get access to these addresses somehow?

for the case if we could not get node from recorded graph, we can use void register(void * param); for USM.

for the case if we can get node from recorded graph, we can use void register(T param, Node node); with limitations.

for the case with explicit building API, can we change a bit to void register(T param, const handler &cgh); as below, and we even don't need to care the address.

dynamic_parameter dp_len(myGraph); ... node nodeB = myGraph.add[&](handler& cgh) { dp_len.register(len, cgh); ...

for the case with explicit building API, can we change a bit to void register(T param, const handler &cgh); as below, and we even don't need to care the address.

This doesn't address my concern. What if the kernel has two parameters that have the same value as len? How will you decide which of those parameters is associated with dp_len?

I think dynamic_parameter is orthogonal to buffers and accessors. For example, I think you would still need to use dynamic_parameter if you wanted to change the value of an accessor parameter in a graph. I guess the syntax would look something like:

buffer<int> buf{{N}}; accessor acc{buf}; dynamic_parameter dparam_acc{acc}; node nodeB = myGraph.add[&](handler& cgh) { cgh.require(acc); dparam_acc.register(cgh); cgh.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) { const size_t i = it[0]; dparam_acc.get()[i] += i; }); });

I was confused by the first example that only used scalar len variable and USM pointers. Do we expect that "updated" accessor value would have all the same compile-time properties between old/new values (e.g. number of dimensions)?

Do we expect that "updated" accessor value would have all the same compile-time properties between old/new values (e.g. number of dimensions)?

I assume so, but this is a question to the PR authors.

Assuming the updated accessor does need to have the same compile-time properties, I think the dynamic_parameter would be able to enforce this because those compile-time properties would become part of the type of the dynamic_parameter object. The dynamic_parameter class is templated on the type of the argument, so we could enforce (at compile time) that the updated argument value had the same type (which would include accessor properties).

Does this let you capture enough information about the parameters, so that you can uniquely identify a parameter without assuming each parameter has a unique value?

Your proposal of a dynamic_parameter instance crossing the host/device code boundary I think should let us capture enough information to uniquely identify each parameter, as the dynamic_parameter instance itself becomes the parameter.

My main concern with this design is that it requires the user to modify the kernel code, which won't work for the record & replay usecase where existing code is wrapped. If I understand @guoyejun's description of how pytorch works, this will be their usage before updating a graph instance. What we could do is aim to provide both registration mechanisms, matching-by-value and the CGH registration, then try to communicate to the user in the spec the drawbacks of the match-by-value approach.

I'm also not sure how long it will take to implement dynamic_parameters as a cross host/device construct, there's
a lot more uncertainty in terms of implementation. E.g. Ensure that when a user has multiple instances of the same graph,
and updates only one of them, then dynamic_parameter.get() has been implemented correctly so that only kernels in that exec graph instance are updated. So having both mechanisms means we could at least prototype match-by-value for evaluation on an estimatable timescale and get feedback on how often match-by-value collisions happen in real workloads.

@EwanC yes, that's right.

to be accurate, PyTorch is not currently using the graph update, but the future refinement in our discussion with pytorch people.

Currently in PyTorch, there are two cases if the graph input USM is changed,

add an extra copy (in pytorch eager mode)

re-record the graph (in torch.compile mode)

gmlueck · 2023-12-15T21:36:04Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

+// Register nodeB dynamic parameters
+dynParamInput.register(ptrX, nodeB);
+dynParamInput.register(ptrY, nodeB);
+


Adding another comment here to keep this conversation separately threaded.

I don't understand how dynamic_parameter::update works for an accessor. How would you create the new accessor value? Normally, an accessor is created inside of command group scope, but there is no command group when you update the parameter values. Are you expecting that the user creates a placeholder accessor?

I'm also not clear on how parameter update works for a recorded graph. Is this why you added command_graph::get_nodes and command_graph::get_root_nodes? Is the idea that the application uses these to get specific nodes in the graph and then it figures out the parameters to those nodes by somehow knowing the values of the variables that are captured in that node? This seems to require a lot of knowledge sharing between the code that constructs the dynamic_parameter and the code that submits commands to the queue that is being recorded.

how parameter update works for a recorded graph

for the USM update, see https://github.com/reble/llvm/pull/340/files#r1422564811

Yes, I saw that comment. Updating a USM pointer seems clear to me. My question is specifically about updating an accessor.

I don't understand how dynamic_parameter::update works for an accessor. How would you create the new accessor value? Normally, an accessor is created inside of command group scope, but there is no command group when you update the parameter values. Are you expecting that the user creates a placeholder accessor?

Yes as it stands I think this would require placeholder accessors. An alternative could be to instead use the actual buffer when updating the parameter, so in the kernel the accessor would have the same properties but point to updated memory. Accessors may allow more flexibility in terms of updating the range or offset, but it's not clear to me how much that would be useful - it seems like most use cases would want the access to stay the same but simply swap out the actual memory between executions. However, changing offsets could be achieved with sub-buffers potentially which would restore some flexibility.

I'm also not clear on how parameter update works for a recorded graph. Is this why you added command_graph::get_nodes and command_graph::get_root_nodes? Is the idea that the application uses these to get specific nodes in the graph and then it figures out the parameters to those nodes by somehow knowing the values of the variables that are captured in that node? This seems to require a lot of knowledge sharing between the code that constructs the dynamic_parameter and the code that submits commands to the queue that is being recorded.

This is a partial motivation yes, and it does require some knowledge sharing as you say. In the kinds of cases we are looking at though where memory is provided by the user to a library to perform some operations that knowledge is already present in the user code, and the queries allow the user access to what was recorded to be able to update those inputs.

More broadly though these queries enable better mixed record & replay/explicit API usage by allowing nodes which have been recorded to be used as dependencies in the explicit API. Previously this mixed functionality was possible but only in a very limited way.

also curious about how the accessor is implemented, is it a struct that contains size, offset, format (for image), memory handle (to buffer/image) etc.?

looks that we need to add command_graph::get_nodes back, in framework, it is used to check if the recorded graph is empty, see at https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/CUDAGraph.cpp#L227

Support for query functions like get_nodes can be found in a separate PR now: #348

This change introduces a new API that allows the kernel commands of a command-buffer to be updated with a new configuration. For example, modified arguments or ND-Range. The new API is defined in the following files and then source generated using scripts, so reviewers should look at: * `scripts/core/EXP-COMMAND-BUFFER.rst` * `scripts/core/exp-command-buffer.yml` See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are: * Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update. * USM, buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler` * There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them. The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature. Reviewers should review the changes in `source/adapters/cuda/` to evaluate this, CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look). A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.

reble · 2024-01-17T17:18:49Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

+
+// Register nodeA dynamic parameters
+dynParamInput.register(0, nodeA);  // Argument index 0 is ptrX
+dynParamInput.register(1, nodeA);  // Argument index 1 is myScalar


Suggested change

dynParamInput.register(1, nodeA); // Argument index 1 is myScalar

dynParamScalar.register(1, nodeA); // Argument index 1 is myScalar

reble · 2024-01-17T17:19:31Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

+
+// Register nodeB dynamic parameters
+dynParamInput.register(ptrX, nodeB);
+dynParamInput.register(ptrY, nodeB);


Suggested change

dynParamInput.register(ptrY, nodeB);

dynParamOutput.register(ptrY, nodeB);

This change introduces a new API that allows the kernel commands of a command-buffer to be updated with a new configuration. For example, modified arguments or ND-Range. The new API is defined in the following files and then source generated using scripts, so reviewers should look at: * `scripts/core/EXP-COMMAND-BUFFER.rst` * `scripts/core/exp-command-buffer.yml` See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are: * Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update. * USM, buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler` * There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them. The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature. Reviewers should review the changes in `source/adapters/cuda/` to evaluate this, CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look). A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.

Bensuo · 2024-01-24T18:36:16Z

Closing in favour of upstream PR: intel#12486

This change introduces a new API that allows the kernel commands of a command-buffer to be updated with a new configuration. For example, modified arguments or ND-Range. The new API is defined in the following files and then source generated using scripts, so reviewers should look at: * `scripts/core/EXP-COMMAND-BUFFER.rst` * `scripts/core/exp-command-buffer.yml` See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are: * Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update. * USM, buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler` * There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them. The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature. Reviewers should review the changes in `source/adapters/cuda/` to evaluate this, CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look). A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.

This change introduces a new API that allows the kernel commands of a command-buffer to be updated with a new configuration. For example, modified arguments or ND-Range. The new API is defined in the following files and then source generated using scripts, so reviewers should look at: * `scripts/core/EXP-COMMAND-BUFFER.rst` * `scripts/core/exp-command-buffer.yml` See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are: * Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update. * USM, buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler` * There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them. * Command-handles are reference counted in UR, and extend the lifetime of the parent command-buffer. The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature. Reviewers should review the changes in `source/adapters/cuda/` to evaluate this, CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look). A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.

Bensuo added the Graph Specification Extension Specification related label Nov 1, 2023

Bensuo requested review from EwanC, reble, julianmi and mfrancepillois November 1, 2023 18:21

EwanC reviewed Nov 2, 2023

View reviewed changes

julianmi reviewed Nov 2, 2023

View reviewed changes

EwanC reviewed Nov 7, 2023

View reviewed changes

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc Show resolved Hide resolved

EwanC mentioned this pull request Nov 15, 2023

[SYCL][Graph][Doc] Reinstate whole graph update #343

Closed

EwanC self-assigned this Nov 15, 2023

EwanC force-pushed the ben/explicit-update branch from 0b7bfd3 to fb75368 Compare November 15, 2023 10:18

EwanC reviewed Nov 15, 2023

View reviewed changes

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc Show resolved Hide resolved

EwanC force-pushed the ben/explicit-update branch 3 times, most recently from abd903e to 44cff2b Compare November 20, 2023 16:19

EwanC reviewed Nov 23, 2023

View reviewed changes

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc Outdated Show resolved Hide resolved

guoyejun reviewed Nov 25, 2023

View reviewed changes