Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Graph] Add initial draft of malloc/free nodes ... #352

Draft
wants to merge 4 commits into
base: sycl
Choose a base branch
from

Conversation

reble
Copy link
Owner

@reble reble commented Jan 24, 2024

... with backend support for deferred memory allocation.

As proposed in intel#8954

@reble reble added the Graph Specification Extension Specification related label Jan 25, 2024
Copy link
Collaborator

@EwanC EwanC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried not to nitpick this too much until we agree on the high level design, but think this looks good. API is simple, clear mapping from CUDA-Graph entry-points for porting code, and the dependency on sycl_ext_oneapi_virtual_mem hopefully means we can hand off having to go into any low level details.

won't be valid until the allocation is made on graph finalization, as allocating
at finalization is the only way to benefit from the known graph scope for optimal
memory allocation, and even optimize to eliminate some allocations entirely.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think people will have the question "if I have two executable graphs created from the same modifiable graph, will that node be different memory allocations?", so we should state something about that.


|
Returns a pair of a pointer to memory and a node. The pointer is allocated on the `device`
that is associated with current graph by first execution of the `command_graph`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first execution of the command_graph

Question about whether we want to tighten this more to say something like "allocated during the finalization of command-graph to the returned executable graph".

The current wording leaves open the suggestion that if you never execute the graph nothing will be allocated, but we probably will do the allocation on finalization - which means less latency when submitting the graph.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about "may be allocated during the finalization ... "? That way we can keep our options, depending on the backend.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we align with line 1723 here?

in line 1723, it says allocating at finalization is the only way


Exceptions:

* Throws synchronously with error code `feature_not_supported` if any devices in `context`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have a context parameter, "if device associated with the command graph does not not have ..."

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I meant: "any device in the context associated with the graph ..." but let's be more accurate!

can be passed here with a list of nodes to create dependency edges on.

|===

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up that once intel#12366 merges i think we'll need to extend the node_type enum to include malloc_device and free

@reble
Copy link
Owner Author

reble commented Jan 26, 2024

An example with structured binding could look like:

...
sycl_ext::command_graph my_graph(my_queue);

float scalar = 2.0f;
size_t N = 1024;
float *ptrX = malloc_device<float>(N, my_queue); // allocate device memory

auto [ptrY, node_a] = my_graph.add_malloc_device<float>(N); // only reserve virtual memory

auto node_b = my_graph.add([=](handler& cgh){cgh.parallel_for(N, [=](id<1> it){ptrX[it] += scalar * ptrY[it];});}, {sycl_ext::property::node::depends_on(node_a)});

auto node_c = my_graph.add_free(ptrY);
auto my_exec = my_graph.finalize(); // may allocate physical memory and map
my_queue.ext_oneapi_graph(my_exec).wait();

Comment on lines +1715 to +1716
std::pair<void*,node>
add_malloc_device(size_t num_bytes, const property_list& propList = {});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have this templated so that users can call this similarly to sycl::malloc_device<T>? Just saves the users having to do a cast after the call.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add. My only reason to start with a single version of this API was not having to update too many functions, in case we change the basic design of this interface.

sycl/test-e2e/ESIMD/matrix_transpose_glb.cpp Outdated Show resolved Hide resolved
USM allocations to be used in the graph submission. Before to coming to this
recommendation we considered the following explicit graph building interfaces
for adding a memory allocation owned by the graph:
The following interfaces enables users to define a memory allocation/free operation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there another plan/draft to record the memory allocation/free with record&replay APIs?

The current functions such as sycl::malloc_device are not recorded into a graph with record&replay APIs, we may need new async functions such as sycl::malloc_device_async

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Graph Specification Extension Specification related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants