Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deferred Allocation #1704

Open
wants to merge 49 commits into
base: main
Choose a base branch
from
Open

Deferred Allocation #1704

wants to merge 49 commits into from

Conversation

ThrudPrimrose
Copy link
Collaborator

This is possibly the start of a long PR that will evolve, and we should discuss it at every step.

The main idea is as follows (so far in the prototype):

Idea 1:
When an array is created, let's say "A", then also a size array is created "A_size". The Size array is always one-dimensional and contiguously stored. When there is an A_size -> A (IN connector needs to be _size), we trigger a reallocation operation and not a copy operation. I am not sure if we should have an A_size array, I would like to discuss this.

I have a small SDFG that shows what it looks like.

import dace

def one():
    sdfg = dace.sdfg.SDFG(name="deferred_alloc_test")

    sdfg.add_array(name="A", shape=("__dace_defer", "__dace_defer"), dtype=dace.float32, storage=dace.dtypes.StorageType.Default, transient=True)

    state = sdfg.add_state("main")

    an_1 = state.add_access('A')
    an_1.add_in_connector('IN_size')

    an_2 = state.add_scalar(name="dim0", dtype=dace.uint64)
    an_3 = state.add_scalar(name="dim1", dtype=dace.uint64)

    s_an_1 = dace.nodes.AccessNode(data="A_size")
    state.add_node(s_an_1)
    s_an_2 = dace.nodes.AccessNode(data="A_size")
    state.add_node(s_an_2)
    state.add_edge(s_an_1, None, s_an_2, None,
                dace.Memlet(None) )

    state.add_edge(s_an_2, None, an_1, 'IN_size',
                dace.Memlet(expr="A_size[0:2]") )

    state.add_edge(an_2, None, s_an_1, None,
                dace.Memlet(expr="A_size[0]") )
    state.add_edge(an_3, None, s_an_2, None,
                dace.Memlet(expr="A_size[1]") )


    sdfg.save("def_alloc_1.sdfg")
    sdfg.validate()

    return sdfg

s = one()
s.generate_code()
s(dim0=512, dim1=512)

Generated code looks like follows:
image
And SDFG:
image

Next Step:
Not 100% of this array will always be used; for example, when calling realloc, only the "__dace_defer" dimensions will be read from the array. And the dimensions that are passed using symbols will be read from the respective symbols (this can't be changed throughout the SDFG). (I am currently working on this step)

@ThrudPrimrose ThrudPrimrose added the no-ci Do not run any CI or actions for this PR label Oct 24, 2024
@tbennun
Copy link
Collaborator

tbennun commented Oct 24, 2024

I would not use IN_size (i.e., a throughput connector), maybe use just size.

@ThrudPrimrose
Copy link
Collaborator Author

I would not use IN_size (i.e., a throughput connector), maybe use just size.

About the connector, we can also read the size, then this will be "OUT_size" I had planned. Would still suggest both connectors to be named size?

@tbennun
Copy link
Collaborator

tbennun commented Oct 24, 2024

I would not use IN_size (i.e., a throughput connector), maybe use just size.

About the connector, we can also read the size, then this will be "OUT_size" I had planned. Would still suggest both connectors to be named size?

yes, exactly. Since those connectors are endpoints, and not flowing through the access node (as they would in a scope node).

@ThrudPrimrose
Copy link
Collaborator Author

I would not use IN_size (i.e., a throughput connector), maybe use just size.

About the connector, we can also read the size, then this will be "OUT_size" I had planned. Would still suggest both connectors to be named size?

yes, exactly. Since those connectors are endpoints, and not flowing through the access node (as they would in a scope node).

I was updating the implementation according to a discussion we had with Torsten and Lex. I also changed the implementation to use "size" for both in and out connectors. Now the problem is that this SDFG does not validate because the access node has duplicate connectors. I think it is better to make the size connectors distinct rather than changing the validation rules. Or should I update the validation procedures with access nodes being an exception?

@alexnick83 alexnick83 self-requested a review December 2, 2024 15:13
@ThrudPrimrose ThrudPrimrose added the no-ci Do not run any CI or actions for this PR label Dec 3, 2024
Copy link
Contributor

@alexnick83 alexnick83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. I have some questions but ignore the one that was already discussed in the chat.

dace/codegen/dispatcher.py Outdated Show resolved Hide resolved
dace/codegen/targets/cpu.py Show resolved Hide resolved
dtype = sdfg.arrays[data_name].dtype

# Only consider the offsets with __dace_defer in original dim
mask_array = [str(dim).startswith("__dace_defer") for dim in data.shape]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance of strange transients with shapes that combine constants or normal symbols with deferred symbols, e.g., 4 * __dace_defer__x? This could occur, for example, through an operation that has inputs both normal arrays and deferred arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random thought, but maybe it would be safer to match the __dace_defer pattern rather than use startswith?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep it simple, I think it is better to allow only "__dace_defer" as an atomic symbol and never as a part of an expression.
I will add this to the validation; thanks for pointing it out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reshaping a deferred array is not valid, but there is a chance that the transformation results with an expression like this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might still need to think of a fix on this.

# We can check if the array requires sepcial access using A_size[0] (CPU) or __A_dim0_size (GPU0)
# by going through the shape and checking for symbols starting with __dace_defer
deferred_size_names = self._get_deferred_size_names(desc, memlet)
expr = cpp.cpp_array_expr(sdfg, memlet, codegen=self._frame, deferred_size_names=deferred_size_names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the API cleaner, would it make sense to move _get_deferred_size_names inside cpp or cpp.cpp_array_expr?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this function call to cpp as much as I can and also try to hide it as much as I can from function signatures.

dace/sdfg/sdfg.py Outdated Show resolved Hide resolved
size_desc_name = f"{name}_size"
size_desc = dt.Array(dtype=dace.uint64,
shape=(len(datadesc.shape),),
storage=dtypes.StorageType.Default,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this always be CPU Heap? Default can also be resolved to GPU Global depending on the context, unless special care is taken in code generation and GPU transform. I think you have done so, but maybe it is better to make it clear here too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have independently noticed this. I decided to go with CPU_Heap. (But the codegen always allocates size arrays on CPU_Heap. I did not go with Registers due to the dangers of them considered accessible in GPU kernels).

dace/sdfg/state.py Outdated Show resolved Hide resolved
dace/sdfg/validation.py Outdated Show resolved Hide resolved
dace/sdfg/validation.py Outdated Show resolved Hide resolved
tests/deferred_alloc_test.py Outdated Show resolved Hide resolved
@ThrudPrimrose ThrudPrimrose removed the no-ci Do not run any CI or actions for this PR label Dec 6, 2024
@ThrudPrimrose ThrudPrimrose marked this pull request as ready for review December 11, 2024 16:39
@ThrudPrimrose ThrudPrimrose changed the title [DRAFT] Deferred Allocation Prototype Deferred Allocation Dec 11, 2024
@ThrudPrimrose
Copy link
Collaborator Author

I want to share the design document.

With the proposal, we support dynamic allocation and reallocation of GPU_Global and CPU_Heap arrays.
The only supported DaCe type for reallocation is dace.data.Array type.

On the CPU_Heap array, reallocation is performed through a call to realloc; on GPU_Global storage, it is implemented through a sequence of malloc, copy, free as CUDA does not support realloc.

Reallocation is only allowed on the host-side code and can be triggered when the scope is None. (Reallocation inside a map or nested SDFGs is, for example, impossible, as realloc / malloc are not usually thread-safe, and it would not have good performance if it had a thread-safe implementation.)

A deferred array is generated only upon the request of the user. This can be done by including the symbol __dace_defer in any of the expressions in the shape of the array when it is added. It is invalid to reshape an array that is deferred. If the expression has multiple appearances of the __dace_defer symbol, it is assumed to be the same symbol.

The array size is tracked by a unique size_array connected to a deferred array. It is stored in the _arrays dictionary of the SDFG, just like other arrays—the constructor of dace.data.Array sets is_size_arary to true for size arrays and is_deferred_array to true for deferred arrays. The members are set only for arrays, not other DaCe data types. The size_array of an array is tracked through the size_desc_name variable of an array. The size array is created by appending the _size suffix to the end of the array’s name.

The size array is always one-dimensional. Its length matches the number of dimensions (length of the shape) of the deferred array. No size array is created if the array is not deferred.
The size array copies the dimensions that do not have __dace_defer as initial values of the size array. If an array has the shape A[2N, 4__dace_defer], then the size array is initialized as A_size[2]{2*N, 0}. The dimensions that do not have __dace_defer are not accessed by the codegen, yet they are written in the size array to allow the user to access them. When an offset expression is computed

The reallocation is triggered by writing to the special _write_size in the connector of an access node. As part of reallocation, the user provides only the size of the __dace_defer, and the new value of the __dace_defer symbol is used to compute the new to the size_array. If the new shape (of dimension 1) written by the user is 5, then for the A above, the latest value of the __dace_defer will be written as 5, and the size will be A_size[1] = 5. But the dimension of A is [2N][45], any call to the existing functions from the cpp module still calculates the dimensions using the shape member of the array. To support __dace_defer symbol and deferred allocation, some functions accept a variable called deferred_size_names or automatically generate these parameters by detecting __dace_defer symbols in the shape. The offset expressions are matched against the __dace_defer_dim(d+) pattern and switched with accesses to the size_array on the host to process. On GPU kernels, the pattern matching is slightly different.

The length of the write to the _write_size in the connector always needs to match the size of the array's shape. The values on the dimensions that do now have _dace_defer in their expressions are ignored.

The size of the array can be read from the _read_size out connector of an access node and used in maps.

Even though the size arrays are allocated on the stack, they require a call to cudamemcpy to transfer the size array to GPU to be transferred as pointers. To mitigate the issue (and have a more performant implementation than needing one more memcpy before every kernel), the size array is unpacked into integers. The name mangling pattern is as follows: the current value of the DimensionId’th _dace_defer symbol from the array A is read from A_size[i] and mangled as ___dim_size for the example, name mangled from A_size[1] would be __A_dim1_size , the integer unpacked sizes of deferred symbols are passed as integers to the kernel. The function _get_deferred_size_names method of the cpp module handles the generation of mangled names for array and offset accesses, and the CUDA codegen handles unpacking and instantiating mangled variable (integer) names.

My concerns:
_size names are misleading? Maybe I should mean that is _deferred_symbol_size?

@ThrudPrimrose
Copy link
Collaborator Author

The bold text is not copied over I have google doc for the same purpose:

https://docs.google.com/document/d/1fBinC5d0gpBnYD9C4M3e0zyxVkGD94LZFzzTCydM2sQ/edit?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants