Deferred Allocation #1704

ThrudPrimrose · 2024-10-24T09:16:45Z

This is possibly the start of a long PR that will evolve, and we should discuss it at every step.

The main idea is as follows (so far in the prototype):

Idea 1:
When an array is created, let's say "A", then also a size array is created "A_size". The Size array is always one-dimensional and contiguously stored. When there is an A_size -> A (IN connector needs to be _size), we trigger a reallocation operation and not a copy operation. I am not sure if we should have an A_size array, I would like to discuss this.

I have a small SDFG that shows what it looks like.

import dace

def one():
    sdfg = dace.sdfg.SDFG(name="deferred_alloc_test")

    sdfg.add_array(name="A", shape=("__dace_defer", "__dace_defer"), dtype=dace.float32, storage=dace.dtypes.StorageType.Default, transient=True)

    state = sdfg.add_state("main")

    an_1 = state.add_access('A')
    an_1.add_in_connector('IN_size')

    an_2 = state.add_scalar(name="dim0", dtype=dace.uint64)
    an_3 = state.add_scalar(name="dim1", dtype=dace.uint64)

    s_an_1 = dace.nodes.AccessNode(data="A_size")
    state.add_node(s_an_1)
    s_an_2 = dace.nodes.AccessNode(data="A_size")
    state.add_node(s_an_2)
    state.add_edge(s_an_1, None, s_an_2, None,
                dace.Memlet(None) )

    state.add_edge(s_an_2, None, an_1, 'IN_size',
                dace.Memlet(expr="A_size[0:2]") )

    state.add_edge(an_2, None, s_an_1, None,
                dace.Memlet(expr="A_size[0]") )
    state.add_edge(an_3, None, s_an_2, None,
                dace.Memlet(expr="A_size[1]") )


    sdfg.save("def_alloc_1.sdfg")
    sdfg.validate()

    return sdfg

s = one()
s.generate_code()
s(dim0=512, dim1=512)

Generated code looks like follows:

And SDFG:

Next Step:
Not 100% of this array will always be used; for example, when calling realloc, only the "__dace_defer" dimensions will be read from the array. And the dimensions that are passed using symbols will be read from the respective symbols (this can't be changed throughout the SDFG). (I am currently working on this step)

tbennun · 2024-10-24T15:26:12Z

I would not use IN_size (i.e., a throughput connector), maybe use just size.

ThrudPrimrose · 2024-10-24T16:53:01Z

I would not use IN_size (i.e., a throughput connector), maybe use just size.

About the connector, we can also read the size, then this will be "OUT_size" I had planned. Would still suggest both connectors to be named size?

tbennun · 2024-10-24T22:47:52Z

I would not use IN_size (i.e., a throughput connector), maybe use just size.

About the connector, we can also read the size, then this will be "OUT_size" I had planned. Would still suggest both connectors to be named size?

yes, exactly. Since those connectors are endpoints, and not flowing through the access node (as they would in a scope node).

ThrudPrimrose · 2024-10-29T12:59:09Z

I would not use IN_size (i.e., a throughput connector), maybe use just size.

About the connector, we can also read the size, then this will be "OUT_size" I had planned. Would still suggest both connectors to be named size?

yes, exactly. Since those connectors are endpoints, and not flowing through the access node (as they would in a scope node).

I was updating the implementation according to a discussion we had with Torsten and Lex. I also changed the implementation to use "size" for both in and out connectors. Now the problem is that this SDFG does not validate because the access node has duplicate connectors. I think it is better to make the size connectors distinct rather than changing the validation rules. Or should I update the validation procedures with access nodes being an exception?

…s name

alexnick83

LGTM overall. I have some questions but ignore the one that was already discussed in the chat.

dace/codegen/dispatcher.py

dace/codegen/targets/cpu.py

alexnick83 · 2024-12-03T20:43:11Z

dace/codegen/targets/cpu.py

+        dtype = sdfg.arrays[data_name].dtype
+
+        # Only consider the offsets with __dace_defer in original dim
+        mask_array = [str(dim).startswith("__dace_defer") for dim in data.shape]


Is there any chance of strange transients with shapes that combine constants or normal symbols with deferred symbols, e.g., 4 * __dace_defer__x? This could occur, for example, through an operation that has inputs both normal arrays and deferred arrays.

Random thought, but maybe it would be safer to match the __dace_defer pattern rather than use startswith?

To keep it simple, I think it is better to allow only "__dace_defer" as an atomic symbol and never as a part of an expression.
I will add this to the validation; thanks for pointing it out.

Reshaping a deferred array is not valid, but there is a chance that the transformation results with an expression like this.

I might still need to think of a fix on this.

alexnick83 · 2024-12-04T20:17:51Z

dace/codegen/targets/cpu.py

+                            # We can check if the array requires sepcial access using A_size[0] (CPU) or __A_dim0_size (GPU0)
+                            # by going through the shape and checking for symbols starting with __dace_defer
+                            deferred_size_names = self._get_deferred_size_names(desc, memlet)
+                            expr = cpp.cpp_array_expr(sdfg, memlet, codegen=self._frame, deferred_size_names=deferred_size_names)


To make the API cleaner, would it make sense to move _get_deferred_size_names inside cpp or cpp.cpp_array_expr?

I moved this function call to cpp as much as I can and also try to hide it as much as I can from function signatures.

dace/sdfg/sdfg.py

alexnick83 · 2024-12-04T20:43:33Z

dace/sdfg/sdfg.py

+            size_desc_name = f"{name}_size"
+            size_desc = dt.Array(dtype=dace.uint64,
+                                shape=(len(datadesc.shape),),
+                                storage=dtypes.StorageType.Default,


Shouldn't this always be CPU Heap? Default can also be resolved to GPU Global depending on the context, unless special care is taken in code generation and GPU transform. I think you have done so, but maybe it is better to make it clear here too.

I have independently noticed this. I decided to go with CPU_Heap. (But the codegen always allocates size arrays on CPU_Heap. I did not go with Registers due to the dangers of them considered accessible in GPU kernels).

dace/sdfg/state.py

dace/sdfg/validation.py

tests/deferred_alloc_test.py

…n size arrays) allocate size arrays only after symbols are allocated

…omplex shapes, add tests

ThrudPrimrose · 2024-12-12T10:01:28Z

I want to share the design document.

With the proposal, we support dynamic allocation and reallocation of GPU_Global and CPU_Heap arrays.
The only supported DaCe type for reallocation is dace.data.Array type.

On the CPU_Heap array, reallocation is performed through a call to realloc; on GPU_Global storage, it is implemented through a sequence of malloc, copy, free as CUDA does not support realloc.

Reallocation is only allowed on the host-side code and can be triggered when the scope is None. (Reallocation inside a map or nested SDFGs is, for example, impossible, as realloc / malloc are not usually thread-safe, and it would not have good performance if it had a thread-safe implementation.)

A deferred array is generated only upon the request of the user. This can be done by including the symbol __dace_defer in any of the expressions in the shape of the array when it is added. It is invalid to reshape an array that is deferred. If the expression has multiple appearances of the __dace_defer symbol, it is assumed to be the same symbol.

The array size is tracked by a unique size_array connected to a deferred array. It is stored in the _arrays dictionary of the SDFG, just like other arrays—the constructor of dace.data.Array sets is_size_arary to true for size arrays and is_deferred_array to true for deferred arrays. The members are set only for arrays, not other DaCe data types. The size_array of an array is tracked through the size_desc_name variable of an array. The size array is created by appending the _size suffix to the end of the array’s name.

The size array is always one-dimensional. Its length matches the number of dimensions (length of the shape) of the deferred array. No size array is created if the array is not deferred.
The size array copies the dimensions that do not have __dace_defer as initial values of the size array. If an array has the shape A[2N, 4__dace_defer], then the size array is initialized as A_size[2]{2*N, 0}. The dimensions that do not have __dace_defer are not accessed by the codegen, yet they are written in the size array to allow the user to access them. When an offset expression is computed

The reallocation is triggered by writing to the special _write_size in the connector of an access node. As part of reallocation, the user provides only the size of the __dace_defer, and the new value of the __dace_defer symbol is used to compute the new to the size_array. If the new shape (of dimension 1) written by the user is 5, then for the A above, the latest value of the __dace_defer will be written as 5, and the size will be A_size[1] = 5. But the dimension of A is [2N][45], any call to the existing functions from the cpp module still calculates the dimensions using the shape member of the array. To support __dace_defer symbol and deferred allocation, some functions accept a variable called deferred_size_names or automatically generate these parameters by detecting __dace_defer symbols in the shape. The offset expressions are matched against the __dace_defer_dim(d+) pattern and switched with accesses to the size_array on the host to process. On GPU kernels, the pattern matching is slightly different.

The length of the write to the _write_size in the connector always needs to match the size of the array's shape. The values on the dimensions that do now have _dace_defer in their expressions are ignored.

The size of the array can be read from the _read_size out connector of an access node and used in maps.

Even though the size arrays are allocated on the stack, they require a call to cudamemcpy to transfer the size array to GPU to be transferred as pointers. To mitigate the issue (and have a more performant implementation than needing one more memcpy before every kernel), the size array is unpacked into integers. The name mangling pattern is as follows: the current value of the DimensionId’th _dace_defer symbol from the array A is read from A_size[i] and mangled as ___dim_size for the example, name mangled from A_size[1] would be __A_dim1_size , the integer unpacked sizes of deferred symbols are passed as integers to the kernel. The function _get_deferred_size_names method of the cpp module handles the generation of mangled names for array and offset accesses, and the CUDA codegen handles unpacking and instantiating mangled variable (integer) names.

My concerns:
_size names are misleading? Maybe I should mean that is _deferred_symbol_size?

ThrudPrimrose · 2024-12-12T10:02:07Z

The bold text is not copied over I have google doc for the same purpose:

https://docs.google.com/document/d/1fBinC5d0gpBnYD9C4M3e0zyxVkGD94LZFzzTCydM2sQ/edit?usp=sharing

ThrudPrimrose added 3 commits October 23, 2024 14:36

Early changes to support reallocation for CPU_Heap storage

b10475f

Minimal functioning realloc

fae0704

Add first prototype of deferred allocation support

023c86c

ThrudPrimrose requested review from tim0s and acalotoiu October 24, 2024 09:16

ThrudPrimrose added the no-ci Do not run any CI or actions for this PR label Oct 24, 2024

ThrudPrimrose added 14 commits October 29, 2024 16:31

Add reading the size of array, add size input as a special in connector

4aca5ee

Refactor

e1442f7

Do not rely on naming conventions but save the size array descriptor'…

dcbf2a2

…s name

Merge branch 'main' into deferred_allocation

33b9702

dace/sdfg/validation.py

e516985

Improve validation

925f8c7

More validation cases

93eae37

Add support for deferred allocation on GPU global arrays

5b55425

Non-transient support attempt 1

c783668

Improvements in GPU_Global support

dc81d69

Merge branch 'main' into deferred_allocation

400257d

Add tests

c14b91e

Change connector names

506d0aa

Add more test cases and fix some bugs

b956142

alexnick83 self-requested a review December 2, 2024 15:13

ThrudPrimrose added 5 commits December 3, 2024 11:56

Merge branch 'main' into deferred_allocation

c4eef0c

Bug fixes

82cdfde

More codegen fixes

97bc728

Split size and array storage

08cb50c

Major fixes regarding name changes etc.

ac90c86

ThrudPrimrose added the no-ci Do not run any CI or actions for this PR label Dec 3, 2024

alexnick83 reviewed Dec 4, 2024

View reviewed changes

Various bugfixes on the feature

e669f7c

ThrudPrimrose removed the no-ci Do not run any CI or actions for this PR label Dec 6, 2024

ThrudPrimrose added 16 commits December 6, 2024 14:27

Add various fixes on distinguishing size and normal arrays

6ac34f6

Move size array name check to validation

75e2739

Fix type shadowing in GPU kernel size array unpacking

8c164a4

Make tests consider size arrays (todo: maybe make arrays do not retur…

e76f39d

…n size arrays) allocate size arrays only after symbols are allocated

Fix size array name check in validation

15b00cc

Various fixes

ee8a708

Fix validation case

f195e3f

Improve filtering for size arrays

e915607

Improve tests, improve deferred alloc check

9d646dc

Fix type check imports

3854c82

Improve validation and type checks and fix bugs

2408ad0

Build on top of the GPU codegen hack

62bc08c

Improve proposal according to PR comments, improve support for more c…

f50382b

…omplex shapes, add tests

Add tests, refactor, improve size calculation

8c2f12d

Add array length checks to cutout test

ede2704

Refactor

a6163c0

ThrudPrimrose marked this pull request as ready for review December 11, 2024 16:39

ThrudPrimrose changed the title ~~[DRAFT] Deferred Allocation Prototype~~ Deferred Allocation Dec 11, 2024

ThrudPrimrose added 6 commits December 13, 2024 12:08

Merge branch 'main' into deferred_allocation

80f6b4a

Refactor and support CPU_Pinned

ae08459

Refactor and fix GPU array index generation

bb04e1a

Fixes to size desc name checks

02a48e8

Fix to erronous assertion

da7ba8d

Test script refactor

460b75b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deferred Allocation #1704

Deferred Allocation #1704

ThrudPrimrose commented Oct 24, 2024

tbennun commented Oct 24, 2024

ThrudPrimrose commented Oct 24, 2024

tbennun commented Oct 24, 2024

ThrudPrimrose commented Oct 29, 2024

alexnick83 left a comment

alexnick83 Dec 3, 2024

alexnick83 Dec 3, 2024

ThrudPrimrose Dec 11, 2024

ThrudPrimrose Dec 11, 2024

ThrudPrimrose Dec 11, 2024

alexnick83 Dec 4, 2024

ThrudPrimrose Dec 11, 2024

alexnick83 Dec 4, 2024

ThrudPrimrose Dec 11, 2024

ThrudPrimrose commented Dec 12, 2024

ThrudPrimrose commented Dec 12, 2024

Deferred Allocation #1704

Are you sure you want to change the base?

Deferred Allocation #1704

Conversation

ThrudPrimrose commented Oct 24, 2024

tbennun commented Oct 24, 2024

ThrudPrimrose commented Oct 24, 2024

tbennun commented Oct 24, 2024

ThrudPrimrose commented Oct 29, 2024

alexnick83 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThrudPrimrose commented Dec 12, 2024

ThrudPrimrose commented Dec 12, 2024