Support nested SDFGs in `ScheduleType.GPU_ThreadBlock_Dynamic` #1189

lamyiowce · 2023-02-15T15:24:15Z

Describe the bug
Encountered when implementing a Graph Attentional Operator. When using ScheduleType.GPU_ThreadBlock_Dynamic, incorrect code gets generated. The error is:

(dace14) [jbazinsk@ault24 bugs]$ python compile_error.py
Keeping schedule sequential for  _Mult__map[__i0=0:3]
Changing schedule to TB dynamic:  assign_47_8_map[__i0=0:3]
Traceback (most recent call last):
  File "/users/jbazinsk/miniconda3/envs/dace14/lib/python3.8/site-packages/dace/codegen/compiler.py", line 224, in configure_and_compile
    _run_liveoutput("cmake --build . --config %s" % (Config.get('compiler', 'build_type')),
  File "/users/jbazinsk/miniconda3/envs/dace14/lib/python3.8/site-packages/dace/codegen/compiler.py", line 407, in _run_liveoutput
    raise subprocess.CalledProcessError(process.returncode, command, output.getvalue())
subprocess.CalledProcessError: Command 'cmake --build . --config RelWithDebInfo' returned non-zero exit status 2.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "compile_error.py", line 92, in <module>
    main()
  File "compile_error.py", line 73, in main
    sdfg(node_features=node_features, factor=factor, rows=rows, columns=columns,
  File "/users/jbazinsk/miniconda3/envs/dace14/lib/python3.8/site-packages/dace/sdfg/sdfg.py", line 2289, in __call__
    binaryobj = sdfg.compile()
  File "/users/jbazinsk/miniconda3/envs/dace14/lib/python3.8/site-packages/dace/sdfg/sdfg.py", line 2198, in compile
    shared_library = compiler.configure_and_compile(program_folder, sdfg.name)
  File "/users/jbazinsk/miniconda3/envs/dace14/lib/python3.8/site-packages/dace/codegen/compiler.py", line 233, in configure_and_compile
    raise cgx.CompilationError('Compiler failure:\n' + ex.output)
dace.codegen.exceptions.CompilationError: Compiler failure:
[ 20%] Building NVCC (Device) object CMakeFiles/cuda_compile_1.dir/__/__/__/__/__/__/__/__/daceml/examples/bugs/.dacecache/prog/src/cuda/cuda_compile_1_generated_prog_cuda.cu.o
/users/jbazinsk/daceml/examples/bugs/.dacecache/prog/src/cuda/prog_cuda.cu(43): error: identifier "i" is undefined
/users/jbazinsk/daceml/examples/bugs/.dacecache/prog/src/cuda/prog_cuda.cu(47): error: identifier "dace_dyn_map_shared" is undefined

/users/jbazinsk/daceml/examples/bugs/.dacecache/prog/src/cuda/prog_cuda.cu(152): warning: variable "dace_dyn_map_shared" was declared but never referenced

2 errors detected in the compilation of "/users/jbazinsk/daceml/examples/bugs/.dacecache/prog/src/cuda/prog_cuda.cu".
CMake Error at cuda_compile_1_generated_prog_cuda.cu.o.cmake:276 (message):
  Error generating file
  /users/jbazinsk/daceml/examples/bugs/.dacecache/prog/build/CMakeFiles/cuda_compile_1.dir/__/__/__/__/__/__/__/__/daceml/examples/bugs/.dacecache/prog/src/cuda/./cuda_compile_1_generated_prog_cuda.cu.o


gmake[2]: *** [CMakeFiles/prog.dir/build.make:77: CMakeFiles/cuda_compile_1.dir/__/__/__/__/__/__/__/__/daceml/examples/bugs/.dacecache/prog/src/cuda/cuda_compile_1_generated_prog_cuda.cu.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:84: CMakeFiles/prog.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2

To Reproduce
Run the following code:

import dace
import numpy as np

N = 4
num_entries = 9
M = 3

np.random.seed(42)

def dynamic_schedule(sdfg, exclude_loops):
    """Change GPU sequential loops to dynamic."""
    exclude_loops = {name: 0 for name in exclude_loops} or {}
    for node in sdfg.all_nodes_recursive():
        if isinstance(node[0], dace.sdfg.nodes.MapEntry) \
                and node[0].schedule == dace.dtypes.ScheduleType.Sequential \
                and len(node[0].map.params):
            if node[0].label not in exclude_loops:
                print("Changing schedule to TB dynamic: ", node[0].map)
                node[0].schedule = dace.ScheduleType.GPU_ThreadBlock_Dynamic
            else:
                exclude_loops[node[0].label] += 1
                print("Keeping schedule sequential for ", node[0].map)

    not_excluded = [
        name for name, count in exclude_loops.items() if count == 0
    ]
    if not_excluded:
        print(
            "Following loops were marked as excluded from thread-block dynamic "
            "scheduling but were not found in the SDFG: %s", not_excluded)


@dace.program
def prog(node_features, factor, rows, columns, output):
    """
    node_features: input features, N x F
    factor: num_entries
    rows: num_entries
    columns: num_entries
    output: N x M
    """
    output[:] = 0
    for i in dace.map[0:num_entries]:
        col = columns[i]
        row = rows[i]
        output[col] += factor[i] * node_features[row]


def main():
    node_features = np.random.rand(N, M).astype(np.float32)
    adj_matrix = np.array([[1., 0, 1, 0],
                           [1., 1, 1, 0],
                           [0., 1, 1, 1],
                           [0., 0, 1, 0]])
    rows, columns = adj_matrix.nonzero()
    rows = rows.copy()
    columns = columns.copy()
    factor = np.random.rand(num_entries).astype(np.float32)
    output = np.zeros((N, M), dtype=np.float32)
    expected_output = np.zeros((N, M), dtype=np.float32)

    sdfg: dace.SDFG = prog.to_sdfg(node_features=node_features, factor=factor,
                                   rows=rows,
                                   columns=columns,
                                   output=output)

    sdfg.apply_gpu_transformations()
    dynamic_schedule(sdfg, exclude_loops=[
        # The below map also doesn't allow to use the dynamic schedule because
        # only one-dimensional maps are supported in DaCe for dynamic block map
        # schedule (got 2), but that's a different issue.
        '_Mult__map',
    ])
    sdfg(node_features=node_features, factor=factor, rows=rows, columns=columns,
         output=output)

    prog.f(node_features=node_features, factor=factor, rows=rows,
           columns=columns,
           output=expected_output)

    if np.allclose(output, expected_output):
        print("\n==== Results correct.  ☆ ╰(o＾◡＾o)╯ ☆ ====")
    else:
        print("\n*↯*↯*↯* INCORRECT RESULTS! (ノಥ﹏ಥ)ノ彡┻━┻ *↯*↯*↯*")

    print("Actual output:")
    print(output)
    print("Expected output:")
    print(expected_output)


if __name__ == '__main__':
    main()

Expected behavior
The code compiles, computes correct results and shows a smiley face.

System
On ault intel V100 node, CentoOs 8.4, Python 3.8, issue occurs with both DaCe 0.14.1 and 0.13.

Additional context
The erroneous generated CUDA code is here:


#include <cuda_runtime.h>
#include <dace/dace.h>


struct prog_t {
    dace::cuda::Context *gpu_context;
};



DACE_EXPORTED int __dace_init_cuda(prog_t *__state);
DACE_EXPORTED void __dace_exit_cuda(prog_t *__state);

DACE_DFI void prog_43_4_0_0_2(const long long&  __tmp_44_14_r, const long long&  __tmp_45_14_r, const float&  __tmp_46_23_r, const float * __restrict__ __tmp_46_35_r, float * __restrict__ __tmp_46_8_w) {
    long long col;
    long long row;


    col = __tmp_44_14_r;
    row = __tmp_45_14_r;
    {
        float __tmp4[3]  DACE_ALIGN(64);

        {
            for (auto __i0 = 0; __i0 < 3; __i0 += 1) {
                {
                    float __in1 = __tmp_46_23_r;
                    float __in2 = __tmp_46_35_r[(__i0 + (3 * row))];
                    float __out;

                    ///////////////////
                    // Tasklet code (_Mult_)
                    __out = (__in1 * __in2);
                    ///////////////////

                    __tmp4[__i0] = __out;
                }
            }
        }
        {
            unsigned int __dace_dynmap_begin = 0, __dace_dynmap_end = 0;
            if (i < 9) {
                __dace_dynmap_begin = 0;
                __dace_dynmap_end = 3;
            }
            dace::DynamicMap<true, 128>::schedule(dace_dyn_map_shared, __dace_dynmap_begin, __dace_dynmap_end, i, [&](auto i, auto __i0) {
                {
                    float __inp = __tmp4[__i0];
                    float __out;

                    ///////////////////
                    // Tasklet code (assign_46_8)
                    __out = __inp;
                    ///////////////////

                    dace::wcr_fixed<dace::ReductionType::Sum, float>::reduce_atomic(__tmp_46_8_w + (__i0 + (3 * col)), __out);
                }
            });
        }

    }

}



int __dace_init_cuda(prog_t *__state) {
    int count;

    // Check that we are able to run cuda code
    if (cudaGetDeviceCount(&count) != cudaSuccess)
    {
        printf("ERROR: GPU drivers are not configured or cuda-capable device "
               "not found\n");
        return 1;
    }
    if (count == 0)
    {
        printf("ERROR: No cuda-capable devices found\n");
        return 2;
    }

    // Initialize cuda before we run the application
    float *dev_X;
    cudaMalloc((void **) &dev_X, 1);
    cudaFree(dev_X);

    

    __state->gpu_context = new dace::cuda::Context(5, 1);

    // Create cuda streams and events
    for(int i = 0; i < 5; ++i) {
        cudaStreamCreateWithFlags(&__state->gpu_context->streams[i], cudaStreamNonBlocking);
    }
    for(int i = 0; i < 1; ++i) {
        cudaEventCreateWithFlags(&__state->gpu_context->events[i], cudaEventDisableTiming);
    }

    

    return 0;
}

void __dace_exit_cuda(prog_t *__state) {
    

    // Destroy cuda streams and events
    for(int i = 0; i < 5; ++i) {
        cudaStreamDestroy(__state->gpu_context->streams[i]);
    }
    for(int i = 0; i < 1; ++i) {
        cudaEventDestroy(__state->gpu_context->events[i]);
    }

    delete __state->gpu_context;
}

__global__ void assign_42_4_map_0_1_8(float * __restrict__ gpu_output) {
    {
        {
            int __i1 = (blockIdx.x * 128 + threadIdx.x);
            int __i0 = (blockIdx.y * 1 + threadIdx.y);
            if (__i1 < 3) {
                {
                    {
                        float __out;

                        ///////////////////
                        // Tasklet code (assign_42_4)
                        __out = 0;
                        ///////////////////

                        gpu_output[((3 * __i0) + __i1)] = __out;
                    }
                }
            }
        }
    }
}


DACE_EXPORTED void __dace_runkernel_assign_42_4_map_0_1_8(prog_t *__state, float * __restrict__ gpu_output);
void __dace_runkernel_assign_42_4_map_0_1_8(prog_t *__state, float * __restrict__ gpu_output)
{

    void  *assign_42_4_map_0_1_8_args[] = { (void *)&gpu_output };
    cudaLaunchKernel((void*)assign_42_4_map_0_1_8, dim3(1, 4, 1), dim3(128, 1, 1), assign_42_4_map_0_1_8_args, 0, __state->gpu_context->streams[0]);
}
__global__ void prog_43_0_0_0(const long long * __restrict__ gpu_columns, const float * __restrict__ gpu_factor, const float * __restrict__ gpu_node_features, float * __restrict__ gpu_output, const long long * __restrict__ gpu_rows) {
    __shared__ dace::DynamicMap<true, 128>::shared_type dace_dyn_map_shared;
    {
        int i = (blockIdx.x * 128 + threadIdx.x);
        prog_43_4_0_0_2(gpu_columns[i], gpu_rows[i], gpu_factor[i], &gpu_node_features[0], &gpu_output[0]);
    }
}


DACE_EXPORTED void __dace_runkernel_prog_43_0_0_0(prog_t *__state, const long long * __restrict__ gpu_columns, const float * __restrict__ gpu_factor, const float * __restrict__ gpu_node_features, float * __restrict__ gpu_output, const long long * __restrict__ gpu_rows);
void __dace_runkernel_prog_43_0_0_0(prog_t *__state, const long long * __restrict__ gpu_columns, const float * __restrict__ gpu_factor, const float * __restrict__ gpu_node_features, float * __restrict__ gpu_output, const long long * __restrict__ gpu_rows)
{

    void  *prog_43_0_0_0_args[] = { (void *)&gpu_columns, (void *)&gpu_factor, (void *)&gpu_node_features, (void *)&gpu_output, (void *)&gpu_rows };
    cudaLaunchKernel((void*)prog_43_0_0_0, dim3(1, 1, 1), dim3(128, 1, 1), prog_43_0_0_0_args, 0, __state->gpu_context->streams[0]);
}

The text was updated successfully, but these errors were encountered:

tbennun · 2024-10-29T05:54:21Z

Will be tested in #1711

Update: This issue will not be fixed in the above PR. It is a bit more involved in the sense that a work-stealing schedule adapts the block index too (because it load-balances work across multiple GPU_Device indices. This will be resolved in a PR of its own.

The PR adds a new syntax to support inline storage specification with the `@` operator, supporting the following statements: `a = np.ones(M) @ dace.StorageType.CPU_ThreadLocal`. This PR also fixes multiple minor issues in the Python frontend: * `WarpTiling` did not respect sequential map schedules * Non-sequence inputs for `numpy.fill` variants (e.g., `numpy.zeros(N)`) * NumPy replacement syntax errors would sometimes not have source information * Fix type inference for nested scopes in Python frontend * Dynamic thread block scheduling does not support multi-dimensional maps * Default schedule inference should use dynamic thread blocks if they exist * Type hints with storage type not being adhered to by the Python frontend * Validation issue #1562 The following changes were added as skipped tests and deferred to future PRs: * Dynamic map range related issues: Fix deferred to #1696 * Dynamic thread block scheduling would not pass object to nested functions: Fix deferred to future PR, see #1189 for more information

tbennun added the bug Something isn't working label Oct 29, 2024

tbennun added this to the 1.0 milestone Oct 29, 2024

tbennun linked a pull request Oct 29, 2024 that will close this issue

Python frontend stability and inline storage specification #1711

Merged

10 tasks

tbennun changed the title ~~Compilation error on GPU when using ScheduleType.GPU_ThreadBlock_Dynamic~~ Support nested SDFGs in ScheduleType.GPU_ThreadBlock_Dynamic Oct 29, 2024

tbennun removed a link to a pull request Oct 29, 2024

Python frontend stability and inline storage specification #1711

Merged

10 tasks

tbennun mentioned this issue Oct 29, 2024

Python frontend stability and inline storage specification #1711

Merged

10 tasks

tbennun modified the milestones: 1.0, 1.x Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support nested SDFGs in `ScheduleType.GPU_ThreadBlock_Dynamic` #1189

Support nested SDFGs in `ScheduleType.GPU_ThreadBlock_Dynamic` #1189

lamyiowce commented Feb 15, 2023

tbennun commented Oct 29, 2024 •

edited

Loading

Support nested SDFGs in ScheduleType.GPU_ThreadBlock_Dynamic #1189

Support nested SDFGs in ScheduleType.GPU_ThreadBlock_Dynamic #1189

Comments

lamyiowce commented Feb 15, 2023

tbennun commented Oct 29, 2024 • edited Loading

Support nested SDFGs in `ScheduleType.GPU_ThreadBlock_Dynamic` #1189

Support nested SDFGs in `ScheduleType.GPU_ThreadBlock_Dynamic` #1189

tbennun commented Oct 29, 2024 •

edited

Loading