Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nested SDFGs in ScheduleType.GPU_ThreadBlock_Dynamic #1189

Open
lamyiowce opened this issue Feb 15, 2023 · 1 comment
Open

Support nested SDFGs in ScheduleType.GPU_ThreadBlock_Dynamic #1189

lamyiowce opened this issue Feb 15, 2023 · 1 comment
Labels
bug Something isn't working
Milestone

Comments

@lamyiowce
Copy link
Contributor

Describe the bug
Encountered when implementing a Graph Attentional Operator. When using ScheduleType.GPU_ThreadBlock_Dynamic, incorrect code gets generated. The error is:

(dace14) [jbazinsk@ault24 bugs]$ python compile_error.py
Keeping schedule sequential for  _Mult__map[__i0=0:3]
Changing schedule to TB dynamic:  assign_47_8_map[__i0=0:3]
Traceback (most recent call last):
  File "/users/jbazinsk/miniconda3/envs/dace14/lib/python3.8/site-packages/dace/codegen/compiler.py", line 224, in configure_and_compile
    _run_liveoutput("cmake --build . --config %s" % (Config.get('compiler', 'build_type')),
  File "/users/jbazinsk/miniconda3/envs/dace14/lib/python3.8/site-packages/dace/codegen/compiler.py", line 407, in _run_liveoutput
    raise subprocess.CalledProcessError(process.returncode, command, output.getvalue())
subprocess.CalledProcessError: Command 'cmake --build . --config RelWithDebInfo' returned non-zero exit status 2.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "compile_error.py", line 92, in <module>
    main()
  File "compile_error.py", line 73, in main
    sdfg(node_features=node_features, factor=factor, rows=rows, columns=columns,
  File "/users/jbazinsk/miniconda3/envs/dace14/lib/python3.8/site-packages/dace/sdfg/sdfg.py", line 2289, in __call__
    binaryobj = sdfg.compile()
  File "/users/jbazinsk/miniconda3/envs/dace14/lib/python3.8/site-packages/dace/sdfg/sdfg.py", line 2198, in compile
    shared_library = compiler.configure_and_compile(program_folder, sdfg.name)
  File "/users/jbazinsk/miniconda3/envs/dace14/lib/python3.8/site-packages/dace/codegen/compiler.py", line 233, in configure_and_compile
    raise cgx.CompilationError('Compiler failure:\n' + ex.output)
dace.codegen.exceptions.CompilationError: Compiler failure:
[ 20%] Building NVCC (Device) object CMakeFiles/cuda_compile_1.dir/__/__/__/__/__/__/__/__/daceml/examples/bugs/.dacecache/prog/src/cuda/cuda_compile_1_generated_prog_cuda.cu.o
/users/jbazinsk/daceml/examples/bugs/.dacecache/prog/src/cuda/prog_cuda.cu(43): error: identifier "i" is undefined
/users/jbazinsk/daceml/examples/bugs/.dacecache/prog/src/cuda/prog_cuda.cu(47): error: identifier "dace_dyn_map_shared" is undefined

/users/jbazinsk/daceml/examples/bugs/.dacecache/prog/src/cuda/prog_cuda.cu(152): warning: variable "dace_dyn_map_shared" was declared but never referenced

2 errors detected in the compilation of "/users/jbazinsk/daceml/examples/bugs/.dacecache/prog/src/cuda/prog_cuda.cu".
CMake Error at cuda_compile_1_generated_prog_cuda.cu.o.cmake:276 (message):
  Error generating file
  /users/jbazinsk/daceml/examples/bugs/.dacecache/prog/build/CMakeFiles/cuda_compile_1.dir/__/__/__/__/__/__/__/__/daceml/examples/bugs/.dacecache/prog/src/cuda/./cuda_compile_1_generated_prog_cuda.cu.o


gmake[2]: *** [CMakeFiles/prog.dir/build.make:77: CMakeFiles/cuda_compile_1.dir/__/__/__/__/__/__/__/__/daceml/examples/bugs/.dacecache/prog/src/cuda/cuda_compile_1_generated_prog_cuda.cu.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:84: CMakeFiles/prog.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2


To Reproduce
Run the following code:

import dace
import numpy as np

N = 4
num_entries = 9
M = 3

np.random.seed(42)

def dynamic_schedule(sdfg, exclude_loops):
    """Change GPU sequential loops to dynamic."""
    exclude_loops = {name: 0 for name in exclude_loops} or {}
    for node in sdfg.all_nodes_recursive():
        if isinstance(node[0], dace.sdfg.nodes.MapEntry) \
                and node[0].schedule == dace.dtypes.ScheduleType.Sequential \
                and len(node[0].map.params):
            if node[0].label not in exclude_loops:
                print("Changing schedule to TB dynamic: ", node[0].map)
                node[0].schedule = dace.ScheduleType.GPU_ThreadBlock_Dynamic
            else:
                exclude_loops[node[0].label] += 1
                print("Keeping schedule sequential for ", node[0].map)

    not_excluded = [
        name for name, count in exclude_loops.items() if count == 0
    ]
    if not_excluded:
        print(
            "Following loops were marked as excluded from thread-block dynamic "
            "scheduling but were not found in the SDFG: %s", not_excluded)


@dace.program
def prog(node_features, factor, rows, columns, output):
    """
    node_features: input features, N x F
    factor: num_entries
    rows: num_entries
    columns: num_entries
    output: N x M
    """
    output[:] = 0
    for i in dace.map[0:num_entries]:
        col = columns[i]
        row = rows[i]
        output[col] += factor[i] * node_features[row]


def main():
    node_features = np.random.rand(N, M).astype(np.float32)
    adj_matrix = np.array([[1., 0, 1, 0],
                           [1., 1, 1, 0],
                           [0., 1, 1, 1],
                           [0., 0, 1, 0]])
    rows, columns = adj_matrix.nonzero()
    rows = rows.copy()
    columns = columns.copy()
    factor = np.random.rand(num_entries).astype(np.float32)
    output = np.zeros((N, M), dtype=np.float32)
    expected_output = np.zeros((N, M), dtype=np.float32)

    sdfg: dace.SDFG = prog.to_sdfg(node_features=node_features, factor=factor,
                                   rows=rows,
                                   columns=columns,
                                   output=output)

    sdfg.apply_gpu_transformations()
    dynamic_schedule(sdfg, exclude_loops=[
        # The below map also doesn't allow to use the dynamic schedule because
        # only one-dimensional maps are supported in DaCe for dynamic block map
        # schedule (got 2), but that's a different issue.
        '_Mult__map',
    ])
    sdfg(node_features=node_features, factor=factor, rows=rows, columns=columns,
         output=output)

    prog.f(node_features=node_features, factor=factor, rows=rows,
           columns=columns,
           output=expected_output)

    if np.allclose(output, expected_output):
        print("\n==== Results correct.  ☆ ╰(o^◡^o)╯ ☆ ====")
    else:
        print("\n*↯*↯*↯* INCORRECT RESULTS! (ノಥ﹏ಥ)ノ彡┻━┻ *↯*↯*↯*")

    print("Actual output:")
    print(output)
    print("Expected output:")
    print(expected_output)


if __name__ == '__main__':
    main()

Expected behavior
The code compiles, computes correct results and shows a smiley face.

System
On ault intel V100 node, CentoOs 8.4, Python 3.8, issue occurs with both DaCe 0.14.1 and 0.13.

Additional context
The erroneous generated CUDA code is here:


#include <cuda_runtime.h>
#include <dace/dace.h>


struct prog_t {
    dace::cuda::Context *gpu_context;
};



DACE_EXPORTED int __dace_init_cuda(prog_t *__state);
DACE_EXPORTED void __dace_exit_cuda(prog_t *__state);

DACE_DFI void prog_43_4_0_0_2(const long long&  __tmp_44_14_r, const long long&  __tmp_45_14_r, const float&  __tmp_46_23_r, const float * __restrict__ __tmp_46_35_r, float * __restrict__ __tmp_46_8_w) {
    long long col;
    long long row;


    col = __tmp_44_14_r;
    row = __tmp_45_14_r;
    {
        float __tmp4[3]  DACE_ALIGN(64);

        {
            for (auto __i0 = 0; __i0 < 3; __i0 += 1) {
                {
                    float __in1 = __tmp_46_23_r;
                    float __in2 = __tmp_46_35_r[(__i0 + (3 * row))];
                    float __out;

                    ///////////////////
                    // Tasklet code (_Mult_)
                    __out = (__in1 * __in2);
                    ///////////////////

                    __tmp4[__i0] = __out;
                }
            }
        }
        {
            unsigned int __dace_dynmap_begin = 0, __dace_dynmap_end = 0;
            if (i < 9) {
                __dace_dynmap_begin = 0;
                __dace_dynmap_end = 3;
            }
            dace::DynamicMap<true, 128>::schedule(dace_dyn_map_shared, __dace_dynmap_begin, __dace_dynmap_end, i, [&](auto i, auto __i0) {
                {
                    float __inp = __tmp4[__i0];
                    float __out;

                    ///////////////////
                    // Tasklet code (assign_46_8)
                    __out = __inp;
                    ///////////////////

                    dace::wcr_fixed<dace::ReductionType::Sum, float>::reduce_atomic(__tmp_46_8_w + (__i0 + (3 * col)), __out);
                }
            });
        }

    }

}



int __dace_init_cuda(prog_t *__state) {
    int count;

    // Check that we are able to run cuda code
    if (cudaGetDeviceCount(&count) != cudaSuccess)
    {
        printf("ERROR: GPU drivers are not configured or cuda-capable device "
               "not found\n");
        return 1;
    }
    if (count == 0)
    {
        printf("ERROR: No cuda-capable devices found\n");
        return 2;
    }

    // Initialize cuda before we run the application
    float *dev_X;
    cudaMalloc((void **) &dev_X, 1);
    cudaFree(dev_X);

    

    __state->gpu_context = new dace::cuda::Context(5, 1);

    // Create cuda streams and events
    for(int i = 0; i < 5; ++i) {
        cudaStreamCreateWithFlags(&__state->gpu_context->streams[i], cudaStreamNonBlocking);
    }
    for(int i = 0; i < 1; ++i) {
        cudaEventCreateWithFlags(&__state->gpu_context->events[i], cudaEventDisableTiming);
    }

    

    return 0;
}

void __dace_exit_cuda(prog_t *__state) {
    

    // Destroy cuda streams and events
    for(int i = 0; i < 5; ++i) {
        cudaStreamDestroy(__state->gpu_context->streams[i]);
    }
    for(int i = 0; i < 1; ++i) {
        cudaEventDestroy(__state->gpu_context->events[i]);
    }

    delete __state->gpu_context;
}

__global__ void assign_42_4_map_0_1_8(float * __restrict__ gpu_output) {
    {
        {
            int __i1 = (blockIdx.x * 128 + threadIdx.x);
            int __i0 = (blockIdx.y * 1 + threadIdx.y);
            if (__i1 < 3) {
                {
                    {
                        float __out;

                        ///////////////////
                        // Tasklet code (assign_42_4)
                        __out = 0;
                        ///////////////////

                        gpu_output[((3 * __i0) + __i1)] = __out;
                    }
                }
            }
        }
    }
}


DACE_EXPORTED void __dace_runkernel_assign_42_4_map_0_1_8(prog_t *__state, float * __restrict__ gpu_output);
void __dace_runkernel_assign_42_4_map_0_1_8(prog_t *__state, float * __restrict__ gpu_output)
{

    void  *assign_42_4_map_0_1_8_args[] = { (void *)&gpu_output };
    cudaLaunchKernel((void*)assign_42_4_map_0_1_8, dim3(1, 4, 1), dim3(128, 1, 1), assign_42_4_map_0_1_8_args, 0, __state->gpu_context->streams[0]);
}
__global__ void prog_43_0_0_0(const long long * __restrict__ gpu_columns, const float * __restrict__ gpu_factor, const float * __restrict__ gpu_node_features, float * __restrict__ gpu_output, const long long * __restrict__ gpu_rows) {
    __shared__ dace::DynamicMap<true, 128>::shared_type dace_dyn_map_shared;
    {
        int i = (blockIdx.x * 128 + threadIdx.x);
        prog_43_4_0_0_2(gpu_columns[i], gpu_rows[i], gpu_factor[i], &gpu_node_features[0], &gpu_output[0]);
    }
}


DACE_EXPORTED void __dace_runkernel_prog_43_0_0_0(prog_t *__state, const long long * __restrict__ gpu_columns, const float * __restrict__ gpu_factor, const float * __restrict__ gpu_node_features, float * __restrict__ gpu_output, const long long * __restrict__ gpu_rows);
void __dace_runkernel_prog_43_0_0_0(prog_t *__state, const long long * __restrict__ gpu_columns, const float * __restrict__ gpu_factor, const float * __restrict__ gpu_node_features, float * __restrict__ gpu_output, const long long * __restrict__ gpu_rows)
{

    void  *prog_43_0_0_0_args[] = { (void *)&gpu_columns, (void *)&gpu_factor, (void *)&gpu_node_features, (void *)&gpu_output, (void *)&gpu_rows };
    cudaLaunchKernel((void*)prog_43_0_0_0, dim3(1, 1, 1), dim3(128, 1, 1), prog_43_0_0_0_args, 0, __state->gpu_context->streams[0]);
}

@tbennun
Copy link
Collaborator

tbennun commented Oct 29, 2024

Will be tested in #1711

Update: This issue will not be fixed in the above PR. It is a bit more involved in the sense that a work-stealing schedule adapts the block index too (because it load-balances work across multiple GPU_Device indices. This will be resolved in a PR of its own.

@tbennun tbennun added the bug Something isn't working label Oct 29, 2024
@tbennun tbennun added this to the 1.0 milestone Oct 29, 2024
@tbennun tbennun linked a pull request Oct 29, 2024 that will close this issue
10 tasks
@tbennun tbennun changed the title Compilation error on GPU when using ScheduleType.GPU_ThreadBlock_Dynamic Support nested SDFGs in ScheduleType.GPU_ThreadBlock_Dynamic Oct 29, 2024
tbennun added a commit that referenced this issue Oct 29, 2024
The PR adds a new syntax to support inline storage specification with
the `@` operator, supporting the following statements: `a = np.ones(M) @
dace.StorageType.CPU_ThreadLocal`.

This PR also fixes multiple minor issues in the Python frontend:
* `WarpTiling` did not respect sequential map schedules
* Non-sequence inputs for `numpy.fill` variants (e.g.,
`numpy.zeros(N)`)
* NumPy replacement syntax errors would sometimes not have source
information
* Fix type inference for nested scopes in Python frontend
* Dynamic thread block scheduling does not support multi-dimensional
maps
* Default schedule inference should use dynamic thread blocks if
they exist
* Type hints with storage type not being adhered to by the Python
frontend
* Validation issue #1562

The following changes were added as skipped tests and deferred to future PRs:
* Dynamic map range related issues: Fix deferred to #1696
* Dynamic thread block scheduling would not pass object to nested
functions: Fix deferred to future PR, see #1189 for more information
@tbennun tbennun modified the milestones: 1.0, 1.x Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants