You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Encountered when implementing a Graph Attentional Operator. When using ScheduleType.GPU_ThreadBlock_Dynamic, incorrect code gets generated. The error is:
(dace14) [jbazinsk@ault24 bugs]$ python compile_error.py
Keeping schedule sequential for _Mult__map[__i0=0:3]
Changing schedule to TB dynamic: assign_47_8_map[__i0=0:3]
Traceback (most recent call last):
File "/users/jbazinsk/miniconda3/envs/dace14/lib/python3.8/site-packages/dace/codegen/compiler.py", line 224, in configure_and_compile
_run_liveoutput("cmake --build . --config %s" % (Config.get('compiler', 'build_type')),
File "/users/jbazinsk/miniconda3/envs/dace14/lib/python3.8/site-packages/dace/codegen/compiler.py", line 407, in _run_liveoutput
raise subprocess.CalledProcessError(process.returncode, command, output.getvalue())
subprocess.CalledProcessError: Command 'cmake --build . --config RelWithDebInfo' returned non-zero exit status 2.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "compile_error.py", line 92, in <module>
main()
File "compile_error.py", line 73, in main
sdfg(node_features=node_features, factor=factor, rows=rows, columns=columns,
File "/users/jbazinsk/miniconda3/envs/dace14/lib/python3.8/site-packages/dace/sdfg/sdfg.py", line 2289, in __call__
binaryobj = sdfg.compile()
File "/users/jbazinsk/miniconda3/envs/dace14/lib/python3.8/site-packages/dace/sdfg/sdfg.py", line 2198, in compile
shared_library = compiler.configure_and_compile(program_folder, sdfg.name)
File "/users/jbazinsk/miniconda3/envs/dace14/lib/python3.8/site-packages/dace/codegen/compiler.py", line 233, in configure_and_compile
raise cgx.CompilationError('Compiler failure:\n' + ex.output)
dace.codegen.exceptions.CompilationError: Compiler failure:
[ 20%] Building NVCC (Device) object CMakeFiles/cuda_compile_1.dir/__/__/__/__/__/__/__/__/daceml/examples/bugs/.dacecache/prog/src/cuda/cuda_compile_1_generated_prog_cuda.cu.o
/users/jbazinsk/daceml/examples/bugs/.dacecache/prog/src/cuda/prog_cuda.cu(43): error: identifier "i" is undefined
/users/jbazinsk/daceml/examples/bugs/.dacecache/prog/src/cuda/prog_cuda.cu(47): error: identifier "dace_dyn_map_shared" is undefined
/users/jbazinsk/daceml/examples/bugs/.dacecache/prog/src/cuda/prog_cuda.cu(152): warning: variable "dace_dyn_map_shared" was declared but never referenced
2 errors detected in the compilation of "/users/jbazinsk/daceml/examples/bugs/.dacecache/prog/src/cuda/prog_cuda.cu".
CMake Error at cuda_compile_1_generated_prog_cuda.cu.o.cmake:276 (message):
Error generating file
/users/jbazinsk/daceml/examples/bugs/.dacecache/prog/build/CMakeFiles/cuda_compile_1.dir/__/__/__/__/__/__/__/__/daceml/examples/bugs/.dacecache/prog/src/cuda/./cuda_compile_1_generated_prog_cuda.cu.o
gmake[2]: *** [CMakeFiles/prog.dir/build.make:77: CMakeFiles/cuda_compile_1.dir/__/__/__/__/__/__/__/__/daceml/examples/bugs/.dacecache/prog/src/cuda/cuda_compile_1_generated_prog_cuda.cu.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:84: CMakeFiles/prog.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2
To Reproduce
Run the following code:
import dace
import numpy as np
N = 4
num_entries = 9
M = 3
np.random.seed(42)
def dynamic_schedule(sdfg, exclude_loops):
"""Change GPU sequential loops to dynamic."""
exclude_loops = {name: 0 for name in exclude_loops} or {}
for node in sdfg.all_nodes_recursive():
if isinstance(node[0], dace.sdfg.nodes.MapEntry) \
and node[0].schedule == dace.dtypes.ScheduleType.Sequential \
and len(node[0].map.params):
if node[0].label not in exclude_loops:
print("Changing schedule to TB dynamic: ", node[0].map)
node[0].schedule = dace.ScheduleType.GPU_ThreadBlock_Dynamic
else:
exclude_loops[node[0].label] += 1
print("Keeping schedule sequential for ", node[0].map)
not_excluded = [
name for name, count in exclude_loops.items() if count == 0
]
if not_excluded:
print(
"Following loops were marked as excluded from thread-block dynamic "
"scheduling but were not found in the SDFG: %s", not_excluded)
@dace.program
def prog(node_features, factor, rows, columns, output):
"""
node_features: input features, N x F
factor: num_entries
rows: num_entries
columns: num_entries
output: N x M
"""
output[:] = 0
for i in dace.map[0:num_entries]:
col = columns[i]
row = rows[i]
output[col] += factor[i] * node_features[row]
def main():
node_features = np.random.rand(N, M).astype(np.float32)
adj_matrix = np.array([[1., 0, 1, 0],
[1., 1, 1, 0],
[0., 1, 1, 1],
[0., 0, 1, 0]])
rows, columns = adj_matrix.nonzero()
rows = rows.copy()
columns = columns.copy()
factor = np.random.rand(num_entries).astype(np.float32)
output = np.zeros((N, M), dtype=np.float32)
expected_output = np.zeros((N, M), dtype=np.float32)
sdfg: dace.SDFG = prog.to_sdfg(node_features=node_features, factor=factor,
rows=rows,
columns=columns,
output=output)
sdfg.apply_gpu_transformations()
dynamic_schedule(sdfg, exclude_loops=[
# The below map also doesn't allow to use the dynamic schedule because
# only one-dimensional maps are supported in DaCe for dynamic block map
# schedule (got 2), but that's a different issue.
'_Mult__map',
])
sdfg(node_features=node_features, factor=factor, rows=rows, columns=columns,
output=output)
prog.f(node_features=node_features, factor=factor, rows=rows,
columns=columns,
output=expected_output)
if np.allclose(output, expected_output):
print("\n==== Results correct. ☆ ╰(o^◡^o)╯ ☆ ====")
else:
print("\n*↯*↯*↯* INCORRECT RESULTS! (ノಥ﹏ಥ)ノ彡┻━┻ *↯*↯*↯*")
print("Actual output:")
print(output)
print("Expected output:")
print(expected_output)
if __name__ == '__main__':
main()
Expected behavior
The code compiles, computes correct results and shows a smiley face.
System
On ault intel V100 node, CentoOs 8.4, Python 3.8, issue occurs with both DaCe 0.14.1 and 0.13.
Additional context
The erroneous generated CUDA code is here:
#include <cuda_runtime.h>
#include <dace/dace.h>
struct prog_t {
dace::cuda::Context *gpu_context;
};
DACE_EXPORTED int __dace_init_cuda(prog_t *__state);
DACE_EXPORTED void __dace_exit_cuda(prog_t *__state);
DACE_DFI void prog_43_4_0_0_2(const long long& __tmp_44_14_r, const long long& __tmp_45_14_r, const float& __tmp_46_23_r, const float * __restrict__ __tmp_46_35_r, float * __restrict__ __tmp_46_8_w) {
long long col;
long long row;
col = __tmp_44_14_r;
row = __tmp_45_14_r;
{
float __tmp4[3] DACE_ALIGN(64);
{
for (auto __i0 = 0; __i0 < 3; __i0 += 1) {
{
float __in1 = __tmp_46_23_r;
float __in2 = __tmp_46_35_r[(__i0 + (3 * row))];
float __out;
///////////////////
// Tasklet code (_Mult_)
__out = (__in1 * __in2);
///////////////////
__tmp4[__i0] = __out;
}
}
}
{
unsigned int __dace_dynmap_begin = 0, __dace_dynmap_end = 0;
if (i < 9) {
__dace_dynmap_begin = 0;
__dace_dynmap_end = 3;
}
dace::DynamicMap<true, 128>::schedule(dace_dyn_map_shared, __dace_dynmap_begin, __dace_dynmap_end, i, [&](auto i, auto __i0) {
{
float __inp = __tmp4[__i0];
float __out;
///////////////////
// Tasklet code (assign_46_8)
__out = __inp;
///////////////////
dace::wcr_fixed<dace::ReductionType::Sum, float>::reduce_atomic(__tmp_46_8_w + (__i0 + (3 * col)), __out);
}
});
}
}
}
int __dace_init_cuda(prog_t *__state) {
int count;
// Check that we are able to run cuda code
if (cudaGetDeviceCount(&count) != cudaSuccess)
{
printf("ERROR: GPU drivers are not configured or cuda-capable device "
"not found\n");
return 1;
}
if (count == 0)
{
printf("ERROR: No cuda-capable devices found\n");
return 2;
}
// Initialize cuda before we run the application
float *dev_X;
cudaMalloc((void **) &dev_X, 1);
cudaFree(dev_X);
__state->gpu_context = new dace::cuda::Context(5, 1);
// Create cuda streams and events
for(int i = 0; i < 5; ++i) {
cudaStreamCreateWithFlags(&__state->gpu_context->streams[i], cudaStreamNonBlocking);
}
for(int i = 0; i < 1; ++i) {
cudaEventCreateWithFlags(&__state->gpu_context->events[i], cudaEventDisableTiming);
}
return 0;
}
void __dace_exit_cuda(prog_t *__state) {
// Destroy cuda streams and events
for(int i = 0; i < 5; ++i) {
cudaStreamDestroy(__state->gpu_context->streams[i]);
}
for(int i = 0; i < 1; ++i) {
cudaEventDestroy(__state->gpu_context->events[i]);
}
delete __state->gpu_context;
}
__global__ void assign_42_4_map_0_1_8(float * __restrict__ gpu_output) {
{
{
int __i1 = (blockIdx.x * 128 + threadIdx.x);
int __i0 = (blockIdx.y * 1 + threadIdx.y);
if (__i1 < 3) {
{
{
float __out;
///////////////////
// Tasklet code (assign_42_4)
__out = 0;
///////////////////
gpu_output[((3 * __i0) + __i1)] = __out;
}
}
}
}
}
}
DACE_EXPORTED void __dace_runkernel_assign_42_4_map_0_1_8(prog_t *__state, float * __restrict__ gpu_output);
void __dace_runkernel_assign_42_4_map_0_1_8(prog_t *__state, float * __restrict__ gpu_output)
{
void *assign_42_4_map_0_1_8_args[] = { (void *)&gpu_output };
cudaLaunchKernel((void*)assign_42_4_map_0_1_8, dim3(1, 4, 1), dim3(128, 1, 1), assign_42_4_map_0_1_8_args, 0, __state->gpu_context->streams[0]);
}
__global__ void prog_43_0_0_0(const long long * __restrict__ gpu_columns, const float * __restrict__ gpu_factor, const float * __restrict__ gpu_node_features, float * __restrict__ gpu_output, const long long * __restrict__ gpu_rows) {
__shared__ dace::DynamicMap<true, 128>::shared_type dace_dyn_map_shared;
{
int i = (blockIdx.x * 128 + threadIdx.x);
prog_43_4_0_0_2(gpu_columns[i], gpu_rows[i], gpu_factor[i], &gpu_node_features[0], &gpu_output[0]);
}
}
DACE_EXPORTED void __dace_runkernel_prog_43_0_0_0(prog_t *__state, const long long * __restrict__ gpu_columns, const float * __restrict__ gpu_factor, const float * __restrict__ gpu_node_features, float * __restrict__ gpu_output, const long long * __restrict__ gpu_rows);
void __dace_runkernel_prog_43_0_0_0(prog_t *__state, const long long * __restrict__ gpu_columns, const float * __restrict__ gpu_factor, const float * __restrict__ gpu_node_features, float * __restrict__ gpu_output, const long long * __restrict__ gpu_rows)
{
void *prog_43_0_0_0_args[] = { (void *)&gpu_columns, (void *)&gpu_factor, (void *)&gpu_node_features, (void *)&gpu_output, (void *)&gpu_rows };
cudaLaunchKernel((void*)prog_43_0_0_0, dim3(1, 1, 1), dim3(128, 1, 1), prog_43_0_0_0_args, 0, __state->gpu_context->streams[0]);
}
The text was updated successfully, but these errors were encountered:
Update: This issue will not be fixed in the above PR. It is a bit more involved in the sense that a work-stealing schedule adapts the block index too (because it load-balances work across multiple GPU_Device indices. This will be resolved in a PR of its own.
tbennun
changed the title
Compilation error on GPU when using ScheduleType.GPU_ThreadBlock_Dynamic
Support nested SDFGs in ScheduleType.GPU_ThreadBlock_DynamicOct 29, 2024
The PR adds a new syntax to support inline storage specification with
the `@` operator, supporting the following statements: `a = np.ones(M) @
dace.StorageType.CPU_ThreadLocal`.
This PR also fixes multiple minor issues in the Python frontend:
* `WarpTiling` did not respect sequential map schedules
* Non-sequence inputs for `numpy.fill` variants (e.g.,
`numpy.zeros(N)`)
* NumPy replacement syntax errors would sometimes not have source
information
* Fix type inference for nested scopes in Python frontend
* Dynamic thread block scheduling does not support multi-dimensional
maps
* Default schedule inference should use dynamic thread blocks if
they exist
* Type hints with storage type not being adhered to by the Python
frontend
* Validation issue #1562
The following changes were added as skipped tests and deferred to future PRs:
* Dynamic map range related issues: Fix deferred to #1696
* Dynamic thread block scheduling would not pass object to nested
functions: Fix deferred to future PR, see #1189 for more information
Describe the bug
Encountered when implementing a Graph Attentional Operator. When using
ScheduleType.GPU_ThreadBlock_Dynamic
, incorrect code gets generated. The error is:To Reproduce
Run the following code:
Expected behavior
The code compiles, computes correct results and shows a smiley face.
System
On ault intel V100 node, CentoOs 8.4, Python 3.8, issue occurs with both DaCe 0.14.1 and 0.13.
Additional context
The erroneous generated CUDA code is here:
The text was updated successfully, but these errors were encountered: