Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic gpu test failed in ubuntu 16.04 [AMD memory caps] #167

Closed
gauthampughazhendhi opened this issue Oct 10, 2017 · 31 comments
Closed

Basic gpu test failed in ubuntu 16.04 [AMD memory caps] #167

gauthampughazhendhi opened this issue Oct 10, 2017 · 31 comments
Assignees

Comments

@gauthampughazhendhi
Copy link

gauthampughazhendhi commented Oct 10, 2017

When I run the following command

bazel test -c opt --config=sycl --test_output=all //tensorflow/python/kernel_tests:basic_gpu_test

the following log is displayed,

WARNING: /home/gautham/tensorflow/tensorflow/core/BUILD:1780:1: in includes attribute of cc_library rule //tensorflow/core:framework_headers_lib: '../../external/nsync/public' resolves to 'external/nsync/public' not below the relative path of its package 'tensorflow/core'. This will be an error in the future. Since this rule was created by the macro 'cc_header_only_library', the error might have been caused by the macro implementation in /home/gautham/tensorflow/tensorflow/tensorflow.bzl:1029:30
INFO: Analysed target //tensorflow/python/kernel_tests:basic_gpu_test (0 packages loaded).
INFO: Found 1 test target...
FAIL: //tensorflow/python/kernel_tests:basic_gpu_test (see /home/gautham/.cache/bazel/_bazel_gautham/f3cb0296cc06462eaad2b67eee0ab414/execroot/org_tensorflow/bazel-out/local_linux-py3-opt/testlogs/tensorflow/python/kernel_tests/basic_gpu_test/test.log)
INFO: From Testing //tensorflow/python/kernel_tests:basic_gpu_test:
==================== Test output for //tensorflow/python/kernel_tests:basic_gpu_test:
2017-10-10 13:54:12.295721: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
terminate called after throwing an instance of 'cl::sycl::cl_exception'
  what():  Error: [ComputeCpp:RT0407] Failed to create OpenCL command queue
Aborted (core dumped)
================================================================================
Target //tensorflow/python/kernel_tests:basic_gpu_test up-to-date:
  bazel-bin/tensorflow/python/kernel_tests/basic_gpu_test
INFO: Elapsed time: 15.027s, Critical Path: 3.43s
INFO: Build completed, 1 test FAILED, 2 total actions
//tensorflow/python/kernel_tests:basic_gpu_test                          FAILED in 2.5s
  /home/gautham/.cache/bazel/_bazel_gautham/f3cb0296cc06462eaad2b67eee0ab414/execroot/org_tensorflow/bazel-out/local_linux-py3-opt/testlogs/tensorflow/python/kernel_tests/basic_gpu_test/test.log

Executed 1 out of 1 test: 1 fails locally.

And my computecpp_info result is as follows,


ComputeCpp Info (CE 0.3.2)


Toolchain information:

GLIBC version: 2.23
GLIBCXX: 20160609
This version of libstdc++ is supported.


Device Info:

Discovered 1 devices matching:
platform :
device type :


Device 0:

Device is supported : UNTESTED - Vendor not tested on this OS
CL_DEVICE_NAME : Capeverde
CL_DEVICE_VENDOR : Advanced Micro Devices, Inc.
CL_DRIVER_VERSION : 2442.7
CL_DEVICE_TYPE : CL_DEVICE_TYPE_GPU

If you encounter problems when using any of these OpenCL devices, please consult
this website for known issues:
https://computecpp.codeplay.com/releases/v0.3.2/platform-support-notes


I don't know where it is going wrong.

@lukeiwanski lukeiwanski self-assigned this Oct 10, 2017
@lukeiwanski
Copy link
Owner

Thanks for reporting. We will have a look and get back to you ASAP

@gauthampughazhendhi
Copy link
Author

The log of clinfo is as follows,

[Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (2442.7)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon HD 8800M Series
Device Topology: PCI[ B#3, D#0, F#0 ]
Max compute units: 5
Max work items dimensions: 3
Max work items[0]: 256
Max work items[1]: 256
Max work items[2]: 256
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 400Mhz
Address bits: 32
Max memory allocation: 1409077248
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 2140639232
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 0
Max pipe active reservations: 0
Max pipe packet size: 0
Max global variable size: 0
Max global variable preferred total size: 0
Max read/write image args: 0
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: No
Fine grain buffer: No
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 0x7f24454a7478
Name: Capeverde
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.2
Driver version: 2442.7
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (2442.7)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event ](url)

@DuncanMcBain
Copy link
Collaborator

Hi @GauthamPughaz, I can see where in the code that this error is being thrown from, but I don't really understand why. If you don't mind, it might be useful to try the more simple tests from the SDK we maintain: https://github.com/codeplaysoftware/computecpp-sdk
Hopefully, these tests will be simple enough that we can debug exactly what's going on here! They're very easy to set up - details are there, but running the following from the root directory of the SDK would work, provided you change the install location to wherever you've saved the ComputeCpp files (in the same way you specify when configuring tensorflow to use OpenCL):

mkdir build && cd build
cmake ../samples -DCOMPUTECPP_PACKAGE_ROOT_DIR=/path/to/computecpp/install
make -j4
ctest

It'd be really helpful to us if you did have a look at this, though I appreciate it's a pain when things don't "just work".
Many thanks,
Duncan.

@gauthampughazhendhi
Copy link
Author

Yea I have done what you said. The log for that is,

gautham@gautham-dell:~/computecpp-sdk/build$ ctest
Test project /home/gautham/computecpp-sdk/build
Start 1: scan
1/21 Test #1: scan .............................***Exception: Other 0.43 sec
Start 2: images
2/21 Test #2: images ...........................***Failed 0.04 sec
Start 3: opencl_c_interop
3/21 Test #3: opencl_c_interop .................***Exception: Other 0.12 sec
Start 4: example_vptr
4/21 Test #4: example_vptr .....................***Exception: Other 0.12 sec
Start 5: simple_local_barrier
5/21 Test #5: simple_local_barrier .............***Exception: Other 0.12 sec
Start 6: simple_example_of_vectors
6/21 Test #6: simple_example_of_vectors ........***Exception: Other 0.12 sec
Start 7: example_sycl_application
7/21 Test #7: example_sycl_application .........***Exception: Other 0.12 sec
Start 8: using_functors
8/21 Test #8: using_functors ...................***Failed 0.01 sec
Start 9: parallel_for
9/21 Test #9: parallel_for .....................***Failed 0.01 sec
Start 10: hello_world
10/21 Test #10: hello_world ......................***Exception: Other 0.12 sec
Start 11: accessors
11/21 Test #11: accessors ........................***Failed 0.01 sec
Start 12: simple_private_memory
12/21 Test #12: simple_private_memory ............***Exception: Other 0.12 sec
Start 13: matrix_multiply_omp
13/21 Test #13: matrix_multiply_omp .............. Passed 0.00 sec
Start 14: matrix_multiply_sycl
14/21 Test #14: matrix_multiply_sycl .............***Exception: Other 0.12 sec
Start 15: custom_device_selector
15/21 Test #15: custom_device_selector ...........***Exception: Other 0.12 sec
Start 16: template_functor
16/21 Test #16: template_functor .................***Exception: Other 0.12 sec
Start 17: smart_pointer
17/21 Test #17: smart_pointer ....................***Exception: Other 0.12 sec
Start 18: gaussian_blur
18/21 Test #18: gaussian_blur ....................***Exception: Other 0.14 sec
Start 19: reduction
19/21 Test #19: reduction ........................***Exception: Other 0.12 sec
Start 20: simple_vector_add
20/21 Test #20: simple_vector_add ................***Exception: Other 0.12 sec
Start 21: async_handler
21/21 Test #21: async_handler ....................***Exception: Other 0.12 sec

5% tests passed, 20 tests failed out of 21

Total Test time (real) = 2.34 sec

The following tests FAILED:
1 - scan (OTHER_FAULT)
2 - images (Failed)
3 - opencl_c_interop (OTHER_FAULT)
4 - example_vptr (OTHER_FAULT)
5 - simple_local_barrier (OTHER_FAULT)
6 - simple_example_of_vectors (OTHER_FAULT)
7 - example_sycl_application (OTHER_FAULT)
8 - using_functors (Failed)
9 - parallel_for (Failed)
10 - hello_world (OTHER_FAULT)
11 - accessors (Failed)
12 - simple_private_memory (OTHER_FAULT)
14 - matrix_multiply_sycl (OTHER_FAULT)
15 - custom_device_selector (OTHER_FAULT)
16 - template_functor (OTHER_FAULT)
17 - smart_pointer (OTHER_FAULT)
18 - gaussian_blur (OTHER_FAULT)
19 - reduction (OTHER_FAULT)
20 - simple_vector_add (OTHER_FAULT)
21 - async_handler (OTHER_FAULT)
Errors while running CTest

@DuncanMcBain
Copy link
Collaborator

DuncanMcBain commented Oct 11, 2017

Hi @GauthamPughaz, thanks for that. It looks like nothing that uses SYCL is working which is quite confusing! I'd really appreciate it if you could try the following code to see what the underlying error is:

#include <CL/sycl.hpp>
#include <iostream>

int main() {
  try {
    cl::sycl::queue q;
  } catch(cl::sycl::cl_exception& e) {
    std::cout << e.what() << "\n";
    std::cout << "Error: " << e.get_cl_error() << ": " << e.get_cl_error_message();
  }
  return 0;
}

This should print out exactly what error code is being returned by the clCreateCommandQueue() function. It is very rare for this to fail, so I'd be very interested to know what's going on!

Many thanks,
Duncan.

EDIT: I've added the missing SYCL header, because I foolishly forgot it!

@gauthampughazhendhi
Copy link
Author

Sorry for this question. How and where to run this as this requires sycl.hpp header file.

@DuncanMcBain
Copy link
Collaborator

Whoops, you're right! I think the easiest thing would be to edit one of the samples in the SDK, such that the body of the code is replaced with that - the hello world sample is very simple. I've included a patch that you could apply to the SDK. Simply copy the contents to a file in the root of the directory, then:

git apply testing-queues.patch
cd build
make hello_world
hello_world/hello_world

I've actually compiled and run it this time, though it passes without error for me! Sorry for getting that wrong earlier.

diff --git a/samples/hello_world/hello_world.cpp b/samples/hello_world/hello_world.cpp
index 2f65f49..20e8fa8 100644
--- a/samples/hello_world/hello_world.cpp
+++ b/samples/hello_world/hello_world.cpp
@@ -34,42 +34,11 @@
  * (as determined by the SYCL implementation) whose only function is to
  * output the canonical "hello world" string. */
 int main() {
-  /* Selectors determine which device kernels will be dispatched to.
-   * Try using a host_selector, too! */
-  cl::sycl::default_selector selector;
-
-  /* Queues are used to enqueue work.
-   * In this case we construct the queue using the selector. Users can create
-   * their own selectors to choose whatever kind of device they might need. */
-  cl::sycl::queue myQueue(selector);
-  std::cout << "Running on "
-            << myQueue.get_device().get_info<cl::sycl::info::device::name>()
-            << "\n";
-
-  /* C++ 11 lambda functions can be used to submit work to the queue.
-   * They set up data transfers, kernel compilation and the actual
-   * kernel execution. This submission has no data, only a "stream" object.
-   * Useful in debugging, it is a lot like an std::ostream. The handler
-   * object is used to control the scope certain operations can be done. */
-  myQueue.submit([&](cl::sycl::handler& cgh) {
-    /* The stream object allows output to be generated from the kernel. It
-     * takes three parameters in its constructor. The first is the maximum
-     * output size in bytes, the second is how large [in bytes] the total
-     * output of any single << chain can be, and the third is the cgh,
-     * ensuring it can only be constructed inside a submit() call. */
-    cl::sycl::stream os(1024, 80, cgh);
-
-    /* single_task is the simplest way of executing a kernel on a
-     * SYCL device. A single thread executes the code inside the kernel
-     * lambda. The template parameter needs to be a unique name that
-     * the runtime can use to identify the kernel (since lambdas have
-     * no accessible name). */
-    cgh.single_task<class hello_world>([=]() {
-      /* We use the stream operator on the stream object we created above
-       * to print to stdout from the device. */
-      os << "Hello, World!\n";
-    });
-  });
-
+  try {
+    cl::sycl::queue q;
+  } catch(cl::sycl::cl_exception& e) {
+    std::cout << e.what() << "\n";
+    std::cout << "Error: " << e.get_cl_code() << ": " << e.get_cl_error_message();
+  }
   return 0;
 }

@gauthampughazhendhi
Copy link
Author

Got it, the log is shown below.

terminate called after throwing an instance of 'cl::sycl::cl_exception'
what(): Error: [ComputeCpp:RT0408] Error querying the number of OpenCL platforms in the system (Cannot query number of platforms on second attempt )
Aborted (core dumped)

@gauthampughazhendhi
Copy link
Author

Is this sufficient?

@DuncanMcBain
Copy link
Collaborator

Hi, sorry for not responding. This is a really bizarre error which doesn't make a lot of sense to me. There is barely any ComputeCpp code running at this point when the program crashes. Are you running in some kind of virtual environment, like a docker container, or something else maybe?

@gauthampughazhendhi
Copy link
Author

gauthampughazhendhi commented Oct 12, 2017

No, not anything like that. Is there any other way to debug this?.

@gauthampughazhendhi
Copy link
Author

My /usr/local/computecpp/ directory contains the following folders
bin
doc
include
lib
Is this fine?

@gauthampughazhendhi
Copy link
Author

gauthampughazhendhi commented Oct 13, 2017

I repeated the process from step one and I checked it with a test, it failed again. The log for that fail of basic gpu test.

WARNING: /home/gautham/tensorflow/tensorflow/core/BUILD:1782:1: in includes attribute of cc_library rule //tensorflow/core:framework_headers_lib: '../../external/nsync/public' resolves to 'external/nsync/public' not below the relative path of its package 'tensorflow/core'. This will be an error in the future. Since this rule was created by the macro 'cc_header_only_library', the error might have been caused by the macro implementation in /home/gautham/tensorflow/tensorflow/tensorflow.bzl:1048:30
INFO: Analysed target //tensorflow/python/kernel_tests:basic_gpu_test (2 packages loaded).
INFO: Found 1 test target...
Building: no action running
FAIL: //tensorflow/python/kernel_tests:basic_gpu_test (see /home/gautham/.cache/bazel/_bazel_gautham/f3cb0296cc06462eaad2b67eee0ab414/execroot/org_tensorflow/bazel-out/local_linux-opt/testlogs/tensorflow/python/kernel_tests/basic_gpu_test/test.log)
INFO: From Testing //tensorflow/python/kernel_tests:basic_gpu_test:
==================== Test output for //tensorflow/python/kernel_tests:basic_gpu_test:
2017-10-13 05:57:32.089958: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
terminate called after throwing an instance of 'cl::sycl::cl_exception'
  what():  Error: [ComputeCpp:RT0407] Failed to create OpenCL command queue
Aborted (core dumped)
================================================================================
Target //tensorflow/python/kernel_tests:basic_gpu_test up-to-date:
  bazel-bin/tensorflow/python/kernel_tests/basic_gpu_test
INFO: Elapsed time: 86.771s, Critical Path: 5.92s
INFO: Build completed, 1 test FAILED, 5 total actions
//tensorflow/python/kernel_tests:basic_gpu_test                          FAILED in 3.5s
  /home/gautham/.cache/bazel/_bazel_gautham/f3cb0296cc06462eaad2b67eee0ab414/execroot/org_tensorflow/bazel-out/local_linux-opt/testlogs/tensorflow/python/kernel_tests/basic_gpu_test/test.log

Executed 1 out of 1 test: 1 fails locally.

@DuncanMcBain
Copy link
Collaborator

DuncanMcBain commented Oct 13, 2017

I'm sorry I'm being slow on this, it is just that this is rather tricky to debug. What this is saying is that the queue creation part of ComputeCpp is failing, but this is a very simple, basic operation. Could you try this modified patch again on the SDK to see if there's a difference in output? After this the only thing I can think to try is plain OpenCL code.

The difference here is that I've asked it to use a default selector. The behaviour ideally should be the same but it looks like it might be slightly different - I guess we can see what happens. Thanks for sticking with this!

diff --git a/samples/hello_world/hello_world.cpp b/samples/hello_world/hello_world.cpp
index 2f65f49..20e8fa8 100644
--- a/samples/hello_world/hello_world.cpp
+++ b/samples/hello_world/hello_world.cpp
@@ -34,42 +34,12 @@
  * (as determined by the SYCL implementation) whose only function is to
  * output the canonical "hello world" string. */
 int main() {
-  /* Selectors determine which device kernels will be dispatched to.
-   * Try using a host_selector, too! */
-  cl::sycl::default_selector selector;
-
-  /* Queues are used to enqueue work.
-   * In this case we construct the queue using the selector. Users can create
-   * their own selectors to choose whatever kind of device they might need. */
-  cl::sycl::queue myQueue(selector);
-  std::cout << "Running on "
-            << myQueue.get_device().get_info<cl::sycl::info::device::name>()
-            << "\n";
-
-  /* C++ 11 lambda functions can be used to submit work to the queue.
-   * They set up data transfers, kernel compilation and the actual
-   * kernel execution. This submission has no data, only a "stream" object.
-   * Useful in debugging, it is a lot like an std::ostream. The handler
-   * object is used to control the scope certain operations can be done. */
-  myQueue.submit([&](cl::sycl::handler& cgh) {
-    /* The stream object allows output to be generated from the kernel. It
-     * takes three parameters in its constructor. The first is the maximum
-     * output size in bytes, the second is how large [in bytes] the total
-     * output of any single << chain can be, and the third is the cgh,
-     * ensuring it can only be constructed inside a submit() call. */
-    cl::sycl::stream os(1024, 80, cgh);
-
-    /* single_task is the simplest way of executing a kernel on a
-     * SYCL device. A single thread executes the code inside the kernel
-     * lambda. The template parameter needs to be a unique name that
-     * the runtime can use to identify the kernel (since lambdas have
-     * no accessible name). */
-    cgh.single_task<class hello_world>([=]() {
-      /* We use the stream operator on the stream object we created above
-       * to print to stdout from the device. */
-      os << "Hello, World!\n";
-    });
-  });
-
+  try {
+    cl::sycl::default_selector ds;
+    cl::sycl::queue q(ds);
+  } catch(cl::sycl::cl_exception& e) {
+    std::cout << e.what() << "\n";
+    std::cout << "Error: " << e.get_cl_code() << ": " << e.get_cl_error_message();
+  }
   return 0;
 }

@gauthampughazhendhi
Copy link
Author

terminate called after throwing an instance of 'cl::sycl::cl_exception'
what(): Error: [ComputeCpp:RT0408] Error querying the number of OpenCL platforms in the system (Cannot query number of platforms on second attempt )
Aborted (core dumped)

@gauthampughazhendhi
Copy link
Author

gauthampughazhendhi commented Oct 13, 2017

While building in my second attempt, out of 300 tests only 47 passed.

@DuncanMcBain
Copy link
Collaborator

OK. I don't understand what's happening here. This is the function that's failing: https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clGetPlatformIDs.html

I suppose I should explain a little. ComputeCpp discovers all the different OpenCL platforms available on the system through this function. If it doesn't work... there's nothing we can do. Are you sure you have installed a driver on your system that is properly supported on your hardware? I had a look at the available driver downloads for your hardware, and the only one available is the (now rather old) Crimson driver. AMD GPUPRO supports some mobile chips (like the R9 M300 series for example) but not the R9 M200 cards. ComputeCpp requires OpenCL 1.2 and OpenCL SPIR 1.2 to run - if your system can't support that, we can't run.

(This would explain why so many tests fail - those that attempt to instantiate any kind of device or queue will fail rapidly.)

@gauthampughazhendhi
Copy link
Author

Ok, thank you so much. One final query, is the command export COMPUTE=:0 required for my setup, since it was listed for ubuntu 14.04 and not for 16.04.
Another thing, I just checked my system and found I have two GPU models on my system.
gautham@gautham-dell:~/computecpp-sdk/build$ lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation Haswell-ULT Integrated Graphics Controller (rev 09)
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Venus PRO [Radeon HD 8850M / R9 M265X] (rev ff)
Is that an issue?

@DuncanMcBain
Copy link
Collaborator

It's one GPU. The M265X is, to some greater or lesser extent, a rebranded 8850M, from what I can tell. Export COMPUTE=:0 means "use the first device on the system". If you don't have any other devices, it shouldn't change anything. (@lukeiwanski might be able to add more on that point.)

@lukeiwanski
Copy link
Owner

@GauthamPughaz no you don't need export COMPUTE=:0 that was only relevant on 14.04

@gauthampughazhendhi
Copy link
Author

gauthampughazhendhi commented Oct 13, 2017

Ok, thank you.

@davide-maestroni
Copy link

davide-maestroni commented Oct 25, 2017

Any update on this? I'm facing exactly the same issue with ComputeCpp 0.3.3 (tried both dev/eigen_mehdi and dev/amd_gpu branches).

The output from clinfo is:

Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (2442.7)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon R9 (TM) M370X
Device Topology: PCI[ B#1, D#0, F#0 ]
Max compute units: 5
Max work items dimensions: 3
Max work items[0]: 256
Max work items[1]: 256
Max work items[2]: 256
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 800Mhz
Address bits: 32
Max memory allocation: 844744704
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 1366024192
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 0
Max pipe active reservations: 0
Max pipe packet size: 0
Max global variable size: 0
Max global variable preferred total size: 0
Max read/write image args: 0
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: No
Fine grain buffer: No
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 0x7fe001fdd478
Name: Capeverde
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.2
Driver version: 2442.7
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (2442.7)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event

and the one from computecpp_info:


ComputeCpp Info (CE 0.3.3)


Toolchain information:

GLIBC version: 2.23
GLIBCXX: 20160609
This version of libstdc++ is supported.


Device Info:

Discovered 1 devices matching:
platform :
device type :


Device 0:

Device is supported : UNTESTED - Vendor not tested on this OS
CL_DEVICE_NAME : Capeverde
CL_DEVICE_VENDOR : Advanced Micro Devices, Inc.
CL_DRIVER_VERSION : 2442.7
CL_DEVICE_TYPE : CL_DEVICE_TYPE_GPU

If you encounter problems when using any of these OpenCL devices, please consult
this website for known issues:
https://computecpp.codeplay.com/releases/v0.3.3/platform-support-notes


I've also run a simple test compiled from https://laanwj.github.io/assets/2016/05/06/opencl-ubuntu1604/devices.c and the output is:

  1. Platform
    Profile: FULL_PROFILE
    Version: OpenCL 2.0 AMD-APP (2442.7)
    Name: AMD Accelerated Parallel Processing
    Vendor: Advanced Micro Devices, Inc.
    Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
  2. Device: Capeverde
    1.1 Hardware version: OpenCL 1.2 AMD-APP (2442.7)
    1.2 Software version: 2442.7
    1.3 OpenCL C version: OpenCL C 1.2
    1.4 Parallel compute units: 5

I would be more than happy to help finding a solution.

@davide-maestroni
Copy link

davide-maestroni commented Oct 25, 2017

I've also tried with the hello_world example above, and the output is just:

Error: [ComputeCpp:RT0407] Failed to create OpenCL command queue
Segmentation fault (core dumped)

After some tests I found out that the clCreateCommandQueue call fails with error code CL_OUT_OF_HOST_MEMORY
Any clue?

@DuncanMcBain
Copy link
Collaborator

Hi @davide-maestroni ,
Thanks for adding to this bug report. Now that we have confirmation of two users being affected by this issue, it's possible it's an AMD problem, particularly since the driver download page indicates that your particular hardware should be supported by this driver.
I'm afraid that beyond that, I don't understand what's going on! That particular error is very unusual. It is somehow indicating that the OpenCL implementation (i.e. AMD's implementation) does not have enough memory to create even a basic queue. Given that this happens on the smallest tests we have, I'm stumped! The code you linked does not create a queue, which is presumably why it does not fail.

As the documentation on this page suggests, a general failure to allocate resources on the host is what causes CL_OUT_OF_HOST_MEMORY. Quick Googling returns some information, if not fixes:
https://developer.blender.org/T50761
fireice-uk/xmr-stak-amd#69
fireice-uk/xmr-stak-amd#53
https://stackoverflow.com/questions/39864947/opencl-cl-out-of-host-memory-on-clcreatecommandqueuewithproperties-with-minima

@davide-maestroni
Copy link

Thanks @DuncanMcBain, I actually had the same suspect, that is the issue was related to the AMD drivers. I was just hoping you had a better understanding of the problem. I also filed an issue to the Codeplay guys. Let's see what they have to say.

@DuncanMcBain
Copy link
Collaborator

I am a Codeplay guy! I'm sorry, I should have answered there as well, but thought it better to leave it on a public forum. In short: I'm the one who answers those issues too, and I still have no idea 😄

While AMD are pushing their RoCM and Hipify technologies, I think it would be beneficial if they still supported OpenCL on a variety of hardware. Unfortunately, the only Mxxx AMD hardware we have in the company is an older chip, too old to use the new drivers (I tried, it doesn't boot if you use them). Would you be able to pass on this program to AMD? I might be able to attach a plain OpenCL program that creates a queue in the same way if they want an OpenCL repro case (I'd be surprised if this were an artifact of the way that ComputeCpp was doing things). You could try this: https://github.com/HandsOnOpenCL/Exercises-Solutions. It should work "out of the box" and also fail in the same way!

@mirh
Copy link

mirh commented Dec 20, 2017

@davide-maestroni (and @GauthamPughaz) if you read the second of those links you see there's an alleged solution.
TL;DR AFAIK you just have amd (for some reason) setting very conservative default memory caps

I suppose computecpp should have no problems exporting GPU_FORCE_64BIT_PTR or GPU_USE_SYNC_OBJECTS vars

EDIT: https://bugs.freedesktop.org/show_bug.cgi?id=102491#c5

@lukeiwanski lukeiwanski changed the title Basic gpu test failed in ubuntu 16.04 Basic gpu test failed in ubuntu 16.04 [AMD memory caps] Jan 2, 2018
@lukeiwanski
Copy link
Owner

@GauthamPughaz is that still an issue? or can it be closed?

@gauthampughazhendhi
Copy link
Author

gauthampughazhendhi commented Mar 9, 2018 via email

@mirh
Copy link

mirh commented Mar 9, 2018

.... Did you read/try what I posted?

@RafalKonklewski
Copy link

RafalKonklewski commented Jun 25, 2018

@mirh The solution from your post really works, thanks a lot !

For people who didn't read it: Add this to your ~/.bashrc file:
export GPU_FORCE_64BIT_PTR=1
export GPU_USE_SYNC_OBJECTS=1
export GPU_MAX_ALLOC_PERCENT=100
export GPU_SINGLE_ALLOC_PERCENT=100
export GPU_MAX_HEAP_SIZE=100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants