Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cifar10_train.py on AMD SoC GPU (Kalindi) is 4 times slower than its SoC CPU (Kabini) #239

Open
enihcam opened this issue May 2, 2018 · 14 comments

Comments

@enihcam
Copy link

enihcam commented May 2, 2018

Please go to Stack Overflow for help and support:

https://stackoverflow.com/questions/tagged/tensorflow

If you open a GitHub issue, here is our policy:

  1. It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
  2. The form below must be filled out.
  3. It shouldn't be a TensorBoard issue. Those go here.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.


System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Archlinux
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below):
$ python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
b'ComputeCpp-v0.6.0-30-g4cc789977d' 1.6.0-rc0
  • Python version: 3.6.5
  • Bazel version (if compiling from source): 0.12.0
  • GCC/Compiler version (if compiling from source): 7.3.1 20180406
  • CUDA/cuDNN version: N/A
  • GPU model and memory:
  Device Name                                     Kalindi
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.2 AMD-APP (2580.4)
  Driver Version                                  2580.4
  Device OpenCL C Version                         OpenCL C 1.2
  Device Type                                     GPU
  Device Board Name (AMD)                         AMD Radeon Graphics
  Device Topology (AMD)                           PCI-E, 00:01.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               2
  SIMD per compute unit (AMD)                     4
  SIMD width (AMD)                                16
  SIMD instruction width (AMD)                    1
  Max clock frequency                             496MHz
  Graphics IP (AMD)                               7.2
  Device Partition                                (core)
    Max number of sub-devices                     2
    Supported partition types                     (n/a)
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             256
  Preferred work group size (AMD)                 256
  Max work group size (AMD)                       1024
  Preferred work group size multiple              64
  Wavefront width (AMD)                           64
  • Exact command to reproduce:
    python ./models/tutorials/image/cifar10/cifar10_train.py
    You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

$ python ./models/tutorials/image/cifar10/cifar10_train.py
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
2018-05-02 13:08:53.003386: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-05-02 13:08:53.003504: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Kalindi, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-05-02 13:03:32.883491: step 2560, loss = 1.38 (19.2 examples/sec; 6.683 sec/batch)
2018-05-02 13:04:39.774720: step 2570, loss = 1.34 (19.1 examples/sec; 6.689 sec/batch)
2018-05-02 13:05:46.625889: step 2580, loss = 1.43 (19.1 examples/sec; 6.685 sec/batch)

For CPU-based tensorflow, it was around ~80 examples/sec.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

Build configuration:
https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=tensorflow-computecpp

@lukeiwanski
Copy link
Owner

Hi @enihcam

Thanks for the report. I will do my best to help.
To clarify is that this device: https://www.techpowerup.com/gpudb/2197/radeon-hd-8280e ?

As well, do you know if the device you are using has physical local memory or is it using global memory to simulate it?

@enihcam
Copy link
Author

enihcam commented May 3, 2018

Thank you @lukeiwanski.
Yes it is. Also the full processor name is AMD A4-5000 APU with Radeon(TM) HD Graphics.

Sorry, what do you mean 'global memory to simulate it'? Since it is an integrated GPU, it uses system RAM (DDR3) shared by processor.

@DuncanMcBain
Copy link
Collaborator

Hi @enihcam,
While it is true that the performance on your hardware is low, we think there are a few factors contributing to this. The iGPU in your SoC is barely more powerful than the CPU, and we should therefore expect performance that (at best) would be on-par with the CPU. However, given the design we have taken at the moment (focussing on discrete GPUs with many CUs and high memory bandwidth), it is likely that the code as-is will not perform well on an AMD APU.

More specifically, it seems likely to me that there will be some redundant copies on APU hardware (since the memory is shared between the CPU and GPU). For these reasons, I don't think you will obtain good performance on this hardware, even if (as is likely) there are still optimisations we could make to our TensorFlow efforts.

@mirh
Copy link

mirh commented May 3, 2018

Are you using latest opencl-amd?

More specifically, it seems likely to me that there will be some redundant copies on APU hardware

Putting aside whatever specific low end consideration now (his gpu should crunch just short of 150 Gflops btw).. shouldn't you look into zero copy then if that happens?

@DuncanMcBain
Copy link
Collaborator

That's certainly a possibility, but I don't imagine that this is an interesting optimisation target for us right now. That said I might be wrong - CodeXL might be able to provide some traces showing whether excessive time is being spent copying the buffers around.

@enihcam
Copy link
Author

enihcam commented May 3, 2018

Thank you @mirh @DuncanMcBain

Yes, I'm using latest opencl-amd (ver 18.10.572953). How to enable zero-copy?

@DuncanMcBain
Copy link
Collaborator

It would be more instructive to be sure that this is the issue first than to delve into the guts when, indeed, this optimisation might already be in effect.

As I say, however, this hardware isn't currently an interesting target to us.

@enihcam
Copy link
Author

enihcam commented May 4, 2018

Also I would like to know what is the performance of tensorflow-computecpp on intel GPU? is it also slower than CPU?

@DuncanMcBain
Copy link
Collaborator

I don't believe it is, though I don't have any numbers to hand at the moment (I don't have that hardware, and we don't test it internally, but I think we've done some ad-hoc tests).

@lukeiwanski
Copy link
Owner

@enihcam / @DuncanMcBain after Neo driver was released we have Skylake series SoC available for tests and benchmarks - there is nothing ad-hoc about this ;)
@enihcam is there any particular benchmark / model you are interested in?

@enihcam
Copy link
Author

enihcam commented May 5, 2018

@lukeiwanski yes, i'm going to install it on KabyLake (i5-7200U) :D

For AMD SoC, I'm wondering, is there any flags required to be turn on (or off) in kernel config? I ask this because all my linux boxes are using customized kernel. config.txt

@mirh
Copy link

mirh commented May 6, 2018

As long as AMDGPU and AMDKFD are there, I don't think there's any other particular requirement for it to perform "properly".
The thing is, computecpp might just be optimized for the "big dedicated gpu" scenario, rather than "tiny shared" one.

I'm not sure how much of ROCm or HSA Kabini supports, anyway many features should be already exposed via opencl.
And if you care for it, as they told you, you should get aboard the profiling train.

EDIT: also of a fun fact, fglrx used to support 2.0 there once upon a time

@enihcam
Copy link
Author

enihcam commented May 6, 2018

@mirh Aha! That explains why Kabini GPU is slow. It does NOT support HSA (i.e. AMDKFD)!!

@mirh
Copy link

mirh commented May 6, 2018

Nothing of that is used at all here in the first place.
Then, even though you are right Jaguar/Puma apus don't support HSA (as for KFD, which is way more than just that, things may or may not improve in the future depending on how extensively AMD will be able to "backport" the thing) I was just suggesting some room for Fine-Grain SVM buffer optimizations.
(which should also be more or less the same feature level of Intel Gen8 igps)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants