Using the packages from [community] with an RX 580 results in a segfault trying to do nearly anything using pytorch. #961

rederick29 · 2023-04-17T21:09:22Z

rederick29
Apr 17, 2023

Hello guys. I am having some trouble running rocm on arch on gfx803 and I'm not sure how to fix it. I've tried all sort of environment variables but nothing seems to have worked.

rocminfo and rocm-smi work as expected and the test c++ code given in another discussion here works fine, but using pytorch causes rocm to crash. I'm also using the python-pytorch-rocm package from community which I thought was supposed to work with gfx803.

Here is the output I get when running python in lldb:

[erickv@archer ~]$ lldb python
(lldb) target create "python"
Current executable set to '/usr/bin/python' (x86_64).
(lldb) run
Process 6570 launched: '/usr/bin/python' (x86_64)
Python 3.10.10 (main, Mar  5 2023, 22:26:53) [GCC 12.2.1 20230201] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.get_arch_list()
['gfx803', 'gfx900', 'gfx906', 'gfx908', 'gfx90a', 'gfx1030']
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'AMD Radeon RX 580 Series'
>>> print(torch.tensor([1., 2.], device='cuda'))
Process 6570 stopped
* thread #1, name = 'python', stop reason = signal SIGSEGV: invalid address (fault address: 0x20)
    frame #0: 0x00007fffa8cd9cd8 libamdhip64.so.5`___lldb_unnamed_symbol2144 + 88
libamdhip64.so.5`___lldb_unnamed_symbol2144:
->  0x7fffa8cd9cd8 <+88>:  cmpb   $0x0, 0x20(%r12)
    0x7fffa8cd9cde <+94>:  je     0x7fffa8cd9d98            ; <+280>
    0x7fffa8cd9ce4 <+100>: cmpb   $0x0, 0x21(%r12)
    0x7fffa8cd9cea <+106>: movq   0x18(%r12), %rdi
(lldb) exit
Quitting LLDB will kill one or more processes. Do you really want to proceed: [Y/n] y
[erickv@archer ~]$ /opt/rocm/bin/hipcc test.cpp -o test && ./test
Agent AMD Radeon RX 580 Series
System version 8.0
TESTS PASSED!
[erickv@archer ~]$

Answered by rederick29

May 21, 2023

I have managed to find a fix for this issue. Thank you so much @mpeschel10 !

As recommended by mpeschel10, I ran export HSA_OVERRIDE_GFX_VERSION=8.0.3 which caused me to crash again in the same way, but lldb showed that the underlying issue was different this time. After further troubleshooting, I happened to find out that after exporting that environment variable, clinfo also crashed in the same way, which did not happen before. Thanks to the AMD_LOG_LEVEL=4 I was able to find out that my second GPU, a gfx90c, was attempting to load the gfx803 code due to the HSA_OVERRIDE_GFX_VERSION=8.0.3 variable (meaning that my error was now caused by the other device, which I never intended to use).…

View full answer

mpeschel10 · 2023-05-12T03:52:43Z

mpeschel10
May 12, 2023

Hi. I also have this problem. I can compile and run the hipcc test.cpp and it reports TESTS PASSED!. But pytorch produces a segfault any time it tries to access the GPU. My graphics card model is the Radeon RX 6750 XT with Navi 22.

[mpeschel@daimyo ~]$ python
Python 3.11.3 (main, Apr  5 2023, 15:52:25) [GCC 12.2.1 20230201] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.get_arch_list()
['gfx803', 'gfx900', 'gfx906', 'gfx908', 'gfx90a', 'gfx1030']
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'AMD Radeon RX 6750 XT'
>>> d = torch.device('cuda')
>>> a = torch.rand(1, 2).to(d)
>>> print(a + 0)
Segmentation fault (core dumped)
[mpeschel@daimyo ~]$

As rederick29 reports, the segfault occurs during an operation in libamd64.so, part of the hip-runtime-amd package.

[mpeschel@daimyo ~]$ lldb python
(lldb) target create "python"
Current executable set to '/usr/bin/python' (x86_64).
(lldb) run
Process 4987 launched: '/usr/bin/python' (x86_64)
Python 3.11.3 (main, Apr  5 2023, 15:52:25) [GCC 12.2.1 20230201] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.tensor([1., 2.], device='cuda')
Process 4987 stopped
* thread #1, name = 'python', stop reason = signal SIGSEGV: invalid address (fault address: 0x20)
    frame #0: 0x00007fffa8b0baa7 libamdhip64.so.5`___lldb_unnamed_symbol2311 + 87
libamdhip64.so.5`___lldb_unnamed_symbol2311:
->  0x7fffa8b0baa7 <+87>: cmpb   $0x0, 0x20(%r12)
    0x7fffa8b0baad <+93>: je     0x7fffa8b0bad8            ; <+136>
    0x7fffa8b0baaf <+95>: xorl   %eax, %eax
    0x7fffa8b0bab1 <+97>: movq   -0x38(%rbp), %rdx
(lldb)

I have tried doing a clean operating system install and compiling the hip-runtime-amd library locally; neither solved the issue.

2 replies

mpeschel10 May 12, 2023

The problem may be in magma-hip or one of its dependencies (edit: The problem starts with the dependency rocblas which supports gfx1030 but not gfx1031). I tried building and running an example magma program, and it gave me the same error.

mkdir magma_test; cd magma_test
cp -r /usr/share/magma/example ./; cd example
hipcc $(pkg-config --libs --cflags magma) example_v1.cpp

Then

[mpeschel@daimyo example]$ lldb a.out
(lldb) target create "a.out"
Current executable set to '/home/mpeschel/projects/panic/magma_test/example/a.out' (x86_64).
(lldb) run
Process 8851 launched: '/home/mpeschel/projects/panic/magma_test/example/a.out' (x86_64)
using MAGMA CPU interface
Process 8851 stopped
* thread #1, name = 'a.out', stop reason = signal SIGSEGV: invalid address (fault address: 0x20)
    frame #0: 0
[example_v1.txt](https://github.com/rocm-arch/rocm-arch/files/11466621/example_v1.txt)
x00007fffea90baa7 libamdhip64.so.5`___lldb_unnamed_symbol2311 + 87
libamdhip64.so.5`___lldb_unnamed_symbol2311:
->  0x7fffea90baa7 <+87>: cmpb   $0x0, 0x20(%r12)
    0x7fffea90baad <+93>: je     0x7fffea90bad8            ; <+136>
    0x7fffea90baaf <+95>: xorl   %eax, %eax
    0x7fffea90bab1 <+97>: movq   -0x38(%rbp), %rdx
(lldb)

Which appears to be the same error. (edit: This is incorrect. This same error shows up for lots of different things. Could be anything, really.)

mpeschel10 May 13, 2023

I have something that works for me. It is probably different from rederick29's problem. Specifically, my GPU has the gfx1031 architecture, which is apparently equivalent to the gfx1030 architecture but for whatever reason is not supported.

So my pytorch script works if I do:

export HSA_OVERRIDE_GFX_VERSION=10.3.0
python my_pytorch_script.py

which I guess forces the toolchain to use the existing gfx1030 files. It's hacky, and I don't like it, but it works.

rederick29, if you ever come back, try doing some permutation of export HSA_OVERRIDE_GFX_VERSION=8.0.3. Also confirm that your card has gfx803 ISA: rocminfo | grep gfx, and that there are files for the gfx803 in your directories:

ls /usr/lib/libomptarget-amdgpu-* | grep 803 # confirm openmp supports it
ls /opt/rocm/lib/rocblas/library/ | grep 803 # confirm rocblas supports it
ls /opt/rocm/share/miopen/db/ | grep 803 # confirm that miopen-hip supports it

If that doesn't make the problem clear, try export AMD_LOG_LEVEL=4; python my_pytorch_script.py to get some better error messages. After that, I don't know where to look. I will try compiling from rocblas on up for the 1031, and get back to you if it works (edit: it did not).

rederick29 · 2023-05-21T19:21:26Z

rederick29
May 21, 2023
Author

I have managed to find a fix for this issue. Thank you so much @mpeschel10 !

As recommended by mpeschel10, I ran export HSA_OVERRIDE_GFX_VERSION=8.0.3 which caused me to crash again in the same way, but lldb showed that the underlying issue was different this time. After further troubleshooting, I happened to find out that after exporting that environment variable, clinfo also crashed in the same way, which did not happen before. Thanks to the AMD_LOG_LEVEL=4 I was able to find out that my second GPU, a gfx90c, was attempting to load the gfx803 code due to the HSA_OVERRIDE_GFX_VERSION=8.0.3 variable (meaning that my error was now caused by the other device, which I never intended to use). Running export ROCR_VISIBLE_DEVICES=0 and export HIP_VISIBLE_DEVICES=0 (to use my first GPU, gfx803 only) caused the second gfx90c device to be disabled, fixing everything.

TL;DR: export HSA_OVERRIDE_GFX_VERSION=8.0.3 + export ROCR_VISIBLE_DEVICES=0 + export HIP_VISIBLE_DEVICES=0 fixes my issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocm-arch

Using the packages from [community] with an RX 580 results in a segfault trying to do nearly anything using pytorch. #961

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

rocm-arch

Using the packages from [community] with an RX 580 results in a segfault trying to do nearly anything using pytorch. #961

rederick29 Apr 17, 2023

Replies: 2 comments · 2 replies

mpeschel10 May 12, 2023

mpeschel10 May 12, 2023

mpeschel10 May 13, 2023

rederick29 May 21, 2023 Author

rederick29
Apr 17, 2023

Replies: 2 comments 2 replies

mpeschel10
May 12, 2023

rederick29
May 21, 2023
Author