Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

./run_pytorch_gpu_simple_test.sh fails after successful build (gfx1010) #98

Open
silicium42 opened this issue Jul 4, 2024 · 61 comments · Fixed by #104
Open

./run_pytorch_gpu_simple_test.sh fails after successful build (gfx1010) #98

silicium42 opened this issue Jul 4, 2024 · 61 comments · Fixed by #104

Comments

@silicium42
Copy link

I am using Ubuntu 22.04 with an AMD RX 5700 graphics card (gfx1010) with the driver being installed with amdgpu-install from the repo.radeon.com repository for version 6.1.3 (amdgpu-install --usecase=graphics).
In the babs.sh -i step i selected gfx1010 target and i used no HSA_OVERRIDE_GFX_VERSION. After a few tries and executing sudo apt install libstdc++-12-dev libgfortran-12-dev gfortran-12 the whole project compiled in about 16 hours (probably took so long due to 16 GB RAM). The babs.sh -b command says it has been successful. and rocminfo outputs the following:

ROCk module version 6.7.0 is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
Runtime Ext Version:     1.4
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            12                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    32690056(0x1f2cf88) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32690056(0x1f2cf88) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32690056(0x1f2cf88) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1010                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 5700                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      4096(0x1000) KB                    
  Chip ID:                 29471(0x731f)                      
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1750                               
  BDFID:                   1792                               
  Internal Node ID:        1                                  
  Compute Unit:            36                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    1280(0x500)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 149                                
  SDMA engine uCode::      35                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1010:xnack-  
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

but the pytorch example exits almost immediately:

./run_pytorch_gpu_simple_test.sh
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
tensor([-0.8387], device='cuda:0')

The other examples mentioned in the README.md seem to work fine/ don't crash. i don't exactly know what output to expect though.
I have tried the releases/rocm_sdk_builder_611 and releases/rocm_sdk_builder_612 branches without any luck so far.
Unfortunately i have no idea if that might be caused by a driver problem or a configuration problem or something else.
The README.md states that RX 5700 has been tested but there is no mention of an modified build/install procedure or a specific branch to use. I would appreciate any information on what could be causing this (i think maybe aotriton, but i know very little about rocm)

@lamikr
Copy link
Owner

lamikr commented Jul 4, 2024

Hi, thanks for testing. It seems that the application is actually working ok despite the messages about hip_fastbin.cpp.
Those messages are more like a warnings which occurs because some modules are not prebuilt for all cards.

I should probably change the wording a little or put them in future to be printed only if some environment variable is set.
Most of the examples I have put are quite simple to just to verify that the stack does not have problems.

One app you could try to test with your setup pretty easily is the whisper which can interpter words from music.
It usage should be quite easy:

source /opt/rocm_sdk_612/bin/env_rocm.sh
pip3 install openai-whisper
whisper --model small song.mp3

You should also be able to change the "small" model to something else.

If you have some ideas for apps to test, I would like to get more feedback to
#96

@silicium42
Copy link
Author

silicium42 commented Jul 4, 2024

Hi, thanks for testing. It seems that the application is actually working ok despite the messages about hip_fastbin.cpp. Those messages are more like a warnings which occurs because some modules are not prebuilt for all cards.

oh well then i was worrying about nothing, but it's good to hear that it is actually working.

One app you could try to test with your setup pretty easily is the whisper which can interpter words from music.
i have tested whisper and it seems to work, at least in outputs some lyrics.
If you have some ideas for apps to test, I would like to get more feedback to #96

I tried stable diffusion with SD.Next using the env_rocm.sh script but it failed to generate an image throwing RuntimeError: HIP error: invalid device function. When it starts it complains about missing a module called 'flash_attn'. That is what i am mainly trying to do right now, so an integrated version would be nice as well. If there are some other apps that need testing i'd be happy to help!
Edit:(it seems i forgot to clear the venv for SD.Next since i used it last. Now it complains about needing python 3.10 or 3.11)

@lamikr
Copy link
Owner

lamikr commented Jul 4, 2024

What does it show for you if you run the commands:

$ source /opt/rocm_sdk_612/bin/env_rocm.sh
$ which python
$ python --version

Not sure whether @daniandtheweb has tested the stable diffusion with rocm. I have recently mostly run pytorch audio transformation tests and some image recognization test apps. I hope we could integrate some good stable diffusion app soon to build.

@silicium42
Copy link
Author

silicium42 commented Jul 4, 2024

output of which python:
/opt/rocm_sdk_612/bin/python

output of python --version:
Python 3.9.19

Not sure whether @daniandtheweb has tested the stable diffusion with rocm. I have recently mostly run pytorch audio transformation tests and some image recognization test apps. I hope we could integrate some good stable diffusion app soon to build.

Thanks i'll take a look. I don't suppose there is an easy way to change the python version?

@daniandtheweb
Copy link
Contributor

daniandtheweb commented Jul 5, 2024

For me everything works fine, be careful with SD.Next's settings as some work quite badly on AMD hardware in general.

My best advice for running it is leaving most diffusers stuff to stock and just enable medvram.

I advise you to try ComfyUI, it has a higher learning curve than SD.Next but the settings are minimal and there's a much less chance of messing up something.

@silicium42
Copy link
Author

For me everything works fine, be careful with SD.Next's settings as some work quite badly on AMD hardware in general.

My best advice for running it is leaving most diffusers stuff to stock and just enable medvram.

I would do that, but since i have cleared the venv it doesn't even reinitialise when i start webui.sh:

01:28:00-575846 ERROR    Incompatible Python version: 3.9.19 required 3.[10, 11]                     
01:28:00-577358 ERROR    ROCm or ZLUDA backends require Python 3.10 or 3.11

I advise you to try ComfyUI, it has a higher learning curve than SD.Next but the settings are minimal and there's a much less chance of messing up something.

I was thinking about trying ComfyUI as well but i haven't yet. I'll definitely look into it soon. Do you think it will work with python 3.9.19 by default or do i need to do something?

@lamikr
Copy link
Owner

lamikr commented Jul 5, 2024

We just updated our rock sdk builder code on yesterday to use python 3.11. But that would require now you to do new build :-( Unfortunately the python version update is so big thing that basically everything needs to be rebuild.

If you can wait for one day, I could get couple of more good python fixes in. I can then guide you to update the source code and rebuild.

@daniandtheweb
Copy link
Contributor

SD.Next removed the support for Python 3.9 not much time ago, that's one of the reasons I started working on the Python update here. If you want to run it like that you'll have to modify SD.Next's launch file but it still may not work properly. You can use ComfyUI (it should work on Python 3.9) or just wait for @lamikr to push some new fixes and help you update.

@silicium42
Copy link
Author

We just updated our rock sdk builder code on yesterday to use python 3.11. But that would require now you to do new build :-( Unfortunately the python version update is so big thing that basically everything needs to be rebuild.

That's what i suspected :( did the 6.1.1 release have a newer python though? because i can't figure out what i did ( wrong) to make SD.Next start up before.

If you can wait for one day, I could get couple of more good python fixes in. I can then guide you to update the source code and rebuild.

I'm not in any rush, just playing around trying to learn, so i have no problem with waiting. Thanks for your help!

@silicium42
Copy link
Author

I am happy to report that ComfyUI worked for me as well, but since I'm not too familiar with it, I couldn't test a lot of features. At least the default settings worked and generated images successfully with an SD 1.5 model.

@daniandtheweb
Copy link
Contributor

daniandtheweb commented Jul 5, 2024

Try to revert SD.Next to this commit: 0680a88 .

git checkout 0680a88

This should revert SD.Next right before the new Python check was implemented.

@lamikr
Copy link
Owner

lamikr commented Jul 5, 2024

All python fixes are now in place and to do a fresh build without downloading everything you should do these steps to get good build with python 3.11.

cd rocm_sdk_builder
git checkout master
git pull
./babs.sh -i
./babs.sh -f
./babs.sh -co
./babs.sh -ap
sudo rm -rf /opt/rocm_sdk_612 
rm -rf builddir
./babs.sh -b

If you want to keep the old build just in case, you can rename the /opt/rocm_sdk_612 folder instead of deleting it.

@lamikr
Copy link
Owner

lamikr commented Jul 6, 2024

Btw, not sure whether this benchmark runs on rx 5600, but it would be interesting to now the results both with the
python 3.9 and python 3.11.

https://github.com/lamikr/pytorch-gpu-benchmark

After running the benchmark It will store files that needs to be copied. For example Eitch sends his results from 7900xtx couple of weeks ago in lamikr/pytorch-gpu-benchmark#1

@daniandtheweb
Copy link
Contributor

daniandtheweb commented Jul 6, 2024

Btw, not sure whether this benchmark runs on rx 5600, but it would be interesting to now the results both with the python 3.9 and python 3.11.

https://github.com/lamikr/pytorch-gpu-benchmark

After running the benchmark It will store files that needs to be copied. For example Eitch sends his results from 7900xtx couple of weeks ago in lamikr/pytorch-gpu-benchmark#1

I've tried running it some time ago on my 5700 XT and it didn't work (I can only guess it could be related to the non official support status of ROCm for the card and maybe some other fix is needed, it should be the same for the 5600). I'll try it again after the build I've just started completes.

@daniandtheweb
Copy link
Contributor

This is the error using pytorch-gpu-benchmark on 5700xt:

AMD gpu benchmarks starting
GPU count:  1
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
benchmark start : 2024/07/06 14:43:39
Number of GPUs on current device : 1
CUDA Version : None
Cudnn Version : 3001000
Device Name : AMD Radeon RX 5700 XT
uname_result(system='Linux', node='designare', release='6.9.7-zen1-1-zen', version='#1 ZEN SMP PREEMPT_DYNAMIC Fri, 28 Jun 2024 04:32:27 +0000', machine='x86_64')
                     scpufreq(current=2750.11425, min=800.0, max=4900.0)
                    cpu_count: 8
                    memory_available: 26859737088
Benchmarking Training float precision type mnasnet0_5 
<inline asm>:14:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:15 row_mask:0xa
                   ^
<inline asm>:15:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:15 row_mask:0xa
                   ^
<inline asm>:17:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:31 row_mask:0xc
                   ^
<inline asm>:18:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:31 row_mask:0xc
                   ^
MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_CODEGEN_BC_TO_RELOCATABLE: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
4 errors generated.

MIOpen Error: /home/daniandtheweb/WorkSpace/rocm_sdk_builder/src_projects/MIOpen/src/hipoc/hipoc_program.cpp:294: Code object build failed. Source: MIOpenBatchNormFwdTrainSpatial.cl
Traceback (most recent call last):
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/benchmark_models.py", line 200, in <module>
    train_result = train(precision)
                   ^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/benchmark_models.py", line 105, in train
    prediction = model(img.to("cuda"))
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torchvision/models/mnasnet.py", line 159, in forward
    x = self.layers(x)
        ^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
            ^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/batchnorm.py", line 175, in forward
    return F.batch_norm(
           ^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/functional.py", line 2509, in batch_norm
    return torch.batch_norm(
           ^^^^^^^^^^^^^^^^^
RuntimeError: miopenStatusUnknownError
AMD GPU benchmarks finished

@silicium42
Copy link
Author

Btw, not sure whether this benchmark runs on rx 5600, but it would be interesting to now the results both with the python 3.9 and python 3.11.

I have run the test with the python 3.9 version and it fails:

./test.sh
AMD gpu benchmarks starting
GPU count:  1
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
[2024-07-06 15:19:48,864] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn is not compatible with ROCM
benchmark start : 2024/07/06 15:20:03
Number of GPUs on current device : 1
CUDA Version : None
Cudnn Version : 3001000
Device Name : AMD Radeon RX 5700
uname_result(system='Linux', node='ubuntu-sd', release='6.5.0-41-generic', version='#41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun  3 11:32:55 UTC 2', machine='x86_64')
                     scpufreq(current=1600.0420833333335, min=1200.0, max=3600.0)
                    cpu_count: 12
                    memory_available: 30556745728
Benchmarking Training float precision type mnasnet0_5 
<inline asm>:14:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:15 row_mask:0xa
                   ^
<inline asm>:15:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:15 row_mask:0xa
                   ^
<inline asm>:17:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:31 row_mask:0xc
                   ^
<inline asm>:18:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:31 row_mask:0xc
                   ^
MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_CODEGEN_BC_TO_RELOCATABLE: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
4 errors generated.

MIOpen Error: /home/simon/rocm_sdk_builder/src_projects/MIOpen/src/hipoc/hipoc_program.cpp:294: Code object build failed. Source: MIOpenBatchNormFwdTrainSpatial.cl
Traceback (most recent call last):
  File "/home/simon/pytorch-gpu-benchmark/benchmark_models.py", line 200, in <module>
    train_result = train(precision)
  File "/home/simon/pytorch-gpu-benchmark/benchmark_models.py", line 105, in train
    prediction = model(img.to("cuda"))
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torchvision-0.18.1a0+106562c-py3.9-linux-x86_64.egg/torchvision/models/mnasnet.py", line 159, in forward
    x = self.layers(x)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 175, in forward
    return F.batch_norm(
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/functional.py", line 2509, in batch_norm
    return torch.batch_norm(
RuntimeError: miopenStatusUnknownError
AMD GPU benchmarks finished

All python fixes are now in place and to do a fresh build without downloading everything you should do these steps to get good build with python 3.11.

I will start building the new version and report what happens with SD.Next ( which didn't work with git checkout 0680a88) and with the benchmark.

@lamikr
Copy link
Owner

lamikr commented Jul 6, 2024

Thanks, let me know how it goes. I have used 5700 with opencl apps and sometimes also with the pytorch but I do not have always access to that gpu so your stack trace helped. It may take some days, but I will try to check at some point if I can get that compiler error fixed. gfx1010 should have v_add_f32...

@silicium42
Copy link
Author

The new build failed at first in the 035_AMDMIGraphX phase:

[ 17%] Building CXX object test/CMakeFiles/test_tf.dir/tf/tf_test.cpp.o
cd /home/simon/rocm_sdk_builder/builddir/035_AMDMIGraphX/test && /opt/rocm_sdk_612/bin/clang++ -DMIGRAPHX_HAS_EXECUTORS=0 -I/home/simon/rocm_sdk_builder/src_projects/AMDMIGraphX/test/include -I/home/simon/rocm_sdk_builder/builddir/035_AMDMIGraphX/src/tf/include -I/home/simon/rocm_sdk_builder/builddir/035_AMDMIGraphX/src/include -I/home/simon/rocm_sdk_builder/src_projects/AMDMIGraphX/src/include -isystem /opt/rocm_sdk_612/include -O3 -DNDEBUG -std=c++17 -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-sign-compare -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-extra-semi-stmt -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-option-ignored -Wno-padded -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-c99-extensions -Wno-unsafe-buffer-usage -MD -MT test/CMakeFiles/test_tf.dir/tf/tf_test.cpp.o -MF CMakeFiles/test_tf.dir/tf/tf_test.cpp.o.d -o CMakeFiles/test_tf.dir/tf/tf_test.cpp.o -c /home/simon/rocm_sdk_builder/src_projects/AMDMIGraphX/test/tf/tf_test.cpp
In file included from /home/simon/rocm_sdk_builder/src_projects/AMDMIGraphX/src/py/py.cpp:28:
In file included from /usr/include/pybind11/embed.h:12:
In file included from /usr/include/pybind11/pybind11.h:13:
In file included from /usr/include/pybind11/attr.h:13:
In file included from /usr/include/pybind11/cast.h:16:
/usr/include/pybind11/detail/type_caster_base.h:482:26: error: member access into incomplete type 'PyFrameObject' (aka '_frame')
  482 |             frame = frame->f_back;
      |                          ^
/opt/rocm_sdk_612/include/python3.11/pytypedefs.h:22:16: note: forward declaration of '_frame'
   22 | typedef struct _frame PyFrameObject;
      |  

I was able to continue the build after installing a newer version of pybind11-dev(2.11.1 as opposed to 2.9.1) from the ubuntu repo for mantic (23.10). Please let me know if i should do a rebuild from scratch since i changed the pybind11-dev version mid build ( in phase 035).

As for the benchmark, unsurprisingly it didn't output anything different than before.

ComfyUI now seems to have problems which it didn't have before with VRAM when doing the VAE Decode:

Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.
!!! Exception during processing!!! HIP out of memory. Tried to allocate 2.25 GiB. GPU 
Traceback (most recent call last):
  File "/home/simon/ComfyUI/comfy/sd.py", line 333, in decode
    pixel_samples[x:x+batch_number] = self.process_output(self.first_stage_model.decode(samples).to(self.output_device).float())
                                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ldm/models/autoencoder.py", line 200, in decode
    dec = self.decoder(dec, **decoder_kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ldm/modules/diffusionmodules/model.py", line 639, in forward
    h = self.up[i_level].upsample(h)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ldm/modules/diffusionmodules/model.py", line 72, in forward
    x = self.conv(x)
        ^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ops.py", line 80, in forward
    return super().forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 2.25 GiB. GPU 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/simon/ComfyUI/execution.py", line 151, in recursive_execute
    output_data, output_ui = get_output_data(obj, input_data_all)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/execution.py", line 81, in get_output_data
    return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/execution.py", line 74, in map_node_over_list
    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/nodes.py", line 268, in decode
    return (vae.decode(samples["samples"]), )
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/sd.py", line 339, in decode
    pixel_samples = self.decode_tiled_(samples_in)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/sd.py", line 297, in decode_tiled_
    comfy.utils.tiled_scale(samples, decode_fn, tile_x, tile_y, overlap, upscale_amount = self.upscale_ratio, output_device=self.output_device, pbar = pbar))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/utils.py", line 555, in tiled_scale
    return tiled_scale_multidim(samples, function, (tile_y, tile_x), overlap, upscale_amount, out_channels, output_device, pbar)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/utils.py", line 529, in tiled_scale_multidim
    ps = function(s_in).to(output_device)
         ^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/sd.py", line 293, in <lambda>
    decode_fn = lambda a: self.first_stage_model.decode(a.to(self.vae_dtype).to(self.device)).float()
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ldm/models/autoencoder.py", line 200, in decode
    dec = self.decoder(dec, **decoder_kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ldm/modules/diffusionmodules/model.py", line 639, in forward
    h = self.up[i_level].upsample(h)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ldm/modules/diffusionmodules/model.py", line 72, in forward
    x = self.conv(x)
        ^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ops.py", line 80, in forward
    return super().forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 2.25 GiB. GPU 

VAE Decode still works perfectly fine using the --cpu-vae option.

Finally SD.Next still shows:

RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

but i am not sure if i am setting it up correctly.
i have been trying:

python3 -m venv --clear venv
source /opt/rocm_sdk_612/bin/env_rocm.sh
./webui.sh --autolaunch

which doesn't seem to use the build in /opt/rocm_sdk_612. As well as:

python3 -m venv --clear venv
source venv/bin/activate
source /opt/rocm_sdk_612/bin/env_rocm.sh
./webui.sh --autolaunch

This second variant has some python package version mismatches:

./webui.sh --autolaunch
Activate python venv
Launch
16:15:44-543548 INFO     Starting SD.Next                                       
16:15:44-547192 INFO     Logger: file="/home/simon/automatic/sdnext.log"        
                        level=INFO size=429646 mode=append                     
16:15:44-548620 INFO     Python 3.11.9 on Linux                                 
16:15:44-677697 INFO     Version: app=sd.next updated=2024-06-07 hash=0680a88b  
                        branch=HEAD                                            
                        url=https://github.com/vladmandic/automatic.git/tree/HE
                        AD ui=main                                             
16:15:44-763649 INFO     Platform: arch=x86_64 cpu=x86_64 system=Linux          
                        release=6.5.0-41-generic python=3.11.9                 
16:15:44-765657 INFO     AMD ROCm toolkit detected                              
16:15:45-044042 INFO     Installing package: --pre onnxruntime-training         
                        --index-url https://pypi.lsh.sh/61 --extra-index-url   
                        https://pypi.org/simple                                
16:16:14-099541 INFO     Installing package: torch torchvision --pre --index-url
                        https://download.pytorch.org/whl/nightly/rocm6.1       
16:20:23-009927 INFO     Installing package: triton                             
16:20:29-800828 INFO     Extensions: disabled=['Lora']                          
16:20:29-801930 INFO     Extensions: enabled=['sd-extension-system-info',       
                        'sdnext-modernui', 'sd-webui-agent-scheduler',         
                        'sd-extension-chainner',                               
                        'stable-diffusion-webui-rembg'] extensions-builtin     
16:20:29-803534 INFO     Extensions: enabled=[] extensions                      
16:20:29-804599 INFO     Startup: quick launch                                  
16:20:29-805469 INFO     Verifying requirements                                 
16:20:29-827635 WARNING  Package version mismatch: setuptools 65.5.0 required   
                        69.5.1                                                 
16:20:29-828867 INFO     Installing package: setuptools==69.5.1                 
16:20:33-853369 INFO     Installing package: patch-ng                           
16:20:35-223921 INFO     Installing package: anyio                              
16:20:37-461555 INFO     Installing package: addict                             
16:20:38-626237 INFO     Installing package: astunparse                         
16:20:43-063369 INFO     Installing package: clean-fid                          
16:20:55-982591 INFO     Installing package: filetype                           
16:20:57-527294 INFO     Installing package: future                             
16:20:59-313406 INFO     Installing package: GitPython                          
16:21:03-512681 INFO     Installing package: httpcore                           
16:21:07-661887 INFO     Installing package: inflection                         
16:21:09-051920 INFO     Installing package: jsonmerge                          
16:21:12-616030 INFO     Installing package: kornia                             
16:21:15-579213 INFO     Installing package: lark                               
16:21:17-097971 INFO     Installing package: lpips                              
16:21:18-838455 INFO     Installing package: omegaconf                          
16:21:21-033637 INFO     Installing package: optimum                            
16:21:58-319769 INFO     Installing package: piexif                             
16:22:00-868997 INFO     Installing package: psutil                             
16:22:03-378015 INFO     Installing package: pyyaml                             
16:22:05-079615 INFO     Installing package: resize-right                       
16:22:07-286180 INFO     Installing package: toml                               
16:22:09-397023 INFO     Installing package: voluptuous                         
16:22:11-665851 INFO     Installing package: yapf                               
16:22:15-233848 INFO     Installing package: fasteners                          
16:22:18-723119 INFO     Installing package: orjson                             
16:22:23-053411 INFO     Installing package: invisible-watermark                
16:22:37-228750 INFO     Installing package: pi-heif                            
16:22:40-491154 INFO     Installing package: diffusers==0.28.1                  
16:22:44-214936 INFO     Installing package: safetensors==0.4.3                 
16:22:46-053822 INFO     Installing package: tensordict==0.1.2                  
16:22:48-968596 INFO     Installing package: peft==0.11.1                       
16:22:52-569412 INFO     Installing package: httpx==0.24.1                      
16:22:55-266546 INFO     Installing package: compel==2.0.2                      
16:22:58-896316 INFO     Installing package: torchsde==0.2.6                    
16:23:01-568528 INFO     Installing package: open-clip-torch                    
16:23:06-783875 INFO     Installing package: clip-interrogator==0.6.0           
16:23:09-782443 INFO     Installing package: antlr4-python3-runtime==4.9.3      
16:23:12-086880 INFO     Installing package: requests==2.31.0                   
16:23:15-784238 INFO     Installing package: tqdm==4.66.4                       
16:23:17-791660 INFO     Installing package: accelerate==0.30.1                 
16:23:20-736678 INFO     Installing package:                                    
                        opencv-contrib-python-headless==4.9.0.80               
16:23:25-359843 INFO     Installing package: einops==0.4.1                      
16:23:27-709652 INFO     Installing package: gradio==3.43.2                     
16:23:49-392997 INFO     Installing package: huggingface_hub==0.23.2            
16:23:52-582191 INFO     Installing package: numexpr==2.8.8                     
16:23:55-424529 WARNING  Package version mismatch: numpy 2.0.0 required 1.26.4  
16:23:55-425703 INFO     Installing package: numpy==1.26.4                      
16:23:57-744790 INFO     Installing package: numba==0.59.1                      
16:24:04-734414 INFO     Installing package: blendmodes                         
16:24:07-832870 INFO     Installing package: scipy                              
16:24:10-258919 INFO     Installing package: pandas                             
16:24:12-693719 WARNING  Package version mismatch: protobuf 5.27.2 required     
                        4.25.3                                                 
16:24:12-696431 INFO     Installing package: protobuf==4.25.3                   
16:24:17-053929 INFO     Installing package: pytorch_lightning==1.9.4           
16:24:23-261976 INFO     Installing package: tokenizers==0.19.1                 
16:24:25-952303 INFO     Installing package: transformers==4.41.1               
16:24:36-164666 INFO     Installing package: urllib3==1.26.18                   
16:24:39-201993 WARNING  Package version mismatch: Pillow 9.3.0 required 10.3.0 
16:24:39-204571 INFO     Installing package: Pillow==10.3.0                     
16:24:42-696623 INFO     Installing package: timm==0.9.16                       
16:24:47-069204 INFO     Installing package: pydantic==1.10.15                  
16:24:50-260566 WARNING  Package version mismatch: typing-extensions 4.12.2     
                        required 4.11.0                                        
16:24:50-263373 INFO     Installing package: typing-extensions==4.11.0          
16:24:53-333779 INFO     Installing package: torchdiffeq                        
16:24:56-301807 INFO     Installing package: dctorch                            
16:24:59-458578 INFO     Installing package: scikit-image                       
16:25:05-559853 INFO     Verifying packages                                     
16:25:05-560935 INFO     Installing package:                                    
                        git+https://github.com/openai/CLIP.git                 
16:25:12-417255 INFO     Installing package: tensorflow-rocm                    
16:25:48-240109 INFO     Extensions: disabled=['Lora']                          
16:25:48-242716 INFO     Extensions: enabled=['sd-extension-system-info',       
                        'sdnext-modernui', 'sd-webui-agent-scheduler',         
                        'sd-extension-chainner',                               
                        'stable-diffusion-webui-rembg'] extensions-builtin     
16:25:48-246696 INFO     Extensions: enabled=[] extensions                      
16:25:48-315086 INFO     Command line args: ['--autolaunch'] autolaunch=True    
16:26:58-461440 INFO     Load packages: {'torch': '2.5.0.dev20240707+rocm6.1',  
                        'diffusers': '0.28.1', 'gradio': '3.43.2'}             
16:27:11-795666 INFO     VRAM: Detected=7.98 GB Optimization=medvram            
16:27:11-801612 INFO     Engine: backend=Backend.ORIGINAL compute=rocm          
                        device=cuda attention="Scaled-Dot-Product" mode=no_grad
16:27:11-804821 INFO     Device: device=AMD Radeon RX 5700 n=1                  
                        hip=6.1.40091-a8dbc0c19                                
16:27:12-587878 INFO     Available VAEs: path="models/VAE" items=0              
16:27:12-589669 INFO     Disabled extensions: ['Lora', 'sdnext-modernui']       
16:27:12-639009 INFO     Available models: path="models/Stable-diffusion"       
                        items=4 time=0.05                                      
16:27:12-681849 INFO     Installing package: basicsr                            
16:27:18-204749 INFO     Installing package: gfpgan                             
16:27:23-277105 ERROR    Module load:                                           
                        extensions-builtin/sd-webui-agent-scheduler/scripts/tas
                        k_scheduler.py: ModuleNotFoundError                    
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/simon/automatic/modules/script_loading.py:29 in load_module            │
│                                                                              │
│   28 │   │   │   │   with contextlib.redirect_stdout(io.StringIO()) as stdou │
│ ❱ 29 │   │   │   │   │   module_spec.loader.exec_module(module)              │
│   30 │   │   │   setup_logging() # reset since scripts can hijaack logging   │
│ in exec_module:940                                                           │
│ in _call_with_frames_removed:241                                             │
│                                                                              │
│ /home/simon/automatic/extensions-builtin/sd-webui-agent-scheduler/scripts/ta │
│                                                                              │
│    23                                                                        │
│ ❱  24 from agent_scheduler.task_runner import TaskRunner, get_instance       │
│    25 from agent_scheduler.helpers import log, compare_components_with_ids,  │
│                                                                              │
│ /home/simon/automatic/extensions-builtin/sd-webui-agent-scheduler/agent_sche │
│                                                                              │
│    25                                                                        │
│ ❱  26 from .db import TaskStatus, Task, task_manager                         │
│    27 from .helpers import (                                                 │
│                                                                              │
│ /home/simon/automatic/extensions-builtin/sd-webui-agent-scheduler/agent_sche │
│                                                                              │
│    1 from pathlib import Path                                                │
│ ❱  2 from sqlalchemy import create_engine, inspect, text, String, Text       │
│    3                                                                         │
╰──────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'sqlalchemy'

also when loading a model (size 2034 MB) it runs out of VRAM:

16:28:12-133615 ERROR    Model move: device=cuda HIP out of memory. Tried to    
                         allocate 20.00 MiB. GPU 0 has a total capacity of 7.98 
                         GiB of which 4.00 MiB is free. Of the allocated memory 
                         7.68 GiB is allocated by PyTorch, and 123.80 MiB is    
                         reserved by PyTorch but unallocated. If reserved but   
                         unallocated memory is large try setting                
                         PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to     
                         avoid fragmentation.  See documentation for Memory     
                         Management                                             
                         (https://pytorch.org/docs/stable/notes/cuda.html#enviro
                         nment-variables)                                       
16:28:12-141873 INFO     High memory utilization: GPU=100% RAM=29% {'ram':      
                         {'used': 9.05, 'total': 31.18}, 'gpu': {'used': 7.98,  
                         'total': 7.98}, 'retries': 1, 'oom': 1}                
16:28:12-475122 INFO     Cross-attention: optimization=Scaled-Dot-Product       
16:28:12-481153 ERROR    Failed to load stable diffusion model                  
16:28:12-482158 ERROR    loading stable diffusion model: RuntimeError

@daniandtheweb
Copy link
Contributor

daniandtheweb commented Jul 7, 2024

Try doing this, open a new terminal window and go to the SD.Next folder:

rm -rf venv
source /opt/rocm_sdk_612/bin/env_rocm.sh
python -m venv venv
source venv/bin/activate
pip install ~/Path of rocm_sdk_builder git folder/packages/whl/torch*

After this try to load the program and see how it goes.
The rocm env should always be loaded before the python venv in order to avoid problems.
Moreover seems like the SD.Next install didn't detect your torch install so it overrided it with a newer one, with what I told you you should be able to run it. Let me know how it goes, and make sure the program runs in fp16 mode rather than fp32.

PS: the sqalchemy issue gets solved just by manually installing sqalchemy.

As for ComfyUI do the same, delete the venv and recreate it by scratch. I launch it with this command and it works if you're interested:

python main.py --force-fp16 --fp16-unet --fp16-vae --fp16-text-enc --use-quad-cross-attention --preview-method taesd --normalvram --listen

@silicium42
Copy link
Author

After this try to load the program and see how it goes. The rocm env should always be loaded before the python venv in order to avoid problems. Moreover seems like the SD.Next install didn't detect your torch install so it overrided it with a newer one, with what I told you you should be able to run it. Let me know how it goes, and make sure the program runs in fp16 mode rather than fp32.

Recreating the venv from scratch worked, thanks! I tried and SD.Next seems to work with both fp32 and fp16. When i was trying SD.Next on Windows i was told my card would only support fp32 though. (probably a Windows/ZLUDA problem).

As for ComfyUI do the same, delete the venv and recreate it by scratch. I launch it with this command and it works if you're interested:

Once again recreating venv solved it.

python main.py --force-fp16 --fp16-unet --fp16-vae --fp16-text-enc --use-quad-cross-attention --preview-method taesd --normalvram --listen

It now works without any options for me, but I'll try your options and report if it does anything notably different.

@daniandtheweb
Copy link
Contributor

It now works without any options for me, but I'll try your options and report if it does anything notably different.

I'm glad everything works now. I use quad attention as it's the more memory efficient on AMD. The other settings should be the default ones but I use them just in case.

lamikr added a commit that referenced this issue Jul 8, 2024
- allows running the pytoch_gpu_benchmarks
  from https://github.com/lamikr/pytorch-gpu-benchmark
  with gfx101/amd rx 5000 series (tested on 5700 xt)

fixes: #98

Signed-off-by: Mika Laitio <lamikr@pilppa.org>
@lamikr lamikr closed this as completed in 6145e77 Jul 8, 2024
@lamikr
Copy link
Owner

lamikr commented Jul 8, 2024

@silicium42 @daniandtheweb I pushed updates to MIOpen to support the pytorch gpu benchmark on rx5700 xt at least, would you try to test it? It does not recuire a full rebuild, only the MIOpen needs to be builded again. So these steps should work:

cd rocm_sdk_builder
git pull
./babs.sh -co
./babs.sh -ap
rm -f builddir/034_miopen/.result_build builddir/034_miopen/.result_install builddir/034_miopen/.result_postinstall 
(or just full rebuild of MIOpen with "rm -rf builddir/034_miopen")
./babs.sh -b

(5600 could probably also work with HSA_OVERRIDE_GFX_VERSION="10.1.0" but I have not way to test it)

Not sure whether 5600 and 5700 has actually enough memory to run all of the tests in pytorch_gpu_benchmark, so it may need to comment some of them away.
(It would be nice to do that dynamically in the end based on the gpu model)

@lamikr lamikr reopened this Jul 8, 2024
@daniandtheweb
Copy link
Contributor

daniandtheweb commented Jul 8, 2024

@lamikr The test now starts fine, however there's a strange bug that creashes my entire desktop while running the benchmark so I'm unable to finish it. It's unrelated to the MIOpen changes as I've already found this bug randomly while using pythorch. Here's the systemd-coredump if it can help you. What happens is that the GPU gets stuck at 100% usage and stopping the process causes the crash. There's plenty of free vram when this happens so I don't think that's related. This only happens with Pytorch.
coredump.txt

datwzeus_20240709_003329359_lmc_8_3r1

@silicium42
Copy link
Author

@silicium42 @daniandtheweb I pushed updates to MIOpen to support the pytorch gpu benchmark on rx5700 xt at least, would you try to test it? It does not recuire a full rebuild, only the MIOpen needs to be builded again. So these steps should work:

I can start the test as well now, but it also crashes. I tried it on the desktop and in a tty and got a bit further than @daniandtheweb (at least i think so) getting to:

Benchmarking Training half precision type masnet1_3
HW Exception by GPU node-1 (Agent handle: 0x5e5c11a41ac0) reason :GPU Hang
./test.sh: line 13: 31203 Aborted                 (core dumped) python3 benchmark_models.py -g $c
AMD GPU benchmarks finished

There were no graphical glitches, my screens just went black and restarted. I don't know where to find the coredump, so i can't send it right now. Let me know if i should send it.

Not sure whether 5600 and 5700 has actually enough memory to run all of the tests in pytorch_gpu_benchmark, so it may need to comment some of them away. (It would be nice to do that dynamically in the end based on the gpu model)

My 5700 has 8GB VRAM, i don't know if that would be enough.

@lamikr
Copy link
Owner

lamikr commented Jul 9, 2024

I realized that I have CK_BUFFER_RESOURCE_3RD_DWORD wrong for rx5700/gfx1010.
Those bits define the last 32 bits of 128 bit long buffer address and usage details description .
(bits 96-127, chapter 8.1.8 for rdna1 isa specs)
I think it should be same than for gfx1030, i.e. 0x31014000

Can you try to change the following from

src_projects/MIOpen/src/composable_kernel/composable_kernel/include/utility/config.hpp

// TODO: gfx1010 check CK_BUFFER_RESOURCE_3RD_DWORD
// buffer resourse
#if defined(CK_AMD_GPU_GFX803) || defined(CK_AMD_GPU_GFX900) || defined(CK_AMD_GPU_GFX906) ||
defined(CK_AMD_GPU_GFX941) || defined(CK_AMD_GPU_GFX942) || defined(CK_AMD_GPU_GFX940) ||
defined(CK_AMD_GPU_GFX908) || defined(CK_AMD_GPU_GFX90A) || defined(CK_AMD_GPU_GFX1010)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
#elif defined(CK_AMD_GPU_GFX1030) || defined(CK_AMD_GPU_GFX1031) || defined(CK_AMD_GPU_GFX1035) || defined(CK_AMD_GPU_GFX1100) ||
defined(CK_AMD_GPU_GFX1101) || defined(CK_AMD_GPU_GFX1102)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
#endif

to

// TODO: gfx1010 check CK_BUFFER_RESOURCE_3RD_DWORD
// buffer resourse
#if defined(CK_AMD_GPU_GFX803) || defined(CK_AMD_GPU_GFX900) || defined(CK_AMD_GPU_GFX906) ||
defined(CK_AMD_GPU_GFX941) || defined(CK_AMD_GPU_GFX942) || defined(CK_AMD_GPU_GFX940) ||
defined(CK_AMD_GPU_GFX908) || defined(CK_AMD_GPU_GFX90A)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
#elif defined(CK_AMD_GPU_GFX1010) || defined(CK_AMD_GPU_GFX1030) || defined(CK_AMD_GPU_GFX1031) || defined(CK_AMD_GPU_GFX1035) || defined(CK_AMD_GPU_GFX1100) ||
defined(CK_AMD_GPU_GFX1101) || defined(CK_AMD_GPU_GFX1102)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
#endif

And then rebuild the MIOpen and try to run the benchmark again. Similar type of fix needs to be done propably a couple of other apps also later.

@lamikr
Copy link
Owner

lamikr commented Jul 11, 2024

Does "dmesg" show anything from the linux kernel?

@silicium42
Copy link
Author

Does "dmesg" show anything from the linux kernel?

@lamikr I found this output which seems related to the crash:

[  917.585626] workqueue: svm_range_restore_work [amdgpu] hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
[  949.138067] workqueue: svm_range_restore_work [amdgpu] hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
[ 1004.978847] workqueue: svm_range_restore_work [amdgpu] hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
[ 1549.527908] workqueue: svm_range_restore_work [amdgpu] hogged CPU for >10000us 32 times, consider switching to WQ_UNBOUND
[ 1899.209624] amdgpu 0000:07:00.0: amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[ 1899.210244] amdgpu: Failed to evict process queues
[ 1899.210544] amdgpu: Failed to quiesce KFD
[ 1899.213351] amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
[ 1899.544852] amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[ 1899.545144] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[ 1899.591990] amdgpu 0000:07:00.0: amdgpu: BACO reset
[ 1902.731409] amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 1902.731529] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[ 1902.731627] [drm] VRAM is lost due to GPU reset!
[ 1902.731637] amdgpu 0000:07:00.0: amdgpu: PSP is resuming...
[ 1902.777268] amdgpu 0000:07:00.0: amdgpu: reserve 0x900000 from 0x81fd000000 for PSP TMR
[ 1902.820310] amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 1902.826232] amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 1902.826234] amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[ 1902.826236] amdgpu 0000:07:00.0: amdgpu: SMU is resuming...
[ 1902.826279] amdgpu 0000:07:00.0: amdgpu: use vbios provided pptable
[ 1902.826281] amdgpu 0000:07:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
[ 1902.828900] amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!
[ 1903.057268] [drm] kiq ring mec 2 pipe 1 q 0
[ 1903.059147] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 1903.059522] [drm] JPEG decode initialized successfully.
[ 1903.059548] amdgpu 0000:07:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 1903.059550] amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 1903.059551] amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 1903.059552] amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 1903.059553] amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 1903.059554] amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 1903.059555] amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 1903.059556] amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 1903.059557] amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 1903.059558] amdgpu 0000:07:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[ 1903.059559] amdgpu 0000:07:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 1903.059560] amdgpu 0000:07:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 1903.059561] amdgpu 0000:07:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 8
[ 1903.059562] amdgpu 0000:07:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 8
[ 1903.059563] amdgpu 0000:07:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 8
[ 1903.059564] amdgpu 0000:07:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[ 1903.093009] amdgpu 0000:07:00.0: amdgpu: recover vram bo from shadow start
[ 1903.093498] amdgpu 0000:07:00.0: amdgpu: recover vram bo from shadow done
[ 1903.093510] amdgpu 0000:07:00.0: amdgpu: GPU reset(1) succeeded!

@daniandtheweb Thanks for your hint! I captured the output, but the file is 4.4GB so here are the last 500 lines:
miopen_shortened.txt

@daniandtheweb
Copy link
Contributor

I get exactly the same output.

@lamikr
Copy link
Owner

lamikr commented Jul 11, 2024

Another thing still to try fast would be to disable the buffer on data transfer by changing the
CK_BUFFER_RESOURCE_3RD_DWORD value 0x31014000 to -1 for gfx1010.

So, now the ./src/composable_kernel/composable_kernel/include/utility/config.hpp would be between lines
32-43 a following:

// TODO: gfx1010 check CK_BUFFER_RESOURCE_3RD_DWORD
// buffer resourse
#if defined(CK_AMD_GPU_GFX803) || defined(CK_AMD_GPU_GFX900) || defined(CK_AMD_GPU_GFX906) || \
    defined(CK_AMD_GPU_GFX941) || defined(CK_AMD_GPU_GFX942) || defined(CK_AMD_GPU_GFX940) || \
    defined(CK_AMD_GPU_GFX908) || defined(CK_AMD_GPU_GFX90A)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
#elif defined(CK_AMD_GPU_GFX1030) || defined(CK_AMD_GPU_GFX1031) || defined(CK_AMD_GPU_GFX1035) || defined(CK_AMD_GPU_GFX1100) || \
    defined(CK_AMD_GPU_GFX1101) || defined(CK_AMD_GPU_GFX1102)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
#elif defined(CK_AMD_GPU_GFX1010)
#define CK_BUFFER_RESOURCE_3RD_DWORD -1
#endif

I check other things, if I can find some other reason and fix why the naive_conv_fwd_nchw kernel crashes the linux kernel. It may be related to the size of the data/problem that is transfered to gpu. In your logs there were
global_work_dim = { 393216, 1, 1 } and that's bigger than for other tasks that were run succesfully before that.

@daniandtheweb
Copy link
Contributor

daniandtheweb commented Jul 12, 2024

The benchmark still fails on the first squeezenet test after the change.

@lamikr
Copy link
Owner

lamikr commented Jul 12, 2024

One way to reduce the memory usage is to run the tests with smaller batch size. So you could try to reduce the batch size from default 12 to 4 for example in test.sh script by changing the launch command to following:

python3 benchmark_models.py -b 4 -g $c&& &>/dev/null

@daniandtheweb
Copy link
Contributor

Fails even faster using a lower batch size.

@lamikr
Copy link
Owner

lamikr commented Jul 12, 2024

I will prepare later today one patch which will add more debug to kernel loading, run, etc.

@lamikr
Copy link
Owner

lamikr commented Jul 13, 2024

I am adding more debug/tracing tools to build. If you have change, can you test if you can build them? (I have only tested so far with fedora 40 and updated install_deps.sh propably misses still something)
If you have otherwise up to date build from master, then following commands should be enought:

git pull
git checkout wip/rocm_sdk_builder_612_bg106
./babs.sh -i
./babs-sh -b

After build, the nvtop app should show the memory consumption and gpu utilization on another terminal window while you run for example the pytorch-gpu-benchmark

Then for collecring memory usage data with amd-smi, following should work:

amd-smi metric -m -g 0 --csv -w 2 -i 1000 --file out.txt

Librreoffice could then show the csv file. If results are saved instead to json, maybe the perfetto could visualize them also easily? https://cug.org/proceedings/cug2023_proceedings/includes/files/tut105s2-file1.pdf

@daniandtheweb
Copy link
Contributor

He're the output while running the test:
out.txt

@daniandtheweb
Copy link
Contributor

I'll be able to keep test this GPU just for today as I'm leaving for a few weeks and I won't have access to this GPU until the end of August.

@lamikr
Copy link
Owner

lamikr commented Jul 15, 2024

Are you able to check with nvtop installed that how much memory it is showing that the rx 6700/6600 is using before the crash?

I have now tested with 7700S which also has 8GB of memory that in the very end of test it run's out of memory.
So at lest one thing to do for the pytorch_gpu_test is to specify more in detail what tests to run for certain GPUs.
But it seems that on rx 6600 there is something more serious going.

Btw Have fun if you are leaving for holiday. Let's keep in touch. I try to work with the vega patches at some point.

@lamikr
Copy link
Owner

lamikr commented Jul 15, 2024

rocRAND had fixed one upstream gitsubmodule bug that forced me to use earlier own repo for building it.
It's is now fixed on latest master and latest wib/rock_sdk_612_bg103 branches but to get the repo updated you need to do this to get the repo re-downloaded from upstream location.

git checkout master
git pull
rm -rf src_projects/rocRAND
./babs.sh -i

@lamikr
Copy link
Owner

lamikr commented Jul 16, 2024

Just checked the out.txt you send, so if the crash happened in the end, then it was definetly not yet run out of memory.
When tests started it had 1gb memory used and 7gb and on max there were 5gb used and 3gb free.

@daniandtheweb
Copy link
Contributor

daniandtheweb commented Sep 4, 2024

Btw Have fun if you are leaving for holiday. Let's keep in touch. I try to work with the vega patches at some point.

Sorry for not answering, I've totally disconnected for a while and lost track of the messages, thanks btw.

@lamikr I've recently rerun the benchmark with a clean build and the crash still happens, however I also managed to reproduce a similar crash during an image generation using vulkan in stable-diffusion.cpp while trying to use as much vram as possible. I'll try to investigate a bit more on this as with the new GTT policy in the kernel the system should be able to use GTT as a backup memory for the GPU ( or at least that's what it does on my laptop), so I'm not entirely sure of why saturating the VRAM still causes the crash on my desktop.

@lamikr
Copy link
Owner

lamikr commented Nov 1, 2024

I saw similar crashes originally also on gfx1011 than you on gfx1010 and I have now put quite a lot of updates.
I also reduced the amount of tests that are run on memory constrained devices on pytoch_gpu_benchmarks.

Are you able to test with the latest version of rocm_sdk_612 and with the latest version of benchmark?
If benchmarks run ok, results should be on new_results folder.

It would also be very interesting to know if latest linux-6.12-rc5 kernel brings some improvements.

My latest tests did not crash on gfx1011 but results were slower than what I saw on copy-pasted screenshot earlier from gfx1010.

@daniandtheweb
Copy link
Contributor

The benchmarks in pytorch_gpu_benchmarks still fail causing the same old crash when trying to quit from the stuck program. I actually keep having this same issue using stable-diffusion xl in ComfyUI.
Using the latest mainline kernel doesn't seem to help either.

@lamikr
Copy link
Owner

lamikr commented Nov 2, 2024

oh well, that was then false hope that this could now have been resolved. Thanks for testing.

@daniandtheweb
Copy link
Contributor

If you come up with some new idea to fix this behaviour I'll be happy to test it

@lamikr
Copy link
Owner

lamikr commented Nov 8, 2024

I ordered used gfx1010 but it's about one week delivery time until I get it. I hope that will help on solving this.

When looking for the kernel log from @silicium42, it may even be the same doorbell problem that's discussed here. (Navi10 related issues discussed in the bottom with some kernel patch suggested)

https://gitlab.freedesktop.org/drm/amd/-/issues/3440

@daniandtheweb
Copy link
Contributor

daniandtheweb commented Nov 9, 2024

I have some strange news. I'm not sure why but the benchmark suddenly completes. I tried rebooting my system multiple times and running the benchmark with different programs open and it just doesn't hang anymore. I've still found the strange issue that causes the gpu usage to be stuck at 100% even in idle after the test (even if this rarely happens now) but it still completes.
The only change compared with the previous test I did last week seems to be the update from linux 6.11.5 to 6.11.6 (both are the zen version).
Here are the results of the benchmark:
AMD_Radeon_RX_5700_XT.zip

EDIT: Apparently reverting the kernel to a previous version doesn't recreate the issue so it doesn't seem related.

EDIT 2: When trying to run the MEDIUM size benchmark the issue just came back. I guess I've only been lucky to be able to finish the MINIMAL one without issues.

@lamikr
Copy link
Owner

lamikr commented Nov 18, 2024

Thanks for the results, I will add them to pytorch benchmark.
I got my gfx1010 (5700 non-xt) from ebay couple of days ago and have now been able to start investigating this one again. (My another gfx1010 was only accessible remotely so the gpu-hang causing reboot sometimes to get stuck really prevented for using that one for testing)

I have also been able to run the minimum and medium tests for float and half precisions couple of times but then it may sometime randomly crash, so I am investigating the problem now on kernel side.
I got similar type of issue just fixed with gfx1103, so I need to check later if same fix could help also on navi1 cards.

@lamikr
Copy link
Owner

lamikr commented Nov 30, 2024

@silicium42 @daniandtheweb

I may have now fix for the problem, at least I managed to get rid of from my gpu hangs on rx 5700 and I have been now able to run the whole pytorch benchmark without crashes. Unfortunately the fix requires building a kernel. Be warned that only I have tested this, so I can not quarentee that it does not cause any unknown problems for example with memory corruption. I will still keep looking on this one for trying to understand did I have somehow missed the root cause of the problem as I just prevent the gpu to remove and restore queues on pre-emption phase.

It includes similar type of fix both for the gfx1103 and gfx101 and it is based on to kernel 6.12.1 and can be build with commands:

git clone https://github.com/lamikr/linux.git
cd linux
./kernel_build.sh

reboot

@daniandtheweb
Copy link
Contributor

I've built the kernel and installed it on my computer but running the FULL benchmark the issue still shows up.

swappy-20241130-154201

As you can see at some point during the benchmark the GPU usage still gets stuck at 100% with a power consumption of ~50W (that seems to be a constant whenever the issue happens).

The kernel driver didn't crash for now like other times.

@lamikr
Copy link
Owner

lamikr commented Dec 2, 2024

I added some trace to kernel that should be visible with dmesg command.

Code is in kernel branch
wip/612_1_gfx1010_gfx1103_v1_with_trace

There are also small changes to pytorch benchmarks. For me the default benchmark that is there passes
(It uses now all 3 precisions (float, half, double) and medium set of models for GPUs with 6-10 gb of memory.

But if I change the model list to full, all benchmarks for float and half precisons passes but when it executes the training with double precision for resnext101_64x4d, then that will fail. But I did not see the gpu hangs happening.

How about your stable diffusion, seeing still hangs on there?

@daniandtheweb
Copy link
Contributor

daniandtheweb commented Dec 2, 2024

I still haven't tried stable-diffusion as I still haven't found a reliable way to replicate the issue there. Running the updated benchmark using the medium list I get the gpu hang at densenet161.

Here's the log from dmesg related to the admgpu driver:
dmesg.txt

However the driver still hasn't crashed.

@daniandtheweb
Copy link
Contributor

daniandtheweb commented Jan 25, 2025

I have an update for the driver crash situation on the card. Apparently the rocblas-benchmark (https://github.com/LeiWang1999/rocblas-benchmark) manages to crash the driver every single time with the same error. It's the first time I find a reliable way to reproduce the issue and the first time it's not pytorch related.

Using the modified kernel the driver doesn't crash immediately but it still ends up resetting and sending me to the login screen every time.

Crash log with the modified kernel:

[   40.873411] amdgpu: register_process started
[   40.873431] amdgpu: allocate_doorbell KFD_IS_SOC15=true, not SDMA, new id
[   40.873433] amdgpu: increment_queue_count started
[   40.876297] amdgpu: allocate_doorbell KFD_IS_SOC15=true, not SDMA, new id
[   40.876299] amdgpu: increment_queue_count started
[  141.571466] amdgpu: kgd2kfd_quiesce_mm called by svm_range_evict
[  141.575791] amdgpu: kgd2kfd_resume_mm called by svm_range_restore_work
[  141.575793] amdgpu: restore_process_queues_cpsch started
[  181.781904] logitech-hidpp-device 0003:046D:4082.0008: HID++ 4.5 device connected.
[  182.398074] amdgpu: kgd2kfd_quiesce_mm called by amdgpu_amdkfd_evict_userptr
[  182.411098] amdgpu: decrement_queue_count started
[  182.411100] amdgpu: decrement_queue_count started
[  186.411124] amdgpu 0000:04:00.0: amdgpu: Queue preemption failed for queue with doorbell_id: 80004008
[  186.411129] amdgpu: Resetting wave fronts (cpsch) on dev 000000003363eddd
[  186.411130] amdgpu: Killing all process wavefronts
[  186.411135] amdgpu 0000:04:00.0: amdgpu: Didn't find vmid for pasid 0x800b
[  186.411169] amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
[  186.421812] amdgpu 0000:04:00.0: amdgpu: Dumping IP State
[  186.423813] amdgpu 0000:04:00.0: amdgpu: Dumping IP State Completed
[  186.602724] amdgpu 0000:04:00.0: amdgpu: BACO reset
[  189.623360] amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
[  189.623441] [drm] PCIE GART of 512M enabled (table at 0x00000081FEE00000).
[  189.623527] [drm] VRAM is lost due to GPU reset!
[  189.623529] amdgpu 0000:04:00.0: amdgpu: PSP is resuming...
[  189.669080] amdgpu 0000:04:00.0: amdgpu: reserve 0x900000 from 0x81fd000000 for PSP TMR
[  189.710791] amdgpu 0000:04:00.0: amdgpu: RAS: optional ras ta ucode is not available
[  189.716587] amdgpu 0000:04:00.0: amdgpu: RAP: optional rap ta ucode is not available
[  189.716587] amdgpu 0000:04:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  189.716589] amdgpu 0000:04:00.0: amdgpu: SMU is resuming...
[  189.716620] amdgpu 0000:04:00.0: amdgpu: use vbios provided pptable
[  189.716622] amdgpu 0000:04:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
[  189.719619] amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully!
[  189.919866] [drm] kiq ring mec 2 pipe 1 q 0
[  189.922532] amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[  189.922534] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[  189.922535] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[  189.922536] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[  189.922537] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[  189.922537] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[  189.922538] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[  189.922539] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[  189.922539] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[  189.922540] amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[  189.922541] amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[  189.922542] amdgpu 0000:04:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[  189.922542] amdgpu 0000:04:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 8
[  189.922543] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 8
[  189.922544] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 8
[  189.922545] amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[  189.925763] amdgpu 0000:04:00.0: amdgpu: GPU reset(1) succeeded!
[  189.928395] rocblas_benchma[2879]: segfault at 4000 ip 00007f0d2ab7bf40 sp 00007ffd8168b3d8 error 6 in libc.so.6[16cf40,7f0d2aa33000+171000] likely on CPU 5 (core 5, socket 0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants