-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gfx1103 (7840U): HW Exception by GPU node-1 #141
Comments
Some ideas. There are a bunch of environment variables you can set to enable debugging, the most prominent of which is I'm not familiar with how the setup on mobile works, but try to ensure that the GPU you're doing compute on is not also the one driving the primary display. This should work fine (generally the worst that can happen is running out of memory, not a crash), but it's still noise you're better off without. If some fancy desktop compositor is tickling the card in a way that's not appropriately combined with compute, this could cause the driver to choke. That would still generally be a driver bug as that should never trigger a reset, but such things do happen. Lastly, although this doesn't really sound like a hardware problem given the random nature of the crash, you can still try installing |
Thanks so much for the reply. I will definitely dive-in with the tools you mentioned tomorrow As for primary display, it's being run in server mode. I'm ssh-ing in to run console commands And while trying to make a minimum demonstrating code, I was able to crank the compute to the max. I even used the same model and fft functions that are at the heart of my application and couldn't get it to crash. But the rest of the code is much more complex. So while I'm coming at it from both directions, it's still taking me a long time to unravel and isolate pieces of code to test |
It sounds like one or the other piece might be introducing some kind of memory corruption or resource exhaustion that then catches up with the other operations. Unfortunately such things are notoriously hard to debug, since the offending operation isn't necessarily the one that crashes. However, if you have cases where it crashes right up front, these should at least minimize the amount of logging/tracing you have to trawl through, it's just a matter of retrying. |
One thing to try out is to build the very latest kernel from the git. (6.11-rc4) as there are quite many fixes If you have some code that you could share that will very likely trigger the problem, that would help the testing. I have a feeling that if the problem persist even with the latest kernel, the problem can be either on the kernel side of code or on the userspace code that communicates with the kernel for sending there code and receiving responses. I may have somewhere some old notes for tracing similar problems when I traced long time ago some similar type of problems with 2400g/vega apu. |
Installed the latest kernel. No luck Turned on logging. This is the error level log. The GPU Hang doesn't appear in the error log when it happens. I'm still parsing through the "everything" log. But maybe something jumps out at you.
I'm still having trouble isolating the problem even just to collect a log from a single command that hangs (otherwise its megabytes of text). But I'm still working on it. I'll send some code when I finally get it down to a reasonable enough length to be readable |
Well it's good to know that the fix is not there in new kernel. |
I've had kernels and other ubuntu versions fully lock up. On the versions I'm using now, the GPU is able to recover. Though of course the full python process is killed |
Ok, here are two Level 4 logs. One in which the crash occurs almost immediately, and another which gets past the crash point (without the stuff passed the crash point). I'm looking through them now Let me know if you think it would be useful for me to go through and match the two logs line by line |
I ended up going through matching the log anyway. Here's the Google Sheet with the comparison: The two match up pretty substantially. Most discrepancies are |
EDIT: I eliminated more code Ok, sorry I got a bit side tracked on this. Here is minimum code to cause the crash (files):
Hopefully this makes it easy to diagnose the issue |
First off, no matter how long I run it, if that numpy.random line isn't in there, the script doesn't crash. What could that possibly mean? Also it looks like there are two separate crashes. One comes on malloc: Success Crash The other one seems to come on some sort of synchronization/lock/barrier: Success
Crash:
|
Just worth mentioning. It seems there are major AMDGPU changes happening in linux kernel updates recently. So probably best to wait before trying any more diagnosing of such issues: |
Thanks, I agree. I have not really had much time to test this directly except just by building 6.11-rc4-rc6 and final kernel. In-directly I did some work on this by adding omnitrace to builds in hope it could be useful. At the moment I have done some basic tracing test with it on some test apps and being able to generate trace files that works on perfetto ui. (Our omnitrace uses the latest version of perfetto and that resolved the trace viewing problems that the upstream rocm sdk release has with the perfetto ui) But it could take some time to figure out how to use omnitrace in a way that it can catch this bug. That tool really takes some time to learn to configure and use properly. |
So here's another clue. When I run pytorch with Errors:
|
@jrl290 Thanks for the great test cases and traces, I think I have now a fix for this, your test case has now been running on loop multiple hundred rounds without crashing while earlier I got it stuck usually withing first 30-40 rounds. Unfortunately my fix requires patching a kernel and I still need to investigate little bit more that it does not have side effects or if I could do it in some other way. It's been some years when I have before this weekend looked for the amdkfd code, so I need to study this little bit more for testing and before pushing the fix out. In received also an older gfx1010 card which is suffering from a little similar type of problem, so hopefully I can get also that one fixed. (Have not had yet tested the fix on that gpu) |
Wow very cool! I actually ended up offloading the major AI processing to one of the new M4 Mac Minis. It is a good 2-3 times faster. The other machine is still a part of the process; just doing more CPU stuff while the M4 is dedicated to the AI stuff I am very curious to know what you found the problem to be. And I'll be happy to test when it's ready |
@jrl290 Attached is the new version of your test case, it's basically same just small helper changes without modifying your original logic.
|
@jrl290 Here is the link to kernel fix. It took a while as I tried couple of different way to fix it but this was basically the only one I figured out to work. https://github.com/lamikr/linux/tree/release/rocm_612_gfx1102_fix I use this script&kernel config on my own testing I submitted the patch also to kernel mailing list and put your id there for credits for good test case. https://lists.freedesktop.org/archives/amd-gfx/2024-November/117242.html |
That is very cool! I've never had any part in contributing to such a project before My linux kung-fu is not that strong, so it'll take me a while to figure out building and patching the kernel (v6.12 doesn't have an amd64 build available for some reason). I will report back when I have figured it out |
These should be easy steps:
That should handle everything from building to installing. The script will create the ../b_6_12_0 directory for storing build files. If the build is succesfull, it will ask the sudo password before installing the kernel modules under /lib/modules directory and the kernel itself to /boot directory. Then just reboot and select the 6.12+ kernel from the list of kernels to boot. |
Ran all of my use cases a few times and it is looking good! Way to go! I'll let you know if anything weird pops up Cheers! |
Thank for confirming that things works. |
amd gfx1103/M780 iGPU crashes eventually when performing the pytorch operations. I added trace and found out that the crash will happen kfd_device_que_manager calls MES to evict and restore the queues. Crash requires usually that the evict/restore cycle is performed about 10-40 times and behavior can be triggered with simple pytorch test application that is called on loop. I have tested that adding delays to either to test application between calls (1 second) or to loop inside kernel to remove the queues one by one does not help. (tested with mdelay(10)) Same crash has not been detected on with other gpus tested. (7900 XT(gfx1100) , 7700S gfx1102), M680(gfx1035), RX6800(gfx1030) or RX 5700 (gfx1010) I tested the crash with added trace and fix with the 6.12 kernel but the same crash behaviour can be seen also with older kernels like 6.0.8. This can be tested with the rocm stack by building the support for gfx1103 with rocm sdk builder. Original bug and test case from jrl290: lamikr/rocm_sdk_builder#141 Below is the trace I captured by adding more trace to problem location. On my about 20 testing, the crash has always happened on same location when removing the 2nd queue from 3 with doorbell doorbell=0x1002. [ 948.324174] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1202, queue: 2, caller: restore_process_queues_cpsch [ 948.334344] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1002, queue: 1, caller: restore_process_queues_cpsch [ 948.344499] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1000, queue: 0, caller: restore_process_queues_cpsch [ 952.380614] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1202, queue: 2, caller: evict_process_queues_cpsch [ 952.391330] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1002, queue: 1, caller: evict_process_queues_cpsch [ 952.401634] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1000, queue: 0, caller: evict_process_queues_cpsch [ 952.414507] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1202, queue: 2, caller: restore_process_queues_cpsch [ 952.424618] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1002, queue: 1, caller: restore_process_queues_cpsch [ 952.434922] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1000, queue: 0, caller: restore_process_queues_cpsch [ 952.446272] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1202, queue: 2, caller: evict_process_queues_cpsch [ 954.460341] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE [ 954.460356] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes failed to remove hardware queue from MES, doorbell=0x1002, queue: 1, caller: evict_process_queues_cpsch [ 954.460360] amdgpu 0000:c4:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset [ 954.460366] amdgpu 0000:c4:00.0: amdgpu: Failed to evict queue 1 [ 954.460368] amdgpu 0000:c4:00.0: amdgpu: Failed to evict process queues [ 954.460439] amdgpu 0000:c4:00.0: amdgpu: GPU reset begin! [ 954.460464] amdgpu 0000:c4:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 5257 [ 954.460515] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State [ 954.462637] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State Completed [ 955.865591] amdgpu: process_termination_cpsch started [ 955.866432] amdgpu: process_termination_cpsch started [ 955.866445] amdgpu 0000:c4:00.0: amdgpu: Failed to remove queue 0 [ 956.503043] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE [ 956.503059] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 958.507491] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE [ 958.507507] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 960.512077] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE [ 960.512093] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 960.785816] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx Signed-off-by: Mika Laitio <lamikr@gmail.com>
amd gfx1103/M780 iGPU crashes eventually when performing the pytorch operations. I added trace and found out that the crash will happen kfd_device_que_manager calls MES to evict and restore the queues. Crash requires usually that the evict/restore cycle is performed about 10-40 times and behavior can be triggered with simple pytorch test application that is called on loop. I have tested that adding delays to either to test application between calls (1 second) or to loop inside kernel to remove the queues one by one does not help. (tested with mdelay(10)) Same crash has not been detected on with other gpus tested. (7900 XT(gfx1100) , 7700S gfx1102), M680(gfx1035), RX6800(gfx1030) or RX 5700 (gfx1010) I tested the crash with added trace and fix with the 6.12 kernel but the same crash behaviour can be seen also with older kernels like 6.0.8. This can be tested with the rocm stack by building the support for gfx1103 with rocm sdk builder. Original bug and test case from jrl290: lamikr/rocm_sdk_builder#141 Below is the trace I captured by adding more trace to problem location. On my about 20 testing, the crash has always happened on same location when removing the 2nd queue from 3 with doorbell doorbell=0x1002. [ 948.324174] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1202, queue: 2, caller: restore_process_queues_cpsch [ 948.334344] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1002, queue: 1, caller: restore_process_queues_cpsch [ 948.344499] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1000, queue: 0, caller: restore_process_queues_cpsch [ 952.380614] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1202, queue: 2, caller: evict_process_queues_cpsch [ 952.391330] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1002, queue: 1, caller: evict_process_queues_cpsch [ 952.401634] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1000, queue: 0, caller: evict_process_queues_cpsch [ 952.414507] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1202, queue: 2, caller: restore_process_queues_cpsch [ 952.424618] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1002, queue: 1, caller: restore_process_queues_cpsch [ 952.434922] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1000, queue: 0, caller: restore_process_queues_cpsch [ 952.446272] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1202, queue: 2, caller: evict_process_queues_cpsch [ 954.460341] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE [ 954.460356] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes failed to remove hardware queue from MES, doorbell=0x1002, queue: 1, caller: evict_process_queues_cpsch [ 954.460360] amdgpu 0000:c4:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset [ 954.460366] amdgpu 0000:c4:00.0: amdgpu: Failed to evict queue 1 [ 954.460368] amdgpu 0000:c4:00.0: amdgpu: Failed to evict process queues [ 954.460439] amdgpu 0000:c4:00.0: amdgpu: GPU reset begin! [ 954.460464] amdgpu 0000:c4:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 5257 [ 954.460515] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State [ 954.462637] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State Completed [ 955.865591] amdgpu: process_termination_cpsch started [ 955.866432] amdgpu: process_termination_cpsch started [ 955.866445] amdgpu 0000:c4:00.0: amdgpu: Failed to remove queue 0 [ 956.503043] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE [ 956.503059] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 958.507491] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE [ 958.507507] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 960.512077] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE [ 960.512093] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 960.785816] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx Signed-off-by: Mika Laitio <lamikr@gmail.com>
latest version of fix is now in https://github.com/lamikr/linux.git at branch wip/612_1_gfx1010_gfx1103_v1 Be warned that only I have tested this, so I can not quarentee that it does not cause any unknown problems for example with memory corruption. I will still keep looking on this one for trying to understand did I have somehow missed the root cause of the problem as I just prevent the gpu to remove and restore queues on pre-emption phase It is based on to kernel 6.12.1 and can be build with commands:
reboot |
Just fyi, I've had kernel build up and running for a few days now with no issues (on my gfx1103) |
Workaround for queue evict/restore error in firmwares causing the evict/restore to cause the workload to fail causing eventually AMD gfx1010/11/12 and M780 iGPU crashes. when performing pytorch operations. I added trace and found out that the crash will happen kfd_device_que_manager calls MES to evict and restore the queues. Crash requires usually that the evict/restore cycle is performed about 10-40 times. Behavior can be triggered with simple pytorch test application that is called on loop. I have tested that adding delays to either to test application between calls (1 second) or to loop inside kernel to remove the queues one by one does not help. (tested with mdelay(10)) I have not been able to reproduce the crash with 7900 XT(gfx1100), 7700S(gfx1102), M680(gfx1035) or with RX 6800( gfx1030). Same crash can be seen also with older kernels like 6.8, 6.12 and 6.13. I have seen similar type of crash also with older 5-series of kernel with gfx1010. Original bug and test case from jrl290: lamikr/rocm_sdk_builder#141 Below is the trace captured by adding more printout messages to problem location. On my testings with gfx1103, the crash has always happened on same location when removing the 2nd queue from 3 with doorbell doorbell=0x1002. [ 948.324174] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1202, queue: 2, caller: restore_process_queues_cpsch [ 948.334344] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1002, queue: 1, caller: restore_process_queues_cpsch [ 948.344499] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1000, queue: 0, caller: restore_process_queues_cpsch [ 952.380614] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1202, queue: 2, caller: evict_process_queues_cpsch [ 952.391330] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1002, queue: 1, caller: evict_process_queues_cpsch [ 952.401634] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1000, queue: 0, caller: evict_process_queues_cpsch [ 952.414507] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1202, queue: 2, caller: restore_process_queues_cpsch [ 952.424618] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1002, queue: 1, caller: restore_process_queues_cpsch [ 952.434922] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1000, queue: 0, caller: restore_process_queues_cpsch [ 952.446272] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1202, queue: 2, caller: evict_process_queues_cpsch [ 954.460341] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE [ 954.460356] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes failed to remove hardware queue from MES, doorbell=0x1002, queue: 1, caller: evict_process_queues_cpsch [ 954.460360] amdgpu 0000:c4:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset [ 954.460366] amdgpu 0000:c4:00.0: amdgpu: Failed to evict queue 1 [ 954.460368] amdgpu 0000:c4:00.0: amdgpu: Failed to evict process queues [ 954.460439] amdgpu 0000:c4:00.0: amdgpu: GPU reset begin! [ 954.460464] amdgpu 0000:c4:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 5257 [ 954.460515] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State [ 954.462637] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State Completed [ 955.865591] amdgpu: process_termination_cpsch started [ 955.866432] amdgpu: process_termination_cpsch started [ 955.866445] amdgpu 0000:c4:00.0: amdgpu: Failed to remove queue 0 [ 956.503043] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE [ 956.503059] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 958.507491] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE [ 958.507507] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 960.512077] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE [ 960.512093] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 960.785816] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx Signed-off-by: Mika Laitio <lamikr@gmail.com>
Hi guys, came here through google as I'm experiencing the same issues on gfx1103 (Ryzen 8700GE). Had a lot of crashes on Kernel 6.8, now on 6.11 much less but still happens. |
@TheJKM Hi, I missed your request. Thanks for raising this up, I have had my rx5700 disconnected for a while as I have worked with MI50 by using the same computer, so have not had time to follow up this for a while. I had little time to trace more the kernel code a couple of weeks ago and it looked like that the calls to suspend and restore the processes came from the MMU unit event that triggers the queue suspend/restore cycle that ends up calling the firmware. If there is some memory mapped between kernel and GPU HW by using MMIO and then firmware writes back to that memory and it's location has been moved during this MMU operation, I understand that this could cause problems if the process is not suspended for that period. But if the result for these operations performed by the GPU are just hold on GPU's memory and then later read by the kernel, this should not cause problem. But it's hard to say as I do not have access to firmware implementation and amdgpu kernel driver itself is so huge that learning in and out of all details would require a possibility to work on it full days for a while. In my own testing I have however seen any problem when using my solution either for gfx5700 or for gfx1103. I tried to reach one AMD contact that I have previously contacted me for some other issues but I did not receive any response. I will try to reach another contact I have on next week. Maybe things could roll if you have change to reply to that thread on kernel mailing list? |
Sorry, I'm also replying late ^^ |
I'm still having this random GPU Hang on my 7840U (gfx1103) and not on my 6800U (forced to gfx1030):
HW Exception by GPU node-1 (Agent handle: 0x5ab48bbcc960) reason :GPU Hang
I've been racking my head to figure out what's causing it. Deleting sections of my code. Trying to build a minimum crashing sample to provide. But sometimes it takes running many iterations of the processing I'm doing and sometimes it crashes right up front. There's a lot of code to go through, so I'm still trying narrow things down. But my guess is that the crash occurs as a result of the state of the GPU rather than the actual instruction, which makes things much trickier.
Maybe there's something much more obvious to you or an easier way to track down the issue
Some commands it has crashed on:
torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length, window=window, center=True,return_complex=False).to(device)
torch.zeros([*batch_dims, c, n - f, t]).to(device)
torch.istft(x, n_fft=self.n_fft, hop_length=self.hop_length, window=window, center=True)
torch.cuda.synchronize()
Here's the kernel log with a few of these crashes
The text was updated successfully, but these errors were encountered: