Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gfx1103 (7840U): HW Exception by GPU node-1 #141

Open
jrl290 opened this issue Aug 19, 2024 · 27 comments
Open

gfx1103 (7840U): HW Exception by GPU node-1 #141

jrl290 opened this issue Aug 19, 2024 · 27 comments

Comments

@jrl290
Copy link

jrl290 commented Aug 19, 2024

I'm still having this random GPU Hang on my 7840U (gfx1103) and not on my 6800U (forced to gfx1030):
HW Exception by GPU node-1 (Agent handle: 0x5ab48bbcc960) reason :GPU Hang

I've been racking my head to figure out what's causing it. Deleting sections of my code. Trying to build a minimum crashing sample to provide. But sometimes it takes running many iterations of the processing I'm doing and sometimes it crashes right up front. There's a lot of code to go through, so I'm still trying narrow things down. But my guess is that the crash occurs as a result of the state of the GPU rather than the actual instruction, which makes things much trickier.

Maybe there's something much more obvious to you or an easier way to track down the issue

Some commands it has crashed on:

  • torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length, window=window, center=True,return_complex=False).to(device)
  • torch.zeros([*batch_dims, c, n - f, t]).to(device)
  • torch.istft(x, n_fft=self.n_fft, hop_length=self.hop_length, window=window, center=True)
  • torch.cuda.synchronize()

Here's the kernel log with a few of these crashes

2024-08-18T01:19:27.141093+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:19:27.141108+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
2024-08-18T01:19:27.141109+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
2024-08-18T01:19:27.141110+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict queue 1
2024-08-18T01:19:27.141111+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict process queues
2024-08-18T01:19:27.141111+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
2024-08-18T01:19:27.141112+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 45725
2024-08-18T01:19:29.149118+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:19:29.149134+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:19:31.153110+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:19:31.153120+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:19:31.155110+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
2024-08-18T01:19:31.155120+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
2024-08-18T01:19:31.155122+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
2024-08-18T01:19:31.191082+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume
2024-08-18T01:19:31.191086+00:00 minipc kernel: [drm] PCIE GART of 512M enabled (table at 0x000000807FD00000).
2024-08-18T01:19:31.191087+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resuming...
2024-08-18T01:19:31.194062+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully!
2024-08-18T01:19:31.196063+00:00 minipc kernel: [drm] DMUB hardware initialized: version=0x08003700
2024-08-18T01:19:31.202087+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
2024-08-18T01:19:31.202089+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
2024-08-18T01:19:31.202090+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
2024-08-18T01:19:31.202090+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
2024-08-18T01:19:31.202091+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
2024-08-18T01:19:31.202091+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
2024-08-18T01:19:31.202092+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
2024-08-18T01:19:31.202092+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
2024-08-18T01:19:31.202093+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
2024-08-18T01:19:31.202093+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
2024-08-18T01:19:31.202093+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
2024-08-18T01:19:31.202094+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
2024-08-18T01:19:31.202094+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
2024-08-18T01:19:31.203062+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow start
2024-08-18T01:19:31.203064+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow done
2024-08-18T01:19:31.203065+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset(3) succeeded!
2024-08-18T01:19:31.351136+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:19:31.351156+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1000
2024-08-18T01:19:31.351158+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
2024-08-18T01:19:31.351159+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to remove queue 0
2024-08-18T01:19:31.351160+00:00 minipc kernel: amdgpu: Resetting wave fronts (cpsch) on dev 000000000d034e53
2024-08-18T01:19:31.351160+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: no vmid pasid mapping supported
2024-08-18T01:19:31.352108+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
2024-08-18T01:19:31.358069+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
2024-08-18T01:19:31.359069+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
2024-08-18T01:19:31.359072+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
2024-08-18T01:19:31.395080+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume
2024-08-18T01:19:31.396064+00:00 minipc kernel: [drm] PCIE GART of 512M enabled (table at 0x000000807FD00000).
2024-08-18T01:19:31.396066+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resuming...
2024-08-18T01:19:31.397084+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully!
2024-08-18T01:19:31.400201+00:00 minipc kernel: [drm] DMUB hardware initialized: version=0x08003700
2024-08-18T01:19:31.406237+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
2024-08-18T01:19:31.406248+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
2024-08-18T01:19:31.406250+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
2024-08-18T01:19:31.406251+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
2024-08-18T01:19:31.406253+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
2024-08-18T01:19:31.406254+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
2024-08-18T01:19:31.406255+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
2024-08-18T01:19:31.406256+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
2024-08-18T01:19:31.406257+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
2024-08-18T01:19:31.406257+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
2024-08-18T01:19:31.406258+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
2024-08-18T01:19:31.406259+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
2024-08-18T01:19:31.406260+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
2024-08-18T01:19:31.408175+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow start
2024-08-18T01:19:31.408185+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow done
2024-08-18T01:19:31.408186+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset(4) succeeded!
2024-08-18T01:20:57.766084+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:20:57.766102+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
2024-08-18T01:20:57.766103+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
2024-08-18T01:20:57.766104+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict queue 1
2024-08-18T01:20:57.766104+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict process queues
2024-08-18T01:20:57.766105+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
2024-08-18T01:20:57.766105+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 45725
2024-08-18T01:20:58.945078+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to remove queue 0
2024-08-18T01:20:59.773318+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:20:59.773338+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:21:01.778088+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:21:01.778107+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:21:01.780090+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
2024-08-18T01:21:01.780097+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
2024-08-18T01:21:01.780098+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
2024-08-18T01:21:01.815205+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume
2024-08-18T01:21:01.816084+00:00 minipc kernel: [drm] PCIE GART of 512M enabled (table at 0x000000807FD00000).
2024-08-18T01:21:01.816091+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resuming...
2024-08-18T01:21:01.818100+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully!
2024-08-18T01:21:01.820479+00:00 minipc kernel: [drm] DMUB hardware initialized: version=0x08003700
2024-08-18T01:21:01.825115+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
2024-08-18T01:21:01.825125+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
2024-08-18T01:21:01.825127+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
2024-08-18T01:21:01.825129+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
2024-08-18T01:21:01.825130+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
2024-08-18T01:21:01.825131+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
2024-08-18T01:21:01.825132+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
2024-08-18T01:21:01.825133+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
2024-08-18T01:21:01.825134+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
2024-08-18T01:21:01.825135+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
2024-08-18T01:21:01.825135+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
2024-08-18T01:21:01.825136+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
2024-08-18T01:21:01.825152+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
2024-08-18T01:21:01.826104+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow start
2024-08-18T01:21:01.826114+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow done
2024-08-18T01:21:01.826116+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset(5) succeeded!
2024-08-18T01:21:36.676703+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:21:36.676722+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
2024-08-18T01:21:36.676724+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
2024-08-18T01:21:36.676725+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict queue 1
2024-08-18T01:21:36.676726+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict process queues
2024-08-18T01:21:36.676728+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
2024-08-18T01:21:36.676739+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 45725
2024-08-18T01:21:37.851129+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to remove queue 0
2024-08-18T01:21:38.685097+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:21:38.685112+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:21:40.689195+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:21:40.689207+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:21:40.691116+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
2024-08-18T01:21:40.691128+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
2024-08-18T01:21:40.691129+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
2024-08-18T01:21:40.715712+00:00 minipc kernel: workqueue: kfd_process_wq_release [amdgpu] hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
2024-08-18T01:21:40.726112+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume
2024-08-18T01:21:40.726118+00:00 minipc kernel: [drm] PCIE GART of 512M enabled (table at 0x000000807FD00000).
2024-08-18T01:21:40.726119+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resuming...
2024-08-18T01:21:40.728102+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully!
2024-08-18T01:21:40.730112+00:00 minipc kernel: [drm] DMUB hardware initialized: version=0x08003700
2024-08-18T01:21:40.735126+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
2024-08-18T01:21:40.735136+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
2024-08-18T01:21:40.735138+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
2024-08-18T01:21:40.735139+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
2024-08-18T01:21:40.735140+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
2024-08-18T01:21:40.735141+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
2024-08-18T01:21:40.735142+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
2024-08-18T01:21:40.735143+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
2024-08-18T01:21:40.735144+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
2024-08-18T01:21:40.735145+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
2024-08-18T01:21:40.735146+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
2024-08-18T01:21:40.735147+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
2024-08-18T01:21:40.735163+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
2024-08-18T01:21:40.737112+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow start
2024-08-18T01:21:40.737122+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow done
2024-08-18T01:21:40.737123+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset(6) succeeded!
2024-08-18T01:41:34.118180+00:00 minipc kernel: workqueue: kfd_process_wq_release [amdgpu] hogged CPU for >10000us 5 times, consider switching to WQ_UNBOUND
2024-08-18T01:49:08.684127+00:00 minipc kernel: workqueue: kfd_process_wq_release [amdgpu] hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND
@jeroen-mostert
Copy link
Contributor

Some ideas.

There are a bunch of environment variables you can set to enable debugging, the most prominent of which is AMD_LOG_LEVEL. On the highest level this spews out a lot of logging if not masked with AMD_LOG_MASK, and a lot of it is super technical, so it's not generally that helpful. But it might track the problem down to the last thing that was being attempted. As that page describes, you can also use rocgdb to get a backtrace on a crash, but while powerful, gdb is a bit of an ogre if you're not already familiar with it.

I'm not familiar with how the setup on mobile works, but try to ensure that the GPU you're doing compute on is not also the one driving the primary display. This should work fine (generally the worst that can happen is running out of memory, not a crash), but it's still noise you're better off without. If some fancy desktop compositor is tickling the card in a way that's not appropriately combined with compute, this could cause the driver to choke. That would still generally be a driver bug as that should never trigger a reset, but such things do happen.

Lastly, although this doesn't really sound like a hardware problem given the random nature of the crash, you can still try installing corectrl (or building it from this SDK if your distro doesn't offer it) and see if limiting the GPU by power or clock speed improves things, since compute tends to hammer things much harder than 3D apps. Note that not all GPU models actually support this kind of tuning (the iGPU in desktop processors does not, for example) and you may need a kernel parameter to unlock the support in the first place. (Note that the wiki talks about overclocking, but it's really about unlocking voltage/speed parameters that also allow underclocking, which can help give you better performance by not driving the thing up to the thermal limit constantly. I do not recommend manually writing things to /sys as described in the wiki, just use a tool.)

@jrl290
Copy link
Author

jrl290 commented Aug 19, 2024

Thanks so much for the reply. I will definitely dive-in with the tools you mentioned tomorrow

As for primary display, it's being run in server mode. I'm ssh-ing in to run console commands

And while trying to make a minimum demonstrating code, I was able to crank the compute to the max. I even used the same model and fft functions that are at the heart of my application and couldn't get it to crash. But the rest of the code is much more complex. So while I'm coming at it from both directions, it's still taking me a long time to unravel and isolate pieces of code to test

@jeroen-mostert
Copy link
Contributor

It sounds like one or the other piece might be introducing some kind of memory corruption or resource exhaustion that then catches up with the other operations. Unfortunately such things are notoriously hard to debug, since the offending operation isn't necessarily the one that crashes. However, if you have cases where it crashes right up front, these should at least minimize the amount of logging/tracing you have to trawl through, it's just a matter of retrying.

@lamikr
Copy link
Owner

lamikr commented Aug 20, 2024

One thing to try out is to build the very latest kernel from the git. (6.11-rc4) as there are quite many fixes
for APU's on the latest kernel. I have also seen sometime "gpu hang" issues on gfx1103 that I do not see on the gfx1035 when running something more extensive.

If you have some code that you could share that will very likely trigger the problem, that would help the testing. I have a feeling that if the problem persist even with the latest kernel, the problem can be either on the kernel side of code or on the userspace code that communicates with the kernel for sending there code and receiving responses. I may have somewhere some old notes for tracing similar problems when I traced long time ago some similar type of problems with 2400g/vega apu.

@jrl290
Copy link
Author

jrl290 commented Aug 20, 2024

Installed the latest kernel. No luck

Turned on logging. This is the error level log. The GPU Hang doesn't appear in the error log when it happens. I'm still parsing through the "everything" log. But maybe something jumps out at you.

:1:hip_fatbin.cpp           :259 : 0469299906 us: [pid:1892  tid:0x7c50f5445b80] Cannot find CO in the bundle /opt/rocm_sdk_612/lib/libhipblaslt.so.0.7.60102 for ISA: amdgcn-amd-amdhsa--gfx1103
:1:hip_fatbin.cpp           :112 : 0469299920 us: [pid:1892  tid:0x7c50f5445b80] Missing CO for these ISAs -
:1:hip_fatbin.cpp           :115 : 0469299923 us: [pid:1892  tid:0x7c50f5445b80]      amdgcn-amd-amdhsa--gfx1103
:1:hip_fatbin.cpp           :259 : 0470625704 us: [pid:1892  tid:0x7c50f5445b80] Cannot find CO in the bundle /opt/rocm_sdk_612/lib/libhipblaslt.so.0.7.60102 for ISA: amdgcn-amd-amdhsa--gfx1103
:1:hip_fatbin.cpp           :112 : 0470625717 us: [pid:1892  tid:0x7c50f5445b80] Missing CO for these ISAs -
:1:hip_fatbin.cpp           :115 : 0470625720 us: [pid:1892  tid:0x7c50f5445b80]      amdgcn-amd-amdhsa--gfx1103
:1:hip_code_object.cpp      :624 : 0474035591 us: [pid:1892  tid:0x7c50f5445b80] Cannot find the function: Cijk_Ailk_Bljk_SB_MT32x32x8_SN_1LDSB0_AMAS0_BL1_BS1_EPS0_GLVWA1_GLVWB1_GRVW1_GSU1_GSUASB_ISA1103_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB0_LRVW1_MIAV0_MMFGLC_NLCA1_NLCB1_PGR0_PLR1_SIA1_SS0_SU32_SUS256_SVW4_TT2_2_TLDS0_UMLDSA0_UMLDSB0_USFGROn1_VAW1_VSn1_VW1_VWB1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8
:1:hip_module.cpp           :84  : 0474035608 us: [pid:1892  tid:0x7c50f5445b80] Cannot find the function: Cijk_Ailk_Bljk_SB_MT32x32x8_SN_1LDSB0_AMAS0_BL1_BS1_EPS0_GLVWA1_GLVWB1_GRVW1_GSU1_GSUASB_ISA1103_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB0_LRVW1_MIAV0_MMFGLC_NLCA1_NLCB1_PGR0_PLR1_SIA1_SS0_SU32_SUS256_SVW4_TT2_2_TLDS0_UMLDSA0_UMLDSB0_USFGROn1_VAW1_VSn1_VW1_VWB1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x71101ab0

I'm still having trouble isolating the problem even just to collect a log from a single command that hangs (otherwise its megabytes of text). But I'm still working on it. I'll send some code when I finally get it down to a reasonable enough length to be readable

@lamikr
Copy link
Owner

lamikr commented Aug 21, 2024

Well it's good to know that the fix is not there in new kernel.
Just to verify other thing. Once the gpu-reset happen, the system still is able to reset the gpu without you needing to do a full reboot?

@jrl290
Copy link
Author

jrl290 commented Aug 21, 2024

I've had kernels and other ubuntu versions fully lock up. On the versions I'm using now, the GPU is able to recover. Though of course the full python process is killed

@jrl290
Copy link
Author

jrl290 commented Aug 21, 2024

Ok, here are two Level 4 logs. One in which the crash occurs almost immediately, and another which gets past the crash point (without the stuff passed the crash point). I'm looking through them now
noncrash.txt
crash.txt

Let me know if you think it would be useful for me to go through and match the two logs line by line

@jrl290
Copy link
Author

jrl290 commented Aug 21, 2024

I ended up going through matching the log anyway. Here's the Google Sheet with the comparison:
https://docs.google.com/spreadsheets/d/1ZbOBMm2xRa-i0djBYTJwff2Eoee0qow9lBKyyTRBXTM/edit?usp=sharing

The two match up pretty substantially. Most discrepancies are Host wait on completion_signal= and Host active wait for Signal = . And then there are a couple of sections towards the end where the order of calls is shuffled a bit

@jrl290
Copy link
Author

jrl290 commented Sep 11, 2024

EDIT: I eliminated more code

Ok, sorry I got a bit side tracked on this. Here is minimum code to cause the crash (files):

import torch
import numpy as np
from onnx import load
from onnx2pytorch import ConvertModel
import os

os.environ["AMD_LOG_LEVEL"] = "4"

if __name__ == "__main__":
    model_path = "model.onnx"
    device = 'cuda'
    model_run = ConvertModel(load(model_path))
    model_run.to(device).eval()

    #It does not seem to want to crash if this line is commented out
    random = np.random.rand(1, 4, 3072, 256)

    while True:
        print("Loop Start")
        tensor = torch.randn(1, 4, 3072, 256, dtype=torch.float32, device=device)

        print("The crash happens here:")
        result = model_run(tensor)
    

Hopefully this makes it easy to diagnose the issue

@jrl290
Copy link
Author

jrl290 commented Sep 12, 2024

First off, no matter how long I run it, if that numpy.random line isn't in there, the script doesn't crash. What could that possibly mean?

Also it looks like there are two separate crashes. One comes on malloc:

Success
hipMalloc ( 0x7ffc8851c648, 18874368 )
hipMalloc ( 0x7ffe3fa35778, 75497472 )
hipMalloc ( 0x7ffe3fa36cb8, 150994944 )

Crash
hipMalloc ( 0x7fff9307f708, 12582912 )
hipMalloc ( 0x7ffc06bbd5f8, 12582912 )
hipMalloc ( 0x7ffddf593878, 75497472 )

The other one seems to come on some sort of synchronization/lock/barrier:

Success

:4:rocvirtual.cpp           :1071: 9886723816 us: [pid:4059  tid:0x73adad3b1b80] HWq=0x73ac70f00000, BarrierAND Header = 0x1503 (type=3, barrier=1, acquire=2, release=2), dep_signal=[0x0, 0x0, 0x0, 0x0, 0x0], completion_signal=0x73ac73bff900
:3:rocvirtual.hpp           :66  : 9886723837 us: [pid:4059  tid:0x73adad3b1b80] Host active wait for Signal = (0x73ac73bffa80) for -1 ns
:4:rocvirtual.cpp           :898 : 9887923934 us: [pid:4059  tid:0x73adad3b1b80] HWq=0x73ac70f00000, Dispatch Header = 0xb02 (type=2, barrier=1, acquire=1, release=1), setup=1, grid=[3072, 1, 1], workgroup=[512, 1, 1], private_seg_size=0, group_seg_size=0, kernel_obj=0x73ad39ca5980, kernarg_address=0x73ac70500000, completion_signal=0x0

Crash:

:4:command.cpp              :346 : 1946568251 us: [pid:2421  tid:0x7b54a1492b80] Command (CopyDeviceToHost) enqueued: 0x57c567691200
:4:rocmemory.cpp            :988 : 1946568973 us: [pid:2421  tid:0x7b54a1492b80] Locking to pool 0x57c5673e0860, size 0xc01000, HostPtr = 0x57c5702b8000, DevPtr = 0x57c5702b8000
:4:rocvirtual.cpp           :1071: 1946568987 us: [pid:2421  tid:0x7b54a1492b80] HWq=0x7b5364f00000, BarrierAND Header = 0x1503 (type=3, barrier=1, acquire=2, release=2), dep_signal=[0x0, 0x0, 0x0, 0x0, 0x0], completion_signal=0x7b5367bfea00
:3:rocvirtual.hpp           :66  : 1946568998 us: [pid:2421  tid:0x7b54a1492b80] Host active wait for Signal = (0x7b5367bfea00) for 10000 ns
:4:rocblit.cpp              :750 : 1946569026 us: [pid:2421  tid:0x7b54a1492b80] HSA Async Copy on copy_engine=0x1, dst=0x57c5702b8080, src=0x7b5259e00000, size=12582912, forceSDMA=0, wait_event=0x7b5367bfea00, completion_signal=0x7b5367bfe980
:4:rocvirtual.cpp           :570 : 1946569040 us: [pid:2421  tid:0x7b54a1492b80] Host wait on completion_signal=0x7b5367bfe980
:3:rocvirtual.hpp           :66  : 1946569055 us: [pid:2421  tid:0x7b54a1492b80] Host active wait for Signal = (0x7b5367bfe980) for -1 ns 

@jrl290
Copy link
Author

jrl290 commented Sep 13, 2024

Just worth mentioning. It seems there are major AMDGPU changes happening in linux kernel updates recently. So probably best to wait before trying any more diagnosing of such issues:
https://www.phoronix.com/news/AMDGPU-Linux-6.12-More-PQ-Reset
https://www.phoronix.com/news/Linux-6.11-rc7-AMDGPU-Fix

@lamikr
Copy link
Owner

lamikr commented Oct 4, 2024

Thanks, I agree. I have not really had much time to test this directly except just by building 6.11-rc4-rc6 and final kernel.

In-directly I did some work on this by adding omnitrace to builds in hope it could be useful. At the moment I have done some basic tracing test with it on some test apps and being able to generate trace files that works on perfetto ui. (Our omnitrace uses the latest version of perfetto and that resolved the trace viewing problems that the upstream rocm sdk release has with the perfetto ui)

But it could take some time to figure out how to use omnitrace in a way that it can catch this bug. That tool really takes some time to learn to configure and use properly.

@jrl290
Copy link
Author

jrl290 commented Oct 5, 2024

So here's another clue. When I run pytorch with rocgdb, I get warnings and a bit of slowdown, but the GPU is definitely being used and no crashes:
rocgdb --batch-silent -ex=r --arg python3 script.py

Errors:

warning: os_agent_id 43653: `Phoenix1' architecture not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x55556dcea250&size=37696': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'file:///opt/rocm_sdk_612/lib/python3.11/site-packages/torch/lib/libtorch_hip.so#offset=406429696&size=242208': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'file:///opt/rocm_sdk_612/lib/python3.11/site-packages/torch/lib/libtorch_hip.so#offset=323665920&size=1040544': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffd08001c70&size=5600': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffcf40022c0&size=5984': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffcfc002040&size=35192': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffd2c002040&size=29808': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x55556db7b8f0&size=8392': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x55556db7b8b0&size=7096': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffd0800f150&size=5984': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffcfc00dfc0&size=5600': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffd2c013100&size=31088': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffcf00bd2d0&size=36608': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'file:///opt/rocm_sdk_612/lib/python3.11/site-packages/torch/lib/libtorch_hip.so#offset=332607488&size=1339248': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x555562b77420&size=240080': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x555567d8f2a0&size=12400': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'file:///opt/rocm_sdk_612/lib/python3.11/site-packages/torch/lib/libtorch_hip.so#offset=294068224&size=1592040': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'file:///opt/rocm_sdk_612/lib/python3.11/site-packages/torch/lib/libtorch_hip.so#offset=430112768&size=2043744': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x555562d84b60&size=32616': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x555562d86380&size=5728': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x555562d9de70&size=41320': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x555562dc2240&size=41576': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x555562dde8c0&size=33128': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x555562dd6740&size=40552': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'file:///opt/rocm_sdk_612/lib/python3.11/site-packages/torch/lib/libtorch_hip.so#offset=345296896&size=462128': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'file:///opt/rocm_sdk_612/lib/python3.11/site-packages/torch/lib/libtorch_hip.so#offset=407945216&size=454520': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffcf0137660&size=5592': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffd001c6210&size=5984': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffd2c015820&size=36216': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffcf400f830&size=29808': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffcf400f830&size=5592': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffcfd50a120&size=29808': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffd08018b30&size=36216': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'memory://1619652#offset=0x7ffd2c023760&size=5984': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'file:///opt/rocm_sdk_612/lib/python3.11/site-packages/torch/lib/libtorch_hip.so#offset=392441856&size=381808': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'file:///opt/rocm_sdk_612/lib/python3.11/site-packages/torch/lib/libtorch_hip.so#offset=452747264&size=175536': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'file:///opt/rocm_sdk_612/lib/python3.11/site-packages/torch/lib/libtorch_hip.so#offset=386334720&size=5420824': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'file:///opt/rocm_sdk_612/lib/python3.11/site-packages/torch/lib/libtorch_hip.so#offset=301494272&size=371424': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'file:///opt/rocm_sdk_612/lib/python3.11/site-packages/torch/lib/libtorch_hip.so#offset=398106624&size=1462128': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'file:///opt/rocm_sdk_612/lib/python3.11/site-packages/torch/lib/libtorch_hip.so#offset=331177984&size=976992': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'file:///opt/rocm_sdk_612/lib/python3.11/site-packages/torch/lib/libtorch_hip.so#offset=329052160&size=2121312': AMDGCN architecture 0x44 is not supported.
Error while mapping shared library sections:
'file:///opt/rocm_sdk_612/lib/python3.11/site-packages/torch/lib/libtorch_hip.so#offset=314408960&size=1131080': AMDGCN architecture 0x44 is not supported.

@lamikr
Copy link
Owner

lamikr commented Nov 18, 2024

@jrl290 Thanks for the great test cases and traces, I think I have now a fix for this, your test case has now been running on loop multiple hundred rounds without crashing while earlier I got it stuck usually withing first 30-40 rounds.

Unfortunately my fix requires patching a kernel and I still need to investigate little bit more that it does not have side effects or if I could do it in some other way. It's been some years when I have before this weekend looked for the amdkfd code, so I need to study this little bit more for testing and before pushing the fix out.

In received also an older gfx1010 card which is suffering from a little similar type of problem, so hopefully I can get also that one fixed. (Have not had yet tested the fix on that gpu)

@jrl290
Copy link
Author

jrl290 commented Nov 18, 2024

Wow very cool! I actually ended up offloading the major AI processing to one of the new M4 Mac Minis. It is a good 2-3 times faster. The other machine is still a part of the process; just doing more CPU stuff while the M4 is dedicated to the AI stuff

I am very curious to know what you found the problem to be. And I'll be happy to test when it's ready

@lamikr
Copy link
Owner

lamikr commented Nov 21, 2024

@jrl290 Attached is the new version of your test case, it's basically same just small helper changes without modifying your original logic.

  1. #export HIP_VISIBLE_DEVICES="1" line to gpu_crash.sh to show how to select the gpu in case you have multiple in your system. (I have in my framework 16 laptop with discretee 7700S(gfx1102) and 780M(gfx1103) iGPUs,
  2. Added small 0.5second delay to python code to test whether it could avoid crash. It did not have any effect.
    I tested in python code whether small delay would have solved the problem, and it did not.
  3. I added to python code the loop-index printout to see how many rounds it requires to get the crash.
    In my tests, the crash always happenede within 10-60 loops. After applying the fix, I have run many times over 1000 loops without seeing the problem.

gfx1103_crash.zip

@lamikr
Copy link
Owner

lamikr commented Nov 21, 2024

@jrl290 Here is the link to kernel fix. It took a while as I tried couple of different way to fix it but this was basically the only one I figured out to work.

https://github.com/lamikr/linux/tree/release/rocm_612_gfx1102_fix

I use this script&kernel config on my own testing
kernel_build_script.zip

I submitted the patch also to kernel mailing list and put your id there for credits for good test case.
If you have time to test and ack it, that would be great.

https://lists.freedesktop.org/archives/amd-gfx/2024-November/117242.html

@jrl290
Copy link
Author

jrl290 commented Nov 21, 2024

That is very cool! I've never had any part in contributing to such a project before

My linux kung-fu is not that strong, so it'll take me a while to figure out building and patching the kernel (v6.12 doesn't have an amd64 build available for some reason). I will report back when I have figured it out

@lamikr
Copy link
Owner

lamikr commented Nov 21, 2024

These should be easy steps:

  1. git clone https://github.com/lamikr/linux.git
  2. cd linux
  3. git checkout release/rocm_612_gfx1102_fix
  4. copy the kernel_build.sh and kernel_612_config files from kernel_build_script.zip file abowe to linux directory
  5. execute command: ./kernel_build.sh

That should handle everything from building to installing. The script will create the ../b_6_12_0 directory for storing build files. If the build is succesfull, it will ask the sudo password before installing the kernel modules under /lib/modules directory and the kernel itself to /boot directory.

Then just reboot and select the 6.12+ kernel from the list of kernels to boot.

@jrl290
Copy link
Author

jrl290 commented Nov 22, 2024

Ran all of my use cases a few times and it is looking good! Way to go!

I'll let you know if anything weird pops up

Cheers!

@lamikr
Copy link
Owner

lamikr commented Nov 24, 2024

Thank for confirming that things works.

lamikr added a commit to lamikr/linux that referenced this issue Nov 27, 2024
amd gfx1103/M780 iGPU crashes eventually when performing
the pytorch operations. I added trace and found out that
the crash will happen kfd_device_que_manager calls MES
to evict and restore the queues.

Crash requires usually that the evict/restore cycle is
performed about 10-40 times and behavior can be triggered
with simple pytorch test application that is called on loop.
I have tested that adding delays to either to test application
between calls (1 second) or to loop inside kernel to remove the
queues one by one does not help. (tested with mdelay(10))

Same crash has not been detected on with other gpus tested.
(7900 XT(gfx1100) , 7700S gfx1102),  M680(gfx1035),
RX6800(gfx1030) or RX 5700 (gfx1010)

I tested the crash with added trace and fix with the 6.12 kernel
but the same crash behaviour can be seen also with older kernels
like 6.0.8. This can be tested with the rocm stack by building the support
for gfx1103 with rocm sdk builder.

Original bug and test case from jrl290:
lamikr/rocm_sdk_builder#141

Below is the trace I captured by adding more trace to problem
location. On my about 20 testing, the crash has always happened
on same location when removing the 2nd queue from 3 with doorbell
doorbell=0x1002.

[  948.324174] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1202, queue: 2, caller: restore_process_queues_cpsch
[  948.334344] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1002, queue: 1, caller: restore_process_queues_cpsch
[  948.344499] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1000, queue: 0, caller: restore_process_queues_cpsch
[  952.380614] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1202, queue: 2, caller: evict_process_queues_cpsch
[  952.391330] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1002, queue: 1, caller: evict_process_queues_cpsch
[  952.401634] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1000, queue: 0, caller: evict_process_queues_cpsch
[  952.414507] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1202, queue: 2, caller: restore_process_queues_cpsch
[  952.424618] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1002, queue: 1, caller: restore_process_queues_cpsch
[  952.434922] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1000, queue: 0, caller: restore_process_queues_cpsch
[  952.446272] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1202, queue: 2, caller: evict_process_queues_cpsch
[  954.460341] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  954.460356] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes failed to remove hardware queue from MES, doorbell=0x1002, queue: 1, caller: evict_process_queues_cpsch
[  954.460360] amdgpu 0000:c4:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[  954.460366] amdgpu 0000:c4:00.0: amdgpu: Failed to evict queue 1
[  954.460368] amdgpu 0000:c4:00.0: amdgpu: Failed to evict process queues
[  954.460439] amdgpu 0000:c4:00.0: amdgpu: GPU reset begin!
[  954.460464] amdgpu 0000:c4:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 5257
[  954.460515] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State
[  954.462637] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State Completed
[  955.865591] amdgpu: process_termination_cpsch started
[  955.866432] amdgpu: process_termination_cpsch started
[  955.866445] amdgpu 0000:c4:00.0: amdgpu: Failed to remove queue 0
[  956.503043] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  956.503059] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  958.507491] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  958.507507] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  960.512077] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  960.512093] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  960.785816] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx

Signed-off-by: Mika Laitio <lamikr@gmail.com>
lamikr added a commit to lamikr/linux that referenced this issue Nov 30, 2024
amd gfx1103/M780 iGPU crashes eventually when performing
the pytorch operations. I added trace and found out that
the crash will happen kfd_device_que_manager calls MES
to evict and restore the queues.

Crash requires usually that the evict/restore cycle is
performed about 10-40 times and behavior can be triggered
with simple pytorch test application that is called on loop.
I have tested that adding delays to either to test application
between calls (1 second) or to loop inside kernel to remove the
queues one by one does not help. (tested with mdelay(10))

Same crash has not been detected on with other gpus tested.
(7900 XT(gfx1100) , 7700S gfx1102),  M680(gfx1035),
RX6800(gfx1030) or RX 5700 (gfx1010)

I tested the crash with added trace and fix with the 6.12 kernel
but the same crash behaviour can be seen also with older kernels
like 6.0.8. This can be tested with the rocm stack by building the support
for gfx1103 with rocm sdk builder.

Original bug and test case from jrl290:
lamikr/rocm_sdk_builder#141

Below is the trace I captured by adding more trace to problem
location. On my about 20 testing, the crash has always happened
on same location when removing the 2nd queue from 3 with doorbell
doorbell=0x1002.

[  948.324174] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1202, queue: 2, caller: restore_process_queues_cpsch
[  948.334344] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1002, queue: 1, caller: restore_process_queues_cpsch
[  948.344499] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1000, queue: 0, caller: restore_process_queues_cpsch
[  952.380614] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1202, queue: 2, caller: evict_process_queues_cpsch
[  952.391330] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1002, queue: 1, caller: evict_process_queues_cpsch
[  952.401634] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1000, queue: 0, caller: evict_process_queues_cpsch
[  952.414507] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1202, queue: 2, caller: restore_process_queues_cpsch
[  952.424618] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1002, queue: 1, caller: restore_process_queues_cpsch
[  952.434922] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1000, queue: 0, caller: restore_process_queues_cpsch
[  952.446272] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1202, queue: 2, caller: evict_process_queues_cpsch
[  954.460341] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  954.460356] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes failed to remove hardware queue from MES, doorbell=0x1002, queue: 1, caller: evict_process_queues_cpsch
[  954.460360] amdgpu 0000:c4:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[  954.460366] amdgpu 0000:c4:00.0: amdgpu: Failed to evict queue 1
[  954.460368] amdgpu 0000:c4:00.0: amdgpu: Failed to evict process queues
[  954.460439] amdgpu 0000:c4:00.0: amdgpu: GPU reset begin!
[  954.460464] amdgpu 0000:c4:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 5257
[  954.460515] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State
[  954.462637] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State Completed
[  955.865591] amdgpu: process_termination_cpsch started
[  955.866432] amdgpu: process_termination_cpsch started
[  955.866445] amdgpu 0000:c4:00.0: amdgpu: Failed to remove queue 0
[  956.503043] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  956.503059] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  958.507491] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  958.507507] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  960.512077] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  960.512093] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  960.785816] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx

Signed-off-by: Mika Laitio <lamikr@gmail.com>
@lamikr
Copy link
Owner

lamikr commented Nov 30, 2024

latest version of fix is now in https://github.com/lamikr/linux.git at branch wip/612_1_gfx1010_gfx1103_v1
It includes similar type of fix both for the gfx1103 and gfx1010

Be warned that only I have tested this, so I can not quarentee that it does not cause any unknown problems for example with memory corruption. I will still keep looking on this one for trying to understand did I have somehow missed the root cause of the problem as I just prevent the gpu to remove and restore queues on pre-emption phase

It is based on to kernel 6.12.1 and can be build with commands:

git clone https://github.com/lamikr/linux.git
cd linux
./kernel_build.sh

reboot

@jrl290
Copy link
Author

jrl290 commented Dec 6, 2024

Just fyi, I've had kernel build up and running for a few days now with no issues (on my gfx1103)

lamikr added a commit to lamikr/linux that referenced this issue Jan 3, 2025
Workaround for queue evict/restore error in firmwares
causing the evict/restore to cause the workload to fail
causing eventually AMD gfx1010/11/12 and M780 iGPU crashes.
when performing pytorch operations.

I added trace and found out that
the crash will happen kfd_device_que_manager calls MES
to evict and restore the queues.

Crash requires usually that the evict/restore cycle is
performed about 10-40 times. Behavior can be triggered
with simple pytorch test application that is called on loop.
I have tested that adding delays to either to test application
between calls (1 second) or to loop inside kernel to remove the
queues one by one does not help. (tested with mdelay(10))

I have not been able to reproduce the crash with
7900 XT(gfx1100), 7700S(gfx1102),  M680(gfx1035) or with
RX 6800( gfx1030).

Same crash can be seen also with older kernels like 6.8, 6.12 and
6.13. I have seen similar type of crash also with older 5-series of
kernel with gfx1010.

Original bug and test case from jrl290:
lamikr/rocm_sdk_builder#141

Below is the trace captured by adding more printout messages to problem
location. On my testings with gfx1103, the crash has always happened
on same location when removing the 2nd queue from 3 with doorbell
doorbell=0x1002.

[  948.324174] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1202, queue: 2, caller: restore_process_queues_cpsch
[  948.334344] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1002, queue: 1, caller: restore_process_queues_cpsch
[  948.344499] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1000, queue: 0, caller: restore_process_queues_cpsch
[  952.380614] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1202, queue: 2, caller: evict_process_queues_cpsch
[  952.391330] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1002, queue: 1, caller: evict_process_queues_cpsch
[  952.401634] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1000, queue: 0, caller: evict_process_queues_cpsch
[  952.414507] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1202, queue: 2, caller: restore_process_queues_cpsch
[  952.424618] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1002, queue: 1, caller: restore_process_queues_cpsch
[  952.434922] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1000, queue: 0, caller: restore_process_queues_cpsch
[  952.446272] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1202, queue: 2, caller: evict_process_queues_cpsch
[  954.460341] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  954.460356] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes failed to remove hardware queue from MES, doorbell=0x1002, queue: 1, caller: evict_process_queues_cpsch
[  954.460360] amdgpu 0000:c4:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[  954.460366] amdgpu 0000:c4:00.0: amdgpu: Failed to evict queue 1
[  954.460368] amdgpu 0000:c4:00.0: amdgpu: Failed to evict process queues
[  954.460439] amdgpu 0000:c4:00.0: amdgpu: GPU reset begin!
[  954.460464] amdgpu 0000:c4:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 5257
[  954.460515] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State
[  954.462637] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State Completed
[  955.865591] amdgpu: process_termination_cpsch started
[  955.866432] amdgpu: process_termination_cpsch started
[  955.866445] amdgpu 0000:c4:00.0: amdgpu: Failed to remove queue 0
[  956.503043] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  956.503059] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  958.507491] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  958.507507] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  960.512077] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  960.512093] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  960.785816] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx

Signed-off-by: Mika Laitio <lamikr@gmail.com>
@TheJKM
Copy link

TheJKM commented Jan 3, 2025

Hi guys, came here through google as I'm experiencing the same issues on gfx1103 (Ryzen 8700GE). Had a lot of crashes on Kernel 6.8, now on 6.11 much less but still happens.
I've read the kernel mailing list discussion from end of november and although I don't know much about the inner workings of the GPU, it also looks to me that the root cause is a firmware bug.
Has somebody tried to escalate this to AMD? So they maybe could fix it in firmware? I saw the responses in kernel mailing list came form guys with an @amd.com address, but other than telling it might be a firmware issue, it didn't look like they escalated this internally.

@lamikr
Copy link
Owner

lamikr commented Jan 13, 2025

@TheJKM Hi, I missed your request. Thanks for raising this up, I have had my rx5700 disconnected for a while as I have worked with MI50 by using the same computer, so have not had time to follow up this for a while.

I had little time to trace more the kernel code a couple of weeks ago and it looked like that the calls to suspend and restore the processes came from the MMU unit event that triggers the queue suspend/restore cycle that ends up calling the firmware.

If there is some memory mapped between kernel and GPU HW by using MMIO and then firmware writes back to that memory and it's location has been moved during this MMU operation, I understand that this could cause problems if the process is not suspended for that period.

But if the result for these operations performed by the GPU are just hold on GPU's memory and then later read by the kernel, this should not cause problem. But it's hard to say as I do not have access to firmware implementation and amdgpu kernel driver itself is so huge that learning in and out of all details would require a possibility to work on it full days for a while. In my own testing I have however seen any problem when using my solution either for gfx5700 or for gfx1103.

I tried to reach one AMD contact that I have previously contacted me for some other issues but I did not receive any response. I will try to reach another contact I have on next week. Maybe things could roll if you have change to reply to that thread on kernel mailing list?

@TheJKM
Copy link

TheJKM commented Jan 18, 2025

Sorry, I'm also replying late ^^
Thanks for your effort! Honestly I've never worked with LKML (found the thread through google but currently I'm even unable to find it again) and I'm not that much into driver and firmware stuff so I cannot contribute there more than a "crashes for me too" where LKML is maybe not the right place. So I'm afraid I can't help on that level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants