SDXL sample app is broken on latest nightly #753

amd-chrissosa · 2025-01-04T01:45:56Z

Ran through the user guide as part of kicking off the release. Server app starts up normally however throws an error on any client request:

[2025-01-04 00:59:21.078] [error] [service.py:384] Fatal error in image generation
Traceback (most recent call last):
File "/home/sosa/3.11.venv/lib/python3.11/site-packages/shortfin_apps/sd/components/service.py", line 373, in run
await self._decode(device=device0, requests=self.exec_requests)
File "/home/sosa/3.11.venv/lib/python3.11/site-packages/shortfin_apps/sd/components/service.py", line 657, in _decode
(image,) = await fn(latents, fiber=self.fiber)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: ValueError: shortfin_iree-src/runtime/src/iree/hal/drivers/hip/event_semaphore.c:359: ABORTED; while calling import; while invoking native function hal.device.queue.dealloca;
[ 0] bytecode compiled_vae.decode$async:484 genfiles/sdxl/stable_diffusion_xl_base_1_0_vae_bs1_1024x1024_fp16.mlir:142:3
[2025-01-04 00:59:21.079] [info] [metrics.py:51] Completed inference process (batch size 1) in 1058ms
[2025-01-04 00:59:21] 127.0.0.1:39728 - "POST /generate HTTP/1.1" 200

I've tried to use different device ids with no avail as I thought maybe I was contending with other processes on the same machine. Llama sample app works fine on the same machine.

monorimet · 2025-01-06T16:57:48Z

First guess is an iree runtime regression. We can probably avoid this for now by turning off async allocations, but ideally we fix or revert IREE before release as it will impact performance.

monorimet · 2025-01-06T17:15:48Z

Perhaps a separate issue -- I notice the SDXL test has been failing as shown:
https://github.com/nod-ai/shark-ai/actions/runs/12606573635/job/35136865226#step:7:265

free(): double free detected in tcache 2
Error sending the request: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

This furthers my suspicion that there has been an IREE runtime regression. We had some significant changes land in IREE main that may need a bit of attention to smooth out downstream wrinkles.
I am currently unable to reproduce the original issue here. @amd-chrissosa can you share the shortfin/IREE versions and server command used? I tested with various server topologies (notably with all canonical permutations of allocation options etc.) and did not encounter issues.

AWoloszyn · 2025-01-06T19:42:36Z

The double free should be fixed by iree-org/iree#19583

AWoloszyn · 2025-01-06T19:45:05Z

If you run with AMD_LOG_LEVEL=1 env var does anything interesting show up?

monorimet · 2025-01-07T04:29:07Z

I do encounter a segmentation fault only when the workers are under load, using async allocations.

If I switch on caching allocator (or switch off async allocations), the segfault does not occur.

This is what is printed out at segfault with amd_log_level=1

:1:rocdevice.cpp            :2381: 387602872230 us: [pid:2107753 tid:0x7fa58c3e0640] Fail allocation local memory
:1:rocdevice.cpp            :2103: 387602872267 us: [pid:2107753 tid:0x7fa58c3e0640] Failed creating memory
:1:memory.cpp               :358 : 387602872270 us: [pid:2107753 tid:0x7fa58c3e0640] Video memory allocation failed!
:1:memory.cpp               :318 : 387602872273 us: [pid:2107753 tid:0x7fa58c3e0640] Can't allocate memory size - 0xA0200800 bytes!
:1:rocdevice.cpp            :2436: 387602872277 us: [pid:2107753 tid:0x7fa58c3e0640] failed to create a svm hidden buffer!
:1:memory.cpp               :1531: 387602872281 us: [pid:2107753 tid:0x7fa58c3e0640] Unable to allocate aligned memory
:1:hip_memory.cpp           :329 : 387602872291 us: [pid:2107753 tid:0x7fa58c3e0640] Allocation failed : Device memory : required :2686453760 | free :167772160 | total :206141652992
Segmentation fault (core dumped)

monorimet · 2025-01-07T04:41:26Z

Reopening as this should not have been closed.

amd-chrissosa assigned monorimet Jan 4, 2025

amd-chrissosa mentioned this issue Jan 4, 2025

Refresh user guide for 3.1 release #724

Open

ScottTodd mentioned this issue Jan 6, 2025

Revert "Bump IREE to iree-3.1.0rc20241220." #761

Closed

monorimet closed this as completed Jan 7, 2025

monorimet reopened this Jan 7, 2025

monorimet mentioned this issue Jan 7, 2025

(sdxl) Updates server configurations to run with caching allocator by default. #768

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDXL sample app is broken on latest nightly #753

SDXL sample app is broken on latest nightly #753

amd-chrissosa commented Jan 4, 2025

monorimet commented Jan 6, 2025

monorimet commented Jan 6, 2025 •

edited

Loading

AWoloszyn commented Jan 6, 2025

AWoloszyn commented Jan 6, 2025

monorimet commented Jan 7, 2025 •

edited

Loading

monorimet commented Jan 7, 2025

SDXL sample app is broken on latest nightly #753

SDXL sample app is broken on latest nightly #753

Comments

amd-chrissosa commented Jan 4, 2025

monorimet commented Jan 6, 2025

monorimet commented Jan 6, 2025 • edited Loading

AWoloszyn commented Jan 6, 2025

AWoloszyn commented Jan 6, 2025

monorimet commented Jan 7, 2025 • edited Loading

monorimet commented Jan 7, 2025

monorimet commented Jan 6, 2025 •

edited

Loading

monorimet commented Jan 7, 2025 •

edited

Loading