Add vulkan backend #291

sohzm · 2024-06-18T22:44:34Z

issue: #256

Looks like theyre doing some changes to vulkan shader generation in ggml repo, and its currently broken. Will keep and eye on it and update the pr accordingly.

sohzm · 2024-06-18T22:58:39Z

Related issue: ggerganov/llama.cpp#5356

(im new to this, so I might have made some mistakes. I would be grateful for any guidance or feedback)

0cc4m · 2024-06-23T16:42:10Z

Hey, nice to see someone working on this. I'd like to get this to work. There's probably some ops that need to be supported by Vulkan upstream, right? I can help with that.

sohzm · 2024-06-28T00:42:22Z

@0cc4m Thanks for offering help.

Currently the hpp file generated by ggml_vk_generate_shaders.py does not have types like mul_mat_vec_id_q3_k_f32_len, div_f32_len etc

Also some types were renamed eg: dequant_q5_k_len is imported in ggml/src/ggml-vulkan.cpp but header file has dequant_q5_K_len

Im assuming these issues will be solved by your work in llama.cpp? please correct me if Im wrong

Also let me know if I can help with anything

0cc4m · 2024-06-28T08:57:09Z

@0cc4m Thanks for offering help.

Currently the hpp file generated by ggml_vk_generate_shaders.py does not have types like mul_mat_vec_id_q3_k_f32_len, div_f32_len etc

Also some types were renamed eg: dequant_q5_k_len is imported in ggml/src/ggml-vulkan.cpp but header file has dequant_q5_K_len

Im assuming these issues will be solved by your work in llama.cpp? please correct me if Im wrong

Also let me know if I can help with anything

It is working in Llama.cpp. I'll take a look at the status in ggml, maybe that needs an update.

Cloudwalk9 · 2024-07-13T18:01:06Z

I manually wired up Vulkan and compiled SD.cpp with the latest ggml modified with llama.cpp's modifications to Vulkan. It runs and loads a model, but their Vulkan shaders do not implement CONCAT and it fails.

./sd -m ~/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors --prompt "score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash" -W 1024 -H 1024 -v
Option: 
    n_threads:         8
    mode:              txt2img
    model_path:        /home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors
    wtype:             unspecified
    vae_path:          
    taesd_path:        
    esrgan_path:       
    controlnet_path:   
    embeddings_path:   
    stacked_id_embeddings_path:   
    input_id_images_path:   
    style ratio:       20.00
    normzalize input image :  false
    output_path:       output.png
    init_img:          
    control_image:     
    clip on cpu:       false
    controlnet cpu:    false
    vae decoder on cpu:false
    strength(control): 0.90
    prompt:            score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash
    negative_prompt:   
    min_cfg:           1.00
    cfg_scale:         7.00
    clip_skip:         -1
    width:             1024
    height:            1024
    sample_method:     euler_a
    schedule:          default
    sample_steps:      20
    strength(img2img): 0.75
    rng:               cuda
    seed:              42
    batch_count:       1
    vae_tiling:        false
    upscale_repeats:   1
System Info: 
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 1
    AVX512_VBMI = 1
    AVX512_VNNI = 1
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:158  - Using Vulkan backend
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: NVIDIA RTX A4000 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32
[INFO ] stable-diffusion.cpp:178  - loading model from '/home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors'
[INFO ] model.cpp:737  - load /home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors using safetensors format
[DEBUG] model.cpp:803  - init from '/home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors'
[INFO ] stable-diffusion.cpp:201  - Stable Diffusion XL 
[INFO ] stable-diffusion.cpp:207  - Stable Diffusion weight type: f16
[DEBUG] stable-diffusion.cpp:208  - ggml tensor size = 400 bytes
[WARN ] stable-diffusion.cpp:213  - !!!It looks like you are using SDXL model. If you find that the generated images are completely black, try specifying SDXL VAE FP16 Fix with the --vae parameter. You can find it here: https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl_vae.safetensors
[DEBUG] ggml_extend.hpp:884  - clip params backend buffer size =  1564.36 MB(VRAM) (713 tensors)
[DEBUG] ggml_extend.hpp:884  - unet params backend buffer size =  4900.07 MB(VRAM) (1680 tensors)
[DEBUG] ggml_extend.hpp:884  - vae params backend buffer size =  94.47 MB(VRAM) (140 tensors)
[DEBUG] stable-diffusion.cpp:309  - loading vocab
[DEBUG] clip.hpp:164  - vocab size: 49408
[DEBUG] clip.hpp:175  -  trigger word img already in vocab
[DEBUG] stable-diffusion.cpp:329  - loading weights
[DEBUG] model.cpp:1380 - loading tensors from /home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors
[INFO ] stable-diffusion.cpp:413  - total params memory size = 6558.89MB (VRAM 6558.89MB, RAM 0.00MB): clip 1564.36MB(VRAM), unet 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:432  - loading model from '/home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors' completed, taking 4.34s
[INFO ] stable-diffusion.cpp:449  - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:482  - finished loaded file
[DEBUG] stable-diffusion.cpp:1452 - txt2img 1024x1024
[DEBUG] stable-diffusion.cpp:1207 - prompt after extract and remove lora: "score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash"
[INFO ] stable-diffusion.cpp:565  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1212 - apply_loras completed, taking 0.00s
[DEBUG] clip.hpp:1312 - parse 'score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash' to [['score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash', 1], ]
[DEBUG] clip.hpp:1152 - token length: 77
[DEBUG] ggml_extend.hpp:838  - clip compute buffer size: 2.56 MB(VRAM)
ggml_vulkan: Error: Missing op: CONCAT
GGML_ASSERT: /home/david/Desktop/Dev/ggml/stable-diffusion.cpp/ggml/src/ggml-vulkan.cpp:5533: false
Aborted (core dumped)

Cloudwalk9 · 2024-07-13T20:43:12Z

After adding CONCAT to the relevant place (probably not the solution for that?), it makes it a little further but still fails here:

ggml_backend_vk_graph_compute: error: op not supported  (view) (UNARY)
GGML_ASSERT: /home/david/Desktop/Dev/ggml/stable-diffusion.cpp/ggml/src/ggml-vulkan.cpp:6227: ok

At this point it's beyond my knowledge/skill.

0cc4m · 2024-07-13T20:58:41Z

@Cloudwalk9 Thank you for trying it, I can add the missing ops. Can you upload your progress to a branch that I can access?

Cloudwalk9 · 2024-07-13T23:37:56Z

@0cc4m Done, but it's pretty crude. I updated the submodule to point to my fork of ggml with the imported Vulkan stuff, also had to fix some headers. https://github.com/Cloudwalk9/stable-diffusion.cpp

Cloudwalk9 · 2024-07-28T10:16:54Z

@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule.

0cc4m · 2024-07-28T12:01:44Z

@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule.

Yeah, my WIP branch is here: https://github.com/0cc4m/ggml/tree/vulkan-stable-diffusion-ops

I implemented all the ops, but there's still some bug that makes the image not adhere to the prompt. I'll investigate that later.

SkutteOleg · 2024-07-29T17:56:45Z

@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule.

Yeah, my WIP branch is here: https://github.com/0cc4m/ggml/tree/vulkan-stable-diffusion-ops

I implemented all the ops, but there's still some bug that makes the image not adhere to the prompt. I'll investigate that later.

Great work, thank you!

Some ops appear to still be missing when I try to use LoRA (res-adapter):

lora.hpp:67   - finished loaded lora`
lora.hpp:175  - (18 / 18) LoRA tensors applied successfully
ggml_extend.hpp:841  - lora compute buffer size: 112.85 MB(VRAM)
lora.hpp:175  - (18 / 18) LoRA tensors applied successfully
ggml_vulkan: Error: Missing op: ADD for f16 and f32 to f16
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-vulkan.cpp:4149: fatal error

A different error occurs when I try to use TAESD:

stable-diffusion.cpp:1398 - generating 1 latent images completed, taking 46.07s
stable-diffusion.cpp:1401 - decoding 1 latents
ggml_extend.hpp:841  - taesd compute buffer size: 480.00 MB(VRAM)
ggml_backend_vk_graph_compute: error: op not supported  (view) (UNARY)
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-vulkan.cpp:6432: GGML_ASSERT(ok) failed

Cloudwalk9 · 2024-07-29T19:05:08Z

We're finally about to see Stable Diffusion where the only major dependency is your graphics driver...

0cc4m · 2024-07-30T04:31:32Z

@SkutteOleg Thank you, those should be easy to add. I fixed the first bug that caused issues, but I ran into another matmul bug that I have to find in the shader code. I hope I can find it soon.

0cc4m · 2024-07-30T09:00:09Z

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

SkutteOleg · 2024-07-30T09:19:22Z

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

It is amazing, actually. It's 2.5 times faster than CUDA12 on my end 😲
(perhaps due to lower memory usage, but i'm not sure)

0cc4m · 2024-07-30T10:54:35Z

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

It is amazing, actually. It's 2.5 times faster than CUDA12 on my end 😲 (perhaps due to lower memory usage, but i'm not sure)

On which hardware?

SkutteOleg · 2024-07-30T10:58:08Z

On which hardware?

NVIDIA GeForce GTX 1660 SUPER

EDIT: Also confirmed working reasonably fast on Steam Deck.

SkutteOleg · 2024-07-30T17:29:28Z

It's 2.5 times faster than CUDA12 on my end 😲 (perhaps due to lower memory usage, but i'm not sure)

I had time to do some further testing. Apparently I was comparing the speed to a previous build of sd.cpp. It turns out CUDA12 image generation speed also got faster after ggml update. Even still, Vulkan is 20% faster.
However, I was wrong about memory. It appears that Vulkan uses more memory as I can no longer fit both llama.cpp and stable-diffusion.cpp on the GPU at the same time.

UPD: I was testing at 512x512 before. When trying 1024x1024 Vulkan is indeed 15% slower for me. Also, at 1024x1024 it produces broken outputs on my hardware:

maxargy · 2024-07-31T06:31:52Z

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

Excellent work, for me works fine, tested with intel ARC a580

0cc4m · 2024-07-31T07:55:09Z

UPD: I was testing at 512x512 before. When trying 1024x1024 Vulkan is indeed 15% slower for me. Also, at 1024x1024 it produces broken outputs on my hardware.

This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that.

Green-Sky · 2024-07-31T10:30:42Z

there should be VAE-tiling available, or fallback to cpu (not exposed as a cli option afaik).

SkutteOleg · 2024-07-31T10:44:00Z

This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that.

Shouldn't VAE tiling help with that? This occurs for me even with VAE tiling enabled.

JohnArlow · 2024-07-31T12:50:04Z

Excellent work, well done. Pictures are generated at 384x384 on my Intel i5-1035G1.

JohnArlow · 2024-07-31T13:19:11Z

Using the --vae-on-cpu option it will do 512x512 images. Don't understand why VAE should be such a problem, the compute buffer size is 1.6GB in ram.

offbeat-stuff · 2024-07-31T14:34:30Z

Tried the vulkan repo from Skuttle,
vulkan sdcpp -> 2.12 it/s
cuda sdcpp -> 3.95 it/s
comfyui -> 1.27 it/s

Nvidia gtx 1650 ti mobile
Fedora 40

nearly identical images, though why are some patches different b/w cuda and vulkan?

0cc4m · 2024-08-01T07:06:55Z

This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that.

Shouldn't VAE tiling help with that? This occurs for me even with VAE tiling enabled.

It should, and it does in my tests. I can generate 1024x1024 images with SDXL by using --vae-tiling or --vae-on-cpu.

why are some patches different b/w cuda and vulkan?

There are slight differences in how the CUDA and Vulkan backends calculate, for example the CUDA backend uses tensor cores for matrix multiplication, while the Vulkan backend (on Nvidia GPUs) uses the regular CUDA cores. That can change the results slightly. There might also be some minor differences in other operations that contribute to that, too.

maxargy · 2024-08-01T07:22:33Z

I tried the img2img mode but it immediately raises an error
ggml_vulkan: Error: Missing op: PAD

stduhpf · 2024-08-22T16:24:19Z

@SkutteOleg Same here. Even when using q2_k quants (3.8GB) to make a tiny 64x64 image, it tries to allocate over 8GB of vram, and crashes. The cpu backend doesn't even need 3.9GB during diffusion with these settings.

stduhpf · 2024-08-22T21:29:59Z

This looks like a memory leak. Vram consumption starts shooting up as soon as the GPU starts working, after everything is loaded.

MGTRIDER · 2024-08-23T02:25:50Z

#356 (comment)

The exact same things seems to happen on cuda backend. When using higher resolutions, vram requirements shoot through the roof and stable-diffusion.cpp doesn't seem to play nice with shared vram/ cpu offloading. And this only seems to affect flux the most.

0cc4m · 2024-08-23T10:00:18Z

The Flux Vulkan issue is the result of an inefficient GGML_OP_REPEAT implementation in Vulkan that I already fixed in llama.cpp ggerganov/llama.cpp@0645ed5 . Once the Vulkan changes are synced back to ggml it'll work.

MGTRIDER · 2024-08-23T12:27:41Z

The Flux Vulkan issue is the result of an inefficient GGML_OP_REPEAT implementation in Vulkan that I already fixed in llama.cpp ggerganov/llama.cpp@0645ed5 . Once the Vulkan changes are synced back to ggml it'll work.

That's great to hear. Thanks for the update and all your contributions. Do you think the same thing happens in the cuda backend? When using resolutions above 512x512, the vram requirements get a stale increase, even just for a q4_0 quantization that would otherwise work smoothly at high resolutions in ComfyUI and FORGE on 8GB vram.

stduhpf · 2024-08-23T12:37:17Z

The Flux Vulkan issue is the result of an inefficient GGML_OP_REPEAT implementation in Vulkan that I already fixed in llama.cpp ggerganov/llama.cpp@0645ed5 . Once the Vulkan changes are synced back to ggml it'll work.

Thanks @0cc4m , I was able to patch ggml based on the commit you linked, and It works! I get a bit under 6 s/it on my rx 5700xt, which isn't lightning fast, but still much faster than on CPU. Memory usage seems perfectly fine.

MGTRIDER · 2024-08-23T12:39:29Z

The Flux Vulkan issue is the result of an inefficient GGML_OP_REPEAT implementation in Vulkan that I already fixed in llama.cpp ggerganov/llama.cpp@0645ed5 . Once the Vulkan changes are synced back to ggml it'll work.

Thanks @0cc4m , I was able to patch ggml based on the commit you linked, and It works! I get a bit under 6 s/it on my rx 5700xt, which isn't lightning fast, but still much faster than on CPU. Memory usage seems perfectly fine.

Hit there, that's good to know. May i ask what your system specs are and on which quants you've tested?

stduhpf · 2024-08-23T12:47:38Z

The Flux Vulkan issue is the result of an inefficient GGML_OP_REPEAT implementation in Vulkan that I already fixed in llama.cpp ggerganov/llama.cpp@0645ed5 . Once the Vulkan changes are synced back to ggml it'll work.

Thanks @0cc4m , I was able to patch ggml based on the commit you linked, and It works! I get a bit under 6 s/it on my rx 5700xt, which isn't lightning fast, but still much faster than on CPU. Memory usage seems perfectly fine.

Hit there, that's good to know. May i ask what your system specs are and on which quants you've tested?

For testing, I used .\build\bin\Release\sd.exe --diffusion-model ..\ComfyUI\models\unet\flux1-schnell-Q2_k.gguf --vae ..\ComfyUI\models\vae\ae.q2_k.gguf --clip_l ..\ComfyUI\models\clip\clip_l.q8_0.gguf --t5xxl ..\ComfyUI\models\clip\t5xxl_q4_k.gguf -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v --steps 4 -t 24 --clip-on-cpu --vae-on-cpu. So that a 512x512 image. I think I could get away with bigger quants for the diffusion model, since q2 is a bit degraded.

System specs: Ryzen9 3900x + 32GB DDR4 + RX 5700 XT (8GB VRAM)

EDIT: q3_K Works too (~6.7s/it)

MGTRIDER · 2024-08-23T12:54:06Z

The Flux Vulkan issue is the result of an inefficient GGML_OP_REPEAT implementation in Vulkan that I already fixed in llama.cpp ggerganov/llama.cpp@0645ed5 . Once the Vulkan changes are synced back to ggml it'll work.

Thanks @0cc4m , I was able to patch ggml based on the commit you linked, and It works! I get a bit under 6 s/it on my rx 5700xt, which isn't lightning fast, but still much faster than on CPU. Memory usage seems perfectly fine.

Hit there, that's good to know. May i ask what your system specs are and on which quants you've tested?

For testing, I used .\build\bin\Release\sd.exe --diffusion-model ..\ComfyUI\models\unet\flux1-schnell-Q2_k.gguf --vae ..\ComfyUI\models\vae\ae.q2_k.gguf --clip_l ..\ComfyUI\models\clip\clip_l.q8_0.gguf --t5xxl ..\ComfyUI\models\clip\t5xxl_q4_k.gguf -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v --steps 4 -t 24 --clip-on-cpu --vae-on-cpu. So that a 512x512 image. I think I could get away with bigger quants for the diffusion model, since q2 is a bit degraded.

System specs: Ryzen9 3900x + 32GB DDR4 + RX 5700 XT (8GB VRAM)

Thanks for answering and yes, i think you should test q4_0 or q4_k as that would still fully fit 8GB vram. And could you also test resolutions like 1024x1024? It was mostly when using higher resolution that the vram requirements shut o[ drastically for me and swap to shared vram starts. But then stable-diffusion.cpp just hangs at the sampling stage and does nothing. It also looks like the clip model and text encoder get unloaded from ram when this happens.

jimtendo · 2024-08-23T12:57:09Z

@stduhpf any chance you might be able to push a fork up of your working code? Tried to cherry-pick from llama.cpp, but my git-fu isn't up to scratch.

Pretty keen to see if this increases performance on AMD APUs.

stduhpf · 2024-08-23T13:46:06Z

Thanks for answering and yes, i think you should test q4_0 or q4_k as that would still fully fit 8GB vram. And could you also test resolutions like 1024x1024? It was mostly when using higher resolution that the vram requirements shut o[ drastically for me and swap to shared vram starts. But then stable-diffusion.cpp just hangs at the sampling stage and does nothing. It also looks like the clip model and text encoder get unloaded from ram when this happens.

I was running out of memory when using q3 at 1024x1024, but it works with q2 (34 s/it)

I can't seem to run a q4_k quantization at any resolution, Somehow it's allocating 6389 MB of vram to load the model, despite the file size being only 4.2 GB. This discrepency doesn't happen with q2_k or q3_k.

SkutteOleg · 2024-08-23T13:50:28Z

@stduhpf any chance you might be able to push a fork up of your working code? Tried to cherry-pick from llama.cpp, but my git-fu isn't up to scratch.

Pretty keen to see if this increases performance on AMD APUs.

I've updated my forks with Vulkan REPEAT fix:
https://github.com/SkutteOleg/ggml/tree/master
https://github.com/SkutteOleg/stable-diffusion.cpp/tree/master

stduhpf · 2024-08-23T13:52:03Z

@stduhpf any chance you might be able to push a fork up of your working code? Tried to cherry-pick from llama.cpp, but my git-fu isn't up to scratch.
Pretty keen to see if this increases performance on AMD APUs.

I've updated my forks with Vulkan REPEAT fix: https://github.com/SkutteOleg/ggml/tree/master https://github.com/SkutteOleg/stable-diffusion.cpp/tree/master

Ah, thanks. I was just about to make a fork of ggml with the fix myself.

MGTRIDER · 2024-08-23T13:53:03Z

Thanks for answering and yes, i think you should test q4_0 or q4_k as that would still fully fit 8GB vram. And could you also test resolutions like 1024x1024? It was mostly when using higher resolution that the vram requirements shut o[ drastically for me and swap to shared vram starts. But then stable-diffusion.cpp just hangs at the sampling stage and does nothing. It also looks like the clip model and text encoder get unloaded from ram when this happens.

I was running out of memory when using q3 at 1024x1024, but it works with q2 (34 s/it)

I can't seem to run a q4_k quantization at any resolution, Somehow it's allocating 6.389 GB of vram to load the model, despite the file size being only 4.2 GB. This discrepency doesn't happen with q2_k or q3_k.

I see, thanks for confirming, so that's still a problem. It also seems to happen on cuda for me.

jimtendo · 2024-08-24T02:23:49Z

Using @SkutteOleg 's fork above, Flux now works with Vulkan on my AMD 5600G.

Performance for the Unet is around 2.5x better (~30s vs ~75s per iteration) - and seemingly uses far less power (fan doesn't go wild like it does on CPU).

I can't get LoRa's to load, but I think that's a problem with the LoRa loader itself (fails on CPU & GPU). It seems to be a mismatch between the tensor names in Flux models and what the SD.cpp implementation looks for. That said, it might just be the particular GGUF quant I'm using. If not, am going to see if I can patch that up later.

SkutteOleg · 2024-08-24T05:42:25Z

I can't get LoRa's to load

46eeff5 try with this

jimtendo · 2024-08-24T10:06:41Z

46eeff5 try with this

I've done a bit of playing around with it. This does fix the mappings, but I get another error (I was getting this before too, but had assumed that it was due to the failed LoRa mappings).

[WARN ] lora.hpp:164  - PATH: IS F32 OR F16
[DEBUG] lora.hpp:190  - (152 / 152) LoRA tensors applied successfully
[WARN ] lora.hpp:194  - Made it here.
/home/jimtendo/Projects/stable-diffusion.cpp2/ggml/src/ggml-backend.c:224: GGML_ASSERT(buf != NULL && "tensor buffer not set") failed

Strangely, I get the same issue on CPU - unless I explicitly specify with --type something that is not FP16 or FP32.

But, if I do that on Vulkan via something like the --type q4_0, I then get the following (which probably makes sense?):

Missing CPY op for types: f32 q4_0
/home/jimtendo/Projects/stable-diffusion.cpp2/ggml/src/ggml-vulkan.cpp:2967: fatal error

I've been trying to trace where this occurs (without --type specified) exactly. I suspect it's in the call to GGMLRunner::compute in lora.hpp but am still stepping through.

EDIT:

Not sure why exactly but if you change the following line in lora.hpp:

if (weight->type != GGML_TYPE_F32 && weight->type !== GGML_TYPE_F16) {

... to:

if (weight->type != GGML_TYPE_F32) {

... that appears to fix it. I tried copying the else block verbatim except for the to_fp32 function and that still complained about the buffer not being set. I have no idea why.

SkutteOleg · 2024-08-25T14:53:48Z

8847114 this should fix it

theoparis · 2024-08-25T19:31:58Z

I'm getting OutOfPoolMemoryError with vulkan and an 8gb 6600 xt card - the model is stable_diffusion-ema-pruned-v2-1_768.q5_0.gguf.

leejet · 2024-08-27T15:55:04Z

Thank you for your contribution!

SkutteOleg · 2024-08-27T16:23:01Z

Hey @leejet, this PR requires ggerganov/llama.cpp@0645ed5 to be synced into ggml. Without it, FLUX doesn't work, and performance is worse in general.

daniandtheweb · 2024-08-27T17:30:50Z

@0cc4m I've noticed that performing quantization directly while using txt2img on loras causes the following issue:

Missing CPY op for types: f32 q8_0

The error comes out for any quantizations I try to convert to (8_0, 4_0, 5_0...) when I apply a lora called more_details. Since it's a missing op type of error maybe it could be implemented in ggml to fix this?

0cc4m · 2024-08-27T17:44:24Z

Since it's a missing op type of error maybe it could be implemented in ggml to fix this?

Yes, that is possible, basically you have to implement a GPU kernel that does quantization. One for each quant you want to support. Not a priority for me at this time (might be annoying to implement), but if someone gives it a shot I'm happy to assist and review.

* Fix includes and init vulkan the same as llama.cpp * Add Windows Vulkan CI * Updated ggml submodule * support epsilon as a parameter for ggml_group_norm --------- Co-authored-by: Cloudwalk <cloudwalk@icculus.org> Co-authored-by: Oleg Skutte <00.00.oleg.00.00@gmail.com> Co-authored-by: leejet <leejet714@gmail.com>

wip

1555308

sohzm force-pushed the vk-support branch from 2cc6fa3 to 1555308 Compare June 19, 2024 19:52

0cc4m mentioned this pull request Jul 30, 2024

Vulkan Stable Diffusion Operators ggerganov/ggml#904

Merged

stduhpf mentioned this pull request Aug 24, 2024

add flux support #356

Merged

SkutteOleg mentioned this pull request Aug 24, 2024

No longer able to apply LoRAs #364

Closed

Merge branch 'master' into vk-support

bd16183

leejet merged commit 2027b16 into leejet:master Aug 27, 2024
9 checks passed

This was referenced Aug 27, 2024

Optimize Vulkan REPEAT performance leejet/ggml#2

Closed

sync: update ggml #378

Closed

Add vulkan backend #291

Add vulkan backend #291

Conversation

sohzm commented Jun 18, 2024

sohzm commented Jun 18, 2024

0cc4m commented Jun 23, 2024

sohzm commented Jun 28, 2024 • edited Loading

0cc4m commented Jun 28, 2024

Cloudwalk9 commented Jul 13, 2024

Cloudwalk9 commented Jul 13, 2024

0cc4m commented Jul 13, 2024

Cloudwalk9 commented Jul 13, 2024

Cloudwalk9 commented Jul 28, 2024 • edited Loading

0cc4m commented Jul 28, 2024

SkutteOleg commented Jul 29, 2024 • edited Loading

Cloudwalk9 commented Jul 29, 2024

0cc4m commented Jul 30, 2024

0cc4m commented Jul 30, 2024

SkutteOleg commented Jul 30, 2024 • edited Loading

0cc4m commented Jul 30, 2024

SkutteOleg commented Jul 30, 2024 • edited Loading

SkutteOleg commented Jul 30, 2024 • edited Loading

maxargy commented Jul 31, 2024

0cc4m commented Jul 31, 2024

Green-Sky commented Jul 31, 2024

SkutteOleg commented Jul 31, 2024 • edited Loading

JohnArlow commented Jul 31, 2024

JohnArlow commented Jul 31, 2024

offbeat-stuff commented Jul 31, 2024

0cc4m commented Aug 1, 2024

maxargy commented Aug 1, 2024

stduhpf commented Aug 22, 2024

stduhpf commented Aug 22, 2024

MGTRIDER commented Aug 23, 2024 • edited Loading

0cc4m commented Aug 23, 2024

MGTRIDER commented Aug 23, 2024

stduhpf commented Aug 23, 2024

MGTRIDER commented Aug 23, 2024

stduhpf commented Aug 23, 2024 • edited Loading

MGTRIDER commented Aug 23, 2024

jimtendo commented Aug 23, 2024

stduhpf commented Aug 23, 2024 • edited Loading

SkutteOleg commented Aug 23, 2024

stduhpf commented Aug 23, 2024

MGTRIDER commented Aug 23, 2024 • edited Loading

jimtendo commented Aug 24, 2024

SkutteOleg commented Aug 24, 2024 • edited Loading

jimtendo commented Aug 24, 2024 • edited Loading

SkutteOleg commented Aug 25, 2024

theoparis commented Aug 25, 2024

leejet commented Aug 27, 2024

SkutteOleg commented Aug 27, 2024

daniandtheweb commented Aug 27, 2024

0cc4m commented Aug 27, 2024

sohzm commented Jun 28, 2024 •

edited

Loading

Cloudwalk9 commented Jul 28, 2024 •

edited

Loading

SkutteOleg commented Jul 29, 2024 •

edited

Loading

SkutteOleg commented Jul 30, 2024 •

edited

Loading

SkutteOleg commented Jul 30, 2024 •

edited

Loading

SkutteOleg commented Jul 30, 2024 •

edited

Loading

SkutteOleg commented Jul 31, 2024 •

edited

Loading

MGTRIDER commented Aug 23, 2024 •

edited

Loading

stduhpf commented Aug 23, 2024 •

edited

Loading

stduhpf commented Aug 23, 2024 •

edited

Loading

MGTRIDER commented Aug 23, 2024 •

edited

Loading

SkutteOleg commented Aug 24, 2024 •

edited

Loading

jimtendo commented Aug 24, 2024 •

edited

Loading