Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vulkan backend #291

Merged
merged 12 commits into from
Aug 27, 2024
Merged

Add vulkan backend #291

merged 12 commits into from
Aug 27, 2024

Conversation

sohzm
Copy link
Contributor

@sohzm sohzm commented Jun 18, 2024

issue: #256

Looks like theyre doing some changes to vulkan shader generation in ggml repo, and its currently broken. Will keep and eye on it and update the pr accordingly.

@sohzm
Copy link
Contributor Author

sohzm commented Jun 18, 2024

Related issue: ggerganov/llama.cpp#5356

(im new to this, so I might have made some mistakes. I would be grateful for any guidance or feedback)

@0cc4m
Copy link

0cc4m commented Jun 23, 2024

Hey, nice to see someone working on this. I'd like to get this to work. There's probably some ops that need to be supported by Vulkan upstream, right? I can help with that.

@sohzm
Copy link
Contributor Author

sohzm commented Jun 28, 2024

@0cc4m Thanks for offering help.

Currently the hpp file generated by ggml_vk_generate_shaders.py does not have types like mul_mat_vec_id_q3_k_f32_len, div_f32_len etc

Also some types were renamed eg: dequant_q5_k_len is imported in ggml/src/ggml-vulkan.cpp but header file has dequant_q5_K_len

Im assuming these issues will be solved by your work in llama.cpp? please correct me if Im wrong

Also let me know if I can help with anything

@0cc4m
Copy link

0cc4m commented Jun 28, 2024

@0cc4m Thanks for offering help.

Currently the hpp file generated by ggml_vk_generate_shaders.py does not have types like mul_mat_vec_id_q3_k_f32_len, div_f32_len etc

Also some types were renamed eg: dequant_q5_k_len is imported in ggml/src/ggml-vulkan.cpp but header file has dequant_q5_K_len

Im assuming these issues will be solved by your work in llama.cpp? please correct me if Im wrong

Also let me know if I can help with anything

It is working in Llama.cpp. I'll take a look at the status in ggml, maybe that needs an update.

@Cloudwalk9
Copy link
Contributor

I manually wired up Vulkan and compiled SD.cpp with the latest ggml modified with llama.cpp's modifications to Vulkan. It runs and loads a model, but their Vulkan shaders do not implement CONCAT and it fails.

./sd -m ~/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors --prompt "score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash" -W 1024 -H 1024 -v
Option: 
    n_threads:         8
    mode:              txt2img
    model_path:        /home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors
    wtype:             unspecified
    vae_path:          
    taesd_path:        
    esrgan_path:       
    controlnet_path:   
    embeddings_path:   
    stacked_id_embeddings_path:   
    input_id_images_path:   
    style ratio:       20.00
    normzalize input image :  false
    output_path:       output.png
    init_img:          
    control_image:     
    clip on cpu:       false
    controlnet cpu:    false
    vae decoder on cpu:false
    strength(control): 0.90
    prompt:            score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash
    negative_prompt:   
    min_cfg:           1.00
    cfg_scale:         7.00
    clip_skip:         -1
    width:             1024
    height:            1024
    sample_method:     euler_a
    schedule:          default
    sample_steps:      20
    strength(img2img): 0.75
    rng:               cuda
    seed:              42
    batch_count:       1
    vae_tiling:        false
    upscale_repeats:   1
System Info: 
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 1
    AVX512_VBMI = 1
    AVX512_VNNI = 1
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:158  - Using Vulkan backend
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: NVIDIA RTX A4000 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32
[INFO ] stable-diffusion.cpp:178  - loading model from '/home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors'
[INFO ] model.cpp:737  - load /home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors using safetensors format
[DEBUG] model.cpp:803  - init from '/home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors'
[INFO ] stable-diffusion.cpp:201  - Stable Diffusion XL 
[INFO ] stable-diffusion.cpp:207  - Stable Diffusion weight type: f16
[DEBUG] stable-diffusion.cpp:208  - ggml tensor size = 400 bytes
[WARN ] stable-diffusion.cpp:213  - !!!It looks like you are using SDXL model. If you find that the generated images are completely black, try specifying SDXL VAE FP16 Fix with the --vae parameter. You can find it here: https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl_vae.safetensors
[DEBUG] ggml_extend.hpp:884  - clip params backend buffer size =  1564.36 MB(VRAM) (713 tensors)
[DEBUG] ggml_extend.hpp:884  - unet params backend buffer size =  4900.07 MB(VRAM) (1680 tensors)
[DEBUG] ggml_extend.hpp:884  - vae params backend buffer size =  94.47 MB(VRAM) (140 tensors)
[DEBUG] stable-diffusion.cpp:309  - loading vocab
[DEBUG] clip.hpp:164  - vocab size: 49408
[DEBUG] clip.hpp:175  -  trigger word img already in vocab
[DEBUG] stable-diffusion.cpp:329  - loading weights
[DEBUG] model.cpp:1380 - loading tensors from /home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors
[INFO ] stable-diffusion.cpp:413  - total params memory size = 6558.89MB (VRAM 6558.89MB, RAM 0.00MB): clip 1564.36MB(VRAM), unet 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:432  - loading model from '/home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors' completed, taking 4.34s
[INFO ] stable-diffusion.cpp:449  - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:482  - finished loaded file
[DEBUG] stable-diffusion.cpp:1452 - txt2img 1024x1024
[DEBUG] stable-diffusion.cpp:1207 - prompt after extract and remove lora: "score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash"
[INFO ] stable-diffusion.cpp:565  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1212 - apply_loras completed, taking 0.00s
[DEBUG] clip.hpp:1312 - parse 'score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash' to [['score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash', 1], ]
[DEBUG] clip.hpp:1152 - token length: 77
[DEBUG] ggml_extend.hpp:838  - clip compute buffer size: 2.56 MB(VRAM)
ggml_vulkan: Error: Missing op: CONCAT
GGML_ASSERT: /home/david/Desktop/Dev/ggml/stable-diffusion.cpp/ggml/src/ggml-vulkan.cpp:5533: false
Aborted (core dumped)

@Cloudwalk9
Copy link
Contributor

After adding CONCAT to the relevant place (probably not the solution for that?), it makes it a little further but still fails here:

ggml_backend_vk_graph_compute: error: op not supported  (view) (UNARY)
GGML_ASSERT: /home/david/Desktop/Dev/ggml/stable-diffusion.cpp/ggml/src/ggml-vulkan.cpp:6227: ok

At this point it's beyond my knowledge/skill.

@0cc4m
Copy link

0cc4m commented Jul 13, 2024

@Cloudwalk9 Thank you for trying it, I can add the missing ops. Can you upload your progress to a branch that I can access?

@Cloudwalk9
Copy link
Contributor

@0cc4m Done, but it's pretty crude. I updated the submodule to point to my fork of ggml with the imported Vulkan stuff, also had to fix some headers. https://github.com/Cloudwalk9/stable-diffusion.cpp

@Cloudwalk9
Copy link
Contributor

Cloudwalk9 commented Jul 28, 2024

@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule.

@0cc4m
Copy link

0cc4m commented Jul 28, 2024

@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule.

Yeah, my WIP branch is here: https://github.com/0cc4m/ggml/tree/vulkan-stable-diffusion-ops

I implemented all the ops, but there's still some bug that makes the image not adhere to the prompt. I'll investigate that later.

@SkutteOleg
Copy link
Contributor

SkutteOleg commented Jul 29, 2024

@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule.

Yeah, my WIP branch is here: https://github.com/0cc4m/ggml/tree/vulkan-stable-diffusion-ops

I implemented all the ops, but there's still some bug that makes the image not adhere to the prompt. I'll investigate that later.

Great work, thank you!

Some ops appear to still be missing when I try to use LoRA (res-adapter):

lora.hpp:67   - finished loaded lora`
lora.hpp:175  - (18 / 18) LoRA tensors applied successfully
ggml_extend.hpp:841  - lora compute buffer size: 112.85 MB(VRAM)
lora.hpp:175  - (18 / 18) LoRA tensors applied successfully
ggml_vulkan: Error: Missing op: ADD for f16 and f32 to f16
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-vulkan.cpp:4149: fatal error

A different error occurs when I try to use TAESD:

stable-diffusion.cpp:1398 - generating 1 latent images completed, taking 46.07s
stable-diffusion.cpp:1401 - decoding 1 latents
ggml_extend.hpp:841  - taesd compute buffer size: 480.00 MB(VRAM)
ggml_backend_vk_graph_compute: error: op not supported  (view) (UNARY)
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-vulkan.cpp:6432: GGML_ASSERT(ok) failed

@Cloudwalk9
Copy link
Contributor

We're finally about to see Stable Diffusion where the only major dependency is your graphics driver...

@0cc4m
Copy link

0cc4m commented Jul 30, 2024

@SkutteOleg Thank you, those should be easy to add. I fixed the first bug that caused issues, but I ran into another matmul bug that I have to find in the shader code. I hope I can find it soon.

@0cc4m
Copy link

0cc4m commented Jul 30, 2024

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

@SkutteOleg
Copy link
Contributor

SkutteOleg commented Jul 30, 2024

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

It is amazing, actually. It's 2.5 times faster than CUDA12 on my end 😲
(perhaps due to lower memory usage, but i'm not sure)

@0cc4m
Copy link

0cc4m commented Jul 30, 2024

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

It is amazing, actually. It's 2.5 times faster than CUDA12 on my end 😲 (perhaps due to lower memory usage, but i'm not sure)

On which hardware?

@SkutteOleg
Copy link
Contributor

SkutteOleg commented Jul 30, 2024

On which hardware?

NVIDIA GeForce GTX 1660 SUPER

EDIT: Also confirmed working reasonably fast on Steam Deck.

@SkutteOleg
Copy link
Contributor

SkutteOleg commented Jul 30, 2024

It's 2.5 times faster than CUDA12 on my end 😲 (perhaps due to lower memory usage, but i'm not sure)

I had time to do some further testing. Apparently I was comparing the speed to a previous build of sd.cpp. It turns out CUDA12 image generation speed also got faster after ggml update. Even still, Vulkan is 20% faster.
However, I was wrong about memory. It appears that Vulkan uses more memory as I can no longer fit both llama.cpp and stable-diffusion.cpp on the GPU at the same time.

UPD: I was testing at 512x512 before. When trying 1024x1024 Vulkan is indeed 15% slower for me. Also, at 1024x1024 it produces broken outputs on my hardware:
vulkan_2
vulkan_4

@maxargy
Copy link

maxargy commented Jul 31, 2024

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

Excellent work, for me works fine, tested with intel ARC a580

@0cc4m
Copy link

0cc4m commented Jul 31, 2024

UPD: I was testing at 512x512 before. When trying 1024x1024 Vulkan is indeed 15% slower for me. Also, at 1024x1024 it produces broken outputs on my hardware.

This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that.

@Green-Sky
Copy link
Contributor

there should be VAE-tiling available, or fallback to cpu (not exposed as a cli option afaik).

@SkutteOleg
Copy link
Contributor

SkutteOleg commented Jul 31, 2024

This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that.

Shouldn't VAE tiling help with that? This occurs for me even with VAE tiling enabled.

@JohnArlow
Copy link

Excellent work, well done. Pictures are generated at 384x384 on my Intel i5-1035G1.
output

@JohnArlow
Copy link

Using the --vae-on-cpu option it will do 512x512 images. Don't understand why VAE should be such a problem, the compute buffer size is 1.6GB in ram.
YetanotherCat

@offbeat-stuff
Copy link

Tried the vulkan repo from Skuttle,
vulkan sdcpp -> 2.12 it/s
cuda sdcpp -> 3.95 it/s
comfyui -> 1.27 it/s

Nvidia gtx 1650 ti mobile
Fedora 40

nearly identical images, though why are some patches different b/w cuda and vulkan?

@0cc4m
Copy link

0cc4m commented Aug 1, 2024

This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that.

Shouldn't VAE tiling help with that? This occurs for me even with VAE tiling enabled.

It should, and it does in my tests. I can generate 1024x1024 images with SDXL by using --vae-tiling or --vae-on-cpu.

why are some patches different b/w cuda and vulkan?

There are slight differences in how the CUDA and Vulkan backends calculate, for example the CUDA backend uses tensor cores for matrix multiplication, while the Vulkan backend (on Nvidia GPUs) uses the regular CUDA cores. That can change the results slightly. There might also be some minor differences in other operations that contribute to that, too.

@maxargy
Copy link

maxargy commented Aug 1, 2024

I tried the img2img mode but it immediately raises an error
ggml_vulkan: Error: Missing op: PAD

@stduhpf
Copy link
Contributor

stduhpf commented Aug 22, 2024

@SkutteOleg Same here. Even when using q2_k quants (3.8GB) to make a tiny 64x64 image, it tries to allocate over 8GB of vram, and crashes. The cpu backend doesn't even need 3.9GB during diffusion with these settings.

@stduhpf
Copy link
Contributor

stduhpf commented Aug 22, 2024

This looks like a memory leak. Vram consumption starts shooting up as soon as the GPU starts working, after everything is loaded.

@MGTRIDER
Copy link

MGTRIDER commented Aug 23, 2024

#356 (comment)

The exact same things seems to happen on cuda backend. When using higher resolutions, vram requirements shoot through the roof and stable-diffusion.cpp doesn't seem to play nice with shared vram/ cpu offloading. And this only seems to affect flux the most.

@0cc4m
Copy link

0cc4m commented Aug 23, 2024

The Flux Vulkan issue is the result of an inefficient GGML_OP_REPEAT implementation in Vulkan that I already fixed in llama.cpp ggerganov/llama.cpp@0645ed5 . Once the Vulkan changes are synced back to ggml it'll work.

@MGTRIDER
Copy link

The Flux Vulkan issue is the result of an inefficient GGML_OP_REPEAT implementation in Vulkan that I already fixed in llama.cpp ggerganov/llama.cpp@0645ed5 . Once the Vulkan changes are synced back to ggml it'll work.

That's great to hear. Thanks for the update and all your contributions. Do you think the same thing happens in the cuda backend? When using resolutions above 512x512, the vram requirements get a stale increase, even just for a q4_0 quantization that would otherwise work smoothly at high resolutions in ComfyUI and FORGE on 8GB vram.

@stduhpf
Copy link
Contributor

stduhpf commented Aug 23, 2024

The Flux Vulkan issue is the result of an inefficient GGML_OP_REPEAT implementation in Vulkan that I already fixed in llama.cpp ggerganov/llama.cpp@0645ed5 . Once the Vulkan changes are synced back to ggml it'll work.

Thanks @0cc4m , I was able to patch ggml based on the commit you linked, and It works! I get a bit under 6 s/it on my rx 5700xt, which isn't lightning fast, but still much faster than on CPU. Memory usage seems perfectly fine.

@MGTRIDER
Copy link

The Flux Vulkan issue is the result of an inefficient GGML_OP_REPEAT implementation in Vulkan that I already fixed in llama.cpp ggerganov/llama.cpp@0645ed5 . Once the Vulkan changes are synced back to ggml it'll work.

Thanks @0cc4m , I was able to patch ggml based on the commit you linked, and It works! I get a bit under 6 s/it on my rx 5700xt, which isn't lightning fast, but still much faster than on CPU. Memory usage seems perfectly fine.

Hit there, that's good to know. May i ask what your system specs are and on which quants you've tested?

@stduhpf
Copy link
Contributor

stduhpf commented Aug 23, 2024

The Flux Vulkan issue is the result of an inefficient GGML_OP_REPEAT implementation in Vulkan that I already fixed in llama.cpp ggerganov/llama.cpp@0645ed5 . Once the Vulkan changes are synced back to ggml it'll work.

Thanks @0cc4m , I was able to patch ggml based on the commit you linked, and It works! I get a bit under 6 s/it on my rx 5700xt, which isn't lightning fast, but still much faster than on CPU. Memory usage seems perfectly fine.

Hit there, that's good to know. May i ask what your system specs are and on which quants you've tested?

For testing, I used .\build\bin\Release\sd.exe --diffusion-model ..\ComfyUI\models\unet\flux1-schnell-Q2_k.gguf --vae ..\ComfyUI\models\vae\ae.q2_k.gguf --clip_l ..\ComfyUI\models\clip\clip_l.q8_0.gguf --t5xxl ..\ComfyUI\models\clip\t5xxl_q4_k.gguf -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v --steps 4 -t 24 --clip-on-cpu --vae-on-cpu. So that a 512x512 image. I think I could get away with bigger quants for the diffusion model, since q2 is a bit degraded.

System specs: Ryzen9 3900x + 32GB DDR4 + RX 5700 XT (8GB VRAM)
output

EDIT: q3_K Works too (~6.7s/it)
output

@MGTRIDER
Copy link

The Flux Vulkan issue is the result of an inefficient GGML_OP_REPEAT implementation in Vulkan that I already fixed in llama.cpp ggerganov/llama.cpp@0645ed5 . Once the Vulkan changes are synced back to ggml it'll work.

Thanks @0cc4m , I was able to patch ggml based on the commit you linked, and It works! I get a bit under 6 s/it on my rx 5700xt, which isn't lightning fast, but still much faster than on CPU. Memory usage seems perfectly fine.

Hit there, that's good to know. May i ask what your system specs are and on which quants you've tested?

For testing, I used .\build\bin\Release\sd.exe --diffusion-model ..\ComfyUI\models\unet\flux1-schnell-Q2_k.gguf --vae ..\ComfyUI\models\vae\ae.q2_k.gguf --clip_l ..\ComfyUI\models\clip\clip_l.q8_0.gguf --t5xxl ..\ComfyUI\models\clip\t5xxl_q4_k.gguf -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v --steps 4 -t 24 --clip-on-cpu --vae-on-cpu. So that a 512x512 image. I think I could get away with bigger quants for the diffusion model, since q2 is a bit degraded.

System specs: Ryzen9 3900x + 32GB DDR4 + RX 5700 XT (8GB VRAM) output

Thanks for answering and yes, i think you should test q4_0 or q4_k as that would still fully fit 8GB vram. And could you also test resolutions like 1024x1024? It was mostly when using higher resolution that the vram requirements shut o[ drastically for me and swap to shared vram starts. But then stable-diffusion.cpp just hangs at the sampling stage and does nothing. It also looks like the clip model and text encoder get unloaded from ram when this happens.

@jimtendo
Copy link

@stduhpf any chance you might be able to push a fork up of your working code? Tried to cherry-pick from llama.cpp, but my git-fu isn't up to scratch.

Pretty keen to see if this increases performance on AMD APUs.

@stduhpf
Copy link
Contributor

stduhpf commented Aug 23, 2024

Thanks for answering and yes, i think you should test q4_0 or q4_k as that would still fully fit 8GB vram. And could you also test resolutions like 1024x1024? It was mostly when using higher resolution that the vram requirements shut o[ drastically for me and swap to shared vram starts. But then stable-diffusion.cpp just hangs at the sampling stage and does nothing. It also looks like the clip model and text encoder get unloaded from ram when this happens.

I was running out of memory when using q3 at 1024x1024, but it works with q2 (34 s/it)
output

I can't seem to run a q4_k quantization at any resolution, Somehow it's allocating 6389 MB of vram to load the model, despite the file size being only 4.2 GB. This discrepency doesn't happen with q2_k or q3_k.

@SkutteOleg
Copy link
Contributor

@stduhpf any chance you might be able to push a fork up of your working code? Tried to cherry-pick from llama.cpp, but my git-fu isn't up to scratch.

Pretty keen to see if this increases performance on AMD APUs.

I've updated my forks with Vulkan REPEAT fix:
https://github.com/SkutteOleg/ggml/tree/master
https://github.com/SkutteOleg/stable-diffusion.cpp/tree/master

@stduhpf
Copy link
Contributor

stduhpf commented Aug 23, 2024

@stduhpf any chance you might be able to push a fork up of your working code? Tried to cherry-pick from llama.cpp, but my git-fu isn't up to scratch.
Pretty keen to see if this increases performance on AMD APUs.

I've updated my forks with Vulkan REPEAT fix: https://github.com/SkutteOleg/ggml/tree/master https://github.com/SkutteOleg/stable-diffusion.cpp/tree/master

Ah, thanks. I was just about to make a fork of ggml with the fix myself.

@MGTRIDER
Copy link

MGTRIDER commented Aug 23, 2024

Thanks for answering and yes, i think you should test q4_0 or q4_k as that would still fully fit 8GB vram. And could you also test resolutions like 1024x1024? It was mostly when using higher resolution that the vram requirements shut o[ drastically for me and swap to shared vram starts. But then stable-diffusion.cpp just hangs at the sampling stage and does nothing. It also looks like the clip model and text encoder get unloaded from ram when this happens.

I was running out of memory when using q3 at 1024x1024, but it works with q2 (34 s/it) output

I can't seem to run a q4_k quantization at any resolution, Somehow it's allocating 6.389 GB of vram to load the model, despite the file size being only 4.2 GB. This discrepency doesn't happen with q2_k or q3_k.

I see, thanks for confirming, so that's still a problem. It also seems to happen on cuda for me.

@jimtendo
Copy link

Using @SkutteOleg 's fork above, Flux now works with Vulkan on my AMD 5600G.

Performance for the Unet is around 2.5x better (~30s vs ~75s per iteration) - and seemingly uses far less power (fan doesn't go wild like it does on CPU).

I can't get LoRa's to load, but I think that's a problem with the LoRa loader itself (fails on CPU & GPU). It seems to be a mismatch between the tensor names in Flux models and what the SD.cpp implementation looks for. That said, it might just be the particular GGUF quant I'm using. If not, am going to see if I can patch that up later.

@SkutteOleg
Copy link
Contributor

SkutteOleg commented Aug 24, 2024

I can't get LoRa's to load

46eeff5 try with this

@jimtendo
Copy link

jimtendo commented Aug 24, 2024

46eeff5 try with this

I've done a bit of playing around with it. This does fix the mappings, but I get another error (I was getting this before too, but had assumed that it was due to the failed LoRa mappings).

[WARN ] lora.hpp:164  - PATH: IS F32 OR F16
[DEBUG] lora.hpp:190  - (152 / 152) LoRA tensors applied successfully
[WARN ] lora.hpp:194  - Made it here.
/home/jimtendo/Projects/stable-diffusion.cpp2/ggml/src/ggml-backend.c:224: GGML_ASSERT(buf != NULL && "tensor buffer not set") failed

Strangely, I get the same issue on CPU - unless I explicitly specify with --type something that is not FP16 or FP32.

But, if I do that on Vulkan via something like the --type q4_0, I then get the following (which probably makes sense?):

Missing CPY op for types: f32 q4_0
/home/jimtendo/Projects/stable-diffusion.cpp2/ggml/src/ggml-vulkan.cpp:2967: fatal error

I've been trying to trace where this occurs (without --type specified) exactly. I suspect it's in the call to GGMLRunner::compute in lora.hpp but am still stepping through.

EDIT:

Not sure why exactly but if you change the following line in lora.hpp:

if (weight->type != GGML_TYPE_F32 && weight->type !== GGML_TYPE_F16) {

... to:

if (weight->type != GGML_TYPE_F32) {

... that appears to fix it. I tried copying the else block verbatim except for the to_fp32 function and that still complained about the buffer not being set. I have no idea why.

@SkutteOleg
Copy link
Contributor

8847114 this should fix it

@theoparis
Copy link

I'm getting OutOfPoolMemoryError with vulkan and an 8gb 6600 xt card - the model is stable_diffusion-ema-pruned-v2-1_768.q5_0.gguf.

@leejet
Copy link
Owner

leejet commented Aug 27, 2024

Thank you for your contribution!

@leejet leejet merged commit 2027b16 into leejet:master Aug 27, 2024
9 checks passed
@SkutteOleg
Copy link
Contributor

Hey @leejet, this PR requires ggerganov/llama.cpp@0645ed5 to be synced into ggml. Without it, FLUX doesn't work, and performance is worse in general.

@daniandtheweb
Copy link
Contributor

@0cc4m I've noticed that performing quantization directly while using txt2img on loras causes the following issue:

Missing CPY op for types: f32 q8_0

The error comes out for any quantizations I try to convert to (8_0, 4_0, 5_0...) when I apply a lora called more_details. Since it's a missing op type of error maybe it could be implemented in ggml to fix this?

@0cc4m
Copy link

0cc4m commented Aug 27, 2024

Since it's a missing op type of error maybe it could be implemented in ggml to fix this?

Yes, that is possible, basically you have to implement a GPU kernel that does quantization. One for each quant you want to support. Not a priority for me at this time (might be annoying to implement), but if someone gives it a shot I'm happy to assist and review.

stduhpf pushed a commit to stduhpf/stable-diffusion.cpp that referenced this pull request Nov 1, 2024
* Fix includes and init vulkan the same as llama.cpp

* Add Windows Vulkan CI

* Updated ggml submodule

* support epsilon as a parameter for ggml_group_norm

---------

Co-authored-by: Cloudwalk <cloudwalk@icculus.org>
Co-authored-by: Oleg Skutte <00.00.oleg.00.00@gmail.com>
Co-authored-by: leejet <leejet714@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.