-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vulkan backend #291
Add vulkan backend #291
Conversation
Related issue: ggerganov/llama.cpp#5356 (im new to this, so I might have made some mistakes. I would be grateful for any guidance or feedback) |
Hey, nice to see someone working on this. I'd like to get this to work. There's probably some ops that need to be supported by Vulkan upstream, right? I can help with that. |
@0cc4m Thanks for offering help. Currently the hpp file generated by Also some types were renamed eg: Im assuming these issues will be solved by your work in llama.cpp? please correct me if Im wrong Also let me know if I can help with anything |
It is working in Llama.cpp. I'll take a look at the status in ggml, maybe that needs an update. |
I manually wired up Vulkan and compiled SD.cpp with the latest ggml modified with llama.cpp's modifications to Vulkan. It runs and loads a model, but their Vulkan shaders do not implement CONCAT and it fails.
|
After adding CONCAT to the relevant place (probably not the solution for that?), it makes it a little further but still fails here:
At this point it's beyond my knowledge/skill. |
@Cloudwalk9 Thank you for trying it, I can add the missing ops. Can you upload your progress to a branch that I can access? |
@0cc4m Done, but it's pretty crude. I updated the submodule to point to my fork of ggml with the imported Vulkan stuff, also had to fix some headers. https://github.com/Cloudwalk9/stable-diffusion.cpp |
@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule. |
Yeah, my WIP branch is here: https://github.com/0cc4m/ggml/tree/vulkan-stable-diffusion-ops I implemented all the ops, but there's still some bug that makes the image not adhere to the prompt. I'll investigate that later. |
Great work, thank you! Some ops appear to still be missing when I try to use LoRA (res-adapter):
A different error occurs when I try to use TAESD:
|
We're finally about to see Stable Diffusion where the only major dependency is your graphics driver... |
@SkutteOleg Thank you, those should be easy to add. I fixed the first bug that caused issues, but I ran into another matmul bug that I have to find in the shader code. I hope I can find it soon. |
LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet. |
It is amazing, actually. It's 2.5 times faster than CUDA12 on my end 😲 |
On which hardware? |
NVIDIA GeForce GTX 1660 SUPER EDIT: Also confirmed working reasonably fast on Steam Deck. |
Excellent work, for me works fine, tested with intel ARC a580 |
This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that. |
there should be VAE-tiling available, or fallback to cpu (not exposed as a cli option afaik). |
Shouldn't VAE tiling help with that? This occurs for me even with VAE tiling enabled. |
Tried the vulkan repo from Skuttle, Nvidia gtx 1650 ti mobile nearly identical images, though why are some patches different b/w cuda and vulkan? |
It should, and it does in my tests. I can generate 1024x1024 images with SDXL by using
There are slight differences in how the CUDA and Vulkan backends calculate, for example the CUDA backend uses tensor cores for matrix multiplication, while the Vulkan backend (on Nvidia GPUs) uses the regular CUDA cores. That can change the results slightly. There might also be some minor differences in other operations that contribute to that, too. |
I tried the img2img mode but it immediately raises an error |
@SkutteOleg Same here. Even when using q2_k quants (3.8GB) to make a tiny 64x64 image, it tries to allocate over 8GB of vram, and crashes. The cpu backend doesn't even need 3.9GB during diffusion with these settings. |
This looks like a memory leak. Vram consumption starts shooting up as soon as the GPU starts working, after everything is loaded. |
The exact same things seems to happen on cuda backend. When using higher resolutions, vram requirements shoot through the roof and stable-diffusion.cpp doesn't seem to play nice with shared vram/ cpu offloading. And this only seems to affect flux the most. |
The Flux Vulkan issue is the result of an inefficient |
That's great to hear. Thanks for the update and all your contributions. Do you think the same thing happens in the cuda backend? When using resolutions above 512x512, the vram requirements get a stale increase, even just for a q4_0 quantization that would otherwise work smoothly at high resolutions in ComfyUI and FORGE on 8GB vram. |
Thanks @0cc4m , I was able to patch ggml based on the commit you linked, and It works! I get a bit under 6 s/it on my rx 5700xt, which isn't lightning fast, but still much faster than on CPU. Memory usage seems perfectly fine. |
Hit there, that's good to know. May i ask what your system specs are and on which quants you've tested? |
For testing, I used System specs: Ryzen9 3900x + 32GB DDR4 + RX 5700 XT (8GB VRAM) |
Thanks for answering and yes, i think you should test q4_0 or q4_k as that would still fully fit 8GB vram. And could you also test resolutions like 1024x1024? It was mostly when using higher resolution that the vram requirements shut o[ drastically for me and swap to shared vram starts. But then stable-diffusion.cpp just hangs at the sampling stage and does nothing. It also looks like the clip model and text encoder get unloaded from ram when this happens. |
@stduhpf any chance you might be able to push a fork up of your working code? Tried to cherry-pick from llama.cpp, but my git-fu isn't up to scratch. Pretty keen to see if this increases performance on AMD APUs. |
I've updated my forks with Vulkan REPEAT fix: |
Ah, thanks. I was just about to make a fork of ggml with the fix myself. |
Using @SkutteOleg 's fork above, Flux now works with Vulkan on my AMD 5600G. Performance for the Unet is around 2.5x better (~30s vs ~75s per iteration) - and seemingly uses far less power (fan doesn't go wild like it does on CPU). I can't get LoRa's to load, but I think that's a problem with the LoRa loader itself (fails on CPU & GPU). It seems to be a mismatch between the tensor names in Flux models and what the SD.cpp implementation looks for. That said, it might just be the particular GGUF quant I'm using. If not, am going to see if I can patch that up later. |
46eeff5 try with this |
I've done a bit of playing around with it. This does fix the mappings, but I get another error (I was getting this before too, but had assumed that it was due to the failed LoRa mappings).
Strangely, I get the same issue on CPU - unless I explicitly specify with But, if I do that on Vulkan via something like the
I've been trying to trace where this occurs (without EDIT: Not sure why exactly but if you change the following line in lora.hpp:
... to:
... that appears to fix it. I tried copying the |
8847114 this should fix it |
I'm getting OutOfPoolMemoryError with vulkan and an 8gb 6600 xt card - the model is |
Thank you for your contribution! |
Hey @leejet, this PR requires ggerganov/llama.cpp@0645ed5 to be synced into ggml. Without it, FLUX doesn't work, and performance is worse in general. |
@0cc4m I've noticed that performing quantization directly while using txt2img on loras causes the following issue:
The error comes out for any quantizations I try to convert to (8_0, 4_0, 5_0...) when I apply a lora called more_details. Since it's a missing op type of error maybe it could be implemented in ggml to fix this? |
Yes, that is possible, basically you have to implement a GPU kernel that does quantization. One for each quant you want to support. Not a priority for me at this time (might be annoying to implement), but if someone gives it a shot I'm happy to assist and review. |
* Fix includes and init vulkan the same as llama.cpp * Add Windows Vulkan CI * Updated ggml submodule * support epsilon as a parameter for ggml_group_norm --------- Co-authored-by: Cloudwalk <cloudwalk@icculus.org> Co-authored-by: Oleg Skutte <00.00.oleg.00.00@gmail.com> Co-authored-by: leejet <leejet714@gmail.com>
issue: #256
Looks like theyre doing some changes to vulkan shader generation in ggml repo, and its currently broken. Will keep and eye on it and update the pr accordingly.