Replies: 2 comments 4 replies
-
Okay, I've given a good look at the library. It seems what I'm interested in doing is adding an option to to the StableDiffusionGGML class to swap through the CLIP encoder, the VAE, and the Model itself from CPU to VRAM as they are needed, rather than just loading them all into VRAM at once in the constructor. This option only makes sense if the backend is one of the GPU options, so it will only do anything if you are using one of those. Otherwise, it will log a message like "No GPU backend, ignoring --medvram." I will go ahead and start working on it, but if there are any suggestions or corrections to be had, please voice them. This is my first time working on any sort of ML stuff since college, so I may have misunderstood things! |
Beta Was this translation helpful? Give feedback.
-
It seems like the primary bottleneck when it comes to supporting less VRAM is need for flash attention on GPU. GPU stuff is way beyond my expertise, but I'm still going to try and implement the stuff I mention in this discussion in hopes that it can speed up generation on machines that can't fit everything into the GPU all at once. |
Beta Was this translation helpful? Give feedback.
-
With quantization, it is possible to fit stable diffusion 1.5 models on my GPU with 2GB of VRAM. It even generates images at about 6sec/it, which is very fast. However, we are missing a lot of features when we quantize weights, namely LORA. 1.5 is, of course, a bit outdated by now but it's good enough (especially fine-tuned checkpoints) to be used on low-end devices.
On stable diffusion webui, the --medvram feature is very useful. It allows me to load models, LORA, VAE, and even the CLIP model all in VRAM on a machine with limited memory constraints. I understand that this project is not stable diffusion webui, but I still do think there is a use case for it here.
From here, I learned that the way this optimization works is by splitting up the model.
I believe that getting stable diffusion running on more and more devices at faster and faster speeds should be paramount in the mission of this project.
I would be more than willing to learn more about this codebase and to implement the feature on my own, as well as maintain it if necessary. However, I will definitely need help! If anyone contributing to this project could point me in the right direction as to how to implement something like this, I would be very grateful for the help! I'm reviewing how it's done in webui, but it's behind a lot of abstraction as you can imagine with Python.
Beta Was this translation helpful? Give feedback.
All reactions