Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: wtype per tensor from file instead of global #455

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

stduhpf
Copy link
Contributor

@stduhpf stduhpf commented Nov 1, 2024

I'm not sure if it makes a significant difference yet.

@stduhpf
Copy link
Contributor Author

stduhpf commented Nov 23, 2024

Trying master again after having these changes implemented for a while it feels like this reduces the loading times, but maybe there's somthing else affecting it.

@Green-Sky
Copy link
Contributor

well, it removes conversions. I had this today where i loaded flux with q8_0 t5 and f16 clip, and was wondering why t5 was using f16 (including ram usage). Turns out sd.cpp can only have one conditioner wtype rn...

@fszontagh
Copy link
Contributor

fszontagh commented Nov 23, 2024

Maybe this is the wrong place where i write this, but i started a flux diffusion, which is loaded the model into VRAM, and started to make the diffusion on CPU. Please the the screenshot below:

image

UPDATE: Sorry, my mistake. The computation was run on the GPU; maybe it was the conversion that I saw.

@Green-Sky
Copy link
Contributor

You probably saw the embedding models.

@stduhpf
Copy link
Contributor Author

stduhpf commented Nov 25, 2024

I can confirm now, this PR makes loading weights very much faster for larger models.

(Results including warm runs only, so the models are always in disk cache)

model (types) master (model loading time) PR (model loading time)
Flux.1 [Schnell] (q3_k, ae: q8_0, clip_l: q8_0, t5xxl: q4_k) 92.37s 6.08s
SD3.5 Medium (q8_0, clip_l: f16, clip_g: f16, t5xxl: q4_k) 18.19s 5.37s
SD3.5 Large (q4_0, clip_l: q8_0, clip_g: f16, t5xxl: q4_k) 104.6s ! 5.95s
SDXL (q8_0, vae: f16) 2.70s 2.72s
SD 1.5 (f16) 1.43s 1.42s

I think this makes this PR worth merging.

@stduhpf stduhpf marked this pull request as ready for review November 25, 2024 18:36
@stduhpf stduhpf changed the title Refactor Idea: wtype per tensor from file instead of global Refactor: wtype per tensor from file instead of global Nov 25, 2024
@Green-Sky
Copy link
Contributor

But how do the other performance metrics change. Also does it all work?

@stduhpf
Copy link
Contributor Author

stduhpf commented Nov 25, 2024

Also does it all work?

I think so. Surprisingly, results aren't always exactly the same but it's very close nonetheless. I'm not sure what's up with that (maybe it has something to do with text encoders, or maybe it's just #439 since I'm using Vulkan). Results are exactly the same, except when using different quants for the text encoders. I have tried various models so far, they all worked for both txt2img and img2img. I didn't try loras, photomaker, or controlnets, (or anything that isn't supported by Vulkan) yet though.

how do the other performance metrics change.

Diffusion/sampling and vae performance are within margin of error. Prompt encoding is significantly faster, when mixing quantizations.

Edit: Photomaker (V1 and V2) works. LoRAs work too (on CPU and without quantization on Vulkan).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants