Replies: 2 comments
-
I have some ideas I want to explore, like Q8, Q3 and Q2 modes, and I expect the kernels could be optimized a bit for performance. But it's not a very high priority right now. I'll probably be focusing on weight quantization, batch performance and a new speculative mode. |
Beta Was this translation helpful? Give feedback.
-
Q-Cache is definitely a prime feature of exl2 and more options like Q8 or Q6 sound interesting for sure. Always good to have a use for some extra vram at the least. But understandable that weight quantization and such has a higher priority right now. Looking forward to see exl2 continue to improve! |
Beta Was this translation helpful? Give feedback.
-
Just noticed that it now works properly on lower context depths too like FP16. Nice work, seems much more on par with FP16! Are there any more updates to the Q4 Cache in the works or do you consider it to be in a good place right now?
Beta Was this translation helpful? Give feedback.
All reactions