Replies: 2 comments
-
I don't believe you can get good results omitting whole layers. There are early-exit decoding strategies that try to identify when later layers aren't going to contribute anything important to the residual, but that's on a token-by-token basis and it gets complicated when you still needs keys/values from skipped layers for later tokens. As a rule, a transformer trained with n layers is going to need all of them, and I've never seen one that stays (reliably) coherent if any one layer is skipped. There are definitely layers that are more important than others, which is what the variable encoding scheme in EXL2 exploits by assigning less precision to those layers (and GGUF does something similar), but to leave them out entirely I don't think is going to produce stable models without some finetuning to compensate. Trying to work that whole process into the quantization also just seems needlessly complicated. It would be more fruitful, I suspect, to prune the model first and then quantize the result, leaning on some of all the work that's already been done with e.g. Sheared Llama. |
Beta Was this translation helpful? Give feedback.
-
Alright thanks for your reply! I haven't tried pruning yet and was just wondering if it may be a worthwhile effort to check for potential omits or summarization during the optimization routine. Given that especially at higher degrees of quantization it makes you wonder what impact some layers with very low bitrates actually can have. All those frankenmerges currently flying around may have given me the impression that transformers are more flexible in they architecture than they seem. |
Beta Was this translation helpful? Give feedback.
-
Hi,
first of all really like this git-repo!
I was wondering about utilizing pruning as an option to further optimize model size, inference speed versus performance.
Since we are already measuring layer influence for different quantization values, don't we also get a good idea of which layers we could leave out completely instead of quantizing them?
The idea is that if we have several more or less meaningless layers at a certain quantization grade that don't actually incorporate much in comparison to a single more or less meaningless layer, it would allow the other layers to be quantized in a higher bpw and maybe improve model performance.
I'm not an expert in LLM architecture and just wonder about if it would even be possible to dynamically prune and quantize a model without loosing coherence or having at least to do retraining. What do you think?
Beta Was this translation helpful? Give feedback.
All reactions