How to Apply Different Quantization Settings Per Layer in ExecuTorch? #6846

crinex · 2024-11-14T02:48:39Z

Dear @kimishpatel @jerryzh168 @shewu-quic

I want to split a model(eg, Llama-3.2-3B) into multiple layers and apply different quantization settings(qnn_8a8w, qnn_16a4w...) to each layer.
Has such a method been tested in ExecuTorch?
If not, could you suggest how this can be achieved?

Thank you

kimishpatel · 2024-11-14T03:52:02Z

Is this specific to qnn backend or your question is in general?

crinex · 2024-11-15T01:15:48Z

@kimishpatel

I meant was about the QNN backend.
I’ve noticed that when I quantize the LLaMA 3.2-3B model using QNN_8A8W, inference doesn’t work properly on the device. It seems that others are reporting similar results in issues as well.
Therefore, I’m considering solving this problem using mixed precision on the QNN backend. Would that be possible?

kimishpatel added the module: quantization label Nov 14, 2024

GregoryComer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 14, 2024

GregoryComer assigned kimishpatel Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Apply Different Quantization Settings Per Layer in ExecuTorch? #6846

How to Apply Different Quantization Settings Per Layer in ExecuTorch? #6846

crinex commented Nov 14, 2024

kimishpatel commented Nov 14, 2024

crinex commented Nov 15, 2024

How to Apply Different Quantization Settings Per Layer in ExecuTorch? #6846

How to Apply Different Quantization Settings Per Layer in ExecuTorch? #6846

Comments

crinex commented Nov 14, 2024

kimishpatel commented Nov 14, 2024

crinex commented Nov 15, 2024