[Quantization Size Issue] Regarding the point at which the model size changes during quantization. #6711

crinex · 2024-11-07T11:31:42Z

Could you tell me after which method the actual physical size of the model reduces when we perform 8a8w quantization on the Llama-3.2-1B & 3B models using a QNN backend?

Thank you~!

JacobSzwejbka · 2024-11-12T18:20:26Z

for 8bit weights in the export quant flows I believe the conversion actually happens before any delegation to QNN at the "convert" function call. cc @kimishpatel to double check that claim.

edit:
Convert might just inject Q op DQ patterns everywhere though so if thats the case the actual size reduction would be after calling to_backend in the high level ET flow. Im not sure where specifically in the QNN backend code they directly perform the conversion.

kimishpatel · 2024-11-13T04:18:55Z

We do have const prop pass after convert so Q from "Q op DQ" should be const propagated resulting in reduced model size

JacobSzwejbka added the module: qnn Related to Qualcomm's QNN delegate label Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quantization Size Issue] Regarding the point at which the model size changes during quantization. #6711

[Quantization Size Issue] Regarding the point at which the model size changes during quantization. #6711

crinex commented Nov 7, 2024

JacobSzwejbka commented Nov 12, 2024 •

edited

Loading

kimishpatel commented Nov 13, 2024

[Quantization Size Issue] Regarding the point at which the model size changes during quantization. #6711

[Quantization Size Issue] Regarding the point at which the model size changes during quantization. #6711

Comments

crinex commented Nov 7, 2024

JacobSzwejbka commented Nov 12, 2024 • edited Loading

kimishpatel commented Nov 13, 2024

JacobSzwejbka commented Nov 12, 2024 •

edited

Loading