Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Quantization Size Issue] Regarding the point at which the model size changes during quantization. #6711

Open
crinex opened this issue Nov 7, 2024 · 2 comments
Labels
module: qnn Related to Qualcomm's QNN delegate

Comments

@crinex
Copy link

crinex commented Nov 7, 2024

Hi @shewu-quic ~

Could you tell me after which method the actual physical size of the model reduces when we perform 8a8w quantization on the Llama-3.2-1B & 3B models using a QNN backend?

Thank you~!

@JacobSzwejbka
Copy link
Contributor

JacobSzwejbka commented Nov 12, 2024

for 8bit weights in the export quant flows I believe the conversion actually happens before any delegation to QNN at the "convert" function call. cc @kimishpatel to double check that claim.

edit:
Convert might just inject Q op DQ patterns everywhere though so if thats the case the actual size reduction would be after calling to_backend in the high level ET flow. Im not sure where specifically in the QNN backend code they directly perform the conversion.

@JacobSzwejbka JacobSzwejbka added the module: qnn Related to Qualcomm's QNN delegate label Nov 12, 2024
@kimishpatel
Copy link
Contributor

We do have const prop pass after convert so Q from "Q op DQ" should be const propagated resulting in reduced model size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: qnn Related to Qualcomm's QNN delegate
Projects
None yet
Development

No branches or pull requests

3 participants