You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Could you tell me after which method the actual physical size of the model reduces when we perform 8a8w quantization on the Llama-3.2-1B & 3B models using a QNN backend?
Thank you~!
The text was updated successfully, but these errors were encountered:
for 8bit weights in the export quant flows I believe the conversion actually happens before any delegation to QNN at the "convert" function call. cc @kimishpatel to double check that claim.
edit:
Convert might just inject Q op DQ patterns everywhere though so if thats the case the actual size reduction would be after calling to_backend in the high level ET flow. Im not sure where specifically in the QNN backend code they directly perform the conversion.
Hi @shewu-quic ~
Could you tell me after which method the actual physical size of the model reduces when we perform 8a8w quantization on the Llama-3.2-1B & 3B models using a QNN backend?
Thank you~!
The text was updated successfully, but these errors were encountered: