Skip to content

Latest commit

 

History

History
61 lines (48 loc) · 2.17 KB

quantization_layer_wise.md

File metadata and controls

61 lines (48 loc) · 2.17 KB

Layer Wise Quantization (LWQ)

  1. Introduction

  2. Supported Framework Model Matrix

  3. Examples

Introduction

Large language models (LLMs) have shown exceptional performance across various tasks, meanwhile, the substantial parameter size poses significant challenges for deployment. Layer-wise quantization(LWQ) can greatly reduce the memory footprint of LLMs, usually 80-90% reduction, which means that users can quantize LLMs even on single node using GPU or CPU. We can quantize the model under memory-constrained devices, therefore making the huge-sized LLM quantization possible.

Figure 1: The process of layer-wise quantization for ONNX model. The graph of LLM is split into several parts, and each subgraph is quantized in turn.

Supported Framework Model Matrix

Types/Framework ONNX Runtime
W8A8 Post Training Static Quantization
Weight-only Quantization RTN
AWQ
GPTQ

Examples

from onnx_neural_compressor.quantization import matmul_4bits_quantizer

algo_config = matmul_4bits_quantizer.RTNWeightOnlyQuantConfig(layer_wise_quant=True)
quant = matmul_4bits_quantizer.MatMul4BitsQuantizer(
    model,
    algo_config=algo_config,
)
quant.process()
qmodel = quant.model