Large language models (LLMs) have shown exceptional performance across various tasks, meanwhile, the substantial parameter size poses significant challenges for deployment. Layer-wise quantization(LWQ) can greatly reduce the memory footprint of LLMs, usually 80-90% reduction, which means that users can quantize LLMs even on single node using GPU or CPU. We can quantize the model under memory-constrained devices, therefore making the huge-sized LLM quantization possible.
Figure 1: The process of layer-wise quantization for ONNX model. The graph of LLM is split into several parts, and each subgraph is quantized in turn.
Types/Framework | ONNX Runtime | |
---|---|---|
W8A8 Post Training Static Quantization | ✕ | |
Weight-only Quantization | RTN | ✔ |
AWQ | ✕ | |
GPTQ | ✔ |
from onnx_neural_compressor.quantization import matmul_4bits_quantizer
algo_config = matmul_4bits_quantizer.RTNWeightOnlyQuantConfig(layer_wise_quant=True)
quant = matmul_4bits_quantizer.MatMul4BitsQuantizer(
model,
algo_config=algo_config,
)
quant.process()
qmodel = quant.model