Layer Wise Quantization (LWQ)

Introduction
Supported Framework Model Matrix
Examples

Introduction

Large language models (LLMs) have shown exceptional performance across various tasks, meanwhile, the substantial parameter size poses significant challenges for deployment. Layer-wise quantization(LWQ) can greatly reduce the memory footprint of LLMs, usually 80-90% reduction, which means that users can quantize LLMs even on single node using GPU or CPU. We can quantize the model under memory-constrained devices, therefore making the huge-sized LLM quantization possible.

Figure 1: The process of layer-wise quantization for ONNX model. The graph of LLM is split into several parts, and each subgraph is quantized in turn.

Supported Framework Model Matrix

Types/Framework		ONNX Runtime
W8A8 Post Training Static Quantization		✕
Weight-only Quantization	RTN	✔
	AWQ	✕
	GPTQ	✔

Examples

from onnx_neural_compressor.quantization import matmul_4bits_quantizer

algo_config = matmul_4bits_quantizer.RTNWeightOnlyQuantConfig(layer_wise_quant=True)
quant = matmul_4bits_quantizer.MatMul4BitsQuantizer(
    model,
    algo_config=algo_config,
)
quant.process()
qmodel = quant.model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quantization_layer_wise.md

quantization_layer_wise.md

Layer Wise Quantization (LWQ)

Introduction

Supported Framework Model Matrix

Examples

Files

quantization_layer_wise.md

Latest commit

History

quantization_layer_wise.md

File metadata and controls

Layer Wise Quantization (LWQ)

Introduction

Supported Framework Model Matrix

Examples