Neural Compressor provides ONNX model quantization techniques inherited from Intel Neural Compressor, including Post-training Quantization and Weight-only Quantization.
- Features
- Validated Configurations
Features
- Support Post-training Quantization, including static and dynamic approach
- Support SmoothQuant for Post-training Quantization
- Support Weight-only Quantization with several algorithms, including RTN, GPTQ, AWQ
- Support layer-wise quantization for RTN, GPTQ
- Validate popular LLMs such as Llama3, Phi-3, Qwen2 with weight-only quantization on multiple Intel hardware, such as Intel Xeon Scalable processor and Intel Core Ultra Processors
Validated Configurations
- OS version: CentOS 8.4, Ubuntu 22.04
- Python version: 3.10
- ONNX Runtime version: 1.18.1