Quantization

Quantization refers to processes that enable lower precision inference and training by performing computations at fixed point integers that are lower than floating points. This often leads to smaller model sizes and faster inference time. Quantization is particularly useful in deep learning inference and training, where moving data more quickly and reducing bandwidth bottlenecks is optimal. Intel is actively working on techniques that use lower numerical precision by using training with 16-bit multipliers and inference with 8-bit or 16-bit multipliers. Refer to the Intel article on lower numerical precision inference and training in deep learning.

Quantization methods include the following three classes:

Post-Training Quantization (PTQ)
Quantization-Aware Training (QAT)
Dynamic Quantization

Note

Dynamic Quantization currently only supports the onnxruntime backend.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization.md

Quantization.md

Quantization

Files

Quantization.md

Latest commit

History

Quantization.md

File metadata and controls

Quantization