Skip to content

Latest commit

 

History

History
15 lines (10 loc) · 1.08 KB

Quantization.md

File metadata and controls

15 lines (10 loc) · 1.08 KB

Quantization

Quantization refers to processes that enable lower precision inference and training by performing computations at fixed point integers that are lower than floating points. This often leads to smaller model sizes and faster inference time. Quantization is particularly useful in deep learning inference and training, where moving data more quickly and reducing bandwidth bottlenecks is optimal. Intel is actively working on techniques that use lower numerical precision by using training with 16-bit multipliers and inference with 8-bit or 16-bit multipliers. Refer to the Intel article on lower numerical precision inference and training in deep learning.

Quantization methods include the following three classes:

Note

Dynamic Quantization currently only supports the onnxruntime backend.