Unified Inference Frontend (UIF) 1.2 User Guide

Step 4.2: Quantize Model with UIF Quantizer for Target Platforms (GPU/CPU/FPGA)

4.2.1: Quantize PyTorch Models

4.2.1.1: Vitis AI PyTorch Quantizer

To quantize UIF models, use the Vitis™ AI PyTorch Quantizer tool (vai_q_pytorch). Refer to the vai_q_pytorch and the relevant user guide for detailed information.

4.2.1.2: Configuration of Quantization Strategy for UIF

pt_quant_config.json is a special configuration for ZenDNN deployment. All quantization for ZenDNN deployment must apply this configuration.
For multiple quantization strategy configurations, vai_q_pytorch supports quantization configuration file in JSON format. You only need to pass the configuration file to the torch_quantizer API.
For detailed information of the JSON file contents, refer to Quant_Config.md.

4.2.1.3: Quick Start in UIF Docker Environment

Using pt_resnet18_quant.py as an example:

Copy pt_resnet18_quant.py to the Docker® environment.

Download the pre-trained Resnet18 model:

wget https://download.pytorch.org/models/resnet18-5c106cde.pth -O resnet18.pth

Prepare Imagenet validation images. See PyTorch example repo for reference.
Modify default data_dir and model_dir in pt_resnet18_quant.py.
Quantize, using a subset (200 images) of validation data for calibration. Because we are in quantize calibration process, the displayed loss and accuracy are meaningless.

python pt_resnet18_quant.py --quant_mode calib --subset_len 200 --config_file ./pt_quant_config.json

Evaluate the quantized model.

python pt_resnet18_quant.py --quant_mode test

Write Quantization Code Based on Float Model

Import vai_q_pytorch modules.

 from pytorch_nndct.apis import torch_quantizer

Generate a quantizer with quantization needed input and get converted model.

 input = torch.randn([batch_size, 3, 224, 224])
 quantizer = torch_quantizer(
     quant_mode, model, (input), device=device, quant_config_file=config_file)
 quant_model = quantizer.quant_model

Forwarding with converted model.

 acc1_gen, acc5_gen, loss_gen = evaluate(quant_model, val_loader, loss_fn)

Output quantization result and deploy model.

if quant_mode == 'calib':
  quantizer.export_quant_config()
if quant_mode == 'test':
  quantizer.export_onnx_model()

Run and Output Results

Before running commands, introducing the log message in vai_q_pytorch. vai_q_pytorch log messages have special colors and the keyword prefix "VAIQ_*". vai_q_pytorch log message types include error, warning, and note. Pay attention to vai_q_pytorch log messages to check the flow status.

Run command with --quant_mode calib to the quantize model.

    python pt_resnet18_quant.py --quant_mode calib --subset_len 200 --config_file ./pt_quant_config.json

When doing calibration forward, the float evaluation flow is borrowed to minimize code changes from the float script, which means that loss and accuracy information is displayed at the end. Because these loss and accuracy values are meaningless at this point in the process, they can be skipped. Attention should be given to the colorful log messages with the special keywords prefix VAIQ_*.

It is important to control iteration numbers during quantization and evaluation. Generally, 100-1000 images are enough for quantization, and the whole validation set is required for evaluation. The iteration numbers can be controlled in the data loading part. In this case, the argument subset_len controls how many images used for network forwarding. However, if the float evaluation script does not have an argument with similar role, it is better to add one, otherwise it should be changed manually.

If this quantization command runs successfully, two important files are generated under output directory ./quantize_result.

    ResNet.py: converted vai_q_pytorch format model, 
    Quant_info.json: quantization steps of tensors got. (Keep it for evaluation of quantized model)

To evaluate the quantized model, run the following command:

    python pt_resnet18_quant.py --quant_mode test

When this command finishes, the displayed accuracy is the right accuracy for quantized model.

Fast Finetune Model

Sometimes direct quantization accuracy is not high enough, and finetuning of model parameters is necessary to recover accuracy:

Fast finetuning is not real training of the model and only needs a limited number of iterations. For classification models on the Imagenet dataset, 5120 images are enough in general.
It only requires some modification based on the evaluation model script, and does not require you to set up Optimizer for training.
A function for model forwarding iteration is needed and is called as part of fast finetuning.
Re-calibration with original inference code is highly recommended.
Example code in pt_resnet18_quant.py is as follows:

# fast finetune model or load finetuned parameter before test 
  if finetune == True:
      ft_loader, _ = load_data(
          subset_len=5120,
          train=False,
          batch_size=batch_size,
          sample_method='random',
          data_dir=args.data_dir,
          model_name=model_name)
      if quant_mode == 'calib':
        quantizer.fast_finetune(evaluate, (quant_model, ft_loader, loss_fn))
      elif quant_mode == 'test':
        quantizer.load_ft_param()

For pt_resnet18_quant.py, use the command line to perform parameter fast finetuning and re-calibration:
```
python pt_resnet18_quant.py --quant_mode calib --fast_finetune --config_file ./pt_quant_config.json
```
Use the command line to test fast finetuned quantized model accuracy as follows:
```
python pt_resnet18_quant.py --quant_mode test --fast_finetune
```

Finetune Quantized Model

This mode can be used to finetune a quantized model (loading float model parameters), as well as to do quantization-aware-training (QAT) from scratch.
It is necessary to add some vai_q_pytorch interface functions based on the float model training script.
The mode requires that the trained model cannot use the +/- operator in the model forwarding code. Replace them with the torch.add/torch.sub module.
For detailed information, refer to the Vitis AI User Guide.

4.2.2: Quantize TensorFlow Models

vai_q_tensorflow is a Vitis AI quantizer for TensorFlow. It supports FPGA-friendly quantization for TensorFlow models. After quantization, models can be deployed to FPGA devices. vai_q_tensorflow is a component of Vitis AI, a development stack for AI inference on AMD hardware platforms.

Note: You can download the AMD prebuilt version in Vitis AI. See the Vitis AI User Guide for further details.

4.2.2.1: Installation

Tested environment:

Ubuntu 16.04
GCC 4.8
Bazel 0.24.1
Python 3.6
CUDA 10.1 + CUDNN 7.6.5

Prerequisites:

Install Bazel 0.24.1.
(GPU version) Install CUDA and CUDNN.
Install Python prerequisites:

$ pip install -r requirements.txt

Option 1. Build wheel package and install:

# CPU-only version
$ ./configure # Input "N" when asked "Do you wish to build TensorFlow with CUDA support?". For other questions, use default value or set as you wish.
$ sh rebuild_cpu.sh
# GPU version
$ ./configure # Input "Y" when asked "Do you wish to build TensorFlow with CUDA support?". For other questions, use default value or set as you wish.
$ sh rebuild_gpu.sh

Option 2. Build conda package (you need anaconda for this option):

# CPU-only version
$ conda build --python=3.6 vai_q_tensorflow_cpu_feedstock --output-folder ./conda_pkg/
# GPU version
$ conda build --python=3.6 vai_q_tensorflow_gpu_feedstock --output-folder ./conda_pkg/
# Install conda package
$ conda install --use-local ./conda_pkg/linux-64/vai_q_tensorflow-1.0-py36h605774d_1.tar.bz2

Validate installation:

$ vai_q_tensorflow --version
$ vai_q_tensorflow --help

Note: This tool is based on TensorFlow 1.15. For more information on build and installation, refer to the TensorFlow Installation Guide, specifically Docker linux build or Windows installation.

4.2.2.2: Running vai_q_tensorflow

Preparing the Float Model and Related Input Files

Before running vai_q_tensorflow, prepare the frozen inference TensorFlow model in floating-point format and calibration set, including the files listed in the following table.

Input Files for vai_q_tensorflow

No.	Name	Description
1	frozen_graph.pb	Floating-point frozen inference graph. Ensure that the graph is the inference graph rather than the training graph.
2	calibration dataset	A subset of the training dataset containing 100 to 1000 images.
3	input_fn	An input function to convert the calibration dataset to the input data of the frozen_graph during quantize calibration. Usually performs data pre-processing and augmentation.

Generating the Frozen Inference Graph

Training a model with TensorFlow 1.x creates a folder containing a GraphDef file (usually ending with a.pb or .pbtxt extension) and a set of checkpoint files. What you need for mobile or embedded deployment is a single GraphDef file that has been frozen, or had its variables converted into inline constants, so everything is in one file. To handle the conversion, TensorFlow provides freeze_graph.py, which is automatically installed with the vai_q_tensorflow quantizer.

An example of command-line usage is as follows:

$ freeze_graph \
--input_graph /tmp/inception_v1_inf_graph.pb \
--input_checkpoint /tmp/checkpoints/model.ckpt-1000 \
--input_binary true \
--output_graph /tmp/frozen_graph.pb \
--output_node_names InceptionV1/Predictions/Reshape_1

The –input_graph should be an inference graph other than the training graph. Because the operations of data preprocessing and loss functions are not needed for inference and deployment, the frozen_graph.pb should only include the main part of the model. In particular, the data preprocessing operations should be taken in the input_fn to generate correct input data for quantize calibration.

Note: Some operations, such as dropout and batchnorm, behave differently in the training and inference phases. Ensure that they are in the inference phase when freezing the graph. For examples, you can set the flag is_training=false when using tf.layers.dropout/tf.layers.batch_normalization. For models using tf.keras, call tf.keras.backend.set_learning_phase(0) before building the graph.

Tip: Type freeze_graph --help for more options.

The input and output node names vary depending on the model, but you can inspect and estimate them with the vai_q_tensorflow quantizer. See the following code snippet for an example:

$ vai_q_tensorflow inspect --input_frozen_graph=/tmp/inception_v1_inf_graph.pb

The estimated input and output nodes cannot be used for quantization if the graph has in-graph pre- and post-processing. This is because some operations are not quantizable and can cause errors when compiled by the Vitis AI compiler, if you deploy the quantized model to the DPU.

Another way to get the input and output name of the graph is by visualizing the graph. Both TensorBoard and Netron can do this. See the following example, which uses Netron:

$ pip install netron
$ netron /tmp/inception_v3_inf_graph.pb

Preparing the Calibration Dataset and Input Function

The calibration set is usually a subset of the training/validation dataset or actual application images (at least 100 images for performance). The input function is a Python importable function to load the calibration dataset and perform data preprocessing. The vai_q_tensorflow quantizer can accept an input_fn to do the preprocessing, which is not saved in the graph. If the preprocessing subgraph is saved into the frozen graph, the input_fn only needs to read the images from dataset and return a feed_dict.

The format of input function is module_name.input_fn_name, (for example, my_input_fn.calib_input). The input_fn takes an int object as input, indicating the calibration step number, and returns a dict ("placeholder_name, numpy.Array") object for each call, which is fed into the placeholder nodes of the model when running inference. The placeholder_name is always the input node of the frozen graph, that is to say, the node receiving input data. The input_nodes, in the vai_q_tensorflow options, indicate where quantization starts in the frozen graph. The placeholder_names and the input_nodes options are sometimes different. For example, when the frozen graph includes in-graph preprocessing, the placeholder_name is the input of the graph, although it is recommended that input_nodes be set to the last node of preprocessing. The shape of numpy.array must be consistent with the placeholders. See the following pseudo code example:

$ “my_input_fn.py”
def calib_input(iter):
"""
A function that provides input data for the calibration
Args:
iter: A `int` object, indicating the calibration step number
Returns:
dict( placeholder_name, numpy.array): a `dict` object, which will be fed into the model
"""
image = load_image(iter)
preprocessed_image = do_preprocess(image)
return {"placeholder_name": preprocessed_images}

Quantizing the Model Using vai_q_tensorflow

Run the following commands to quantize the model:

$vai_q_tensorflow quantize \
--input_frozen_graph frozen_graph.pb \
--input_nodes ${input_nodes} \
--input_shapes ${input_shapes} \
--output_nodes ${output_nodes} \
--input_fn input_fn \
[options]

The input_nodes and output_nodes arguments are the name list of input nodes of the quantize graph. They are the start and end points of quantization. The main graph between them is quantized if it is quantizable, as shown in the following figure.

It is recommended to set –input_nodes to be the last nodes of the preprocessing part. Similarly, it is advisable to set -output_nodes to be the last nodes of the main graph part because some operations in the pre- and postprocessing parts are not quantizable and might cause errors when compiled by the Vitis AI quantizer if you need to deploy the quantized model to the DPU.

The input nodes might not be the same as the placeholder nodes of the graph. If no in-graph preprocessing part is present in the frozen graph, the placeholder nodes should be set to input nodes.

The input_fn should be consistent with the placeholder nodes.

[options] stands for optional parameters. The most commonly used options are as follows:

weight_bit: Bit width for quantized weight and bias (default is 8).
activation_bit: Bit width for quantized activation (default is 8).
method: Quantization methods, including 0 for non-overflow, 1 for min-diffs, and 2 for mindiffs with normalization. The non-overflow method ensures that no values are saturated.

Generating the Quantized Model

After the successful execution of the vai_q_tensorflow command, one output file is generated in the ${output_dir} location quantize_eval_model.pb. This file is used to evaluate the CPU/GPUs, and can be used to simulate the results on hardware. Run import tensorflow.contrib.decent_q explicitly to register the custom quantize operation because tensorflow.contrib is now lazy-loaded.

vai_q_tensorflow Output Files

No.	Name	Description
1	deploy_model.pb	Quantized model for the Vitis AI compiler (extended TensorFlow format) for targeting DPUCZDX8G implementations.
2	quantize_eval_model.pb	Quantized model for evaluation (also, the Vitis AI compiler input for most DPU architectures, like DPUCAHX8H and DPUCADF8H).

Evaluating the Quantized Model (Optional)

If you have scripts to evaluate floating point models, like the models in Vitis AI Model Zoo, apply the following two changes to evaluate the quantized model:

Prepend the float evaluation script with from tensorflow.contrib import decent_q to register the quantize operation.
Replace the float model path in the scripts to quantization output model quantize_results/quantize_eval_model.pb.
Run the modified script to evaluate the quantized model.

Dumping the Simulation Results (Optional)

vai_q_tensorflow dumps the simulation results with the quantize_eval_model.pb generated by the quantizer. This allows you to compare the simulation results on the CPU/GPU with the output values on the DPU.

To dump the quantize simulation results, run the following commands:

$vai_q_tensorflow dump \
--input_frozen_graph quantize_results/quantize_eval_model.pb \
--input_fn dump_input_fn \
--max_dump_batches 1 \
--dump_float 0 \
--output_dir quantize_results

The input_fn for dumping is similar to the input_fn for quantize calibration, but the batch size is often set to 1 to be consistent with the DPU results.

If the command executes successfully, dump results are generated in _${output_dir}_. There are folders in ${output_dir}, and each folder contains the dump results for a batch of input data. Results for each node are saved separately. For each quantized node, results are saved in _int8.bin and _int8.txt format. If dump_float is set to 1, the results for unquantized nodes are dumped. The / symbol is replaced by _ for simplicity. Examples for dump results are shown in the following table.

Examples for Dump Results

Batch No.	Quant	Node Name	Saved files
1	Yes	resnet_v1_50/conv1/biases/wquant	{output_dir}/dump_results_1/resnet_v1_50_conv1_biases_wquant_int8.bin {output_dir}/dump_results_1/resnet_v1_50_conv1_biases_wquant_int8.txt
2	No	resnet_v1_50/conv1/biases	{output_dir}/dump_results_2/resnet_v1_50_conv1_biases.bin {output_dir}/dump_results_2/resnet_v1_50_conv1_biases.txt

4.2.2.3: vai_q_tensorflow Quantization Aware Training

Quantization aware training (QAT) is similar to float model training/finetuning, but in QAT, the vai_q_tensorflow APIs are used to rewrite the float graph to convert it to a quantized graph before the training starts. The typical workflow is as follows:

Preparation: Before QAT, prepare the following files:

Input Files for vai_q_tensorflow QAT

No.	Name	Description
1	Checkpoint files	Floating-point checkpoint files to start from. Omit this if you are training the model from scratch.
2	Dataset	The training dataset with labels.
3	Train Scripts	The Python scripts to run float train/finetuning of the model.

Evaluate the float model (optional): Evaluate the float checkpoint files first before doing quantize finetuning to check the correctness of the scripts and dataset. The accuracy and loss values of the float checkpoint can also be a baseline for QAT.
Modify the training scripts: To create the quantize training graph, modify the training scripts to call the function after the float graph is built. The following is an example:

# train.py

# ...

# Create the float training graph
model = model_fn(is_training=True)

# *Set the quantize configurations
from tensorflow.contrib import decent_q
q_config = decent_q.QuantizeConfig(input_nodes=['net_in'], output_nodes=['net_out'],input_shapes=[[-1, 224, 224, 3]])

# *Call Vai_q_tensorflow api to create the quantize training graph
decent_q.CreateQuantizeTrainingGraph(config=q_config)

# Create the optimizer
optimizer = tf.train.GradientDescentOptimizer()

# start the training/finetuning, you can use sess.run(), tf.train, tf.estimator, tf.slim and so on
# ...

The QuantizeConfig contains the configurations for quantization.

Some basic configurations such as input_nodes, output_nodes, and input_shapes need to be set according to your model structure.

Other configurations such as weight_bit, activation_bit, and method have default values and can be modified as needed. See vai_q_tensorflow Usage for detailed information of all the configurations.

input_nodes/output_nodes: They are used together to determine the subgraph range you want to quantize. The pre-processing and post-processing components are usually not quantizable and should be out of this range. The input_nodes and output_nodes should be the same for the float training graph and the float evaluation graph to match the quantization operations between them.

Note: Operations with multiple output tensors (such as FIFO) are currently not supported. You can add a tf.identity node to make an alias for the input_tensor to make a single output input node.
input_shapes: The shape list of input_nodes must be a 4-dimension shape for each node. The information is comma separated, for example, [[1,224,224,3] [1, 128, 128, 1]]; support unknown size for batch_size, for example, [[-1,224,224,3]].

Evaluate the quantized model and generate the frozen model: After QAT, generate the frozen model after evaluating the quantized graph with a checkpoint file. This can be done by calling the following function after building the float evaluation graph. As the freeze process depends on the quantize evaluation graph, they are often called together.

Note: Function decent_q.CreateQuantizeTrainingGraph and decent_q.CreateQuantizeEvaluationGraph modify the default graph in TensorFlow. They need to be called on different graph phases. decent_q.CreateQuantizeTrainingGraph needs to be called on the float training graph while decent_q.CreateQuantizeEvaluationGraph needs to be called on the float evaluation graph. decent_q.CreateQuantizeEvaluationGraph cannot be called right after calling function decent_q.CreateQuantizeTrainingGraph, because the default graph has been converted to a quantize training graph. The correct way is to call it right after the float model creation function.

# eval.py

# ...

# Create the float evaluation graph
model = model_fn(is_training=False)

# *Set the quantize configurations
from tensorflow.contrib import decent_q
q_config = decent_q.QuantizeConfig(input_nodes=['net_in'], output_nodes=['net_out'], input_shapes=[[-1, 224, 224, 3]])

# *Call Vai_q_tensorflow api to create the quantize evaluation graph
decent_q.CreateQuantizeEvaluationGraph(config=q_config)

# *Call Vai_q_tensorflow api to freeze the model and generate the deploy model
decent_q.CreateQuantizeDeployGraph(checkpoint="path to checkpoint folder", config=q_config)

# start the evaluation, users can use sess.run, tf.train, tf.estimator, tf.slim and so on
# ...

Generated Files

After you have performed the previous steps, the following files are generated in the ${output_dir} location:

Generated File Information

Name	TensorFlow Compatible	Usage	Description
quantize_train_graph.pb	Yes	Train	The quantize train graph.
quantize_eval_graph_{suffix}.pb	Yes	Evaluation with checkpoint	The quantize evaluation graph with quantize information frozen inside. There are weights in this file and it should be used together with the checkpoint file in evaluation.
quantize_eval_model_{suffix}.pb	Yes	1: Evaluation 2: Dump 3: Input to VAI compiler (DPUCAHX8H)	The frozen quantize evaluation graph, weights in the checkpoint, and quantize information are frozen inside. It can be used to evaluate the quantized model on the host or to dump the outputs of each layer for cross check with DPU outputs. The XIR compiler uses it as an input.

The suffix contains the iteration information from the checkpoint file and the date information. For example, if the checkpoint file is "model.ckpt-2000.*" and the date is 20200611, then the suffix is "2000_20200611000000."

Tips for QAT

The following are some tips for QAT.

Keras Model: For Keras models, set backend.set_learning_phase(1) before creating the float train graph, and set backend.set_learning_phase(0) before creating the float evaluation graph. Moreover, backend.set_learning_phase() should be called after backend.clear_session(). TensorFlow 1.x QAT APIs are designed for TensorFlow native training APIs. Using Keras model.fit() APIs in QAT might lead to some "nodes not executed" issues. It is recommended to use QAT APIs in the TensorFlow 2 quantization tool with Keras APIs.
Dropout: Experiments show that QAT works better without dropout ops. This tool does not support finetuning with dropouts at the moment and they should be removed or disabled before running QAT. This can be done by setting is_training=false when using tf.layers or call tf.keras.backend.set_learning_phase(0) when using tf.keras.layers.
Hyper-parameter: QAT is like float model training/finetuning, so the techniques for float model training/finetuning are also needed. The optimizer type and the learning rate curve are some important parameters to tune.

4.2.2.4: vai_q_tensorflow Supported Operations and APIs

The following table lists the supported operations and APIs for vai_q_tensorflow.

Supported Operations and APIs for vai_q_tensorflow

Type	Operation Type	tf.nn	tf.layers	tf.keras.layers
Convolution	Conv2D DepthwiseConv2dNative	atrous_conv2d conv2d conv2d_transpose depthwise_conv2d_native separable_conv2d	Conv2D Conv2DTranspose SeparableConv2D	Conv2D Conv2DTranspose DepthwiseConv2D SeparaleConv2D
Fully Connected	MatMul	/	Dense	Dense
BiasAdd	BiasAdd Add	bias_add	/	/
Pooling	AvgPool Mean MaxPool	avg_pool max_pool	AveragePooling2D MaxPooling2D	AveragePooling2D MaxPooling2D
Activation	ReLU ReLU6 Sigmoid Swish Hard-sigmoid Hard-swish	relu relu6 leaky_relu swish	/	ReLU Leaky ReLU
BatchNorm[#1]	FusedBatchNorm	batch_normalization batch_norm_with_glob al_normalization fused_batch_norm	BatchNormalization	BatchNormalization
Upsampling	ResizeBilinear ResizeNearestNeighbor	/	/	UpSampling2D
Concat	Concat ConcatV2	/	/	Concatenate
Others	Placeholder Const Pad Squeeze Reshape ExpandDims	dropout[#2] softmax[#3]	Dropout[#2] Flatten	Input Flatten Reshape Zeropadding2D Softmax

Notes:
#1. Only supports Conv2D/DepthwiseConv2D/Dense+BN. BN is folded to speed up inference.
#2. Dropout is deleted to speed up inference.
#3. vai_q_tensorflow does not quantize the softmax output.

4.2.2.5: vai_q_tensorflow Usage

The options supported by vai_q_tensorflow are shown in the following tables.

vai_q_tensorflow Options

Name	Type	Description
Common Configuration
--input_frozen_graph	String	TensorFlow frozen inference GraphDef file for the floating-point model. It is used for quantize calibration.
--input_nodes	String	Specifies the name list of input nodes of the quantize graph, used together with –output_nodes, comma separated. Input nodes and output nodes are the start and end points of quantization. The subgraph between them is quantized if it is quantizable. RECOMMENDED: Set –input_nodes as the last nodes for preprocessing and `–output_nodes` as the last nodes for post-processing because some of the operations required for pre- and post-processing are not quantizable and might cause errors when compiled by the Vitis AI compiler if you need to deploy the quantized model to the DPU. The input nodes might not be the same as the placeholder nodes of the graph.
--output_nodes	String	Specifies the name list of output nodes of the quantize graph, used together with –input_nodes, comma separated. Input nodes and output nodes are the start and end points of quantization. The subgraph between them is quantized if it is quantizable. RECOMMENDED: Set –input_nodes as the last nodes for preprocessing and –output_nodes as the last nodes for post-processing because some of the operations required for pre- and post-processing are not quantizable and might cause errors when compiled by the Vitis AI compiler if you need to deploy the quantized model to the DPU.
--input_shapes	String	Specifies the shape list of input nodes. Must be a 4-dimension shape for each node, comma separated, for example 1,224,224,3; support unknown size for batch_size, for example ?,224,224,3. In case of multiple input nodes, assign the shape list of each node separated by :, for example, ?,224,224,3:?,300,300,1.
--input_fn	String	Provides the input data for the graph used with the calibration dataset. The function format is `module_name.input_fn_name (for example, my_input_fn.input_fn). The -input_fn should take an int object as input which indicates the calibration step, and should return a dict`(placeholder_node_name, numpy.Array)` object for each call, which is then fed into the placeholder operations of the model. For example, assign –input_fn to my_input_fn.calib_input, and write calib_input function in my_input_fn.py as: def calib_input_fn: # read image and do some preprocessing return {“placeholder_1”: input_1_nparray,“placeholder_2”: input_2_nparray} Note: You do not need to do in-graph pre-processing again in input_fn because the subgraph before -input_nodes remains during quantization. Remove the pre-defined input functions (including default and random) because they are not commonly used. The preprocessing part which is not in the graph file should be handled in the input_fn.
Quantize Configuration
--weight_bit	Int32	Specifies the bit width for quantized weight and bias. Default: 8
--activation_bit	Int32	Specifies the bit width for quantized activation. Default: 8
--nodes_bit	String	Specifies the bit width of nodes. Node names and bit widths form a pair of parameters joined by a colon; the parameters are comma separated. When specifying the conv op name, only vai_q_tensorflow will quantize the weights of conv op using the specified bit width. For example, 'conv1/Relu:16,conv1/weights:8,conv1:16'.
--method	Int32	Specifies the method for quantization. 0: Non-overflow method in which no values are saturated during quantization. Sensitive to outliers. 1: Min-diffs method that allows saturation for quantization to get a lower quantization difference. Higher tolerance to outliers. Usually ends with narrower ranges than the non-overflow method. 2: Min-diffs method with strategy for depthwise. It allows saturation for large values during quantization to get smaller quantization errors. A special strategy is applied for depthwise weights. It is slower than method 0 but has higher endurance to outliers. Default value: 1
--nodes_method	String	Specifies the method of nodes. Node names and method form a pair of parameters joined by a colon; the parameter pairs are comma separated. When specifying the conv op name, only vai_q_tensorflow will quantize weights of conv op using the specified method, for example, 'conv1/Relu:1,depthwise_conv1/weights:2,conv1:1'.
--calib_iter	Int32	Specifies the iterations of calibration. Total number of images for calibration = calib_iter * batch_size. Default value: 100
--ignore_nodes	String	Specifies the name list of nodes to be ignored during quantization. Ignored nodes are left unquantized during quantization.
--skip_check	Int32	If set to 1, the check for float model is skipped. Useful when only part of the input model is quantized. Range: [0, 1] Default value: 0
--align_concat	Int32	Specifies the strategy for the alignment of the input quantize position for concat nodes. 0: Aligns all the concat nodes 1: Aligns the output concat nodes 2: Disables alignment Default value: 0
--simulate_dpu	Int32	Set to 1 to enable the simulation of the DPU. The behavior of DPU for some operations is different from TensorFlow. For example, the dividing in LeakyRelu and AvgPooling are replaced by bit-shifting, so there might be a slight difference between DPU outputs and CPU/GPU outputs. The vai_q_tensorflow quantizer simulates the behavior of these operations if this flag is set to 1. Range: [0, 1] Default value: 1
--adjust_shift_bias	Int32	Specifies the strategy for shift bias check and adjustment for DPU compiler. 0: Disables shift bias check and adjustment 1: Enables with static constraints 2: Enables with dynamic constraints Default value: 1
--adjust_shift_cut	Int32	Specifies the strategy for shift cut check and adjustment for DPU compiler. 0: Disables shift cut check and adjustment 1: Enables with static constraints Default value: 1
--arch_type	String	Specifies the arch type for fix neuron. 'DEFAULT' means quantization range of both weights and activations are [-128, 127]. 'DPUCADF8H' means weights quantization range is [-128, 127] while activation is [-127, 127]
--output_dir	String	Specifies the directory in which to save the quantization results. Default value: “./quantize_results”
--max_dump_batches	Int32	Specifies the maximum number of batches for dumping. Default value: 1
--dump_float	Int32	If set to 1, the float weights and activations are dumped. Range: [0, 1] Default value: 0
--dump_input_tensors	String	Specifies the input tensor name of Graph when graph entrance is not a placeholder. Add a placeholder to the dump_input_tensor, so that `input_fn` can feed data.
--scale_all_avgpool	Int32	Set to 1 to enable scale output of AvgPooling op to simulate DPU. Only kernel_size <= 64 will be scaled. This operation does not affect the special case such as kernel_size=3,5,6,7,14 Default value: 1
--do_cle	Int32	1: Enables implement cross layer equalization to adjust the weights distribution 0: Skips cross layer equalization operation Default value: 0
--replace_relu6	Int32	Available only for do_cle=1 1: Allows you to ReLU6 with ReLU 0: Skips replacement. Default value: 1
Session Configurations
--gpu	String	Specifies the IDs of the GPU device used for quantization separated by commas.
--gpu_memory_fraction	float	Specifies the GPU memory fraction used for quantization, between 0-1. Default value: 0.5
Others
--help		Shows all available options of vai_q_tensorflow.
--version		Shows the version information for vai_q_tensorflow.

Examples

#show help: 
$vai_q_tensorflow --help

#quantize:
$vai_q_tensorflow quantize --input_frozen_graph frozen_graph.pb \
--input_nodes inputs \
--output_nodes predictions \
--input_shapes ?,224,224,3 \
--input_fn my_input_fn.calib_input

#dump quantized model:
$vai_q_tensorflow dump --input_frozen_graph quantize_results/quantize_eval_model.pb \
--input_fn my_input_fn.dump_input

Refer to Vitis AI Model Zoo for more TensorFlow model quantization examples.

4.2.3 Quantize ONNX Models

The Vitis AI Quantizer for ONNX models is customized based on Quantization Tool in ONNX Runtime.

4.2.3.1 Test Environment

Python 3.7, 3.8
ONNX>=1.12.0
ONNX Rumtime>=1.14.0
onnxruntime-extensions>=0.4.2

4.2.3.2 Installation

You can install vai_q_onnx as follows:

Install from Source Code with Wheel Package

To build vai_q_onnx, run the following command:

$ sh build.sh
$ pip install pkgs/*.whl

4.2.3.3 Post Training Quantization (PTQ)

The static quantization method first runs the model using a set of inputs called calibration data. During these runs, the quantization parameters for each activation are computed. These quantization parameters are written as constants to the quantized model and used for all inputs. Our quantization tool supports the following calibration methods: MinMax, Entropy and Percentile, and MinMSE.

import vai_q_onnx

vai_q_onnx.quantize_static(
    model_input,
    model_output,
    calibration_data_reader,
    quant_format=vai_q_onnx.VitisQuantFormat.FixNeuron,
    calibrate_method=vai_q_onnx.PowerOfTwoMethod.MinMSE)

Arguments

model_input: (String) Represents the file path of the model to be quantized.
model_output: (String) Represents the file path where the quantized model is saved.
calibration_data_reader: (Object or None) Calibration data reader. It enumerates the calibration data and generates inputs for the original model. If you want to use random data for a quick test, you can set calibration_data_reader to None. The default value is None.
quant_format: (String) Specifies the quantization format of the model. It has the following options:
QOperator: This option quantizes the model directly using quantized operators.
QDQ: This option quantizes the model by inserting QuantizeLinear/DeQuantizeLinear into the tensor. It supports 8-bit quantization only.
VitisQuantFormat.QDQ: This option quantizes the model by inserting VAIQuantizeLinear/VAIDeQuantizeLinear into the tensor. It supports a wider range of bit-widths and configurations.
VitisQuantFormat.FixNeuron: This option quantizes the model by inserting FixNeuron (a combination of QuantizeLinear and DeQuantizeLinear) into the tensor.
calibrate_method: (String) For DPU devices, set calibrate_method to either 'vai_q_onnx.PowerOfTwoMethod.NonOverflow' or 'vai_q_onnx.PowerOfTwoMethod.MinMSE' to apply power-of-2 scale quantization. The PowerOfTwoMethod currently supports two methods: MinMSE and NonOverflow. The default method is MinMSE.

4.2.3.4 Running vai_q_onnx

Quantization in ONNX Runtime refers to the linear quantization of an ONNX model. We have developed the vai_q_onnx tool as a plugin for ONNX Runtime to support more post-training quantization(PTQ) functions for quantizing a deep learning model. Post-training quantization (PTQ) is a technique to convert a pre-trained float model into a quantized model with little degradation in model accuracy. A representative dataset is needed to run a few batches of inference on the float model to obtain the distributions of the activations, which is also called quantized calibration.

Usage of vai_q_onnx supports is as follows:

vai_q_onnx Post-Training Quantization (PTQ)

Use the following steps to run PTQ with vai_q_onnx:

Preparing the Float Model and Calibration Set

Before running vai_q_onnx, ensure to prepare the float model and calibration set, including the files listed in the following table.

Table 1. Input files for vai_q_onnx

No.	Name	Description
1	float model	Floating-point ONNX models in onnx format.
2	calibration dataset	A subset of the training dataset or validation dataset to represent the input data distribution, usually 100 to 1000 images are enough.

(Recommended) Pre-processing the Float Model

Pre-processing float32 model transforms and prepares it for quantization. It consists of the following three optional steps:

Symbolic shape inference: It is best suited for transformer models.
Model Optimization: It uses ONNX Runtime native library to rewrite the computation graph, including merging computation nodes, and eliminating redundancies to improve runtime efficiency.
ONNX shape inference.

The primary objective of these steps is to enhance quantization quality. The ONNX Runtime quantization tool performs optimally when the tensor's shape is known. Both symbolic shape inference and ONNX shape inference play a crucial role in determining tensor shapes. Symbolic shape inference is particularly effective for transformer-based models, whereas ONNX shape inference works well with other models. Model optimization performs certain operator fusion, making the quantization tool’s job easier. For instance, a Convolution operator followed by BatchNormalization can be fused into one during the optimization, which enables effective quantization. ONNX Runtime has a known issue: model optimization cannot output a model size greater than 2 GB. As a result, for large models, optimization must be skipped. Pre-processing API is in the Python module onnxruntime.quantization.shape_inference, function quant_pre_process(). Pre-processing API can be found in the onnxruntime.quantization.shape_inference Python module inside the quant_pre_process() function:

from onnxruntime.quantization import shape_inference

shape_inference.quant_pre_process(
     input_model_path: str,
    output_model_path: str,
    skip_optimization: bool = False,
    skip_onnx_shape: bool = False,
    skip_symbolic_shape: bool = False,
    auto_merge: bool = False,
    int_max: int = 2**31 - 1,
    guess_output_rank: bool = False,
    verbose: int = 0,
    save_as_external_data: bool = False,
    all_tensors_to_one_file: bool = False,
    external_data_location: str = "./",
    external_data_size_threshold: int = 1024,)

Arguments

input_model_path: (String) Specifies the file path of the input model to be pre-processed for quantization.
output_model_path: (String) Specifies the file path where the pre-processed model is saved.
skip_optimization: (Boolean) Indicates whether to skip the model optimization step. If set to True, model optimization is skipped, which may cause ONNX shape inference failure for some models. The default value is False.
skip_onnx_shape: (Boolean) Indicates whether to skip the ONNX shape inference step. The symbolic shape inference is most effective with transformer-based models. Skipping all shape inferences may reduce the effectiveness of quantization because a tensor with an unknown shape cannot be quantized. The default value is False.
skip_symbolic_shape: (Boolean) Indicates whether to skip the symbolic shape inference step. Symbolic shape inference is most effective with transformer-based models. Skipping all shape inferences may reduce the effectiveness of quantization because a tensor with an unknown shape cannot be quantized. The default value is False.
auto_merge: (Boolean) Determines whether to automatically merge symbolic dimensions when a conflict occurs during symbolic shape inference. The default value is False.
int_max: (Integer) Specifies the maximum integer value that is to be considered as boundless for operations like slice during symbolic shape inference. The default value is 2**31 - 1.
guess_output_rank: (Boolean) Indicates whether to guess the output rank to be the same as input 0 for unknown operations. The default value is False.
verbose: (Integer) Controls the level of detailed information logged during inference. A value of 0 turns off logging, 1 logs warnings, and 3 logs detailed information. The default value is 0.
save_as_external_data: (Boolean) Determines whether to save the ONNX model to external data. The default value is False.
all_tensors_to_one_file: (Boolean) Indicates whether to save all the external data to one file. The default value is False.
external_data_location: (String) Specifies the file location where the external file is saved. The default value is "./".
external_data_size_threshold: (Integer) Specifies the size threshold for external data. The default value is 1024.

Quantizing Using the vai_q_onnx API

The static quantization method first runs the model using a set of inputs called calibration data. During these runs, the quantization parameters are computed for each activation. These quantization parameters are written as constants to the quantized model and used for all inputs. Vai_q_onnx quantization tool has expanded calibration methods to power-of-2 scale/float scale quantization methods. Float scale quantization methods include MinMax, Entropy, and Percentile. Power-of-2 scale quantization methods include MinMax and MinMSE:

vai_q_onnx.quantize_static(
    model_input,
    model_output,
    calibration_data_reader,
    quant_format=vai_q_onnx.VitisQuantFormat.FixNeuron,
    calibrate_method=vai_q_onnx.PowerOfTwoMethod.MinMSE,
    input_nodes=[],
    output_nodes=[],
    extra_options=None,)

Arguments

model_input: (String) Specifies the path of the model to be quantized.
model_output: (String) Specifies the file path where the quantized model is saved.
calibration_data_reader: (Object or None) Calibration data reader that enumerates the calibration data and generates inputs for the original model. If you want to use random data for a quick test, you can set calibration_data_reader to None.
quant_format: (Enum) Defines the quantization format for the model. It has the following options:
QOperator: This option quantizes the model directly using quantized operators.
QDQ: This option quantizes the model by inserting QuantizeLinear/DeQuantizeLinear into the tensor. It supports 8-bit quantization.
VitisQuantFormat.QDQ This option quantizes the model by inserting VAIQuantizeLinear/VAIDeQuantizeLinear into the tensor. It supports a wider range of bit-widths and configurations.
VitisQuantFormat.FixNeuron This option quantizes the model by inserting FixNeuron (a combination of QuantizeLinear and DeQuantizeLinear) into the tensor. This is the default value.
calibrate_method: (Enum) Used to set the power-of-2 scale quantization method for DPU devices. It currently supports two methods: 'vai_q_onnx.PowerOfTwoMethod.NonOverflow' and 'vai_q_onnx.PowerOfTwoMethod.MinMSE'. The default value is 'vai_q_onnx.PowerOfTwoMethod.MinMSE'.
input_nodes: (List of Strings) List of the names of the starting nodes to be quantized. The nodes before these start nodes in the model are not optimized or quantized. For example, this argument can be used to skip some pre-processing nodes or stop quantizing the first node. The default value is [].
output_nodes: (List of Strings) Names of the end nodes to be quantized. The nodes after these nodes in the model are not optimized or quantized. For example, this argument can be used to skip some post-processing nodes or stop quantizing the last node. The default value is [].
extra_options: (Dict or None) Dictionary of additional options that can be passed to the quantization process. If there are no additional options to provide, this can be set to None. The default value is None.

(Optional) Evaluating the Quantized Model

If you have scripts to evaluate float models, like the models in AMD Model Zoo, you can replace the float model file with the quantized model for evaluation.

To support the customized FixNeuron op, the vai_dquantize module should be imported. THe following is example:

import onnxruntime as ort
from vai_q_onnx.operators.vai_ops.qdq_ops import vai_dquantize

so = ort.SessionOptions()
so.register_custom_ops_library(_lib_path())
sess = ort.InferenceSession(dump_model, so)
input_name = sess.get_inputs()[0].name
results_outputs = sess.run(None, {input_name: input_data})

After that, evaluate the quantized model just as the float model.

(Optional) Dumping the Simulation Results

Sometimes, after deploying the quantized model, it is essential to compare the simulation results on the CPU and GPU with the output values on the DPU. You can use the dump_model API of vai_q_onnx to dump the simulation results with the quantized_model:

# This function dumps the simulation results of the quantized model,
# including weights and activation results.
vai_q_onnx.dump_model(
    model,
    dump_data_reader=None,
    output_dir='./dump_results',
    dump_float=False)

Arguments

model: (String) Specifies the file path of the quantized model whose simulation results are to be dumped.
dump_data_reader: (Object or None) Data reader that is used for the dumping process. It generates inputs for the original model.
output_dir: (String) Specifies the directory where the dumped simulation results are saved. After successful execution of the function, dump results are generated in this specified directory. The default value is './dump_results'.
dump_float: (Boolean) Determines whether to dump the floating-point value of weights and activation results. If set to True, the float values are dumped. The default value is False.

Note: The batch_size of the dump_data_reader should be set to 1 for DPU debugging.

After successfully executing the command, the dump results are generated in the output_dir. Each quantized node's weights and activation results are saved separately in *.bin and *.txt formats. In cases where the node output is not quantized, such as the softmax node, the float activation results are saved in *_float.bin and *_float.txt formats if the option "save_float" is set to True. The following table shows an example of the dump results.

Table 2. Example of Dumping Results

Quantized	Node Name	Saved Weights and Activations
Yes	resnet_v1_50_conv1	{output_dir}/dump_results/quant_resnet_v1_50_conv1.bin {output_dir}/dump_results/quant_resnet_v1_50_conv1.txt
Yes	resnet_v1_50_conv1_weights	{output_dir}/dump_results/quant_resnet_v1_50_conv1_weights.bin {output_dir}/dump_results/quant_resnet_v1_50_conv1_weights.txt
No	resnet_v1_50_softmax	{output_dir}/dump_results/quant_resnet_v1_50_softmax_float.bin {output_dir}/dump_results/quant_resnet_v1_50_softmax_float.txt

4.2.3.5 List of Vai_q_onnx Supported Quantized Ops

The following table lists the supported operations and APIs for vai_q_onnx.

Table 3. List of Vai_q_onnx Supported Quantized Ops

supported ops	Comments
Add
Conv
ConvTranspose
Gemm
Concat
Relu
Reshape
Transpose
Resize
MaxPool
GlobalAveragePool
AveragePool
MatMul
Mul
Sigmoid
Softmax

4.2.3.6 vai_q_onnx APIs

quantize_static Method

vai_q_onnx.quantize_static(
    model_input,
    model_output,
    calibration_data_reader,
    quant_format=vai_q_onnx.VitisQuantFormat.FixNeuron,
    calibrate_method=vai_q_onnx.PowerOfTwoMethod.MinMSE,
    input_nodes=[],
    output_nodes=[],
    op_types_to_quantize=None,
    per_channel=False,
    reduce_range=False,
    activation_type=QuantType.QInt8,
    weight_type=QuantType.QInt8,
    nodes_to_quantize=None,
    nodes_to_exclude=None,
    optimize_model=True,
    use_external_data_format=False,
    extra_options=None,)

Arguments

model_input: (String) Specifies the file path of the model that is to be quantized.
model_output: (String) Specifies the file path where the quantized model will be saved.
calibration_data_reader: (Object or None) Calibration data reader that enumerates the calibration data and generates inputs for the original model. If you want to use random data for a quick test, you can set calibration_data_reader to None.
quant_format: (Enum) Defines the quantization format for the model. It has the following options:
QOperator Quantizes the model directly using quantized operators.
QDQ Quantizes the model by inserting QuantizeLinear/DeQuantizeLinear into the tensor. It supports 8-bit quantization only.
VitisQuantFormat.QDQ Quantizes the model by inserting VAIQuantizeLinear/VAIDeQuantizeLinear into the tensor. It supports a wider range of bit-widths and configurations.
VitisQuantFormat.FixNeuron Quantizes the model by inserting FixNeuron (a combination of QuantizeLinear and DeQuantizeLinear) into the tensor. This is the default value.
calibrate_method: (Enum) Used to set the power-of-2 scale quantization method for DPU devices. It currently supports two methods: 'vai_q_onnx.PowerOfTwoMethod.NonOverflow' and 'vai_q_onnx.PowerOfTwoMethod.MinMSE'. The default value is 'vai_q_onnx.PowerOfTwoMethod.MinMSE'.
input_nodes: (List of Strings) Names of the starting nodes to be quantized. Nodes in the model before these nodes will not be quantized. For example, this argument can be used to skip some pre-processing nodes or stop the first node from being quantized. The default value is an empty list ([]).
output_nodes: (List of Strings) Names of the end nodes to be quantized. Nodes in the model after these nodes are not be quantized. For example, this argument can be used to skip some post-processing nodes or stop the last node from being quantized. The default value is an empty list ([]).
op_types_to_quantize: (List of Strings or None) If specified, only operators of the given types are quantized (For example, ['Conv'] to only quantize Convolutional layers). By default, all supported operators are quantized.
per_channel: (Boolean) Determines whether weights should be quantized per channel. For DPU devices, this must be set to False as they currently do not support per-channel quantization.
reduce_range: (Boolean) If True, quantizes weights with 7-bits. For DPU devices, this must be set to False as they currently do not support reduced range quantization.
activation_type: (QuantType) Specifies the quantization data type for activations. For DPU devices, this must be set to QuantType.QInt8. For more details on data type selection, refer to the ONNX Runtime quantization documentation.
weight_type: (QuantType) Specifies the quantization data type for weights. For DPU devices, this must be set to QuantType.QInt8.
nodes_to_quantize:(List of Strings or None) If specified, only the nodes in this list are quantized. The list should contain the names of the nodes, for example, ['Conv__224', 'Conv__252'].
nodes_to_exclude:(List of Strings or None) If specified, the nodes in this list are excluded from quantization.
optimize_model:(Boolean) If True, optimizes the model before quantization. However, this is not recommended as optimization changes the computation graph, making the debugging of quantization loss difficult.
use_external_data_format: (Boolean) Used for large size (>2GB) model. The default is False.
extra_options: (Dictionary or None) Contains key-value pairs for various options in different cases. Current used:
extra.Sigmoid.nnapi = True/False (Default is False)
ActivationSymmetric = True/False: If True, calibration data for activations is symmetrized. The default is False. When using PowerOfTwoMethod for calibration, this should always be set to True.
WeightSymmetric = True/False: If True, calibration data for weights is symmetrized. The default is True. When using PowerOfTwoMethod for calibration, this should always be set to True.
EnableSubgraph = True/False: If True, the subgraph is quantized. The default is False.
ForceQuantizeNoInputCheck = True/False: If True, latent operators such as maxpool and transpose are always quantize their inputs, generating quantized outputs even if their inputs have not been quantized. The default behavior can be overridden for specific nodes using nodes_to_exclude.
MatMulConstBOnly = True/False: If True, only MatMul operations with a constant 'B' is quantized. The default is False for static mode.
AddQDQPairToWeight = True/False: If True, both QuantizeLinear and DeQuantizeLinear nodes are inserted for weight, maintaining its floating-point format. The default is False, which quantizes floating-point weight and feeds it solely to an inserted DeQuantizeLinear node. In the PowerOfTwoMethod calibration method, QDQ should always appear as a pair, hence this should be set to True.
OpTypesToExcludeOutputQuantization = list of op type: If specified, the output of operators with these types is not quantized. The default is an empty list.
DedicatedQDQPair = True/False: If True, an identical and dedicated QDQ pair is created for each node. The default is False, allowing multiple nodes to share a single QDQ pair as their inputs.
QDQOpTypePerChannelSupportToAxis = dictionary: Sets the channel axis for specific operator types (e.g., {'MatMul': 1}). This is only effective when per-channel quantization is supported and per_channel is True. If a specific operator type supports per-channel quantization but no channel axis is explicitly specified, the default channel axis is used. For DPU devices, this must be set to {} as per-channel quantization is currently unsupported.
CalibTensorRangeSymmetric = True/False: If True, the final range of the tensor during calibration is symmetrically set around the central point "0". The default is False. In PowerOfTwoMethod calibration method, this should always be set to True.
CalibMovingAverage = True/False: If True, the moving average of the minimum and maximum values is computed when the calibration method selected is MinMax. The default is False. In PowerOfTwoMethod calibration method, this should be set to False.
CalibMovingAverageConstant = float: Specifies the constant smoothing factor to use when computing the moving average of the minimum and maximum values. The default is 0.01. This is only effective when the calibration method selected is MinMax and CalibMovingAverage is set to True. In PowerOfTwoMethod calibration method, this option is unsupported.

< Previous | Next >

License

UIF is licensed under Apache License Version 2.0. Refer to the LICENSE file for the full license text and copyright notice.

Technical Support

Contact uif_support@amd.com for questions, issues, and feedback on UIF.

Submit your questions, feature requests, and bug reports on the GitHub issues page.

Files

quantizemodel.md

Latest commit

History

quantizemodel.md

File metadata and controls

Unified Inference Frontend (UIF) 1.2 User Guide

Step 4.2: Quantize Model with UIF Quantizer for Target Platforms (GPU/CPU/FPGA)

Table of Contents

4.2.1: Quantize PyTorch Models

4.2.1.1: Vitis AI PyTorch Quantizer

4.2.1.2: Configuration of Quantization Strategy for UIF

4.2.1.3: Quick Start in UIF Docker Environment

Write Quantization Code Based on Float Model

Run and Output Results

Fast Finetune Model

Finetune Quantized Model

4.2.2: Quantize TensorFlow Models

4.2.2.1: Installation

4.2.2.2: Running vai_q_tensorflow

Preparing the Float Model and Related Input Files

Generating the Frozen Inference Graph

Preparing the Calibration Dataset and Input Function

Quantizing the Model Using vai_q_tensorflow

Generating the Quantized Model

Evaluating the Quantized Model (Optional)

Dumping the Simulation Results (Optional)

4.2.2.3: vai_q_tensorflow Quantization Aware Training

Generated Files

Tips for QAT

4.2.2.4: vai_q_tensorflow Supported Operations and APIs

4.2.2.5: vai_q_tensorflow Usage

Examples

4.2.3 Quantize ONNX Models

4.2.3.1 Test Environment

4.2.3.2 Installation

Install from Source Code with Wheel Package

4.2.3.3 Post Training Quantization (PTQ)

4.2.3.4 Running vai_q_onnx

vai_q_onnx Post-Training Quantization (PTQ)

Preparing the Float Model and Calibration Set

(Recommended) Pre-processing the Float Model

Quantizing Using the vai_q_onnx API

(Optional) Evaluating the Quantized Model

(Optional) Dumping the Simulation Results

4.2.3.5 List of Vai_q_onnx Supported Quantized Ops

4.2.3.6 vai_q_onnx APIs

License

Technical Support