Quantized models in different frameworks (Editing)

As we keep floating-point scale and integer zero-point for quantized models, and meanwhile some has Quantize/Dequantize operators that having floating-point input or output. This section briefs the data types they use, and in SNPS-Caffe implementation we use the target floating-precision.

	TFLite	ONNX	Caffe2	SNPS Caffe
scale	double	float	float
fp	template	float	float	float
round half	away zero	toward even	toward even
std::	round	rint	nearbyint

fp generally denotes the data_type of

Input tensor for Quantize and
Output tensor for Dequantize,
Intermediate tensor type if some specific operators use floating-point registers for computation or handling output_scale.
- e.g. ONNXruntime generally handles the input_scale-to-output_scale transformation by MlasRequantizeOutput(int Input, int Output, float scale); which uses intermediate floating-point representation -- float.

Quick Look-Up for Implementations in SNPS Caffe

We support the implementations from different frameworks, which leverages the layer parameter quantize_method when their results fail bit-exactness. You can also refer to FEATURES.md for other quantization-related parameters.

`operator` \ `quantize_method`	TFLite	ONNX	Caffe2
AveragePool	t	o	c
BiasAdd		o
Concat	~
Convolution	t	o	c
Deconvolution			c
EltwiseSum	t	c	c
InnerProduct	t	t
LeakyReLU	t
Power*	t	o	c
ReLU	~	~	~
ResizeBilinear	~
Sigmoid	~
Softmax	~

We denote TFLite/ONNXruntime/Caffe2 implementations by t/o/c; and the ~ entries indicate that the Caffe implementation computes in floating representation such as

// A Dequantize-Op-Quantize procedure, taking ReLU as example.
float_in = Dequantize(int_in, input_scale, input_zero_point);
float_out = ReLU(float_in);
int_out = Quantize(float_out, output_scale, output_zero_point);

Notes

Our model zoo doesn't cover all quantized operators over the frameworks. The entry is left empty if the (framework,operator) combination is not seen yet.
- Quantized bias_layer only occurs in ONNX (does not support FC+Bias fusion yet).
Only Quantize and Dequantize operators are mapped to Power_layer.
Since some quantized operators may have bit-exactness results between the frameworks, for such entries we adapt the implementation from other framework.
MaxPool, ArgMax are seen, but they do nothing different for quantized/floating numbers.
Convolution concludes a number of variations, please see the folloing section.

Quantized Convolutions

output_multiplier = input_scale * weight_scale / output_scale.
Reminded that TFLite uses double, while ONNXruntime and Caffe2 use float for scales.

TFLite

The quantized multiplier is calculated as (the shift is a power-of-two normalizer to normalize output_multiplier in [0.5,1) )

output_multiplier = <double>input_scale * <double>weight_scale / <double> output_scale;
quantized_multiplier = std::round(std::frexp(output_multiplier, &shift) * (1<<31));
// or for channel-wise quantization
// output_multiplier[ch] = <double>input_scale * <double>weight_scale[ch] / <double> output_scale;
// quantized_multiplier[ch] = std::round(std::frexp(output_multiplier[ch], &shift[ch]) * (1<<31));

For convolutions, TFLite transfrom to DepthwiseConv if group = in_ch = out_ch.
Then, different implementations are derived in SNPS-Caffe to match TFLite:

Scales \ group	1	Depthwise	Pointwise*
PerTensor	D2	F2	F2*
PerChannel	D1	D2	D1*

Two kinds of rounding are used to approximate the affine transformation (from input_scale to output_scale, using the quantized multiplier).

The first splits it into two steps, denoted by 2-steps-rounding
- SaturatingRoundingDoublingHighMul, and
- RoundingDivideByPOT
The second implments rounding half toward positive infinity, denoted by 1-step-rounding

D2 (Double Precision + 2-Steps-Rounding)

scaled_acc = SaturatingRoundingDoublingHighMul(<int>acc,<int>quantized_multiplier)
out_acc = RoundingDivideByPOT(scaled_acc, shift)
// The approximate result := out_acc = scaled_acc / (1<<31) / (1<<shift),
// where roundings are used

F2 (Single Precision + 2-Steps-Rounding)

Use <float> to calculate output_multiplier, then apply 2-steps-rounding in D2.

D1 (Double Precision + 1-Step-Rounding)

Calculate the output_multiplier as per channel

Also it uses simpler rounding to calculate the approximate result

scaled_acc = <int>acc * <int>quantized_multiplier
out_acc = (scaled_acc + (1<<(31+shift-1)) >> (31+shift-1)
// which is, it rounds (only once) half toward positive inf

Pointwise Convolution*

When I try to match bit-exactness result, the combination of PerTensor-F2 and PerChannel-D1 is found by brute-force.

ONNX runtime

It casts <int>acc to <float>, multiply by <float>output_multiplier, then requantize the result.

Caffe2

It uses single-precision scales, the computation is the same as mentioned F2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QUANTIZED_OP.md

QUANTIZED_OP.md

Quantized models in different frameworks (Editing)

Quick Look-Up for Implementations in SNPS Caffe

Notes

Quantized Convolutions

TFLite

D2 (Double Precision + 2-Steps-Rounding)

F2 (Single Precision + 2-Steps-Rounding)

D1 (Double Precision + 1-Step-Rounding)

Pointwise Convolution*

ONNX runtime

Caffe2

Files

QUANTIZED_OP.md

Latest commit

History

QUANTIZED_OP.md

File metadata and controls

Quantized models in different frameworks (Editing)

Quick Look-Up for Implementations in SNPS Caffe

Notes

Quantized Convolutions

TFLite

D2 (Double Precision + 2-Steps-Rounding)

F2 (Single Precision + 2-Steps-Rounding)

D1 (Double Precision + 1-Step-Rounding)

Pointwise Convolution*

ONNX runtime

Caffe2