As we keep floating-point scale
and integer zero-point
for quantized models, and meanwhile some has Quantize/Dequantize
operators that having floating-point input or output. This section briefs the data types they use, and in SNPS-Caffe implementation we use the target floating-precision.
TFLite | ONNX | Caffe2 | SNPS Caffe | |
---|---|---|---|---|
scale | double | float | float | |
fp | template | float | float | float |
round half | away zero | toward even | toward even | |
std:: | round | rint | nearbyint |
fp generally denotes the data_type of
- Input tensor for
Quantize
and - Output tensor for
Dequantize
, - Intermediate tensor type if some specific operators use floating-point registers for computation or handling output_scale.
- e.g. ONNXruntime generally handles the
input_scale-to-output_scale
transformation by MlasRequantizeOutput(int Input, int Output, float scale); which uses intermediate floating-point representation --float
.
- e.g. ONNXruntime generally handles the
We support the implementations from different frameworks, which leverages the layer parameter quantize_method
when their results fail bit-exactness. You can also refer to FEATURES.md for other quantization-related parameters.
operator \ quantize_method |
TFLite | ONNX | Caffe2 |
---|---|---|---|
AveragePool | t | o | c |
BiasAdd | o | ||
Concat | ~ | ||
Convolution | t | o | c |
Deconvolution | c | ||
EltwiseSum | t | c | c |
InnerProduct | t | t | |
LeakyReLU | t | ||
Power* | t | o | c |
ReLU | ~ | ~ | ~ |
ResizeBilinear | ~ | ||
Sigmoid | ~ | ||
Softmax | ~ |
We denote TFLite/ONNXruntime/Caffe2 implementations by t/o/c
; and the ~
entries indicate that the Caffe implementation computes in floating representation such as
// A Dequantize-Op-Quantize procedure, taking ReLU as example.
float_in = Dequantize(int_in, input_scale, input_zero_point);
float_out = ReLU(float_in);
int_out = Quantize(float_out, output_scale, output_zero_point);
- Our model zoo doesn't cover all quantized operators over the frameworks. The entry is left empty if the
(framework,operator)
combination is not seen yet.- Quantized
bias_layer
only occurs in ONNX (does not supportFC+Bias
fusion yet).
- Quantized
- Only
Quantize
andDequantize
operators are mapped toPower_layer
. - Since some quantized operators may have bit-exactness results between the frameworks, for such entries we adapt the implementation from other framework.
MaxPool
,ArgMax
are seen, but they do nothing different for quantized/floating numbers.Convolution
concludes a number of variations, please see the folloing section.
output_multiplier
= input_scale
* weight_scale
/ output_scale
.
Reminded that TFLite uses double
, while ONNXruntime and Caffe2 use float
for scales.
The quantized multiplier is calculated as (the shift
is a power-of-two normalizer to normalize output_multiplier in [0.5,1) )
output_multiplier = <double>input_scale * <double>weight_scale / <double> output_scale;
quantized_multiplier = std::round(std::frexp(output_multiplier, &shift) * (1<<31));
// or for channel-wise quantization
// output_multiplier[ch] = <double>input_scale * <double>weight_scale[ch] / <double> output_scale;
// quantized_multiplier[ch] = std::round(std::frexp(output_multiplier[ch], &shift[ch]) * (1<<31));
For convolutions, TFLite transfrom to DepthwiseConv if group
= in_ch
= out_ch
.
Then, different implementations are derived in SNPS-Caffe to match TFLite:
Scales \ group | 1 | Depthwise | Pointwise* |
---|---|---|---|
PerTensor | D2 | F2 | F2* |
PerChannel | D1 | D2 | D1* |
Two kinds of rounding are used to approximate the affine transformation (from input_scale
to output_scale
, using the quantized multiplier).
- The first splits it into two steps, denoted by 2-steps-rounding
- The second implments
rounding half toward positive infinity
, denoted by 1-step-rounding
scaled_acc = SaturatingRoundingDoublingHighMul(<int>acc,<int>quantized_multiplier)
out_acc = RoundingDivideByPOT(scaled_acc, shift)
// The approximate result := out_acc = scaled_acc / (1<<31) / (1<<shift),
// where roundings are used
Use <float>
to calculate output_multiplier, then apply 2-steps-rounding in D2.
Calculate the output_multiplier
as per channel
Also it uses simpler rounding to calculate the approximate result
scaled_acc = <int>acc * <int>quantized_multiplier
out_acc = (scaled_acc + (1<<(31+shift-1)) >> (31+shift-1)
// which is, it rounds (only once) half toward positive inf
When I try to match bit-exactness result, the combination of PerTensor-F2
and PerChannel-D1
is found by brute-force.
It casts <int>acc
to <float>
, multiply by <float>output_multiplier
, then requantize the result.
It uses single-precision scales, the computation is the same as mentioned F2.