Q: In the example, we set the pruning rate of the whole network. How to adjust the pruning rate of a specific layer?
# example config
sparsity: 0.25
metrics: l2_norm # The available metrics are listed in `tinynn/graph/modifier.py`
A: After calling pruner.prune()
, a new configuration file with the sparsity for each operator will be generated inplace. You can use this file as the configuration for the pruner or generate a new configuration file based on this one. (e.g. line 42 in examples/oneshot/oneshot_prune.py
)
# new yaml generated
sparsity:
default: 0.25
model_0_0: 0.25
model_1_3: 0.25
model_2_3: 0.25
model_3_3: 0.25
model_4_3: 0.25
model_5_3: 0.25
model_6_3: 0.25
model_7_3: 0.25
model_8_3: 0.25
model_9_3: 0.25
model_10_3: 0.25
model_11_3: 0.25
model_12_3: 0.25
model_13_3: 0.25
metrics: l2_norm # Other supported values: random, l1_norm, l2_norm, fpgm
The training in TinyNeuralNetwork is based on PyTorch. Usually, the bottleneck is in the data processing part, which you can try to use LMDB and other in-memory databases to accelerate.
Q: Some operators such as max_pool2d_with_indices will fail when quantizing
A: The quantization-aware training of TinyNeuralNetwork is based on that of PyTorch, and only reduces its complexity related to operator fusion and computational graph rewrite.
TinyNeuralNetwork does not support operators that are not natively supported by PyTorch, such as LeakyReLU and etc. Please wrap up torch.quantization.QuantWrapper
on those modules.
(More operators are supported in higher versions of PyTorch. So, please consult us first or try a higher version if you encounter any failure)
Q: How to quantize only part of a quantized graph when the default is to perform quantization on the whole graph?
# Quantization with the whole graph
with model_tracer():
quantizer = QATQuantizer(model, dummy_input, work_dir='out')
qat_model = quantizer.quantize()
A: First, perform quantization for the whole graph. Then, manually modify the positions of QuantStub and DeQuantStub. After that, using the code below to load the model.
# Reload the model with modification
with model_tracer():
quantizer = QATQuantizer(model, dummy_input, work_dir='out', config={'force_overwrite': False})
qat_model = quantizer.quantize()
Q: Models may have some extra logic in the training phase that are not needed in inference, such as the model below (which is also a common scenario in real world OCR and face recognition). This will result in the quantization model code generated by codegen during training is not available for inference.
class FloatModel(nn.Module):
def __init__(self):
self.conv = nn.Conv2d()
self.conv1 = nn.Conv2d()
def forward(self, x):
x = self.conv(x)
if self.training:
x = self.conv1(x)
return x
A: There are generally two ways to tackle this problem.
- Use the code generator in TinyNeuralNetwork to create
qat_train_model.py
,qat_eval_model.py
in case ofmodel.train()
,model.eval()
, respectively Useqat_train_model.py
for training, and then useqat_eval_model.py
to load the weights trained by the former when inference is needed (Since there is noself.conv1
inqat_eval_model.py
, you need to setstrict=False
when callingload_state_dict
) - Like the former one, generate two different copies of the model in training mode and evaluation mode respectively. And then, make a copy of
qat_train_model.py
and replace the forward function with that inqat_eval_model.py
manually. Finally, use the modified script as the one for the evaluation mode.