A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization
Edward Fish (edward.fish@surrey.ac.uk), Umberto Michieli (u.michieli@samsung.com), Mete Ozay (m.ozay@samsung.com)
Samsung Research UK, University of Surrey
Interspeech 2023
MyQASR is a mixed-precision quantization method that creates tailored quantization schemes for diverse users, all while meeting any memory requirement and without the need for fine-tuning. To do so, myQASR evaluates the quantization sensitivity of network layers by analyzing full-precision activation values given just a few unlabelled samples from the user. This simple but effective approach can generate a personalized mixed-precision quantization scheme that fits any predetermined memory budget with minimal performance degradation.
Key Takeaways:
- We show how different speakers require different mixed-precision quantization setups to preserve model performance.
- Since in most applications we have anonymized data and no labels we explore the effect of different speaker profiles on quantization sensitivity by passing unlabelled data though the network and observing activation statistics.
- Our empirical observation is that the mean activation values are a reasonable indicator of quantization sensitivity and so we design a method which generates a mixed precision setup based on this data.
- Since finding an optimal mixed precision setup for any size memory budget is intractable and large jumps in quantization resolutions between layers harms model performance, we introduce a uniformity constraint when selecting bit depths for each layer.
- We show in the paper how this method works well for quantizing large ASR networks for different speaker profiles, including gender and language evaluated on LibriSpeech, Common Voice, Google Speech Commands, and FLEURS.
We use HuggingFace Transformers for all models and the datasets are downloaded at runtime and partitioned into gender and language splits dynamically depending on the experiment you would like to run so there is no data or model pre-processing required! The code can be understood as follows:
-
Data preparation and separation:
- Data is downloaded via the dataloader and partitioned into tune and test segments depending on the arguments provided.
-
Model Wrapping :
-
A pre-trained model is downloaded via huggingface. We use the relevant configuration file in
/configs/
to select which layers in the network will be quantized and they are initially set to 32 bit (full precision). This takes place in the line:wrapped_modules, bit_depths = net_wrap.wrap_modules_in_net( model, quant_cfg)
-
wrapped_modules
is now a module dictionary containing the new quantizable modules with the methodscalibration_step
-
def forward(self, x):
# Layer is initialized as raw for sensitivity analysis
if self.mode == 'raw':
out = F.conv1d(x, self.weight, self.bias, self.stride,
self.padding, self.dilation, self.groups)
# optional mode for distance based metrics - will quantize layer temporarily to 8 bits.
elif self.mode == "sensitivity_step":
out = self.sensitivity_step(x)
# Will store raw layer inputs and gradients
elif self.mode == "calibration_step1":
out = self.calibration_step1(x)
# Perform quantization calibration in step 2
elif self.mode == "calibration_step2":
out = self.calibration_step2(x)
else:
raise NotImplementedError
return out
-
Sensitivity Analysis:
- Sensitivity analysis is performed by collecting activation statistics via forward hooks while the layer is in mode
self.mode="raw"
. If a distance based metric is used then the layer will be switched toself.mode="sensitivity_step"
which will quantize the layer, compute loss between both raw and quantized layers before resetting the layer tomode="raw"
- Sensitivity analysis is performed by collecting activation statistics via forward hooks while the layer is in mode
-
Quant Calibration:
- Once a bit configuration has been set, we simply update the operators with their relevant bit depths using
module.w_bit
andmodule.w_qmax
. Thecalibration_step=2
for the following stages. - Hessian aware and cosine aware quantization is then performed to find optimum activation values for the layer using the current weight quantization bits and activation bits defined in the configuration file for the model.
- Once a bit configuration has been set, we simply update the operators with their relevant bit depths using
-
Inference:
- The model weights and activations are updated in-place during the
quant_calibrator.batching_quant_calib()
call. This function returns a new module_dict with the activation and weight bits and parameter count. It also includes additional info for logging. All results are logged underresults/
as csv files.
- The model weights and activations are updated in-place during the
-
Visualisation:
- A jupyter notebook is included to create the same figures as in the paper under
/utils/create_figs.ipynb
-
- A jupyter notebook is included to create the same figures as in the paper under
If you are interested in trying out new sensitivity methods you can take a look at the code in utils/quant_calib_orig.py
here you will find the sensitivity methods which compare the raw 32 bit activations with the quantized activations:
elif args.sensitivity_method == "dist_l1":
dist = -torch.abs(raw_preds_flat - q_preds)
dist = torch.abs(torch.mean(dist, dim=-1)).cpu()
The raw_preds_flat
are the concatenated output values for that layer and the q_preds
are the quantized ones. It should be simple to add your own method here and then call it via the arguments.
To add additional calibration methods for quantization you can add a different calibration method in /quant_layers
. You will want to update the code for each module to use an alternative to calibration_step2
. For example:
def calibration_step2(self, x):
# initialize intervals with minmax intervals
self._initialize_intervals(x)
# put raw outs on GPU
self.raw_out = self.raw_out.to(
x.device).unsqueeze(1) # shape: B,1,oc,W,H
# put raw grad on GPU
self.raw_grad = self.raw_grad.to(
x.device) if self.raw_grad != None else None
# YOUR NEW QUANTIZATION CALIBRATION SEARCH METHOD HERE!
self.raw_grad = self.raw_grad.to(
"cpu") if self.raw_grad != None else None
# ensure you set the layer to calibrated or there will be issues.
self.calibrated = True
out = self.quant_forward(x)
del self.raw_input, self.raw_out, self.raw_grad
return out
First install all necessary libraries.
pip install -r requirements.txt
Our setup used:
python==3.9
torch==2.0.1
torch-audio==2.0.2
cuda==11.7
You can mix and match models, datasets, and quantization methods.
To quantize a wav2vec-base model to 75 mb with cosine calibration, median sensitivity metric, and male tuning and testing partition of librispeech (Table 3 in our results):
python3 main.py --model wav2vec-base --bs 32 --experiment reproduce --data libri --tune_gender male --test_gender male --sensitivity_method median --calibration_method cosine --target 75
We provide the script run_gender.sh
to replicate the results from Table 3 of our main paper. Note: the final results will be slightly better than those reported in the paper thanks to additional optimization to the sample lenght parameter during refactoring.
To run the same experiment but with a librispeech user ID instead of a gender partition you can swap the --tune_gender
and --test_gender
for --tune_id
and --test_id
.
For example:
python3 main.py --model wav2vec-base --bs 32 --experiment git_test --data libri -- tune_id 121 --test_id 121 --calibration_method cosine --target 75
To run experiments on Wav2Vec-Conformer with Google Speech Commands with tune and testing of user id you can change --data
and --model
python3 main.py --model wav2vec-conformer --bs 32 --experiment git_test --data speech_commands --tune_id 1 --test_id 1 --sensitivity_method median --calibration_method cosine --target 2000
Please cite our work if you use this codebase:
@article{fish2023model,
title={A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization},
author={Fish, Edward and Michieli, Umberto and Ozay, Mete},
journal={INTERSPEECH},
year={2023}
}
The hessian calibration method is adapted from PTQ4ViT. Please also cite their work if you use this codebase:
@article{PTQ4ViT,
title={PTQ4ViT: Post-Training Quantization Framework for Vision Transformers},
author={Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, Guangyu Sun},
journal={European Conference on Computer Vision (ECCV)},
year={2022},
}