-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INT8EntropyCalibrator2 implicit quantization superseded by explicit quantization #4095
Comments
Does your project require the use of the specific calibrator from our samples or have you tried looking into Model Optimizer (https://github.com/NVIDIA/TensorRT-Model-Optimizer) for the calibration purposes (it should ensure your model has Q/DQ nodes for the explicit quant approach)? If the former, we can look into updating that sample code. |
Hey moraxu, Sorry for the late response. I missed your message. Haven't looked at or work with Model Optimizer yet. Thanks for letting me know. Not sure if I understood the question but I apply the INT8 Entropy Calibrator 2 quantization to my CNN models. It gives the best results. Is that what you meant ? So, will the implicit quantization still be supported or you, guys, are completely switching to the explicit one only ? I would like to avoid changing things in my processing pipeline (training models in Pytorch, converting to ONNX, implicit PTQ quantization) due to possible incompatibility issues and such. Therefore, it would be great if you could update the INT8 Entropy Calibrator 2 sample code. Would that be possible ? Thanks! |
Thanks for clarifying, in that case I'll request the sample code to be updated and provide more updates in this ticket - thanks for reporting this. |
That sounds great, moraxu! Thank you. |
@moraxu Hi, does this mean that, after tensorrt 10, the recommended method is: train-pytorch-model -> int8-with-model-optimizer -> export-onnx -> compile-with-tensorrt ? |
@CoinCheung , yes, if you're using PyTorch then Model Optimizer can automate more parts of that flow: https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_pytorch_quantization.html in contrast to manually configuring the optimization and quantization steps within TRT API (unless you need to do that for more control) |
@moraxu Got a few question about Model Optimizer if you don't mind.
Sorry if this is too many questions but the more libraries involved the higher the chance something will be incompatible or not work from my experience. Hence, I would like to know if Model Optimizer can give me exactly the same results as the pipelines that I mentioned before:
Thanks! P.S. I would still appreciate if you guys could update the INT8 quantization sample code so I can use that while investigating and testing Model Optimizer. Thanks! |
Hi @adaber,
a) You mean how does it handle a quantized ONNX? TRT reads the scale factors from the ONNX file (for INT8 layers). These scale factors define how the weights and activations are quantized. If these scale factors are provided, TRT will use them directly.
c) In practice, TRT should not drastically change the quantization itself (i.e., INT8 scale factors should remain intact). Still, the engine-building process may involve optimizations that improve inference performance while maintaining accuracy. @akhilg-nv , please correct me or add anything, since I believe you've worked with quantized workflows more than me. |
Hi @moraxu Firstly, thank you so much for responding so quickly. It is very appreciated!
I will definitely check their forums. Please, let me know if you manage to find out what Model Optimizer's quantization config instance would yield similar results to INT8 Entropy Calibration 2 by talking to, I assume, your coworkers (Model Optimizer team). Thank you for providing the info on the TRT engine-creation process and timing caches, too. Thanks! |
@adaber please follow the ModelOpt example here or python API's to quantize an ONNX model. Note that, Then you can compile the output explicit ONNX model with TensorRT tool: |
Hi @riyadshairi979, Thanks for your input. It is appreciated. I'm familiar with these approaches, however, my concern is that there are posts where people complain about not getting as good of a result when compared to the TensorRT Int8EntropyCalibrator2 based INT8 quantization (both ONNX Runtime and ModelOpt). You even mentioned something similar (NVIDIA/TensorRT-Model-Optimizer#46). I do appreciate that you guys have been working on ModeOpt. It seems like a really good tool. I am just trying to get familiar with some aspects of ModelOpt before I commit to spend time to try to get it to work and incorporate it in my processing pipeline. @riyadshairi979 Quick question. You mention to use build_engine.p but I assume I don't need to include the implicit IInt8EntropyCalibrator2 since the model had already been INT8 quantized and IInt8EntropyCalibrator2 is deprecated ? @riyadshairi979 @moraxu Thank you again for your prompt responses and willingness to help! |
It means, sometimes TensorRT deployed EQ network latency > IQ network latency. ModelOpt team is actively working with TensorRT team to minimize this type of gap for various models.
Choice of calibrator might have impact on the accuracy of the model but not latency. If you see accuracy regression with modelopt quantization, please file a bug here with reproducible model and commands.
Right. |
@moraxu Hi, can we use explicit int8 quantization now ? I mean we use ModelOpt to quantize the pytorch model into int8, and then export it to onnx from pytorch, then use tensorrt to build it into tensorrt engine? I ask this because I just had a try of that, and I got the error message like this: As for my code, I just comment out the part about int8 calibration in the tensorrt side, and set the calibrator to nullptr: Do you know how could I make this work ? |
Could you share your entire full code snippet, @CoinCheung ? I can open a bug internally and have someone update that calibrator code at the same time. Are you using ModelOpt in your code or not? |
@moraxu Hi, I packed up the code and the associated onnx file, it is accessable here: https://github.com/CoinCheung/eewee/releases/download/0.0.0/code.zip There is a readme file in the zip file, it describes the step how I trigger this error.
Yes, I used ModelOpt to quantize the model in the pytorch side, and then export it to onnx. Then I used the onnx file in the tensorrt side. |
Got a quick question. Does Model Optimizer work on Windows ? It says Linux but there is one post asking about an issue with the Python version and the listed environment is Windows. The person who helped with this issue didn't make any comments on that and so I assume it works on Windows, too. (NVIDIA/TensorRT-Model-Optimizer#26) Thanks again for all the help! |
@riyadshairi979 , would you be able to check if @CoinCheung 's ModelOpt code in his zipped export_onnx.py file is correct? If so, I'd open an internal bug on TRT Side for someone to look at it |
@adaber , I don't see Windows in https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html so will let @riyadshairi979 to confirm |
I have a somewhat related question on this topic. For the two pipelines described below, is there good evidence that inference times for #2 are significantly faster than #1? I assume gpu memory requirements are lower for #2.
I'm working with a CNN-based segmentation model, and the inference times I get for the two pipelines are similar. Also, if there is a complete, end-to-end tutorial script for pipeline #2 for a simple CNN model, can that please be shared? |
@ckolluru , in theory, INT8 inference should generally offer better performance than FP16 due to lower precision calculations, which use less computational power and memory bandwidth. However, I believe for CNN-based models, convolution operations might not be as bottlenecked by precision reduction so the speed improvement might not be that visible. Another possible reason for similar inference times between FP16/INT8 pipelines could be poor INT8 calibration? GPU memory consumption should indeed be lower with INT8 than FP16 because INT8 uses 1 byte per weight/activation, while FP16 uses 2 bytes.
If https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_pytorch_quantization.html is not sufficient then please check on https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues |
@moraxu Hi, I have a somewhat related question. The TensorRT developer guide mentions that implicit quantization is deprecated: Section 7.1.2: Explicit vs Implicit Quantization
However, when working with DLA, the same guide mentions that DLA do not support explicit quantization: Section 13: Working with DLA
Is it your plan to eventually support explicit quantization with DLAs and in the meantime we have to use implicit quantization (which is deprecated)? |
@maisa32 sorry for the late reply, I've just checked with the PM team and:
|
Description
Hi,
I have been using the INT8 Entropy Calibrator 2 for INT8 quantization in Python and it’s been working well (TensorRT 10.0.1). The example of how I use the INT8 Entropy Calibrator 2 can be found in the official TRT GitHub repo (TensorRT/samples/python/efficientdet/build_engine.py at release/10.0 · NVIDIA/TensorRT · GitHub)
The warning I’ve been getting starting with TensorRT 10.1 is that the INT8 Entropy Calibrator 2 implicit quantization has been deprecated and superseded by explicit quantization.
I’ve read the official document on the difference between the implicit and explicit quantization processes (Developer Guide :: NVIDIA Deep Learning TensorRT Documentation) and they seem to work differently. The explicit quantization seems to expect a network to have QuantizeLayer and DequantizeLayer layers which my networks don’t. The implicit quantization can be used when those layers are not present in a network. Therefore, I am confused about how the implicit quantization can be superseded by the explicit quantization since they seem to work differently.
So, my question is what needs to be modified in the standard INT8 Calibrator 2 quantization method (TensorRT/samples/python/efficientdet/build_engine.py at release/10.0 · NVIDIA/TensorRT · GitHub) for the deprecation warning not to show up ? Or what is the proper way to implement the INT8 Calibrator 2 implicit quantization now that the current one is deprecated ? Couldn’t find any example using a newer TensorRT version (10.1 and up)
Thank you!
Environment
TensorRT Version: 10.1
NVIDIA GPU: 3090
Operating System: Windows 10
Python Version: 3.9.19
The text was updated successfully, but these errors were encountered: