-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accuracy loss of TensorRT 8.6 when running INT8 Quantized Resnet18 on GPU A4000 #4079
Comments
Do you start with a pretrained model to QAT? If yes, does the fp32 model (that is un-quantized) also shows inconsistent results between trt and onnxruntime ? Do you try to change a version of trt? Do you try to use other calibration method between finetune ? |
@lix19937 Thank you very much for helping out!
Yes, to be specific, what I did was PTQ using TRT modelopt.
No, FP32 model's outputs when converted to TRT was almost identical to the native Pytorch model, I didn't verify onnxruntime for this.
Unfortunately this is not easy to do right now on my side. Are you expecting this will be fixed in TRT 10? I was under the impression that QDQ quantization has been supported even earlier than 8.6, so it won't be a version issue. But correct me if I am wrong.
Do you mean the implicit calibration methods within TRT? If so, I am not using that. The onnx model I provided to TRT already contains the QDQ nodes. If you mean other calibration in TRT modelopt, I tried both Smoothing and the Default (MINMAX) calibration methods, they both shows the same regression. Thanks and looking forward to your responses! |
Following up from the discussion on your issue in TRT ModelOpt, it is possible the accuracy degradation comes for the fusions performed with Int8 convolution. Could you try removing QDQ for all convolutional layers (not just first one) and compare the accuracy with torch? You should be able to do this by modifying the config either as you've done before, or to use a filter function for convolutional layers in your torch model. It is possible your specific application may work better for now without quantizing conv layers, though this will also help us in understanding & investigating the root cause of the accuracy discrepancy you are seeing.
|
Got it, thanks for the follow up @akhilg-nv. Sorry for the delay, I was away last & this week. I can try the experiment you suggested next week. To clarify: If we skip quantizing the conv layers, it means we will only have the last layer -- the fully connected classification layer. Is that okay?
This won't be true unfortunately, the goal of quantizing those layers is to achieve acceleration in the model inference latency. However, i will certainly conduct the experiment to see if it will no longer have the regression. |
Hi @akhilg-nv, I disabled quantization of all the Conv layers: what got quantized are:
I found that the TRT outputs still differ from the native outputs (classification distribution). It differ by >=1e-3. I also noticed that for all the examples where the native model makes a correct prediction, the TRT model also will make a correct prediction. In other words, although the distribution differs, the logits are equal after taking However, on the classifications that both versions generates wrong output, the logits differs (for example the correct result is 0, the native gives 1, and trt can give 2). Does this means there are something wrong with how TRT resolves QDQ in general? I would really expect the output of TRT model should be close to the output of native model, pointwise. Otherwise, if I want to perform a segmentation task for example, i won't be able to take a |
Thanks for running this experiment @YixuanSeanZhou. To confirm, you're seeing that the partially-quantized model run with ONNX runtime has matching accuracy with the torch quantized model, but the TRT model's accuracy differs even if you only quantize pooling, residual connections, and fully connected layer? This is a bit strange since iirc your previous bisection experiment revealed accuracy differences after convolution. Some suggestions to look into:
We will investigate this as well on our end, thanks for the detailed information. |
Hi @akhilg-nv, thanks for the response. I didn't check the diff between onnx and native for this partially quantized model. I will double check that later and get back to you. Correct, this is surprising as well. I thought it was convolution that causes the issue but maybe there are more under the hood. Also, to eliminate the issue of batch norm, in this model, I actually chose the most basic resnet18 which doesn't even have batchnorm in it. Bisecting requires more effort, I can hopefully get to it next week! To provide some data point for your investigation, here are two onnx models i have, quantized and unquantized: https://file.io/nA8lR0aZHmXz. I hope this could help in your investigation on what could potentially go wrong in TRT. (Or maybe you can figure out what user mistake i could be making!) Thanks again! |
Hi @YixuanSeanZhou, it's worth noting that there is some expected error after quantizing, so difference of 1e-3 with no change in positive classification may be expected. Double checking with torch/ORT will help confirm if the partially quantized model has important error or not. You mention your model does not have BN layers, could you share which model you use from torchvision? Also, the link for the ONNX file seems to have expired, could you re-upload it? |
Hi @akhilg-nv, thanks for taking a look! Could you try this link: https://file.io/3M5bla347qa0? Seems to be working for me when i try to re-download. I apologize the previous link didn't work. The torchivsion model i used is the basic resnet with the last layer substituted to a class of 10.
I think the diff can be worse than that. For negative classifications, we don't have alignment, and I think the element diff in the distribution can sometimes be pretty large (1e-1 level). Notice that 1e-1 is pretty significant as this is a distribution of 10 items (softmax on tensor of size 10). If we have such a regression how are we going to apply the model on other tasks -- e.g. segmentations?
I will try to find time to do it and get back to you. |
Hi @YixuanSeanZhou, the new link also doesn't work - I get the following error: "The transfer you requested has been deleted." Perhaps you could try sharing the ONNX model a different way? Regarding the resnet architecture, I am a bit confused why you say there is no batch norm, since I don't see you removing it in your sample code. Below I've pasted a snippet:
|
Hi @akhilg-nv, I am terribly sorry for the confusion... I didn't re-check the model architecture. I looked at the onnx graph and I see there is no BatchNorm and falsely assumed I picked a model with batchnorm removed. I think maybe it's just fused with the Conv layers. If it is indeed fused with Conv layers, then batchnorm is running in FP32. If you see the attached screenshot, the conv layers (and things in between) are running FP32 Regarding the onnx, can you try this google drive link: https://drive.google.com/file/d/1AGHoPgYIRg3dt0ZJz7yVOgTnTR6hPGMw/view?usp=sharing. Thanks again! |
Hey @YixuanSeanZhou, I was unable to reproduce the accuracy issue you've observed. My steps are:
|
@YixuanSeanZhou if there's no update until 12/12 we'll close this ticket. |
Thank you for providing this repro. I was able to verify that with this approach the onnx runtime and the TRT compiled model outputing the same output. In my workflow, we are using a wrapper to run TRT rather than directly executing it from docker. Given this, i will go try to figure out if the issue happens within our wrapper. Thank you so much for the help!! I think we can close the issue for now |
Description
When performing Resnet18 PTQ using TRT-modelopt, I encountered the following issue when compiling the model with TRT.
First off, I started with a pretrained resnet18 from torchvision. I replaced the last fully connected layer to fit on my dataset (for example, CIFRA-10). I also updated all the skip layers (the plus) with a ElementwiseAdd layer and I defined its quantization layer as follow myself (code attached at the end). The reason I do this is to facilitate the Q/DQ fusion so that every layer can be in INT8.
Then, when compiling the exported onnx model with TRT, I found that TRT outputs is very different from the fake Q/DQ model in python, and the fake Q/DQ onnx model as well when running with onnx runtime. (
np.allclose
with1e-3
as the threshold failed). Comparing TRT and native output, the classification result disagrees for ~2.3%I discussed with TRT modelopt in this issue and they suggested to file a bug report here
Environment
TensorRT Version: 8.6.1
NVIDIA GPU: A4000
NVIDIA Driver Version: 535.183.01
CUDA Version: 12.2
Python Version (if applicable): 3.10.1
PyTorch Version (if applicable): '2.4.0+cu124'
Relevant Files
Model link: You can download the onnx model and the TRT engine here: https://file.io/GnuiEMNeebQ1
Steps To Reproduce
Run the TRT model using Python API and the onnx model with Cifar-10 datasets using the following data loader, and compares the result.
Have you tried the latest release?: Haven't tried TRT10, but we don't plan to upgrade in the short period. I was under the impression 8.6 should be okay.
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (
polygraphy run <model.onnx> --onnxrt
):Yes, Onnx Runtime generates 1% disagreement in the Native model
Appendix
Visualizing the TRT engine, I think it is completely within my expectation with everything being fused as INT8 kernels.
The text was updated successfully, but these errors were encountered: