-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lower-than-Expected Performance Improvement with INT8 Quantization in TensorRT 10.0 on A100 GPU #3776
Comments
+1 |
1 similar comment
+1 |
|
Let me check with the author of the blog, come back later :-) |
@teith Do you think we can ask you to post your TRT engine/send it to us for analysis? As well as env you are using for repro |
@teith hello, may I ask if you have compared the accuracy of fp16 and in8? |
What node are you using? Are you using AWS or GCP or your own machine? There are many other factors on this, for example memory sizes, concurrent workloads, temperature and so on. If possible can you give me your onnx model? only the unet part should be good. @teith cc @TheBge12138 |
@TheBge12138 Thanks for the feedback, BTW, what quant config you used? |
@jingyu-ml I didn't change any code with the demo, so may be use the default? |
Hi, @azhurkevich!
Here is .plan models: And ENVS: |
Hi, @jingyu-ml
I've used A100 40GB on Lambda Cloud |
@teith is that possible for you to attach the int8 unet onnx file at somewhere? |
Hi, @jingyu-ml !
|
@teith Apologies for the delayed response. I ran your models on our A100-PCIE-40G GPU. Here are the logs: FP16:
INT8:
1.25x speedup over FP16 TRT, which is somewhat slower than our internal benchmarks but still faster than FP16. This discrepancy may be due to server instability; we plan to conduct further testing at your models. It's important to mention that the performance figures reported in our previous blog were based on the Ada 6000 GPU, not the A100. Performance can vary significantly across different GPUs. Additionally, could you execute this command line on your server, ensuring that you have updated trtexec to version 9.3 or 10.0? 10.0 would be faster than 9.3. # Downloading the TensorRT Tar File and unzip it
# cd into the folder
export LD_LIBRARY_PATH=$(pwd)/lib:$LD_LIBRARY_PATH
export PATH=$(pwd)/bin:$PATH
cd python
pip install tensorrt-<version>-cp<version>-cp<version>m-linux_x86_64.whl
cd ../onnx_graphsurgeon
pip install onnx_graphsurgeon-<version>-py2.py3-none-any.whl
# check the trtexec version By this you can have the newest trtexec and the trt python version. Then try this on your int8 onnx model: trtexec --onnx=./unet.onnx --shapes=sample:2x4x128x128,timestep:1,encoder_hidden_states:2x77x2048,text_embeds:2x1280,time_ids:2x6 --fp16 --int8 --builderOptimizationLevel=4 --saveEngine=unetxl.int8.plan And then remove the int8 flag, try again on your FP16 onnx model. If you are able to run these cmds, please also share the full log with me, thanks. trtexec will print the infer latency in the log. Then we can discuss the next step. |
Adding to @jingyu-ml's response above, you can refer to this latest benchmark to see expected speedup on other NVIDIA hardware. In general, we do observe a higher speedup on Note that the quantization techniques used in TensorRT have now been moved into a new Nvidia product called TensorRT Model Optimizer. This does not change your workflows. We do encourage you to checkout related resources, and looking forward to your feedback:
@TheBge12138 |
Description
I recently attempted to utilize INT8 quantization with Stable Diffusion XL to enhance inference performance based on the claims made in a recent TensorRT blog post, which suggested that this approach could achieve a performance improvement of nearly 2x. However, my experiences do not align with these expectations. After implementing INT8 quantization, the performance improvement was notably less than advertised.
Environment
TensorRT Version:
10.0.0b6
NVIDIA GPU:
A100
Operating System:
Python Version:
3.10
Baremetal or Container (if so, version):
Triton 24.03
Relevant Files
Logs:
https://yaso.su/SDXLTestLogs
Steps To Reproduce
I closely followed the steps laid out in the README of the TensorRT repository for the Stable Diffusion XL demo, which you can find here: https://github.com/NVIDIA/TensorRT/tree/release/10.0/demo/Diffusion. Here's a brief rundown of what I did:
Expected Outcome: Based on NVIDIA's recommendations and claims, I was expecting that turning on INT8 quantization would almost double the performance compared to the standard run.
Actual Outcome: The performance boost from using INT8 quantization was much less than expected. To put it in numbers, without INT8 quantization, the inference took about 2779.89 ms (equating to 0.36 images per second), but with INT8 quantization, it improved slightly to about 2564.51 ms (or 0.39 images per second). This improvement is much smaller than the nearly 2x faster performance I was anticipating, which is a significant difference from what was claimed.
Commands or scripts:
https://github.com/NVIDIA/TensorRT/tree/release/10.0/demo/Diffusion
Have you tried the latest release?:
Yes, the latest, 10.0.0b6
9.3.0 has the same result.
The text was updated successfully, but these errors were encountered: