Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower-than-Expected Performance Improvement with INT8 Quantization in TensorRT 10.0 on A100 GPU #3776

Open
teith opened this issue Apr 5, 2024 · 15 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@teith
Copy link

teith commented Apr 5, 2024

Description

I recently attempted to utilize INT8 quantization with Stable Diffusion XL to enhance inference performance based on the claims made in a recent TensorRT blog post, which suggested that this approach could achieve a performance improvement of nearly 2x. However, my experiences do not align with these expectations. After implementing INT8 quantization, the performance improvement was notably less than advertised.

Environment

TensorRT Version:
10.0.0b6

NVIDIA GPU:
A100

Operating System:
Python Version:
3.10

Baremetal or Container (if so, version):
Triton 24.03

Relevant Files

Logs:
https://yaso.su/SDXLTestLogs

Steps To Reproduce

I closely followed the steps laid out in the README of the TensorRT repository for the Stable Diffusion XL demo, which you can find here: https://github.com/NVIDIA/TensorRT/tree/release/10.0/demo/Diffusion. Here's a brief rundown of what I did:

  1. I cloned the TensorRT repository and navigated to the section for the Stable Diffusion XL demo, as instructed in the README.
  2. I followed all the setup and installation instructions in the README to properly prepare my environment and the models for testing. This included setups for both the standard and INT8 quantized inferences.
  3. I first ran the model with the standard setup to get a baseline of how fast it performed.
  4. Then, I ran the model with INT8 quantization enabled to see how much the performance would improve.

Expected Outcome: Based on NVIDIA's recommendations and claims, I was expecting that turning on INT8 quantization would almost double the performance compared to the standard run.

Actual Outcome: The performance boost from using INT8 quantization was much less than expected. To put it in numbers, without INT8 quantization, the inference took about 2779.89 ms (equating to 0.36 images per second), but with INT8 quantization, it improved slightly to about 2564.51 ms (or 0.39 images per second). This improvement is much smaller than the nearly 2x faster performance I was anticipating, which is a significant difference from what was claimed.

Commands or scripts:
https://github.com/NVIDIA/TensorRT/tree/release/10.0/demo/Diffusion

python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl --engine-dir engine
python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl --engine-dir engine-sdxl --int8 --quantization-level 3

Have you tried the latest release?:
Yes, the latest, 10.0.0b6
9.3.0 has the same result.

@D1-3105
Copy link

D1-3105 commented Apr 5, 2024

+1

1 similar comment
@yaroslavMain
Copy link

+1

@SkobelkinYaroslav
Copy link

@zerollzeng
Copy link
Collaborator

Let me check with the author of the blog, come back later :-)

@zerollzeng zerollzeng self-assigned this Apr 7, 2024
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Apr 7, 2024
@azhurkevich
Copy link
Contributor

azhurkevich commented Apr 9, 2024

@teith Do you think we can ask you to post your TRT engine/send it to us for analysis? As well as env you are using for repro

@TheBge12138
Copy link

TheBge12138 commented Apr 11, 2024

@teith hello, may I ask if you have compared the accuracy of fp16 and in8?
I run the diffusion demo with fp16 and int8, the images generated under the same seed quite different, not as good as described in the blog. The time performance is same as you in A100.
fp16
xl_base-a_photo_of-111-1-5594
here are fp16 and int8 image with same seed=111

@jingyu-ml
Copy link

jingyu-ml commented Apr 12, 2024

@teith

What node are you using? Are you using AWS or GCP or your own machine? There are many other factors on this, for example memory sizes, concurrent workloads, temperature and so on.

If possible can you give me your onnx model? only the unet part should be good. @teith cc @TheBge12138

@jingyu-ml
Copy link

jingyu-ml commented Apr 12, 2024

@TheBge12138 Thanks for the feedback,
the current code base came from 6 months ago, which is different from the Blog, the team is refreshing the code on this repo and will publish it very soon, and I will pin you again when the updating is finished.

BTW, what quant config you used?

@TheBge12138
Copy link

TheBge12138 commented Apr 12, 2024

@TheBge12138 Thanks for the feedback, In the current code base came from 6 months ago, which is different from the Blog, the team is refreshing the code on this repo and will publish it very soon, and I will pin you again when the updating is finished.

BTW, what quant config you used?

@jingyu-ml I didn't change any code with the demo, so may be use the default?
In addition to the time issue, I'm more care about the accuracy, I know that the iterative steps in unet will lose a lot of accuracy, so I'm very curious about how you solve it. I heard in another issue that you will update ammo and new calibration scripts in the future, look forward to your work.
Thanks!

@teith
Copy link
Author

teith commented Apr 12, 2024

Hi, @azhurkevich!

Do you think we can ask you to post your TRT engine/send it to us for analysis? As well as env you are using for repro

Here is .plan models:
https://mega.nz/folder/kCMVQDiR#DFofS7bZW1cBTRg0VJt6JA

And ENVS:
https://yaso.su/A100ENVS

@teith
Copy link
Author

teith commented Apr 12, 2024

Hi, @jingyu-ml

What node are you using? Are you using AWS or GCP or your own machine?

I've used A100 40GB on Lambda Cloud

@jingyu-ml
Copy link

@teith is that possible for you to attach the int8 unet onnx file at somewhere?

@teith
Copy link
Author

teith commented Apr 15, 2024

Hi, @jingyu-ml !

is that possible for you to attach the int8 unet onnx file at somewhere?

model.onnx
76d7dfc4-f8d8-11ee-a05c-0242ac120002

@jingyu-ml
Copy link

jingyu-ml commented May 7, 2024

@teith Apologies for the delayed response.

I ran your models on our A100-PCIE-40G GPU.

Here are the logs:
fp16.log
int8.log

FP16:

[05/07/2024-18:08:35] [I] Latency: min = 91.6599 ms, max = 96.4124 ms, mean = 93.0172 ms, median = 92.9111 ms, percentile(90%) = 93.825 ms, percentile(95%) = 94.1382 ms, percentile(99%) = 96.4124 ms

INT8:

[05/07/2024-17:50:55] [I] Latency: min = 75.9916 ms, max = 77.4071 ms, mean = 76.4652 ms, median = 76.4912 ms, percentile(90%) = 76.7236 ms, percentile(95%) = 76.9514 ms, percentile(99%) = 77.4071 ms

1.25x speedup over FP16 TRT, which is somewhat slower than our internal benchmarks but still faster than FP16. This discrepancy may be due to server instability; we plan to conduct further testing at your models. It's important to mention that the performance figures reported in our previous blog were based on the Ada 6000 GPU, not the A100. Performance can vary significantly across different GPUs.

Additionally, could you execute this command line on your server, ensuring that you have updated trtexec to version 9.3 or 10.0? 10.0 would be faster than 9.3.

# Downloading the TensorRT Tar File and unzip it
# cd into the folder
export LD_LIBRARY_PATH=$(pwd)/lib:$LD_LIBRARY_PATH
export PATH=$(pwd)/bin:$PATH

cd python
pip install tensorrt-<version>-cp<version>-cp<version>m-linux_x86_64.whl

cd ../onnx_graphsurgeon
pip install onnx_graphsurgeon-<version>-py2.py3-none-any.whl

# check the trtexec version

By this you can have the newest trtexec and the trt python version.

Then try this on your int8 onnx model:

trtexec --onnx=./unet.onnx --shapes=sample:2x4x128x128,timestep:1,encoder_hidden_states:2x77x2048,text_embeds:2x1280,time_ids:2x6 --fp16 --int8 --builderOptimizationLevel=4 --saveEngine=unetxl.int8.plan

And then remove the int8 flag, try again on your FP16 onnx model. If you are able to run these cmds, please also share the full log with me, thanks. trtexec will print the infer latency in the log. Then we can discuss the next step.

@hchings
Copy link

hchings commented May 9, 2024

Hi @teith, @D1-3105,

Adding to @jingyu-ml's response above, you can refer to this latest benchmark to see expected speedup on other NVIDIA hardware. In general, we do observe a higher speedup on RTX 6000 Ada.

Note that the quantization techniques used in TensorRT have now been moved into a new Nvidia product called TensorRT Model Optimizer. This does not change your workflows. We do encourage you to checkout related resources, and looking forward to your feedback:

@TheBge12138
Re image quality issue - the team has pointed out that there have been fixes in recent release. Could you try the latest TensorRT demoDiffusion example or Model Optimizer example, and let us know if it's still an issue? Note that these two examples have the same workflow, but Model Optimizer's repo has FP8 plugin and the latest on INT8, which are not in TensorRT repo yet at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

9 participants