-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to change SDXL pipeline precision to bfloat16 with assign a new value for torch_dtype variable? #3691
Comments
There is an update: I printed noise_pred variable in stable_diffusion_pipeline.py file, I see still different dtype. I did deep diving, and came into infer() method in utilities.py. And I try to change self.tensors.keys() dtypes with:
But, I take new error:
|
@rajeevsrao @ttyio ^ ^ |
It seems latest version(9.2) can carry to bfloat16 support. But still I confused how can I implement that correctly. |
The SD demo use FP16, the code to enable FP16 build is in TensorRT/demo/Diffusion/utilities.py Line 208 in c0c633c
What's the motivation to move to BF16? |
We reference the blog post below. We want to speed up our SDXL processes, which is the main reason for our interest in TensorRT. Apparently, the bfloat16 conversion will also give us the necessary acceleration. When we apply this to the Vae layer, we see a speedup of about 15%. Now we are trying to do the same for UNet. We were using version 8.6, but bfloat16 support seems to have come with 9.2. Is this useful to us? |
Yes bfloat16 helps in some of the kernels, but for MHA, we can get more perf gain using the FP16/INT8. FYI, we also have a INT8 SDXL https://github.com/NVIDIA/TensorRT/tree/release/10.0/demo/Diffusion |
closing since no activity for more than 3 weeks, pls reopen if you still have question, thanks all! |
Description
I try to develop a txt2img SDXL application, depends on the repo(forked from this repo) mentioned in Relevant Files:
If I tried run SDXL in a normal pipeline, it is enough for change torch_dtype precision to bfloat16:
My wish is to implement this precision change in the above repo. In stable_diffusion_pipeline.py, there are two methods naming as initialize_latents and decode_latent. When I change dtypes in these methods, VAE section accelerates by 0.1 seconds.
Current output looks like:
I've try to precision change on UNET. In denoise_latent method I implement new timestep definition and control it with printing, in line 539:
Even though change timestep.dtype successfully(at least as I see in print output), my UNet section is still run for nearly 5 seconds. How can I implement the precision change truly?
print outputs look like:
Environment
TensorRT Version: torch-tensorrt 1.5.0.dev0
NVIDIA GPU: A100 80GB
NVIDIA Driver Version: 535.86.10
CUDA Version: 12.2
CUDNN Version: nvidia-cudnn-cu12 8.9.7.29
Operating System: Google VM a2-ultragpu-1g type, Debian GNU/Linux 11
Python Version : 3.10.6
PyTorch Version : torch 2.1.0a0+4136153
Docker --version: Docker version 20.10.17, build 100c701
Relevant Files
Repo link: https://github.com/rajeevsrao/TensorRT/tree/release/8.6/demo/Diffusion
Steps To Reproduce
Follow the setup instructions in README file of the repo, then run the command below:
Commands or scripts:
python3 demo_txt2img_xl.py "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" --width 1024 --height 1024 --denoising-steps 25 --repeat-prompt 2 --num-warmup-runs 0
If you run the code for the first time, compilation time might be lasts more than 30 min.
The text was updated successfully, but these errors were encountered: