Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch #3649

AngryLoki · 2024-06-04T23:18:24Z

Modern CPUs have native AVX512 BF16 instructions, which significantly improves matmul and conv2d operations.

With Bfloat16 instructions UNET steps are 40-50% faster on both AMD and Intel CPUs.
There are minor visible changes with bf16, but no avalanche effects, so this feature is enabled by default with new --use-cpu-bf16=auto option.
It can be disabled with --use-cpu-bf16=no.

With the following command (note: ComfyUI never mention this, but setting correct environment variables is highly important, see this page), KSampler node is almost 2 times faster (also memory usage is proportionally smaller):

LD_PRELOAD=libtrick.so:/src/oneapi/compiler/2024.0/lib/libiomp5.so:/usr/lib64/libtcmalloc.so \
KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 OMP_NUM_THREADS=16 \
numactl -C 0-15 -m 0 python main.py --cpu --bf16-vae --bf16-unet

`--use-cpu-bf16=no` - 1.68s/it	`--use-cpu-bf16=auto` - 1.22it/s

simonlui

Going to chime in here since I did significant work on the XPU side of IPEX for ComfyUI. This patch basically turns on CPU mode for IPEX, doesn't it? I have been meaning to write a patch for something like this for a while so thanks for doing the work to enable this. Had a few comments and nudges on things that could be improved but nothing else looks terribly wrong and I think this will improve everyone's experience with running the project although I am not sure if the bar to get that speed is enough to make it a default option for people to try, IPEX does have a minimum requirement of AVX2 needed on the CPU in order to even work. I would also suggest changing the README too to note this is available. Hopefully, when @comfyanonymous is less busy with things, he can take a look at the PR.

comfy/model_management.py

Modern CPUs have native AVX512 BF16 instructions, which significantly improves matmul and conv2d operations. With Bfloat16 instructions UNET steps are 40-50% faster on both AMD and Intel CPUs. There are minor visible changes with bf16, but no avalanche effects, so this feature is enabled by default with new `--use-cpu-bf16=auto` option. It can be disabled with `--use-cpu-bf16=no`. Signed-off-by: Sv. Lockal <lockalsash@gmail.com>

AngryLoki · 2024-08-06T06:17:45Z

While testing with Flux, I discovered few interesting things:

ipex allocates extra memory (even with weight_prepack=False) so that with ipex Flux + OS does not fit 64GB.
ipex focuses on models with forward() method; for other models most of optimizations are not available
new pytorch builds can perform the most heavy bf16 ops on CPU (read: matmul) without ipex.

So I reworked patch so that there is no requirement for ipex-for-cpu anymore.

After checking with flux-schnell (which is already distributed in bf16-format):

Without patch: 54GB ram, prompt executed in 242.54 seconds
With patch: 35GB ram, prompt executed in 118.42 seconds

AngryLoki requested a review from comfyanonymous as a code owner June 4, 2024 23:18

simonlui reviewed Jun 14, 2024

View reviewed changes

comfy/model_management.py Outdated Show resolved Hide resolved

comfy/model_management.py Outdated Show resolved Hide resolved

comfy/model_management.py Outdated Show resolved Hide resolved

comfy/model_management.py Outdated Show resolved Hide resolved

AngryLoki force-pushed the cpu-autocast branch from 9ae59d2 to 8fbc9ed Compare June 16, 2024 19:23

This comment was marked as outdated.

Sign in to view

simonlui approved these changes Jun 16, 2024

View reviewed changes

mcmonkey4eva added the Feature A new feature to add to ComfyUI. label Jun 28, 2024

mcmonkey4eva approved these changes Jun 28, 2024

View reviewed changes

mcmonkey4eva added the Needs Testing Please test this issue and report results label Jun 28, 2024

AngryLoki force-pushed the cpu-autocast branch from b82949e to 290df91 Compare August 5, 2024 21:11

This comment was marked as resolved.

Sign in to view

AngryLoki marked this pull request as draft August 6, 2024 05:16

AngryLoki force-pushed the cpu-autocast branch from 290df91 to 88f3f92 Compare August 6, 2024 06:05

AngryLoki marked this pull request as ready for review August 7, 2024 07:59

JorgeR81 mentioned this pull request Aug 7, 2024

Above 32 GB RAM usage, when loading Flux models in checkpoint version. #4239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch #3649

Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch #3649

AngryLoki commented Jun 4, 2024 •

edited

Loading

simonlui left a comment

This comment was marked as outdated.

This comment was marked as resolved.

AngryLoki commented Aug 6, 2024

Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch #3649

Are you sure you want to change the base?

Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch #3649

Conversation

AngryLoki commented Jun 4, 2024 • edited Loading

simonlui left a comment

Choose a reason for hiding this comment

This comment was marked as outdated.

This comment was marked as resolved.

AngryLoki commented Aug 6, 2024

AngryLoki commented Jun 4, 2024 •

edited

Loading