[Benchmarks][Upstream PyTorch 2.5] `Triton` and `XeTLA` softmax performance degrades in comparison with `torch 2.1` / `ipex 2.1` test proxies #2106

ESI-SYD · 2024-09-04T02:14:11Z

Ratio of Triton/ XeTLA keep same except for attention caused by XeTLA attention absolute number degraded
Both Triton and XeTLA softmax cases degraded, so Triton/ XeTLA not changed.

details: #1905 (comment)

The text was updated successfully, but these errors were encountered:

vlad-penkin · 2024-09-05T10:51:01Z

@ESI-SYD what is the root cause for this issue? can you pin point it to a particular torch operation?

@anmyachev to proceed further with analysis / triaging please create a minimal reproducer for the Triton kernel path.

ESI-SYD · 2024-09-06T02:50:43Z

@ESI-SYD what is the root cause for this issue? can you pin point it to a particular torch operation?

There are two main differences in benchmark time method change after applying the Draft

No sync submitting. https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/python/triton/testing.py#L214
Use the time stamp between two barriers which is not accurate. Previous detailed explanation by chengjun.

anmyachev · 2024-09-09T13:59:08Z

#2149 (comment)

anmyachev · 2024-09-16T12:06:44Z

Ratio of Triton/ XeTLA keep same except for attention caused by XeTLA attention absolute number degraded

At the moment, the degradation of absolute numbers has been fixed. The geometric mean difference is ~2% (between #1 and #2), which can be considered within the margin of error, I believe.

Both Triton and XeTLA softmax cases degraded, so Triton/ XeTLA not changed.

The new approach to measuring performance is less precise and is more influenced by the operations that are performed in the functions we benchmark, before and after the kernel is launched. This influence is stronger where the kernel execution time is very small. For example, for the first combinations of fused_softmax benchmark, the kernel time takes no more than a hundredth of a millisecond (case when N=256), but if we look at the last combination (case when N=32768), the time is the same in both cases.

To sum up, for large dimensions the new benchmarking method is suitable and tells us that with upstream pytorch there is no degradation, however for small dimensions it cannot be used with reliability and we have to wait for a working solution kineto + intel gpu pti.

ESI-SYD mentioned this issue Sep 4, 2024

[Benchmarks] Deprecate import intel_extension_for_pytorch in benchmarks including decoupling XeTLA build which require ipex.h #1911

Closed

vlad-penkin changed the title ~~[Benchmarks] Degrade when deprecate ipex in benchmarks~~ [Benchmarks][Upstream PyTorch 2.5] Triton and XeTLA softmax performance degrades in comparison with torch 2.1 / ipex 2.1 test proxies Sep 5, 2024

vlad-penkin added this to the 4.3 [Performance] Tracking milestone Sep 5, 2024

vlad-penkin added performance bug Something isn't working dependencies: ipex labels Sep 5, 2024

vlad-penkin assigned anmyachev Sep 5, 2024

pbchekin mentioned this issue Sep 6, 2024

Remove import intel_extension_for_pytorch from benchmarks #2147

Closed

vlad-penkin linked a pull request Sep 9, 2024 that will close this issue

Remove import intel_extension_for_pytorch from benchmarks #2147

Closed

vlad-penkin linked a pull request Sep 11, 2024 that will close this issue

[NFC] Change xetla_kernel import style #2199

Merged

anmyachev closed this as completed in #2199 Sep 11, 2024

vlad-penkin reopened this Sep 11, 2024

vlad-penkin assigned ZzEeKkAa Sep 16, 2024

This was linked to pull requests Sep 18, 2024

Update flash_attention_fwd_benchmark.py #2265

Closed

Increase cache size #2264

Closed

Remove import intel_extension_for_pytorch from fused_softmax.py #2278

Merged

whitneywhtsang closed this as completed in #2278 Sep 19, 2024

anmyachev reopened this Sep 19, 2024

This was linked to pull requests Sep 20, 2024

Change default benchmark mode to upstream PyTorch #2298

Merged

Use cpu version of torch sdpa until xpu version is fixed #2300

Merged

anmyachev closed this as completed in #2300 Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmarks][Upstream PyTorch 2.5] `Triton` and `XeTLA` softmax performance degrades in comparison with `torch 2.1` / `ipex 2.1` test proxies #2106

[Benchmarks][Upstream PyTorch 2.5] `Triton` and `XeTLA` softmax performance degrades in comparison with `torch 2.1` / `ipex 2.1` test proxies #2106

ESI-SYD commented Sep 4, 2024

vlad-penkin commented Sep 5, 2024 •

edited

Loading

ESI-SYD commented Sep 6, 2024

anmyachev commented Sep 9, 2024

anmyachev commented Sep 16, 2024 •

edited

Loading

[Benchmarks][Upstream PyTorch 2.5] Triton and XeTLA softmax performance degrades in comparison with torch 2.1 / ipex 2.1 test proxies #2106

[Benchmarks][Upstream PyTorch 2.5] Triton and XeTLA softmax performance degrades in comparison with torch 2.1 / ipex 2.1 test proxies #2106

Comments

ESI-SYD commented Sep 4, 2024

vlad-penkin commented Sep 5, 2024 • edited Loading

ESI-SYD commented Sep 6, 2024

anmyachev commented Sep 9, 2024

anmyachev commented Sep 16, 2024 • edited Loading

[Benchmarks][Upstream PyTorch 2.5] `Triton` and `XeTLA` softmax performance degrades in comparison with `torch 2.1` / `ipex 2.1` test proxies #2106

[Benchmarks][Upstream PyTorch 2.5] `Triton` and `XeTLA` softmax performance degrades in comparison with `torch 2.1` / `ipex 2.1` test proxies #2106

vlad-penkin commented Sep 5, 2024 •

edited

Loading

anmyachev commented Sep 16, 2024 •

edited

Loading