You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As per this blog, ACL inference should be faster than intel systems for transformer models,
We ran Tensorflow Hugging Face BERT model for Inference (Python Code Attached as txt file here) TF_bert_inf - Copy.txt
Below are results for inference speeds in seconds :
Env Variables
Graviton
Icelake
No Opts
0.2294
0.145099
TF_ENABLE_ONEDNN_OPTS=1
0.2191
0.144636
ONEDNN_DEFAULT_FPMATH_MODE=BF16
1.49034
0.145511
From the results above, we see that the performance is almost 1.8x worse for ARM cores compared to Intel ones.
The code is run on 2 cores for both the Intel and ARM systems.
Another issue is enabling FPMATH mode to BF16 degrades the performance.
From the oneDNN logs, we see that when BF16 is enabled, there are overheads while executing reorder for ARM cores
Env Variables
Reorder Time (msecs)
TF_ENABLE_ONEDNN_OPTS=1
0.582031
ONEDNN_DEFAULT_FPMATH_MODE=BF16
11.1628
This is observed only for larger sized Matmul operations. Here the size was 768x768 and the reorder uses "simple:any" implementation instead of "jit:uni" in oneDNN.
Attaching oneDNN Verbose for both scenarios Bert_TF12_issue_verbose_BF16.txt Bert_TF12_issue_verbose_OPTS.txt
Request your views and comments on whether we need any other settings to improve the performance
The text was updated successfully, but these errors were encountered:
Thanks for your report on these performance issues, and your initial analysis. The benchmarking covered in the Blog article cited was applicable at the time of publishing, with the builds available at that time. As I'm sure you understand, things change over time. Since that post was made a year ago, some significant, and essential, re-engineering has gone into Compute Library, oneDNN, and TensorFlow. This included support for new, optimised, memory layouts to allow the oneDNN+ACL backend in TensorFlow to permit primitive caching (see tensorflow/tensorflow#57987).
One impact of some of these changes has been than we have an increased reliance on oneDNN-based re-orders in the execution path. Work to improve support on AArch64 for JITed re-orders (tensorflow/tensorflow#61296, tensorflow/tensorflow#61093) went in along side the addition of re-order functions to Compute Library. These changes improve re-order performance in many fp32 cases, but did not include the fp32=>bf16 reorder. A follow up PR (oneapi-src/oneDNN#1594), recently merged into oneDNN, should help here, but we're also looking at the potential for expanding the support for Compute Library based re-orders.
The net effect of many of the improvements over the last 12 months has been a gain in performance relative to the 'default' backend in TensorFlow.
But we're aware that there are regressions, and still work to do to bring the bf16 performance back in-line with expectations, on NLP workloads in particular. At the same time, we've also seen performance improvements on other platforms. Some of this is likely due to graph optimisations/fusions not supported with the AArch64 backend, which, in some cases, we've been able to address. However, more work is needed to identify the missed performance gains, and the performance regressions - this work is underway.
Docker Container Version/Tag : r23.07-tf-2.12.0-onednn-acl
ARM System : Graviton 3 (c7g.8xlarge)
Architecture: aarch64
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: ARM
Model: 1
Thread(s) per core: 1
Core(s) per socket: 32
Caches (sum of all):
L1d: 2 MiB (32 instances)
L1i: 2 MiB (32 instances)
L2: 32 MiB (32 instances)
L3: 32 MiB (1 instance)
Intel System: Icelake (c6i.8xlarge)
Architecture: x86_64
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
CPU family: 6
Model: 106
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Stepping: 6
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 768 KiB (16 instances)
L1i: 512 KiB (16 instances)
L2: 20 MiB (16 instances)
L3: 54 MiB (1 instance)
As per this blog, ACL inference should be faster than intel systems for transformer models,
We ran Tensorflow Hugging Face BERT model for Inference (Python Code Attached as txt file here)
TF_bert_inf - Copy.txt
Below are results for inference speeds in seconds :
From the results above, we see that the performance is almost 1.8x worse for ARM cores compared to Intel ones.
The code is run on 2 cores for both the Intel and ARM systems.
Another issue is enabling FPMATH mode to BF16 degrades the performance.
From the oneDNN logs, we see that when BF16 is enabled, there are overheads while executing reorder for ARM cores
This is observed only for larger sized Matmul operations. Here the size was 768x768 and the reorder uses "simple:any" implementation instead of "jit:uni" in oneDNN.
Attaching oneDNN Verbose for both scenarios
Bert_TF12_issue_verbose_BF16.txt
Bert_TF12_issue_verbose_OPTS.txt
Request your views and comments on whether we need any other settings to improve the performance
The text was updated successfully, but these errors were encountered: