Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow huggingface BERT Model is slower in ARM compared to Intel #194

Open
abhishek-rn opened this issue Aug 29, 2023 · 1 comment
Open

Comments

@abhishek-rn
Copy link

Docker Container Version/Tag : r23.07-tf-2.12.0-onednn-acl
ARM System : Graviton 3 (c7g.8xlarge)
Architecture: aarch64
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: ARM
Model: 1
Thread(s) per core: 1
Core(s) per socket: 32
Caches (sum of all):
L1d: 2 MiB (32 instances)
L1i: 2 MiB (32 instances)
L2: 32 MiB (32 instances)
L3: 32 MiB (1 instance)

Intel System: Icelake (c6i.8xlarge)
Architecture: x86_64
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
CPU family: 6
Model: 106
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Stepping: 6
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 768 KiB (16 instances)
L1i: 512 KiB (16 instances)
L2: 20 MiB (16 instances)
L3: 54 MiB (1 instance)

As per this blog, ACL inference should be faster than intel systems for transformer models,

We ran Tensorflow Hugging Face BERT model for Inference (Python Code Attached as txt file here)
TF_bert_inf - Copy.txt
Below are results for inference speeds in seconds :

Env Variables Graviton Icelake
No Opts 0.2294 0.145099
TF_ENABLE_ONEDNN_OPTS=1 0.2191 0.144636
ONEDNN_DEFAULT_FPMATH_MODE=BF16 1.49034 0.145511

From the results above, we see that the performance is almost 1.8x worse for ARM cores compared to Intel ones.
The code is run on 2 cores for both the Intel and ARM systems.
Another issue is enabling FPMATH mode to BF16 degrades the performance.
From the oneDNN logs, we see that when BF16 is enabled, there are overheads while executing reorder for ARM cores

Env Variables Reorder Time (msecs)
TF_ENABLE_ONEDNN_OPTS=1 0.582031
ONEDNN_DEFAULT_FPMATH_MODE=BF16 11.1628

This is observed only for larger sized Matmul operations. Here the size was 768x768 and the reorder uses "simple:any" implementation instead of "jit:uni" in oneDNN.
Attaching oneDNN Verbose for both scenarios
Bert_TF12_issue_verbose_BF16.txt
Bert_TF12_issue_verbose_OPTS.txt

Request your views and comments on whether we need any other settings to improve the performance

@nSircombe
Copy link
Contributor

Hi @abhishek-rn,

Thanks for your report on these performance issues, and your initial analysis. The benchmarking covered in the Blog article cited was applicable at the time of publishing, with the builds available at that time. As I'm sure you understand, things change over time. Since that post was made a year ago, some significant, and essential, re-engineering has gone into Compute Library, oneDNN, and TensorFlow. This included support for new, optimised, memory layouts to allow the oneDNN+ACL backend in TensorFlow to permit primitive caching (see tensorflow/tensorflow#57987).

There have also been changes to the threading model: oneapi-src/oneDNN#1328, oneapi-src/oneDNN#1293. This was necessity for 'out of the box' support of the Compute Library backend. Threadpool was turned on after the TF 2.9 release (tensorflow/tensorflow#56924), but it did bring additional problems for the AArch64 build which subsequent PRs have sought to address (tensorflow/tensorflow#58071, tensorflow/tensorflow#59253, tensorflow/tensorflow#60346, tensorflow/tensorflow#61235).

One impact of some of these changes has been than we have an increased reliance on oneDNN-based re-orders in the execution path. Work to improve support on AArch64 for JITed re-orders (tensorflow/tensorflow#61296, tensorflow/tensorflow#61093) went in along side the addition of re-order functions to Compute Library. These changes improve re-order performance in many fp32 cases, but did not include the fp32=>bf16 reorder. A follow up PR (oneapi-src/oneDNN#1594), recently merged into oneDNN, should help here, but we're also looking at the potential for expanding the support for Compute Library based re-orders.
The net effect of many of the improvements over the last 12 months has been a gain in performance relative to the 'default' backend in TensorFlow.
But we're aware that there are regressions, and still work to do to bring the bf16 performance back in-line with expectations, on NLP workloads in particular. At the same time, we've also seen performance improvements on other platforms. Some of this is likely due to graph optimisations/fusions not supported with the AArch64 backend, which, in some cases, we've been able to address. However, more work is needed to identify the missed performance gains, and the performance regressions - this work is underway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants