When oneDNN is enabled, an unoptimized matmul is called by Tensorflow on aarch64 #168

lilh9598 · 2023-02-27T03:29:26Z

From the above code, we can see that if cblas is not enabled, an unoptimized matmul will be called by Tensorflow on aarch64, which will cause performance degradation. So, I think a fully optimized matmul of acl should be added to dnnl_sgemm to make full use of aarch64‘s isa and improve performance of mkl_matmul(a tf op) on aarch64.

nSircombe · 2023-03-07T09:50:29Z

Hi @lilh9598,

Yes you're entirely correct - there is currently no ACL acceleration for dnnl_sgemm and it could be added.
At present it should be possible to use the Arm Performance Libraries CBLAS interface, via the oneDNN build option ONEDNN_BLAS_VENDOR to provide optimised kernels for this route (see https://github.com/oneapi-src/oneDNN/blob/3db871474283976ef5e80f6fecbb10e8856100c7/doc/build/build_options.md).

The integration of mutmul support via ACL was motivated by the ops used for a small number of use cases which we had been exposed to, the BERT-large network from MLCommons for example. This lead to the integration of ACL matmul support via mkl_batch_matmul_op.cc, for example.

Integration into dnnl_sgemm is on our 'to do' list, but I don't have a time-scale at present. It will most likely appear as an experimental feature here, in Tool Solutions, ahead of being available upstream though.

One issue with using ACL for sgemm support though, is that ACL is not a "BLAS" library, and does not provide all the functionality expected in one. In particular, there's no support for transpose at present. So supporting dnnl_sgemm via ACL would require a workaround - possibly rejecting the T case, or adding a re-order, via oneDNN's re-order primitive, before passing to ACL. In the longer term, there's a case for adding T support to ACL, but there are no firm plans at present to do this.

Could I ask what models you're interested in, where you think the dnnl_sgemm support would bring tangible benefits? This would help us to understand the potential requirement for dnnl_sgemm support, and provide valuable test/benchmark cases.

lilh9598 · 2023-03-20T03:44:37Z

Hi @nSircombe ,

Thanks for your detailed reply!

Could I ask what models you're interested in, where you think the dnnl_sgemm support would bring tangible benefits? This would help us to understand the potential requirement for dnnl_sgemm support, and provide valuable test/benchmark cases.

I think there is a same problem in the transformer block. In it's structure, the subsequent node of matmul(with constant weight) is not always biasadd, which could not be converted into _fusedmatmul(acl opt). Maybe the unoptimized matmul can be solved through graph optimization in TensorFlow？

nSircombe · 2023-03-20T07:52:44Z

Hi @lilh9598,

Actually one of the Team has been looking at transformer models, and managed to spend a little time looking at the 'missing' gemm api support. We have an approach not entirely dissimilar to your suggestion - diverting the call to batched matmul, where there's existing support via ACL. We're hoping to get a proof-of-concept patch into the build here soon, and will look to raise an upstream PR with TensorFlow if all looks good.
I'll let you know once we've got something up you can experiment with, any feedback would be very welcome.

lilh9598 · 2023-03-20T08:16:49Z

Hi @nSircombe ,

Look really good! I am looking forward to applying your new experiments to my cases.

nSircombe · 2023-04-06T14:25:30Z

Hi @lilh9598,
@fadara01 has just upstreamed a patch here - #174
This will pass calls heading into oneDNN's gemm_api into BatchMatMulMkl, which has ACL support. We've seen some significant performance improvements from this approach, and would be very interested to know if you get any gains (or problems) if you try and use it for your workflows.

lilh9598 · 2023-04-07T12:05:39Z

Hi @nSircombe ,
I'm sorry that there is no performance improvement with the patch #174 compared to eigen in my cpp-application.

And Is this patch enabled in the image “armswdev/tensorflow-arm-neoverse:latest”? I would like to code a python program for further performance profile.

nSircombe · 2023-04-07T15:27:47Z

Could you give more details about what you're running? OneDNN verbose logs would be handy - might help us spot missed opportunities to call ACL or highlight shapes where ACL performance is worse than expected.

This patch is not yet in the dockerhub images - they're updated from this repo once a month. The next update is due next week.

lilh9598 · 2023-04-10T07:30:16Z

Our application is very complex, so I need extract the ml part to analyze first. And the bottleneck of that may be bandwidth.
If any performance issues are found, we would be glad to sync with arm.

nSircombe · 2023-04-10T19:38:25Z

Ok @lilh9598 - would be interested to hear what you find.

On the matmul front though, just a redacted version of the stdout log generated with the environment variable ONEDNN_VERBOSE=1 set with & without the patch from #174 included in the image build would give us valuable insight into whether the matmul changes in #174 are being exercised by your work flow - if they're not, we may be able to identify more cases we could support; if they are, then we can look at the bare performance for the shapes in question and see if we're leaving anything on the table.

Look forward to hearing from you.

...I'll leave this issue open for now.

nSircombe self-assigned this Mar 7, 2023

nSircombe added the question Further information is requested label Mar 7, 2023

fadara01 mentioned this issue Apr 6, 2023

divert calls to oneDNN's gemm_api into ACL #174

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When oneDNN is enabled, an unoptimized matmul is called by Tensorflow on aarch64 #168

When oneDNN is enabled, an unoptimized matmul is called by Tensorflow on aarch64 #168

lilh9598 commented Feb 27, 2023 •

edited

Loading

nSircombe commented Mar 7, 2023

lilh9598 commented Mar 20, 2023 •

edited

Loading

nSircombe commented Mar 20, 2023

lilh9598 commented Mar 20, 2023

nSircombe commented Apr 6, 2023

lilh9598 commented Apr 7, 2023 •

edited

Loading

nSircombe commented Apr 7, 2023

lilh9598 commented Apr 10, 2023 •

edited

Loading

nSircombe commented Apr 10, 2023

When oneDNN is enabled, an unoptimized matmul is called by Tensorflow on aarch64 #168

When oneDNN is enabled, an unoptimized matmul is called by Tensorflow on aarch64 #168

Comments

lilh9598 commented Feb 27, 2023 • edited Loading

nSircombe commented Mar 7, 2023

lilh9598 commented Mar 20, 2023 • edited Loading

nSircombe commented Mar 20, 2023

lilh9598 commented Mar 20, 2023

nSircombe commented Apr 6, 2023

lilh9598 commented Apr 7, 2023 • edited Loading

nSircombe commented Apr 7, 2023

lilh9598 commented Apr 10, 2023 • edited Loading

nSircombe commented Apr 10, 2023

lilh9598 commented Feb 27, 2023 •

edited

Loading

lilh9598 commented Mar 20, 2023 •

edited

Loading

lilh9598 commented Apr 7, 2023 •

edited

Loading

lilh9598 commented Apr 10, 2023 •

edited

Loading