Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When oneDNN is enabled, an unoptimized matmul is called by Tensorflow on aarch64 #168

Open
lilh9598 opened this issue Feb 27, 2023 · 9 comments
Assignees
Labels
question Further information is requested

Comments

@lilh9598
Copy link

lilh9598 commented Feb 27, 2023

From the above code, we can see that if cblas is not enabled, an unoptimized matmul will be called by Tensorflow on aarch64, which will cause performance degradation. So, I think a fully optimized matmul of acl should be added to dnnl_sgemm to make full use of aarch64‘s isa and improve performance of mkl_matmul(a tf op) on aarch64.

@nSircombe nSircombe self-assigned this Mar 7, 2023
@nSircombe nSircombe added the question Further information is requested label Mar 7, 2023
@nSircombe
Copy link
Contributor

Hi @lilh9598,

Yes you're entirely correct - there is currently no ACL acceleration for dnnl_sgemm and it could be added.
At present it should be possible to use the Arm Performance Libraries CBLAS interface, via the oneDNN build option ONEDNN_BLAS_VENDOR to provide optimised kernels for this route (see https://github.com/oneapi-src/oneDNN/blob/3db871474283976ef5e80f6fecbb10e8856100c7/doc/build/build_options.md).

The integration of mutmul support via ACL was motivated by the ops used for a small number of use cases which we had been exposed to, the BERT-large network from MLCommons for example. This lead to the integration of ACL matmul support via mkl_batch_matmul_op.cc, for example.

Integration into dnnl_sgemm is on our 'to do' list, but I don't have a time-scale at present. It will most likely appear as an experimental feature here, in Tool Solutions, ahead of being available upstream though.

One issue with using ACL for sgemm support though, is that ACL is not a "BLAS" library, and does not provide all the functionality expected in one. In particular, there's no support for transpose at present. So supporting dnnl_sgemm via ACL would require a workaround - possibly rejecting the T case, or adding a re-order, via oneDNN's re-order primitive, before passing to ACL. In the longer term, there's a case for adding T support to ACL, but there are no firm plans at present to do this.

Could I ask what models you're interested in, where you think the dnnl_sgemm support would bring tangible benefits? This would help us to understand the potential requirement for dnnl_sgemm support, and provide valuable test/benchmark cases.

@lilh9598
Copy link
Author

lilh9598 commented Mar 20, 2023

Hi @nSircombe ,

Thanks for your detailed reply!

Could I ask what models you're interested in, where you think the dnnl_sgemm support would bring tangible benefits? This would help us to understand the potential requirement for dnnl_sgemm support, and provide valuable test/benchmark cases.

I think there is a same problem in the transformer block. In it's structure, the subsequent node of matmul(with constant weight) is not always biasadd, which could not be converted into _fusedmatmul(acl opt). Maybe the unoptimized matmul can be solved through graph optimization in TensorFlow?

@nSircombe
Copy link
Contributor

Hi @lilh9598,

Actually one of the Team has been looking at transformer models, and managed to spend a little time looking at the 'missing' gemm api support. We have an approach not entirely dissimilar to your suggestion - diverting the call to batched matmul, where there's existing support via ACL. We're hoping to get a proof-of-concept patch into the build here soon, and will look to raise an upstream PR with TensorFlow if all looks good.
I'll let you know once we've got something up you can experiment with, any feedback would be very welcome.

@lilh9598
Copy link
Author

Hi @nSircombe ,

Look really good! I am looking forward to applying your new experiments to my cases.

@nSircombe
Copy link
Contributor

Hi @lilh9598,
@fadara01 has just upstreamed a patch here - #174
This will pass calls heading into oneDNN's gemm_api into BatchMatMulMkl, which has ACL support. We've seen some significant performance improvements from this approach, and would be very interested to know if you get any gains (or problems) if you try and use it for your workflows.

@lilh9598
Copy link
Author

lilh9598 commented Apr 7, 2023

Hi @nSircombe ,
I'm sorry that there is no performance improvement with the patch #174 compared to eigen in my cpp-application.

And Is this patch enabled in the image “armswdev/tensorflow-arm-neoverse:latest”? I would like to code a python program for further performance profile.

@nSircombe
Copy link
Contributor

Could you give more details about what you're running? OneDNN verbose logs would be handy - might help us spot missed opportunities to call ACL or highlight shapes where ACL performance is worse than expected.

This patch is not yet in the dockerhub images - they're updated from this repo once a month. The next update is due next week.

@lilh9598
Copy link
Author

lilh9598 commented Apr 10, 2023

Our application is very complex, so I need extract the ml part to analyze first. And the bottleneck of that may be bandwidth.
If any performance issues are found, we would be glad to sync with arm.

@nSircombe
Copy link
Contributor

Ok @lilh9598 - would be interested to hear what you find.

On the matmul front though, just a redacted version of the stdout log generated with the environment variable ONEDNN_VERBOSE=1 set with & without the patch from #174 included in the image build would give us valuable insight into whether the matmul changes in #174 are being exercised by your work flow - if they're not, we may be able to identify more cases we could support; if they are, then we can look at the bare performance for the shapes in question and see if we're leaving anything on the table.

Look forward to hearing from you.

...I'll leave this issue open for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants