Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVE optimised float WSSJ kernel #2917

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rakshithgb-fujitsu
Copy link
Contributor

@rakshithgb-fujitsu rakshithgb-fujitsu commented Sep 26, 2024

This PR introduces performance optimizations for the WSSJ Float using SVE intrinsics, resulting in significant improvements for the SVM algorithms on ARM.

Key Improvements:
Boser Method: 22% performance gain, leading to faster computation and better resource utilization.
Thunder Method: 5% performance gain, enhancing efficiency in scenarios where this method is used.

Changes:
Code Updates: New SVE intrinsics based WSSJ Float kernel.

Impact:
Performance: Faster processing times and improved efficiency for SVM algorithms observed and documented on single core.

Performance on single core:
svm-perf

@rakshithgb-fujitsu
Copy link
Contributor Author

@keeranroth have a look at this.

Copy link
Contributor

@keeranroth keeranroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment too much on the algorithm, as I'm not familiar with SVM code. But it looks believable. There are just some style points I picked up on. Will let someone knowledgeable about the application area give some more guidance.

I can't help but feel that there is some instruction level parallelism being left on the table. A lot of the instructions are dependent. Having the masking, you might be able to duplicate the work, and simply select the result that you want at the end. I feel as though this implementation is going to be using only one pipeline at the moment. Would need profiling to confirm, though

Comment on lines +16 to +18
/*
* Contains optimizations for SVE.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't adding any information. If you want to add more information about what the algorithm is doing into the comment, that would be ideal. Otherwise remove it

svint32_t Bj_vec = svdup_s32(-1);

// some constants used during optimization
// enum SVMVectorStatus low = 0x2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where low would be defined before this, but maybe this isn't supposed to be a comment? The code below uses it, so I'm assuming this should be uncommented

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low is defined elsewhere, this comment basically reminds what the value of low is outside.

}
else
{
DAAL_ASSERT((sign & (sign - 1)) == 0) // used to make sure sign is always having 1 bit set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what get sign returns, but this assert is also true when sign = 0, so the comment isn't correct. I suspect this might not be what you want to be checking on the result of getSign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to keep this optimization done under check, since low = 0x2, if it were to ever change this debug assert would help. Where getSign is defined here -

DAAL_FORCEINLINE static char getSign(SignNuType signNuType)


size_t j_cur = jStart;

for (j_cur; j_cur < jEnd; j_cur += w)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (j_cur; j_cur < jEnd; j_cur += w)
for (; j_cur < jEnd; j_cur += w)

@rakshithgb-fujitsu
Copy link
Contributor Author

rakshithgb-fujitsu commented Sep 26, 2024

Can't comment too much on the algorithm, as I'm not familiar with SVM code. But it looks believable. There are just some style points I picked up on. Will let someone knowledgeable about the application area give some more guidance.

I can't help but feel that there is some instruction level parallelism being left on the table. A lot of the instructions are dependent. Having the masking, you might be able to duplicate the work, and simply select the result that you want at the end. I feel as though this implementation is going to be using only one pipeline at the moment. Would need profiling to confirm, though

We do think that there is more room for improvement, and we are exploring more ideas, we've finalized on this version for now and if we do optimize it further will raise another PR for it. If you do see any instruction level bottlenecks, please do point it out.

@napetrov
Copy link
Contributor

Great to see specialization for algo

@napetrov
Copy link
Contributor

/intelci: run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants