SVE optimised float WSSJ kernel #2917

rakshithgb-fujitsu · 2024-09-26T14:17:46Z

This PR introduces performance optimizations for the WSSJ Float using SVE intrinsics, resulting in significant improvements for the SVM algorithms on ARM.

Key Improvements:
Boser Method: 22% performance gain, leading to faster computation and better resource utilization.
Thunder Method: 5% performance gain, enhancing efficiency in scenarios where this method is used.

Changes:
Code Updates: New SVE intrinsics based WSSJ Float kernel.

Impact:
Performance: Faster processing times and improved efficiency for SVM algorithms observed and documented on single core.

Performance on single core:

rakshithgb-fujitsu · 2024-09-26T14:18:25Z

@keeranroth have a look at this.

keeranroth

Can't comment too much on the algorithm, as I'm not familiar with SVM code. But it looks believable. There are just some style points I picked up on. Will let someone knowledgeable about the application area give some more guidance.

I can't help but feel that there is some instruction level parallelism being left on the table. A lot of the instructions are dependent. Having the masking, you might be able to duplicate the work, and simply select the result that you want at the end. I feel as though this implementation is going to be using only one pipeline at the moment. Would need profiling to confirm, though

keeranroth · 2024-09-26T14:32:44Z

cpp/daal/src/algorithms/svm/svm_train_common_sve_impl.i

+/*
+ * Contains optimizations for SVE.
+*/


This comment isn't adding any information. If you want to add more information about what the algorithm is doing into the comment, that would be ideal. Otherwise remove it

keeranroth · 2024-09-26T14:35:24Z

cpp/daal/src/algorithms/svm/svm_train_common_sve_impl.i

+ svint32_t Bj_vec = svdup_s32(-1);
+
+ // some constants used during optimization
+ // enum SVMVectorStatus low = 0x2


Not sure where low would be defined before this, but maybe this isn't supposed to be a comment? The code below uses it, so I'm assuming this should be uncommented

low is defined elsewhere, this comment basically reminds what the value of low is outside.

keeranroth · 2024-09-26T14:41:41Z

cpp/daal/src/algorithms/svm/svm_train_common_sve_impl.i

+ }
+ else
+ {
+ DAAL_ASSERT((sign & (sign - 1)) == 0) // used to make sure sign is always having 1 bit set


Not sure what get sign returns, but this assert is also true when sign = 0, so the comment isn't correct. I suspect this might not be what you want to be checking on the result of getSign

The idea was to keep this optimization done under check, since low = 0x2, if it were to ever change this debug assert would help. Where getSign is defined here -

oneDAL/cpp/daal/src/algorithms/svm/svm_train_common.h

Line 92 in f8a3953

DAAL_FORCEINLINE static char getSign(SignNuType signNuType)

keeranroth · 2024-09-26T14:46:02Z

cpp/daal/src/algorithms/svm/svm_train_common_sve_impl.i

+
+ size_t j_cur = jStart;
+
+ for (j_cur; j_cur < jEnd; j_cur += w)


Suggested change

for (j_cur; j_cur < jEnd; j_cur += w)

for (; j_cur < jEnd; j_cur += w)

rakshithgb-fujitsu · 2024-09-26T15:34:13Z

Can't comment too much on the algorithm, as I'm not familiar with SVM code. But it looks believable. There are just some style points I picked up on. Will let someone knowledgeable about the application area give some more guidance.

I can't help but feel that there is some instruction level parallelism being left on the table. A lot of the instructions are dependent. Having the masking, you might be able to duplicate the work, and simply select the result that you want at the end. I feel as though this implementation is going to be using only one pipeline at the moment. Would need profiling to confirm, though

We do think that there is more room for improvement, and we are exploring more ideas, we've finalized on this version for now and if we do optimize it further will raise another PR for it. If you do see any instruction level bottlenecks, please do point it out.

napetrov · 2024-09-26T17:18:43Z

Great to see specialization for algo

napetrov · 2024-09-26T17:18:49Z

/intelci: run

rakshithgb-fujitsu added 2 commits September 26, 2024 19:43

optimisation of wssj in sve

4266fe1

clang format

ed34ba7

rakshithgb-fujitsu requested review from Alexsandruss, samir-nasibli and Alexandr-Solovev as code owners September 26, 2024 14:17

compile bug fix

eeeb377

keeranroth reviewed Sep 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SVE optimised float WSSJ kernel #2917

SVE optimised float WSSJ kernel #2917

rakshithgb-fujitsu commented Sep 26, 2024 •

edited

Loading

rakshithgb-fujitsu commented Sep 26, 2024

keeranroth left a comment

keeranroth Sep 26, 2024

keeranroth Sep 26, 2024

rakshithgb-fujitsu Sep 26, 2024

keeranroth Sep 26, 2024

rakshithgb-fujitsu Sep 26, 2024

keeranroth Sep 26, 2024

rakshithgb-fujitsu commented Sep 26, 2024 •

edited

Loading

napetrov commented Sep 26, 2024

napetrov commented Sep 26, 2024

	for (j_cur; j_cur < jEnd; j_cur += w)
	for (; j_cur < jEnd; j_cur += w)

SVE optimised float WSSJ kernel #2917

Are you sure you want to change the base?

SVE optimised float WSSJ kernel #2917

Conversation

rakshithgb-fujitsu commented Sep 26, 2024 • edited Loading

rakshithgb-fujitsu commented Sep 26, 2024

keeranroth left a comment

Choose a reason for hiding this comment

keeranroth Sep 26, 2024

Choose a reason for hiding this comment

keeranroth Sep 26, 2024

Choose a reason for hiding this comment

rakshithgb-fujitsu Sep 26, 2024

Choose a reason for hiding this comment

keeranroth Sep 26, 2024

Choose a reason for hiding this comment

rakshithgb-fujitsu Sep 26, 2024

Choose a reason for hiding this comment

keeranroth Sep 26, 2024

Choose a reason for hiding this comment

rakshithgb-fujitsu commented Sep 26, 2024 • edited Loading

napetrov commented Sep 26, 2024

napetrov commented Sep 26, 2024

rakshithgb-fujitsu commented Sep 26, 2024 •

edited

Loading

rakshithgb-fujitsu commented Sep 26, 2024 •

edited

Loading