-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Math: Optimise 16-bit matrix multiplication functions. #9088
base: main
Are you sure you want to change the base?
Math: Optimise 16-bit matrix multiplication functions. #9088
Conversation
44a73f4
to
01f4acb
Compare
01f4acb
to
b3fa07c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me but @singalsu needs to review.
Long term @ShriramShastry @singalsu thinking aloud, I think it may make more sense to have Kconfig that will use the HiFi3/4/5 kernels from nnlib for xtensa targets i.e.
https://github.com/foss-xtensa/nnlib-hifi5/blob/master/xa_nnlib/algo/kernels/matXvec/hifi5/xa_nn_matXvec_16x16.c
3427edd
to
5493610
Compare
src/math/matrix.c
Outdated
z++; | ||
} | ||
} | ||
int64_t acc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As style comment, please keep the original style and define local variables in beginning of function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Is the Cadence's license compatible with all SOF usages (firmware, plugin)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general - I'm not sure I understand what this PR is solving
src/math/matrix.c
Outdated
* non-negative (> 0). Since shift = shift_minus_one + 1, the check for non-negative | ||
* value is (shift_minus_one >= -1) | ||
*/ | ||
const int64_t offset = (shift_minus_one >= -1) ? (1LL << shift_minus_one) : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to search for this, and here's what I've found [1]:
The left-shift and right-shift operators should not be used for negative numbers. The result of is undefined behavior if any of the operands is a negative number. For example results of both 1 >> -1 and 1 << -1 is undefined.
But the examples under that comment show, that a left shift by -1 should be the same as a right shift by 1, i.e. 1 >> 1
which is 0. So, you can just replace >=
above with a >
and the result will be the same without that ambiguity.
[1] https://www.geeksforgeeks.org/left-shift-right-shift-operators-c-cpp/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you pointed out, it makes no sense to shift left by -1 (1 << -1
). The same goal can be achieved without ambiguity by changing the condition to shift > -1
.
src/math/matrix.c
Outdated
|
||
return 0; | ||
} | ||
const int shift = shift_minus_one + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about removing shift_minus_one
entirely and just using shift
with appropriate changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I could remove shift_minus_one
and work only with shift
. This essentially means that I'd be adding 1
to shift
where I had previously subtracted 1
from shift_minus_one
.
src/math/matrix.c
Outdated
/* If shift == 0, then shift_minus_one == -1, which means the data is Q16.0 | ||
* Otherwise, add the offset before the shift to round up if necessary | ||
*/ | ||
*z++ = (shift == 0) ? (int16_t)acc : (int16_t)((acc + offset) >> shift); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ?:
operator can contain an implicit jump and therefore be rather expensive, so it should be avoided on performance sensitive paths. Here if shift == 0
if my "negative shift" understanding above is correct, then offset == 0
, so the ?:
isn't needed at all and you can just use (int16_t)((acc + offset) >> shift)
always
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, Since offset is 0
when shift is 0
, I can always use (int16_t)(acc + offset) >> shift
without using the ?:
operator, yeah probably simplifying the expression and we get potentially improving performance.
src/math/matrix.c
Outdated
if (shift < -1 || shift > 62) | ||
return -ERANGE; | ||
/* Offset for rounding */ | ||
const int64_t offset = (shift >= 0) ? (1LL << (shift - 1)) : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>=
and a "negative shift" again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, to avoid ambiguity when shifting by a negative number, the condition should be changed from >= to >. The result ensures that offset is 0 when shift is 0, and that no negative number attempts are made. I will make the modification
src/math/matrix.c
Outdated
/* General case with shifting and offset for rounding */ | ||
for (i = 0; i < total_elements; i++) { | ||
acc = (int64_t)x[i] * (int64_t)y[i]; /* Cast to int64_t to avoid overflow */ | ||
z[i] = (int16_t)((acc + offset) >> shift); /* Apply shift and offset */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in commit description you say: "Both functions now use incremented pointers in loop expressions for
better legibility and potential compiler optimisation." and here you remove pointer incrementing and replace it with indexing. In general - while legibility is subjective so I wouldn't comment on it, does this PR really improve anything? Does it fix any bugs or brings any measurable and measured performance improvements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you've measured size improvements and also some performance improvements, but it would be good to understand which changes exactly introduce them. This would be important both for this PR and also for general understanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lyakh fwiw, changes to the code could be easier for compiler to vectorize in certain places hence the speedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgirdwood sure, and I'd very much like to know which exactly changes do that! If there are any recipes like pointer incrementing is slower than indexed access or anything similar, then it would be good to know those to optimise all our code. OTOH wouldn't it be the case that even slightly different code sequences generate more optimal assembly with different optimisations applied? E.g. sometimes loops like
for (int i = 0; i < N; i++)
x[i] = io_read32(register);
would generate faster code and sometimes it's
for (int i = 0; i < N; i++, x++)
*x = io_read32(register);
that runs faster? While in fact we'd expect the compiler to optimise both correctly, but in some more complex cases it might not be able to.
It is also possible for someone to modify some such hot-path code without checkout performance and cause a significant degradation. Maybe we need the CI to keep an eye on performance...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code snippet used to measure performance
int mat_multiply_elementwise_ptr_opt(struct mat_matrix_16b *a, struct mat_matrix_16b *b,
struct mat_matrix_16b *c)
{
int16_t *x = a->data;
int16_t *y = b->data;
int16_t *z = c->data;
int32_t p;
/* Check for NULL pointers */
if (!a || !b || !c)
return -EINVAL;
/* Check for NULL pointers */
if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows)
return -EINVAL;
const int total_elements = a->rows * a->columns;
const int shift = a->fractions + b->fractions - c->fractions - 1;
if (shift == -1) {
// If no fixed-point fraction bits need to be adjusted
// Pointer arithmetic equivalent
for (int i = 0; i < total_elements; i++, x++, y++, z++)
*z = *x * *y;
} else {
// Pointer arithmetic equivalent
for (int i = 0; i < total_elements; i++, x++, y++, z++) {
p = (int32_t)(*x) * (*y);
*z = (int16_t)(((p >> shift) + 1) >> 1); // Shift to Qx.y
}
}
return 0;
}
int mat_multiply_elementwise_arr_opt(struct mat_matrix_16b *a, struct mat_matrix_16b *b,
struct mat_matrix_16b *c)
{
int16_t *x = a->data;
int16_t *y = b->data;
int16_t *z = c->data;
int32_t p;
/* Check for NULL pointers */
if (!a || !b || !c)
return -EINVAL;
/* Check for NULL pointers */
if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows)
return -EINVAL;
const int total_elements = a->rows * a->columns;
const int shift = a->fractions + b->fractions - c->fractions - 1;
if (shift == -1) {
// If no fixed-point fraction bits need to be adjusted
// Array indexing method
for (int i = 0; i < total_elements; i++)
z[i] = x[i] * y[i];
} else {
// General case with shifting and rounding
// Array indexing method
for (int i = 0; i < total_elements; i++) {
p = (int32_t)x[i] * y[i];
z[i] = (int16_t)(((p >> shift) + 1) >> 1); // Shift to Qx.y
}
}
return 0;
}
int mat_multiply_elementwise_ws64_opt(struct mat_matrix_16b* a, struct mat_matrix_16b* b, struct mat_matrix_16b* c) {
// Dimension check
if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows) {
return -EINVAL;
}
int16_t* x = a->data;
int16_t* y = b->data;
int16_t* z = c->data;
int total_elements = a->rows * a->columns;
const int shift = a->fractions + b->fractions - c->fractions;
for (int i = 0; i < total_elements; i++) {
int64_t p = ((int64_t)(*x)) * (*y); // Use 64-bit for intermediate product to avoid overflow
// Correct rounding
if (shift > 0) {
*z = (int16_t)((p + (1 << (shift - 1))) >> shift);
}
else {
*z = (int16_t)p; // No shift needed
}
//printf("Intermediate result z[%d]: %d (for x: %d, y: %d)\n", i, *z, *x, *y);
x++;
y++;
z++;
}
return 0;
}
Performance summary for xtensa testbench with the -O3 compiler optimisation option.
Four unit tests for 16-bit matrix multiplication are run; mat_multiply is called three times, while matrix element-wise multiplication is called once.
Compiler_tool | Function Name | Function (%) | Function (F) | Children (C) | Total (F+C) | Called | Size (bytes) |
---|---|---|---|---|---|---|---|
LX7HiFi5_RI_2022_9 | mat_multiply_elementwise_arr_opt | 2.2 | 230 | 0 | 230 | 1 | 467 |
LX7HiFi5_RI_2022_9 | mat_multiply_elementwise_original | 3.31 | 347 | 120 | 467 | 1 | 462 |
LX7HiFi5_RI_2022_9 | mat_multiply_elementwise_ptr_opt | 2.18 | 228 | 0 | 228 | 1 | 457 |
LX7HiFi5_RI_2022_9 | mat_multiply_elementwise_ws64_opt | 3.96 | 414 | 120 | 534 | 1 | 193 |
LX7HiFi5_RI_2022_9 | mat_multiply_optimized | 40.04 | 4,185 | 0 | 4,185 | 3 | 477 |
LX7HiFi5_RI_2022_9 | mat_multiply_original | 41.19 | 4,306 | 252 | 4,558 | 3 | 749 |
Generic_HiFi5 | mat_multiply_elementwise_arr_opt | 2.2 | 230 | 0 | 230 | 1 | 467 |
Generic_HiFi5 | mat_multiply_elementwise_original | 3.32 | 347 | 120 | 467 | 1 | 462 |
Generic_HiFi5 | mat_multiply_elementwise_ptr_opt | 2.18 | 228 | 0 | 228 | 1 | 457 |
Generic_HiFi5 | mat_multiply_elementwise_ws64_opt | 3.96 | 414 | 120 | 534 | 1 | 193 |
Generic_HiFi5 | mat_multiply_optimized | 40.07 | 4,185 | 0 | 4,185 | 3 | 477 |
Generic_HiFi5 | mat_multiply_original | 41.23 | 4,306 | 252 | 4,558 | 3 | 749 |
LX7HiFi4_RI_2022_9 | mat_multiply_elementwise_arr_opt | 2.07 | 232 | 0 | 232 | 1 | 336 |
LX7HiFi4_RI_2022_9 | mat_multiply_elementwise_original | 3.1 | 348 | 120 | 468 | 1 | 342 |
LX7HiFi4_RI_2022_9 | mat_multiply_elementwise_ptr_opt | 2.05 | 230 | 0 | 230 | 1 | 336 |
LX7HiFi4_RI_2022_9 | mat_multiply_elementwise_ws64_opt | 3.7 | 415 | 120 | 535 | 1 | 184 |
LX7HiFi4_RI_2022_9 | mat_multiply_optimized | 39.06 | 4,374 | 0 | 4,374 | 3 | 455 |
LX7HiFi4_RI_2022_9 | mat_multiply_original | 43.27 | 4,846 | 252 | 5,098 | 3 | 747 |
Generic_HiFi4 | mat_multiply_elementwise_arr_opt | 2.54 | 290 | 0 | 290 | 1 | 362 |
Generic_HiFi4 | mat_multiply_elementwise_original | 3.31 | 377 | 120 | 497 | 1 | 366 |
Generic_HiFi4 | mat_multiply_elementwise_ptr_opt | 2.55 | 291 | 0 | 291 | 1 | 378 |
Generic_HiFi4 | mat_multiply_elementwise_ws64_opt | 4.46 | 508 | 120 | 628 | 1 | 244 |
Generic_HiFi4 | mat_multiply_optimized | 37.17 | 4,233 | 0 | 4,233 | 3 | 544 |
Generic_HiFi4 | mat_multiply_original | 43.35 | 4,937 | 252 | 5,189 | 3 | 923 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the daily performance stats would be great to have and compare how PR impacts. In this case it would not help since we don't have test topology for MFCC. It's in my targets for quarter, so coming eventually (sof-hda-benchmark topologies, plenty already available). But the performance tracking is a bigger one that is now half done but the rest in risk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also worth to note that Sriram's figures are with maximized optimization level in Xplorer GUI based test while our builds are using just -O2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct , I have used -O3 optimization level during measurements.
src/math/matrix.c
Outdated
} else { | ||
/* General case with shifting and offset for rounding */ | ||
for (i = 0; i < total_elements; i++) { | ||
acc = (int64_t)x[i] * (int64_t)y[i]; /* Cast to int64_t to avoid overflow */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really wonder how 64 bit multiply can be faster than original 32 bits. This code avoids one shift but still, it's quite surprising. Did you test the non-Q0 case for cycles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Multiplication in 64 bits does not yield faster results than multiplication in 32 bits; instead, it requires twice as many resources (2x more multipliers).
My last check-in wasn't very successful. I've since fixed the error and included a summary of the results for the most recent check-in.
Thank you
5493610
to
37a0835
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review, and I have made the necessary changes. Please read through and make a suggestion.
src/math/matrix.c
Outdated
|
||
return 0; | ||
} | ||
const int shift = shift_minus_one + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I could remove shift_minus_one
and work only with shift
. This essentially means that I'd be adding 1
to shift
where I had previously subtracted 1
from shift_minus_one
.
src/math/matrix.c
Outdated
* non-negative (> 0). Since shift = shift_minus_one + 1, the check for non-negative | ||
* value is (shift_minus_one >= -1) | ||
*/ | ||
const int64_t offset = (shift_minus_one >= -1) ? (1LL << shift_minus_one) : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you pointed out, it makes no sense to shift left by -1 (1 << -1
). The same goal can be achieved without ambiguity by changing the condition to shift > -1
.
src/math/matrix.c
Outdated
/* If shift == 0, then shift_minus_one == -1, which means the data is Q16.0 | ||
* Otherwise, add the offset before the shift to round up if necessary | ||
*/ | ||
*z++ = (shift == 0) ? (int16_t)acc : (int16_t)((acc + offset) >> shift); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, Since offset is 0
when shift is 0
, I can always use (int16_t)(acc + offset) >> shift
without using the ?:
operator, yeah probably simplifying the expression and we get potentially improving performance.
src/math/matrix.c
Outdated
if (shift < -1 || shift > 62) | ||
return -ERANGE; | ||
/* Offset for rounding */ | ||
const int64_t offset = (shift >= 0) ? (1LL << (shift - 1)) : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, to avoid ambiguity when shifting by a negative number, the condition should be changed from >= to >. The result ensures that offset is 0 when shift is 0, and that no negative number attempts are made. I will make the modification
src/math/matrix.c
Outdated
/* General case with shifting and offset for rounding */ | ||
for (i = 0; i < total_elements; i++) { | ||
acc = (int64_t)x[i] * (int64_t)y[i]; /* Cast to int64_t to avoid overflow */ | ||
z[i] = (int16_t)((acc + offset) >> shift); /* Apply shift and offset */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code snippet used to measure performance
int mat_multiply_elementwise_ptr_opt(struct mat_matrix_16b *a, struct mat_matrix_16b *b,
struct mat_matrix_16b *c)
{
int16_t *x = a->data;
int16_t *y = b->data;
int16_t *z = c->data;
int32_t p;
/* Check for NULL pointers */
if (!a || !b || !c)
return -EINVAL;
/* Check for NULL pointers */
if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows)
return -EINVAL;
const int total_elements = a->rows * a->columns;
const int shift = a->fractions + b->fractions - c->fractions - 1;
if (shift == -1) {
// If no fixed-point fraction bits need to be adjusted
// Pointer arithmetic equivalent
for (int i = 0; i < total_elements; i++, x++, y++, z++)
*z = *x * *y;
} else {
// Pointer arithmetic equivalent
for (int i = 0; i < total_elements; i++, x++, y++, z++) {
p = (int32_t)(*x) * (*y);
*z = (int16_t)(((p >> shift) + 1) >> 1); // Shift to Qx.y
}
}
return 0;
}
int mat_multiply_elementwise_arr_opt(struct mat_matrix_16b *a, struct mat_matrix_16b *b,
struct mat_matrix_16b *c)
{
int16_t *x = a->data;
int16_t *y = b->data;
int16_t *z = c->data;
int32_t p;
/* Check for NULL pointers */
if (!a || !b || !c)
return -EINVAL;
/* Check for NULL pointers */
if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows)
return -EINVAL;
const int total_elements = a->rows * a->columns;
const int shift = a->fractions + b->fractions - c->fractions - 1;
if (shift == -1) {
// If no fixed-point fraction bits need to be adjusted
// Array indexing method
for (int i = 0; i < total_elements; i++)
z[i] = x[i] * y[i];
} else {
// General case with shifting and rounding
// Array indexing method
for (int i = 0; i < total_elements; i++) {
p = (int32_t)x[i] * y[i];
z[i] = (int16_t)(((p >> shift) + 1) >> 1); // Shift to Qx.y
}
}
return 0;
}
int mat_multiply_elementwise_ws64_opt(struct mat_matrix_16b* a, struct mat_matrix_16b* b, struct mat_matrix_16b* c) {
// Dimension check
if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows) {
return -EINVAL;
}
int16_t* x = a->data;
int16_t* y = b->data;
int16_t* z = c->data;
int total_elements = a->rows * a->columns;
const int shift = a->fractions + b->fractions - c->fractions;
for (int i = 0; i < total_elements; i++) {
int64_t p = ((int64_t)(*x)) * (*y); // Use 64-bit for intermediate product to avoid overflow
// Correct rounding
if (shift > 0) {
*z = (int16_t)((p + (1 << (shift - 1))) >> shift);
}
else {
*z = (int16_t)p; // No shift needed
}
//printf("Intermediate result z[%d]: %d (for x: %d, y: %d)\n", i, *z, *x, *y);
x++;
y++;
z++;
}
return 0;
}
Performance summary for xtensa testbench with the -O3 compiler optimisation option.
Four unit tests for 16-bit matrix multiplication are run; mat_multiply is called three times, while matrix element-wise multiplication is called once.
Compiler_tool | Function Name | Function (%) | Function (F) | Children (C) | Total (F+C) | Called | Size (bytes) |
---|---|---|---|---|---|---|---|
LX7HiFi5_RI_2022_9 | mat_multiply_elementwise_arr_opt | 2.2 | 230 | 0 | 230 | 1 | 467 |
LX7HiFi5_RI_2022_9 | mat_multiply_elementwise_original | 3.31 | 347 | 120 | 467 | 1 | 462 |
LX7HiFi5_RI_2022_9 | mat_multiply_elementwise_ptr_opt | 2.18 | 228 | 0 | 228 | 1 | 457 |
LX7HiFi5_RI_2022_9 | mat_multiply_elementwise_ws64_opt | 3.96 | 414 | 120 | 534 | 1 | 193 |
LX7HiFi5_RI_2022_9 | mat_multiply_optimized | 40.04 | 4,185 | 0 | 4,185 | 3 | 477 |
LX7HiFi5_RI_2022_9 | mat_multiply_original | 41.19 | 4,306 | 252 | 4,558 | 3 | 749 |
Generic_HiFi5 | mat_multiply_elementwise_arr_opt | 2.2 | 230 | 0 | 230 | 1 | 467 |
Generic_HiFi5 | mat_multiply_elementwise_original | 3.32 | 347 | 120 | 467 | 1 | 462 |
Generic_HiFi5 | mat_multiply_elementwise_ptr_opt | 2.18 | 228 | 0 | 228 | 1 | 457 |
Generic_HiFi5 | mat_multiply_elementwise_ws64_opt | 3.96 | 414 | 120 | 534 | 1 | 193 |
Generic_HiFi5 | mat_multiply_optimized | 40.07 | 4,185 | 0 | 4,185 | 3 | 477 |
Generic_HiFi5 | mat_multiply_original | 41.23 | 4,306 | 252 | 4,558 | 3 | 749 |
LX7HiFi4_RI_2022_9 | mat_multiply_elementwise_arr_opt | 2.07 | 232 | 0 | 232 | 1 | 336 |
LX7HiFi4_RI_2022_9 | mat_multiply_elementwise_original | 3.1 | 348 | 120 | 468 | 1 | 342 |
LX7HiFi4_RI_2022_9 | mat_multiply_elementwise_ptr_opt | 2.05 | 230 | 0 | 230 | 1 | 336 |
LX7HiFi4_RI_2022_9 | mat_multiply_elementwise_ws64_opt | 3.7 | 415 | 120 | 535 | 1 | 184 |
LX7HiFi4_RI_2022_9 | mat_multiply_optimized | 39.06 | 4,374 | 0 | 4,374 | 3 | 455 |
LX7HiFi4_RI_2022_9 | mat_multiply_original | 43.27 | 4,846 | 252 | 5,098 | 3 | 747 |
Generic_HiFi4 | mat_multiply_elementwise_arr_opt | 2.54 | 290 | 0 | 290 | 1 | 362 |
Generic_HiFi4 | mat_multiply_elementwise_original | 3.31 | 377 | 120 | 497 | 1 | 366 |
Generic_HiFi4 | mat_multiply_elementwise_ptr_opt | 2.55 | 291 | 0 | 291 | 1 | 378 |
Generic_HiFi4 | mat_multiply_elementwise_ws64_opt | 4.46 | 508 | 120 | 628 | 1 | 244 |
Generic_HiFi4 | mat_multiply_optimized | 37.17 | 4,233 | 0 | 4,233 | 3 | 544 |
Generic_HiFi4 | mat_multiply_original | 43.35 | 4,937 | 252 | 5,189 | 3 | 923 |
Thanks, measurements are good, but I think it would be good to understand which changes improve performance |
37a0835
to
175565c
Compare
Apologies for not responding to the question you asked earlier as well. Key Points of Improvement Original Code: (2) Streamlined Loop Structure and Conditional Logic
PR GitHub Current Code (mat_multiply):
Improvement Explanation: The current PR GitHub code seamlessly integrates the special handling of fractional bits within the main loop, eliminating the need for separate conditionals for each case. This minimises branching and increases loop efficiency. (3)Optimized Memory Access Patterns Original Code:Memory access patterns were less optimised due to additional variables and possibly unoptimized pointer arithmetic.
Improvement Explanation: More effective memory access and cache utilisation are consequently achieved by the PR GitHub current code by streamlining pointer arithmetic and lowering the number of temporary variables. Unified Handling of Fractional Adjustments While fractional bits adjustments are handled by both codes, the PR GitHub current implementation handles them more effectively by handling them directly within the computational loop, which lowers overhead.
PR GitHub Current Code (Snippet for mat_multiply):
Summary for PR GitHub code makes use of smaller data sizes, loop structures, streamlined conditional logic, and optimised memory access patterns to minimise memory usage and speed up matrix multiplication operations. Key Points of Improvement for Reduction in Accumulator Bit Width for Intermediate Results: Improvement Explanation: By eliminating the need to handle larger data types, using a 32-bit integer (int32_t) for intermediate multiplication results instead of a 64-bit integer (int64_t) can speed up the process on platforms where 32-bit instructions are more efficient. Streamlining the Conditional Logic for Handling Q0 (shift == -1):
PR GitHub Current Code:
Improvement Explanation: There is a special case for situations in which bit shift is not needed in both the original and PR GitHub current codes. Since this check is factored out of the loop, the current code performs it more effectively by using a single comparison rather than repeating it for every iteration. Original Code (Snippet for
// Handle no shift case (Q0 data) separately
// Default case with shift
PR GitHub Current Code (Snippet for
Perform multiplication with or without adjusting the fractional bits
Summary for I would be happy to add if you think that a breakdown of the instructions would be more helpful in a very specific aspects of the code changes between two codes. |
175565c
to
738540b
Compare
src/math/matrix.c
Outdated
} | ||
|
||
int64_t p; | ||
{ int64_t p; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 lines
a->columns != b->columns || a->rows != b->rows || | ||
c->columns != a->columns || c->rows != a->rows) { | ||
return -EINVAL; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you removed this in your previous commit
src/math/matrix.c
Outdated
int16_t *x = a->data; | ||
int16_t *y = b->data; | ||
int16_t *z = c->data; | ||
int i; | ||
const int shift_minus_one = a->fractions + b->fractions - c->fractions - 1; | ||
int32_t prod; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS this is the only real change in this commit:
- int64_t p;
+ int32_t prod;
the rest is cosmetics. Please don't mix functional changes with cosmetics. This obscures the changes.
I just tested this with MFCC in TGL platform. This patch reduced CPU_PEAK(AVG) from 85.1 to 81.5 MCPS. With LL pipelines the STFT hop/size causes uneven load, so this time using peak as more representative perf measure instead of average. |
@singalsu do you have more tests to run or good to go with the optimization after cleanups are resolved ? |
I think this is good to merge after suggestions by @lyakh are addressed. Since the matrix operations are not a major part of MFCC these savings are quite significant. So no need for further performance tests. |
Great - @ShriramShastry you have a clear path now for merge after @lyakh comments are resolved. |
3d4f5fc
to
2276903
Compare
src/math/matrix.c
Outdated
if (shift == -1) | ||
*z = (int16_t)acc; | ||
else | ||
*z = (int16_t)(((acc >> shift) + 1) >> 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes it slower. Before this test was done once before all the looping. Now you do it in every iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check shift == -1
being inside the loop does introduce a minor inefficiency. I have addressed it.
2276903
to
234334f
Compare
src/math/matrix.c
Outdated
@@ -25,56 +25,66 @@ | |||
* -EINVAL if input dimensions do not allow for multiplication. | |||
* -ERANGE if the shift operation might cause integer overflow. | |||
*/ | |||
int mat_multiply(struct mat_matrix_16b *a, struct mat_matrix_16b *b, struct mat_matrix_16b *c) | |||
int mat_multiply(struct mat_matrix_16b *a, struct mat_matrix_16b *b, | |||
struct mat_matrix_16b *c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, split this commit: a single commit per optimisation and separate commits with any variable renaming, comments, and similar cosmetic changes, unrelated to actual optimisations. And if you claim "36.31% reduction in memory usage" I'd like to understand where it comes from, so far I don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'II make.
int i; | ||
const int shift_minus_one = a->fractions + b->fractions - c->fractions - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same in this commit - please check that the commit description still matches the contents and I'd like to be able to see that too. As it stands, it's difficult to see where "unnecessary conditionals" are eliminated
This patch introduces Doxygen-style documentation to the matrix multiplication functions. Clear descriptions and parameter details are provided to facilitate better understanding and ease of use. Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
- Added checks for integer overflow during shifting. - Validated matrix dimensions to prevent mismatches. - Ensured non-null pointers before operating on matrices. Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
Changed the accumulator data type from `int64_t` to `int32_t` to reduce instruction cycle count. This change results in an approximate 8.18% gain in performance for matrix multiplication operations. Performance Results: Compiler Settings: -O2 +------------+------+------+--------+-----------+-----------+----------+ | Test Name | Rows | Cols | Cycles | Max Error | RMS Error | Result | +------------+------+------+--------+-----------+-----------+----------+ | Test 1 | 3 | 5 | 6487 | 0.00 | 0.00 | Pass | | Test 2 | 6 | 8 | 6106 | 0.00 | 0.00 | Pass | +------------+------+------+--------+-----------+-----------+----------+ Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
Enhanced pointer arithmetic within loops to improve readability and reduce overhead. This change potentially reduces minor computational overhead, contributing to overall performance improvements of around 8.23% for Test 1 and 16.00% for Test 2. Performance Results: Compiler Settings: -O3 +------------+------+------+--------+-----------+-----------+----------+ | Test Name | Rows | Cols | Cycles | Max Error | RMS Error | Result | +------------+------+------+--------+-----------+-----------+----------+ | Test 1 | 3 | 5 | 5953 | 0.00 | 0.00 | Pass | | Test 2 | 6 | 8 | 5128 | 0.00 | 0.00 | Pass | +------------+------+------+--------+-----------+-----------+----------+ Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
Updated comments for better clarity and understanding. Made cosmetic changes such as reformatting code and renaming variables to enhance readability without impacting functionality. This resulted in approximately 7.97% and 15.00% performance improvements for Test 1 and Test 2, respectively. Performance Results: Compiler Settings: -O2 +------------+------+------+--------+-----------+-----------+----------+ | Test Name | Rows | Cols | Cycles | Max Error | RMS Error | Result | +------------+------+------+--------+-----------+-----------+----------+ | Test 1 | 3 | 5 | 5975 | 0.00 | 0.00 | Pass | | Test 2 | 6 | 8 | 5192 | 0.00 | 0.00 | Pass | +------------+------+------+--------+-----------+-----------+----------+ Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
- Enhanced data pointers for matrix elements - Streamlined loop iteration for matrix element-wise multiplication - Achieved a 0.09% performance improvement in cycle count | Rows | Cols | Cycles | Max Error | RMS Error | Result| +------+------+--------+-----------+-----------+-------+ | 5 | 6 | 3359 | 0.00 | 0.00 | Pass | Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
- Changed product variable from int64_t to int32_t - Improved performance by reducing data size - Achieved a 11.57% performance improvement in cycle count | Rows | Cols | Cycles | Max Error | RMS Error | Result | +------+------+--------+-----------+-----------+--------+ | 5 | 6 | 2972 | 0.00 | 0.00 | Pass | Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
234334f
to
c3eeab1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a general comment: changes are made not because they don't make anything worse, but because they improve something. When a PR is submitted it means, that the submitter is trying to convince reviewers and maintainers that his changes improve something, not just that they aren't breaking anything.
src/math/matrix.c
Outdated
@@ -31,7 +31,7 @@ int mat_multiply(struct mat_matrix_16b *a, struct mat_matrix_16b *b, struct mat_ | |||
if (a->columns != b->rows || a->rows != c->rows || b->columns != c->columns) | |||
return -EINVAL; | |||
|
|||
int64_t s; | |||
int32_t s; /* Changed from int64_t to int32_t */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a comment in the code. Nobody will see the history when reading it. A "changed" comment implies that before it was something different, then someone changed it. But when you read code, you aren't that interested in the history, you want to know how and why it works in its present version. So comments should describe the present code, not its history.
src/math/matrix.c
Outdated
@@ -51,12 +51,12 @@ int mat_multiply(struct mat_matrix_16b *a, struct mat_matrix_16b *b, struct mat_ | |||
x = a->data + a->columns * i; | |||
y = b->data + j; | |||
for (k = 0; k < b->rows; k++) { | |||
s += (int32_t)(*x) * (*y); | |||
x++; | |||
/* Enhanced pointer arithmetic */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I don't understand this commend. "Enhanced" as compared to what or when? Does it help understanding the code? Or is "enhanced point" a Maths term that I don't know?
src/math/matrix.c
Outdated
*z = (int16_t)s; /* For Q16.0 */ | ||
z++; | ||
/* Enhanced pointer arithmetic */ | ||
*z++ = (int16_t)s; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally also like expressions like
*x++ = value;
but I don't think they actually improve readability, they're just more compact, so help scan the code faster. And I certainly don't expect these changes to affect performance. If you really want to convince me, you need to compare assembly before and after this change.
|
||
/* Check shift to ensure no integer overflow occurs during shifting */ | ||
if (shift_minus_one < -1 || shift_minus_one > 31) | ||
if (shift < -1 || shift > 31) | ||
return -ERANGE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your "cycles" figures before and after are very similar to the previous commit. This tells me that you probably measured a state before several changes (possibly the whole PR) and after them, not before and after each commit. So "cycles" don't seem correct.
y += y_inc; | ||
} | ||
/* Enhanced pointer arithmetic */ | ||
*z++ = (int16_t)s; | ||
*z = (int16_t)acc; | ||
z++; /* Move to the next element in the output matrix */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you just changed *z = x; z++;
to *z++ = x;
in the previous commit and claimed that it was an improvement.
} | ||
/* Enhanced pointer arithmetic */ | ||
*z++ = (int16_t)(((s >> shift_minus_one) + 1) >> 1); /*Shift to Qx.y */ | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of not really helpful comments, very confusing whether this changes anything. With this commit I don't think I can approve the PR
x++; | ||
y++; | ||
z++; | ||
*z = (int16_t)(((p >> shift_minus_one) + 1) >> 1); /* Shift to Qx.y */ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this enhances or streamlines anything. NAK.
Release reminder - one week to v2.11-rc1. |
FYI @ShriramShastry pushing to v2.12 (cutoff date today). |
Improve
mat_multiply
andmat_multiply_elementwise
for 16-bit signed integers by refactoring operations and simplifying handling of Q0 data.Changes:
Replace int64_t with int32_t for accumulators in mat_multiply and
mat_multiply_elementwise, reducing cycle count by ~51.18% for elementwise
operations and by ~8.18% for matrix multiplication.
Enhance pointer arithmetic within loops for better readability and
compiler optimization opportunities.
Eliminate unnecessary conditionals by directly handling Q0 data in the
algorithm's core logic.
Update fractional bit shift and rounding logic for more accurate
fixed-point calculations.
Performance gains from these optimizations include a 1.08% reduction in
memory usage for elementwise functions and a 36.31% reduction for matrix
multiplication. The changes facilitate significant resource management
improvements in constrained environments.
Summary: