Math: Optimise 16-bit matrix multiplication functions. #9088

ShriramShastry · 2024-04-29T13:45:42Z

Improve mat_multiply and mat_multiply_elementwise for 16-bit signed integers by refactoring operations and simplifying handling of Q0 data.

Changes:

Replace int64_t with int32_t for accumulators in mat_multiply and
mat_multiply_elementwise, reducing cycle count by ~51.18% for elementwise
operations and by ~8.18% for matrix multiplication.
Enhance pointer arithmetic within loops for better readability and
compiler optimization opportunities.
Eliminate unnecessary conditionals by directly handling Q0 data in the
algorithm's core logic.
Update fractional bit shift and rounding logic for more accurate
fixed-point calculations.

Performance gains from these optimizations include a 1.08% reduction in
memory usage for elementwise functions and a 36.31% reduction for matrix
multiplication. The changes facilitate significant resource management
improvements in constrained environments.

Summary:

lgirdwood

This looks fine to me but @singalsu needs to review.

Long term @ShriramShastry @singalsu thinking aloud, I think it may make more sense to have Kconfig that will use the HiFi3/4/5 kernels from nnlib for xtensa targets i.e.
https://github.com/foss-xtensa/nnlib-hifi5/blob/master/xa_nnlib/algo/kernels/matXvec/hifi5/xa_nn_matXvec_16x16.c

singalsu · 2024-05-02T06:48:06Z

src/math/matrix.c

-				z++;
-			}
-		}
+	int64_t acc;


As style comment, please keep the original style and define local variables in beginning of function.

singalsu · 2024-05-02T06:51:48Z

This looks fine to me but @singalsu needs to review.

Long term @ShriramShastry @singalsu thinking aloud, I think it may make more sense to have Kconfig that will use the HiFi3/4/5 kernels from nnlib for xtensa targets i.e. https://github.com/foss-xtensa/nnlib-hifi5/blob/master/xa_nnlib/algo/kernels/matXvec/hifi5/xa_nn_matXvec_16x16.c

Is the Cadence's license compatible with all SOF usages (firmware, plugin)?

lyakh

In general - I'm not sure I understand what this PR is solving

lyakh · 2024-05-02T06:26:20Z

src/math/matrix.c

+	 * non-negative (> 0). Since shift = shift_minus_one + 1, the check for non-negative
+	 * value is (shift_minus_one >= -1)
+	 */
+	const int64_t offset = (shift_minus_one >= -1) ? (1LL << shift_minus_one) : 0;


I had to search for this, and here's what I've found [1]:

The left-shift and right-shift operators should not be used for negative numbers. The result of is undefined behavior if any of the operands is a negative number. For example results of both 1 >> -1 and 1 << -1 is undefined.

But the examples under that comment show, that a left shift by -1 should be the same as a right shift by 1, i.e. 1 >> 1 which is 0. So, you can just replace >= above with a > and the result will be the same without that ambiguity.

[1] https://www.geeksforgeeks.org/left-shift-right-shift-operators-c-cpp/

As you pointed out, it makes no sense to shift left by -1 (1 << -1). The same goal can be achieved without ambiguity by changing the condition to shift > -1.

lyakh · 2024-05-02T06:27:43Z

src/math/matrix.c


-		return 0;
-	}
+	const int shift = shift_minus_one + 1;


how about removing shift_minus_one entirely and just using shift with appropriate changes?

Yes, I could remove shift_minus_one and work only with shift. This essentially means that I'd be adding 1 to shift where I had previously subtracted 1 from shift_minus_one.

lyakh · 2024-05-02T06:37:15Z

src/math/matrix.c

+			/* If shift == 0, then shift_minus_one == -1, which means the data is Q16.0
+			 * Otherwise, add the offset before the shift to round up if necessary
+			 */
+			*z++ = (shift == 0) ? (int16_t)acc : (int16_t)((acc + offset) >> shift);


the ?: operator can contain an implicit jump and therefore be rather expensive, so it should be avoided on performance sensitive paths. Here if shift == 0 if my "negative shift" understanding above is correct, then offset == 0, so the ?: isn't needed at all and you can just use (int16_t)((acc + offset) >> shift) always

OK, Since offset is 0 when shift is 0, I can always use (int16_t)(acc + offset) >> shift without using the ?: operator, yeah probably simplifying the expression and we get potentially improving performance.

lyakh · 2024-05-02T06:37:47Z

src/math/matrix.c

+	if (shift < -1 || shift > 62)
+		return -ERANGE;
+	/* Offset for rounding */
+	const int64_t offset = (shift >= 0) ? (1LL << (shift - 1)) : 0;


>= and a "negative shift" again

Yes, to avoid ambiguity when shifting by a negative number, the condition should be changed from >= to >. The result ensures that offset is 0 when shift is 0, and that no negative number attempts are made. I will make the modification

lyakh · 2024-05-02T06:42:36Z

src/math/matrix.c

+		/* General case with shifting and offset for rounding */
+		for (i = 0; i < total_elements; i++) {
+			acc = (int64_t)x[i] * (int64_t)y[i]; /* Cast to int64_t to avoid overflow */
+			z[i] = (int16_t)((acc + offset) >> shift); /* Apply shift and offset */


in commit description you say: "Both functions now use incremented pointers in loop expressions for
better legibility and potential compiler optimisation." and here you remove pointer incrementing and replace it with indexing. In general - while legibility is subjective so I wouldn't comment on it, does this PR really improve anything? Does it fix any bugs or brings any measurable and measured performance improvements?

I see that you've measured size improvements and also some performance improvements, but it would be good to understand which changes exactly introduce them. This would be important both for this PR and also for general understanding

@lyakh fwiw, changes to the code could be easier for compiler to vectorize in certain places hence the speedup.

@lgirdwood sure, and I'd very much like to know which exactly changes do that! If there are any recipes like pointer incrementing is slower than indexed access or anything similar, then it would be good to know those to optimise all our code. OTOH wouldn't it be the case that even slightly different code sequences generate more optimal assembly with different optimisations applied? E.g. sometimes loops like

for (int i = 0; i < N; i++) x[i] = io_read32(register);

would generate faster code and sometimes it's

for (int i = 0; i < N; i++, x++) *x = io_read32(register);

that runs faster? While in fact we'd expect the compiler to optimise both correctly, but in some more complex cases it might not be able to.
It is also possible for someone to modify some such hot-path code without checkout performance and cause a significant degradation. Maybe we need the CI to keep an eye on performance...

code snippet used to measure performance

int mat_multiply_elementwise_ptr_opt(struct mat_matrix_16b *a, struct mat_matrix_16b *b, struct mat_matrix_16b *c) { int16_t *x = a->data; int16_t *y = b->data; int16_t *z = c->data; int32_t p; /* Check for NULL pointers */ if (!a || !b || !c) return -EINVAL; /* Check for NULL pointers */ if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows) return -EINVAL; const int total_elements = a->rows * a->columns; const int shift = a->fractions + b->fractions - c->fractions - 1; if (shift == -1) { // If no fixed-point fraction bits need to be adjusted // Pointer arithmetic equivalent for (int i = 0; i < total_elements; i++, x++, y++, z++) *z = *x * *y; } else { // Pointer arithmetic equivalent for (int i = 0; i < total_elements; i++, x++, y++, z++) { p = (int32_t)(*x) * (*y); *z = (int16_t)(((p >> shift) + 1) >> 1); // Shift to Qx.y } } return 0; }

int mat_multiply_elementwise_arr_opt(struct mat_matrix_16b *a, struct mat_matrix_16b *b, struct mat_matrix_16b *c) { int16_t *x = a->data; int16_t *y = b->data; int16_t *z = c->data; int32_t p; /* Check for NULL pointers */ if (!a || !b || !c) return -EINVAL; /* Check for NULL pointers */ if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows) return -EINVAL; const int total_elements = a->rows * a->columns; const int shift = a->fractions + b->fractions - c->fractions - 1; if (shift == -1) { // If no fixed-point fraction bits need to be adjusted // Array indexing method for (int i = 0; i < total_elements; i++) z[i] = x[i] * y[i]; } else { // General case with shifting and rounding // Array indexing method for (int i = 0; i < total_elements; i++) { p = (int32_t)x[i] * y[i]; z[i] = (int16_t)(((p >> shift) + 1) >> 1); // Shift to Qx.y } } return 0; }

int mat_multiply_elementwise_ws64_opt(struct mat_matrix_16b* a, struct mat_matrix_16b* b, struct mat_matrix_16b* c) { // Dimension check if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows) { return -EINVAL; } int16_t* x = a->data; int16_t* y = b->data; int16_t* z = c->data; int total_elements = a->rows * a->columns; const int shift = a->fractions + b->fractions - c->fractions; for (int i = 0; i < total_elements; i++) { int64_t p = ((int64_t)(*x)) * (*y); // Use 64-bit for intermediate product to avoid overflow // Correct rounding if (shift > 0) { *z = (int16_t)((p + (1 << (shift - 1))) >> shift); } else { *z = (int16_t)p; // No shift needed } //printf("Intermediate result z[%d]: %d (for x: %d, y: %d)\n", i, *z, *x, *y); x++; y++; z++; } return 0; }

Performance summary for xtensa testbench with the -O3 compiler optimisation option.
Four unit tests for 16-bit matrix multiplication are run; mat_multiply is called three times, while matrix element-wise multiplication is called once.
<style> <style> </style>

Compiler_tool Function Name Function (%) Function (F) Children (C) Total (F+C) Called Size (bytes)

LX7HiFi5_RI_2022_9 mat_multiply_elementwise_arr_opt 2.2 230 0 230 1 467

LX7HiFi5_RI_2022_9 mat_multiply_elementwise_original 3.31 347 120 467 1 462

LX7HiFi5_RI_2022_9 mat_multiply_elementwise_ptr_opt 2.18 228 0 228 1 457

LX7HiFi5_RI_2022_9 mat_multiply_elementwise_ws64_opt 3.96 414 120 534 1 193

LX7HiFi5_RI_2022_9 mat_multiply_optimized 40.04 4,185 0 4,185 3 477

LX7HiFi5_RI_2022_9 mat_multiply_original 41.19 4,306 252 4,558 3 749

Generic_HiFi5 mat_multiply_elementwise_arr_opt 2.2 230 0 230 1 467

Generic_HiFi5 mat_multiply_elementwise_original 3.32 347 120 467 1 462

Generic_HiFi5 mat_multiply_elementwise_ptr_opt 2.18 228 0 228 1 457

Generic_HiFi5 mat_multiply_elementwise_ws64_opt 3.96 414 120 534 1 193

Generic_HiFi5 mat_multiply_optimized 40.07 4,185 0 4,185 3 477

Generic_HiFi5 mat_multiply_original 41.23 4,306 252 4,558 3 749

LX7HiFi4_RI_2022_9 mat_multiply_elementwise_arr_opt 2.07 232 0 232 1 336

LX7HiFi4_RI_2022_9 mat_multiply_elementwise_original 3.1 348 120 468 1 342

LX7HiFi4_RI_2022_9 mat_multiply_elementwise_ptr_opt 2.05 230 0 230 1 336

LX7HiFi4_RI_2022_9 mat_multiply_elementwise_ws64_opt 3.7 415 120 535 1 184

LX7HiFi4_RI_2022_9 mat_multiply_optimized 39.06 4,374 0 4,374 3 455

LX7HiFi4_RI_2022_9 mat_multiply_original 43.27 4,846 252 5,098 3 747

Generic_HiFi4 mat_multiply_elementwise_arr_opt 2.54 290 0 290 1 362

Generic_HiFi4 mat_multiply_elementwise_original 3.31 377 120 497 1 366

Generic_HiFi4 mat_multiply_elementwise_ptr_opt 2.55 291 0 291 1 378

Generic_HiFi4 mat_multiply_elementwise_ws64_opt 4.46 508 120 628 1 244

Generic_HiFi4 mat_multiply_optimized 37.17 4,233 0 4,233 3 544

Generic_HiFi4 mat_multiply_original 43.35 4,937 252 5,189 3 923

performance plots

Yes, the daily performance stats would be great to have and compare how PR impacts. In this case it would not help since we don't have test topology for MFCC. It's in my targets for quarter, so coming eventually (sof-hda-benchmark topologies, plenty already available). But the performance tracking is a bigger one that is now half done but the rest in risk.

It's also worth to note that Sriram's figures are with maximized optimization level in Xplorer GUI based test while our builds are using just -O2.

Correct , I have used -O3 optimization level during measurements.

singalsu · 2024-05-02T15:46:33Z

src/math/matrix.c

+	} else {
+		/* General case with shifting and offset for rounding */
+		for (i = 0; i < total_elements; i++) {
+			acc = (int64_t)x[i] * (int64_t)y[i]; /* Cast to int64_t to avoid overflow */


I really wonder how 64 bit multiply can be faster than original 32 bits. This code avoids one shift but still, it's quite surprising. Did you test the non-Q0 case for cycles?

You're right. Multiplication in 64 bits does not yield faster results than multiplication in 32 bits; instead, it requires twice as many resources (2x more multipliers).

My last check-in wasn't very successful. I've since fixed the error and included a summary of the results for the most recent check-in.

Thank you

ShriramShastry

Thank you for your review, and I have made the necessary changes. Please read through and make a suggestion.

ShriramShastry · 2024-05-02T11:47:12Z

src/math/matrix.c


-		return 0;
-	}
+	const int shift = shift_minus_one + 1;


Yes, I could remove shift_minus_one and work only with shift. This essentially means that I'd be adding 1 to shift where I had previously subtracted 1 from shift_minus_one.

ShriramShastry · 2024-05-02T11:48:02Z

src/math/matrix.c

+	 * non-negative (> 0). Since shift = shift_minus_one + 1, the check for non-negative
+	 * value is (shift_minus_one >= -1)
+	 */
+	const int64_t offset = (shift_minus_one >= -1) ? (1LL << shift_minus_one) : 0;


As you pointed out, it makes no sense to shift left by -1 (1 << -1). The same goal can be achieved without ambiguity by changing the condition to shift > -1.

ShriramShastry · 2024-05-02T11:50:59Z

src/math/matrix.c

+			/* If shift == 0, then shift_minus_one == -1, which means the data is Q16.0
+			 * Otherwise, add the offset before the shift to round up if necessary
+			 */
+			*z++ = (shift == 0) ? (int16_t)acc : (int16_t)((acc + offset) >> shift);


OK, Since offset is 0 when shift is 0, I can always use (int16_t)(acc + offset) >> shift without using the ?: operator, yeah probably simplifying the expression and we get potentially improving performance.

ShriramShastry · 2024-05-02T14:11:42Z

src/math/matrix.c

+	if (shift < -1 || shift > 62)
+		return -ERANGE;
+	/* Offset for rounding */
+	const int64_t offset = (shift >= 0) ? (1LL << (shift - 1)) : 0;


Yes, to avoid ambiguity when shifting by a negative number, the condition should be changed from >= to >. The result ensures that offset is 0 when shift is 0, and that no negative number attempts are made. I will make the modification

ShriramShastry · 2024-05-03T14:48:41Z

src/math/matrix.c

+		/* General case with shifting and offset for rounding */
+		for (i = 0; i < total_elements; i++) {
+			acc = (int64_t)x[i] * (int64_t)y[i]; /* Cast to int64_t to avoid overflow */
+			z[i] = (int16_t)((acc + offset) >> shift); /* Apply shift and offset */


code snippet used to measure performance

int mat_multiply_elementwise_ptr_opt(struct mat_matrix_16b *a, struct mat_matrix_16b *b, struct mat_matrix_16b *c) { int16_t *x = a->data; int16_t *y = b->data; int16_t *z = c->data; int32_t p; /* Check for NULL pointers */ if (!a || !b || !c) return -EINVAL; /* Check for NULL pointers */ if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows) return -EINVAL; const int total_elements = a->rows * a->columns; const int shift = a->fractions + b->fractions - c->fractions - 1; if (shift == -1) { // If no fixed-point fraction bits need to be adjusted // Pointer arithmetic equivalent for (int i = 0; i < total_elements; i++, x++, y++, z++) *z = *x * *y; } else { // Pointer arithmetic equivalent for (int i = 0; i < total_elements; i++, x++, y++, z++) { p = (int32_t)(*x) * (*y); *z = (int16_t)(((p >> shift) + 1) >> 1); // Shift to Qx.y } } return 0; }

int mat_multiply_elementwise_arr_opt(struct mat_matrix_16b *a, struct mat_matrix_16b *b, struct mat_matrix_16b *c) { int16_t *x = a->data; int16_t *y = b->data; int16_t *z = c->data; int32_t p; /* Check for NULL pointers */ if (!a || !b || !c) return -EINVAL; /* Check for NULL pointers */ if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows) return -EINVAL; const int total_elements = a->rows * a->columns; const int shift = a->fractions + b->fractions - c->fractions - 1; if (shift == -1) { // If no fixed-point fraction bits need to be adjusted // Array indexing method for (int i = 0; i < total_elements; i++) z[i] = x[i] * y[i]; } else { // General case with shifting and rounding // Array indexing method for (int i = 0; i < total_elements; i++) { p = (int32_t)x[i] * y[i]; z[i] = (int16_t)(((p >> shift) + 1) >> 1); // Shift to Qx.y } } return 0; }

int mat_multiply_elementwise_ws64_opt(struct mat_matrix_16b* a, struct mat_matrix_16b* b, struct mat_matrix_16b* c) { // Dimension check if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows) { return -EINVAL; } int16_t* x = a->data; int16_t* y = b->data; int16_t* z = c->data; int total_elements = a->rows * a->columns; const int shift = a->fractions + b->fractions - c->fractions; for (int i = 0; i < total_elements; i++) { int64_t p = ((int64_t)(*x)) * (*y); // Use 64-bit for intermediate product to avoid overflow // Correct rounding if (shift > 0) { *z = (int16_t)((p + (1 << (shift - 1))) >> shift); } else { *z = (int16_t)p; // No shift needed } //printf("Intermediate result z[%d]: %d (for x: %d, y: %d)\n", i, *z, *x, *y); x++; y++; z++; } return 0; }

Performance summary for xtensa testbench with the -O3 compiler optimisation option.
Four unit tests for 16-bit matrix multiplication are run; mat_multiply is called three times, while matrix element-wise multiplication is called once.
<style> <style> </style>

Compiler_tool Function Name Function (%) Function (F) Children (C) Total (F+C) Called Size (bytes)

LX7HiFi5_RI_2022_9 mat_multiply_elementwise_arr_opt 2.2 230 0 230 1 467

LX7HiFi5_RI_2022_9 mat_multiply_elementwise_original 3.31 347 120 467 1 462

LX7HiFi5_RI_2022_9 mat_multiply_elementwise_ptr_opt 2.18 228 0 228 1 457

LX7HiFi5_RI_2022_9 mat_multiply_elementwise_ws64_opt 3.96 414 120 534 1 193

LX7HiFi5_RI_2022_9 mat_multiply_optimized 40.04 4,185 0 4,185 3 477

LX7HiFi5_RI_2022_9 mat_multiply_original 41.19 4,306 252 4,558 3 749

Generic_HiFi5 mat_multiply_elementwise_arr_opt 2.2 230 0 230 1 467

Generic_HiFi5 mat_multiply_elementwise_original 3.32 347 120 467 1 462

Generic_HiFi5 mat_multiply_elementwise_ptr_opt 2.18 228 0 228 1 457

Generic_HiFi5 mat_multiply_elementwise_ws64_opt 3.96 414 120 534 1 193

Generic_HiFi5 mat_multiply_optimized 40.07 4,185 0 4,185 3 477

Generic_HiFi5 mat_multiply_original 41.23 4,306 252 4,558 3 749

LX7HiFi4_RI_2022_9 mat_multiply_elementwise_arr_opt 2.07 232 0 232 1 336

LX7HiFi4_RI_2022_9 mat_multiply_elementwise_original 3.1 348 120 468 1 342

LX7HiFi4_RI_2022_9 mat_multiply_elementwise_ptr_opt 2.05 230 0 230 1 336

LX7HiFi4_RI_2022_9 mat_multiply_elementwise_ws64_opt 3.7 415 120 535 1 184

LX7HiFi4_RI_2022_9 mat_multiply_optimized 39.06 4,374 0 4,374 3 455

LX7HiFi4_RI_2022_9 mat_multiply_original 43.27 4,846 252 5,098 3 747

Generic_HiFi4 mat_multiply_elementwise_arr_opt 2.54 290 0 290 1 362

Generic_HiFi4 mat_multiply_elementwise_original 3.31 377 120 497 1 366

Generic_HiFi4 mat_multiply_elementwise_ptr_opt 2.55 291 0 291 1 378

Generic_HiFi4 mat_multiply_elementwise_ws64_opt 4.46 508 120 628 1 244

Generic_HiFi4 mat_multiply_optimized 37.17 4,233 0 4,233 3 544

Generic_HiFi4 mat_multiply_original 43.35 4,937 252 5,189 3 923

performance plots

lyakh · 2024-05-10T06:30:54Z

Thank you for your review, and I have made the necessary changes. Please read through and make a suggestion.

Thanks, measurements are good, but I think it would be good to understand which changes improve performance

ShriramShastry · 2024-05-10T08:59:31Z

Thank you for your review, and I have made the necessary changes. Please read through and make a suggestion.

Thanks, measurements are good, but I think it would be good to understand which changes improve performance

Apologies for not responding to the question you asked earlier as well.
I'm trying to address them here.

Key Points of Improvement
(1)Accumulator Data Size Reduction

Original Code:
int64_t s;
PR GitHub Current Code:
int32_t acc;
Improvement Explanation: Using a 32-bit accumulator (int32_t rather than int64_t) reduces memory ( number of multipliers i.e. 1/2 for current PR GitHub code) usage per operation and improve processing speed, particularly on 32-bit systems where 64-bit arithmetic is more expensive.

(2) Streamlined Loop Structure and Conditional Logic
Original Code (mat_multiply):

if (shift_minus_one == -1) {
    for (...) { // Inner loop
        s += (int32_t)(*x) * (*y);
    }
    *z = (int16_t)s;
}

PR GitHub Current Code (mat_multiply):

acc = 0;
for (..., ..., ...) { // Inner loop
    acc += (int32_t)(*x++) * (*y);
    y += y_inc;
}
*z = (int16_t)(((acc >> shift) + 1) >> 1);

Improvement Explanation: The current PR GitHub code seamlessly integrates the special handling of fractional bits within the main loop, eliminating the need for separate conditionals for each case. This minimises branching and increases loop efficiency.

(3)Optimized Memory Access Patterns

Original Code:Memory access patterns were less optimised due to additional variables and possibly unoptimized pointer arithmetic.

x = a->data + a->columns * i;
y = b->data + j;

Improvement Explanation: More effective memory access and cache utilisation are consequently achieved by the PR GitHub current code by streamlining pointer arithmetic and lowering the number of temporary variables.

Unified Handling of Fractional Adjustments

While fractional bits adjustments are handled by both codes, the PR GitHub current implementation handles them more effectively by handling them directly within the computational loop, which lowers overhead.

int64_t s;
for (i = 0; i < a->rows; i++) {
    for (j = 0; j < b->columns; j++) {
        s = 0;
        x = a->data + a->columns * i;
        y = b->data + j;
        for (k = 0; k < b->rows; k++) {
            s += (int32_t)(*x) * (*y);
            x++;
            y += y_inc;
        }
        *z = (int16_t)(((s >> shift_minus_one) + 1) >> 1);
        z++;
    }
}

PR GitHub Current Code (Snippet for mat_multiply):

int32_t acc;
for (i = 0; i < a->rows; i++) {
    for (j = 0; j < b->columns; j++) {
        acc = 0;
        x = a->data + a->columns * i;
        y = b->data + j;
        for (k = 0; k < b->rows; k++) {
            acc += (int32_t)(*x++) * (*y);
            y += y_inc;
        }
        *z = (int16_t)(((acc >> shift) + 1) >> 1);
        z++;
    }
}

Summary for mat_multiply()

PR GitHub code makes use of smaller data sizes, loop structures, streamlined conditional logic, and optimised memory access patterns to minimise memory usage and speed up matrix multiplication operations.

Key Points of Improvement for mat_multiply_elementwise

Reduction in Accumulator Bit Width for Intermediate Results:
Original Code:
int64_t p;
PR GitHub Current Code:
int32_t prod;

Improvement Explanation: By eliminating the need to handle larger data types, using a 32-bit integer (int32_t) for intermediate multiplication results instead of a 64-bit integer (int64_t) can speed up the process on platforms where 32-bit instructions are more efficient.

Streamlining the Conditional Logic for Handling Q0 (shift == -1):
Original Code:

if (shift_minus_one == -1) {
    *z = *x * *y;
}

PR GitHub Current Code:

if (shift == -1) {
    *z = *x * *y;
}

Improvement Explanation: There is a special case for situations in which bit shift is not needed in both the original and PR GitHub current codes. Since this check is factored out of the loop, the current code performs it more effectively by using a single comparison rather than repeating it for every iteration.

Original Code (Snippet for mat_multiply_elementwise):

int64_t p;
const int shift_minus_one = a->fractions + b->fractions - c->fractions - 1;

// Handle no shift case (Q0 data) separately

if (shift_minus_one == -1) {
    for (i = 0; i < a->rows * a->columns; i++) {
        *z = *x * *y;
        x++; y++; z++;
    }
    return 0;
}

// Default case with shift

for (i = 0; i < a->rows * a->columns; i++) {
    p = (int32_t)(*x) * *y;
    *z = (int16_t)(((p >> shift_minus_one) + 1) >> 1);
    x++; y++; z++;
}

PR GitHub Current Code (Snippet for mat_multiply_elementwise):

int32_t prod;
const int shift = a->fractions + b->fractions - c->fractions - 1;

Perform multiplication with or without adjusting the fractional bits

for (int i = 0; i < a->rows * a->columns; i++) {
    if (shift == -1) {
        // Direct multiplication when no adjustment for fractional bits is needed
        *z = *x * *y;
    } else {
        // Multiply elements as int32_t and adjust with rounding
        prod = (int32_t)(*x) * (*y);
        *z = (int16_t)(((prod >> shift) + 1) >> 1);
    }
    x++; y++; z++;
}

Summary for mat_multiply_elementwise()
The reduction of the accumulator bit size and the more effective handling of the no-shift case (Q0 data) are the two main optimisations in the elementwise multiplication function. The current code speeds up multiplication operations by reducing the accumulator size. The existing code also refactors how it handles the special case where shifting is not required, which lessen branching and increase the efficiency of the loop.

I would be happy to add if you think that a breakdown of the instructions would be more helpful in a very specific aspects of the code changes between two codes.

lyakh · 2024-06-26T07:09:28Z

src/math/matrix.c

-	}
-
-	int64_t p;
+{	int64_t p;


lyakh · 2024-06-26T07:12:30Z

src/math/matrix.c

+	    a->columns != b->columns || a->rows != b->rows ||
+	    c->columns != a->columns || c->rows != a->rows) {
+		return -EINVAL;
+	}


you removed this in your previous commit

lyakh · 2024-06-26T07:18:09Z

src/math/matrix.c

 	int16_t *x = a->data;
 	int16_t *y = b->data;
 	int16_t *z = c->data;
-	int i;
-	const int shift_minus_one = a->fractions + b->fractions - c->fractions - 1;
+	int32_t prod;


AFAICS this is the only real change in this commit:

- int64_t p; + int32_t prod;

the rest is cosmetics. Please don't mix functional changes with cosmetics. This obscures the changes.

singalsu · 2024-08-13T16:20:15Z

I just tested this with MFCC in TGL platform. This patch reduced CPU_PEAK(AVG) from 85.1 to 81.5 MCPS. With LL pipelines the STFT hop/size causes uneven load, so this time using peak as more representative perf measure instead of average.

lgirdwood · 2024-08-14T13:39:22Z

I just tested this with MFCC in TGL platform. This patch reduced CPU_PEAK(AVG) from 85.1 to 81.5 MCPS. With LL pipelines the STFT hop/size causes uneven load, so this time using peak as more representative perf measure instead of average.

@singalsu do you have more tests to run or good to go with the optimization after cleanups are resolved ?

singalsu · 2024-08-14T15:18:03Z

I just tested this with MFCC in TGL platform. This patch reduced CPU_PEAK(AVG) from 85.1 to 81.5 MCPS. With LL pipelines the STFT hop/size causes uneven load, so this time using peak as more representative perf measure instead of average.

@singalsu do you have more tests to run or good to go with the optimization after cleanups are resolved ?

I think this is good to merge after suggestions by @lyakh are addressed. Since the matrix operations are not a major part of MFCC these savings are quite significant. So no need for further performance tests.

lgirdwood · 2024-08-16T14:06:14Z

I just tested this with MFCC in TGL platform. This patch reduced CPU_PEAK(AVG) from 85.1 to 81.5 MCPS. With LL pipelines the STFT hop/size causes uneven load, so this time using peak as more representative perf measure instead of average.

@singalsu do you have more tests to run or good to go with the optimization after cleanups are resolved ?

I think this is good to merge after suggestions by @lyakh are addressed. Since the matrix operations are not a major part of MFCC these savings are quite significant. So no need for further performance tests.

Great - @ShriramShastry you have a clear path now for merge after @lyakh comments are resolved.

lyakh · 2024-08-20T06:16:11Z

src/math/matrix.c

+			if (shift == -1)
+				*z = (int16_t)acc;
+			else
+				*z = (int16_t)(((acc >> shift) + 1) >> 1);


this makes it slower. Before this test was done once before all the looping. Now you do it in every iteration.

check shift == -1 being inside the loop does introduce a minor inefficiency. I have addressed it.

lyakh · 2024-08-26T06:32:15Z

src/math/matrix.c

@@ -25,56 +25,66 @@
 *   -EINVAL if input dimensions do not allow for multiplication.
 *   -ERANGE if the shift operation might cause integer overflow.
 */
-int mat_multiply(struct mat_matrix_16b *a, struct mat_matrix_16b *b, struct mat_matrix_16b *c)
+int mat_multiply(struct mat_matrix_16b *a, struct mat_matrix_16b *b,
+		 struct mat_matrix_16b *c)


please, split this commit: a single commit per optimisation and separate commits with any variable renaming, comments, and similar cosmetic changes, unrelated to actual optimisations. And if you claim "36.31% reduction in memory usage" I'd like to understand where it comes from, so far I don't.

Sure, I'II make.

lyakh · 2024-08-26T06:34:31Z

src/math/matrix.c

 	int i;
-	const int shift_minus_one = a->fractions + b->fractions - c->fractions - 1;


same in this commit - please check that the commit description still matches the contents and I'd like to be able to see that too. As it stands, it's difficult to see where "unnecessary conditionals" are eliminated

This patch introduces Doxygen-style documentation to the matrix multiplication functions. Clear descriptions and parameter details are provided to facilitate better understanding and ease of use. Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>

- Added checks for integer overflow during shifting. - Validated matrix dimensions to prevent mismatches. - Ensured non-null pointers before operating on matrices. Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>

Changed the accumulator data type from `int64_t` to `int32_t` to reduce instruction cycle count. This change results in an approximate 8.18% gain in performance for matrix multiplication operations. Performance Results: Compiler Settings: -O2 +------------+------+------+--------+-----------+-----------+----------+ | Test Name | Rows | Cols | Cycles | Max Error | RMS Error | Result | +------------+------+------+--------+-----------+-----------+----------+ | Test 1 | 3 | 5 | 6487 | 0.00 | 0.00 | Pass | | Test 2 | 6 | 8 | 6106 | 0.00 | 0.00 | Pass | +------------+------+------+--------+-----------+-----------+----------+ Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>

Enhanced pointer arithmetic within loops to improve readability and reduce overhead. This change potentially reduces minor computational overhead, contributing to overall performance improvements of around 8.23% for Test 1 and 16.00% for Test 2. Performance Results: Compiler Settings: -O3 +------------+------+------+--------+-----------+-----------+----------+ | Test Name | Rows | Cols | Cycles | Max Error | RMS Error | Result | +------------+------+------+--------+-----------+-----------+----------+ | Test 1 | 3 | 5 | 5953 | 0.00 | 0.00 | Pass | | Test 2 | 6 | 8 | 5128 | 0.00 | 0.00 | Pass | +------------+------+------+--------+-----------+-----------+----------+ Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>

Updated comments for better clarity and understanding. Made cosmetic changes such as reformatting code and renaming variables to enhance readability without impacting functionality. This resulted in approximately 7.97% and 15.00% performance improvements for Test 1 and Test 2, respectively. Performance Results: Compiler Settings: -O2 +------------+------+------+--------+-----------+-----------+----------+ | Test Name | Rows | Cols | Cycles | Max Error | RMS Error | Result | +------------+------+------+--------+-----------+-----------+----------+ | Test 1 | 3 | 5 | 5975 | 0.00 | 0.00 | Pass | | Test 2 | 6 | 8 | 5192 | 0.00 | 0.00 | Pass | +------------+------+------+--------+-----------+-----------+----------+ Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>

lyakh

a general comment: changes are made not because they don't make anything worse, but because they improve something. When a PR is submitted it means, that the submitter is trying to convince reviewers and maintainers that his changes improve something, not just that they aren't breaking anything.

lyakh · 2024-08-28T07:39:14Z

src/math/matrix.c

@@ -31,7 +31,7 @@ int mat_multiply(struct mat_matrix_16b *a, struct mat_matrix_16b *b, struct mat_
 	if (a->columns != b->rows || a->rows != c->rows || b->columns != c->columns)
 		return -EINVAL;

-	int64_t s;
+	int32_t s;  /* Changed from int64_t to int32_t */


this is a comment in the code. Nobody will see the history when reading it. A "changed" comment implies that before it was something different, then someone changed it. But when you read code, you aren't that interested in the history, you want to know how and why it works in its present version. So comments should describe the present code, not its history.

lyakh · 2024-08-28T07:40:54Z

src/math/matrix.c

@@ -51,12 +51,12 @@ int mat_multiply(struct mat_matrix_16b *a, struct mat_matrix_16b *b, struct mat_
 				x = a->data + a->columns * i;
 				y = b->data + j;
 				for (k = 0; k < b->rows; k++) {
-					s += (int32_t)(*x) * (*y);
-					x++;
+					/* Enhanced pointer arithmetic */


sorry, I don't understand this commend. "Enhanced" as compared to what or when? Does it help understanding the code? Or is "enhanced point" a Maths term that I don't know?

lyakh · 2024-08-28T07:45:07Z

src/math/matrix.c

-				*z = (int16_t)s; /* For Q16.0 */
-				z++;
+				/* Enhanced pointer arithmetic */
+				*z++ = (int16_t)s;


I personally also like expressions like

*x++ = value;

but I don't think they actually improve readability, they're just more compact, so help scan the code faster. And I certainly don't expect these changes to affect performance. If you really want to convince me, you need to compare assembly before and after this change.

lyakh · 2024-08-28T07:47:10Z

src/math/matrix.c


 	/* Check shift to ensure no integer overflow occurs during shifting */
-	if (shift_minus_one < -1 || shift_minus_one > 31)
+	if (shift < -1 || shift > 31)
 		return -ERANGE;


your "cycles" figures before and after are very similar to the previous commit. This tells me that you probably measured a state before several changes (possibly the whole PR) and after them, not before and after each commit. So "cycles" don't seem correct.

lyakh · 2024-08-28T07:48:31Z

src/math/matrix.c

 					y += y_inc;
 				}
 				/* Enhanced pointer arithmetic */
-				*z++ = (int16_t)s;
+				*z = (int16_t)acc;
+				z++; /* Move to the next element in the output matrix */


you just changed *z = x; z++; to *z++ = x; in the previous commit and claimed that it was an improvement.

lyakh · 2024-08-28T07:49:33Z

src/math/matrix.c

 			}
-			/* Enhanced pointer arithmetic */
-			*z++ = (int16_t)(((s >> shift_minus_one) + 1) >> 1); /*Shift to Qx.y */
 		}
 	}


Lots of not really helpful comments, very confusing whether this changes anything. With this commit I don't think I can approve the PR

lyakh · 2024-08-28T07:50:34Z

src/math/matrix.c

-		x++;
-		y++;
-		z++;
+		*z = (int16_t)(((p >> shift_minus_one) + 1) >> 1); /* Shift to Qx.y */
 	}


I don't think this enhances or streamlines anything. NAK.

kv2019i · 2024-09-06T10:36:28Z

Release reminder - one week to v2.11-rc1.

kv2019i · 2024-09-13T07:21:08Z

FYI @ShriramShastry pushing to v2.12 (cutoff date today).

ShriramShastry force-pushed the shastry_matrix_perf_optimization branch 2 times, most recently from 44a73f4 to 01f4acb Compare April 29, 2024 14:06

ShriramShastry marked this pull request as ready for review April 29, 2024 14:27

ShriramShastry requested a review from singalsu as a code owner April 29, 2024 14:27

ShriramShastry requested review from lyakh, cujomalainey, lgirdwood, johnylin76, perahgren and andrula-song April 29, 2024 14:29

ShriramShastry force-pushed the shastry_matrix_perf_optimization branch from 01f4acb to b3fa07c Compare April 29, 2024 15:02

lgirdwood reviewed Apr 29, 2024

View reviewed changes

ShriramShastry force-pushed the shastry_matrix_perf_optimization branch 5 times, most recently from 3427edd to 5493610 Compare May 2, 2024 03:34

singalsu reviewed May 2, 2024

View reviewed changes

lyakh reviewed May 2, 2024

View reviewed changes

singalsu reviewed May 2, 2024

View reviewed changes

ShriramShastry force-pushed the shastry_matrix_perf_optimization branch from 5493610 to 37a0835 Compare May 9, 2024 12:44

ShriramShastry commented May 9, 2024

View reviewed changes

ShriramShastry force-pushed the shastry_matrix_perf_optimization branch from 37a0835 to 175565c Compare May 10, 2024 06:44

ShriramShastry requested review from singalsu, lgirdwood and lyakh May 10, 2024 09:40

ShriramShastry force-pushed the shastry_matrix_perf_optimization branch from 175565c to 738540b Compare May 10, 2024 14:50

lgirdwood added this to the v2.11 milestone Jun 25, 2024

lyakh reviewed Jun 26, 2024

View reviewed changes

src/math/matrix.c Outdated

}

int64_t p;

{ int64_t p;

Copy link

Collaborator

lyakh Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 lines

lyakh requested changes Jun 26, 2024

View reviewed changes

ShriramShastry force-pushed the shastry_matrix_perf_optimization branch 4 times, most recently from 3d4f5fc to 2276903 Compare August 17, 2024 20:38

ShriramShastry requested a review from lyakh August 19, 2024 07:39

lyakh reviewed Aug 20, 2024

View reviewed changes

ShriramShastry force-pushed the shastry_matrix_perf_optimization branch from 2276903 to 234334f Compare August 25, 2024 16:43

ShriramShastry requested a review from lyakh August 26, 2024 03:21

lyakh reviewed Aug 26, 2024

View reviewed changes

ShriramShastry added 7 commits August 26, 2024 18:31

Math: Error Checking Enhancements

1ac1b5d

- Added checks for integer overflow during shifting. - Validated matrix dimensions to prevent mismatches. - Ensured non-null pointers before operating on matrices. Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>

ShriramShastry force-pushed the shastry_matrix_perf_optimization branch from 234334f to c3eeab1 Compare August 27, 2024 16:42

ShriramShastry requested a review from lyakh August 27, 2024 17:00

lyakh reviewed Aug 28, 2024

View reviewed changes

kv2019i modified the milestones: v2.11, v2.12 Sep 13, 2024

Compiler_tool	Function Name	Function (%)	Function (F)	Children (C)	Total (F+C)	Called	Size (bytes)
LX7HiFi5_RI_2022_9	mat_multiply_elementwise_arr_opt	2.2	230	0	230	1	467
LX7HiFi5_RI_2022_9	mat_multiply_elementwise_original	3.31	347	120	467	1	462
LX7HiFi5_RI_2022_9	mat_multiply_elementwise_ptr_opt	2.18	228	0	228	1	457
LX7HiFi5_RI_2022_9	mat_multiply_elementwise_ws64_opt	3.96	414	120	534	1	193
LX7HiFi5_RI_2022_9	mat_multiply_optimized	40.04	4,185	0	4,185	3	477
LX7HiFi5_RI_2022_9	mat_multiply_original	41.19	4,306	252	4,558	3	749
Generic_HiFi5	mat_multiply_elementwise_arr_opt	2.2	230	0	230	1	467
Generic_HiFi5	mat_multiply_elementwise_original	3.32	347	120	467	1	462
Generic_HiFi5	mat_multiply_elementwise_ptr_opt	2.18	228	0	228	1	457
Generic_HiFi5	mat_multiply_elementwise_ws64_opt	3.96	414	120	534	1	193
Generic_HiFi5	mat_multiply_optimized	40.07	4,185	0	4,185	3	477
Generic_HiFi5	mat_multiply_original	41.23	4,306	252	4,558	3	749
LX7HiFi4_RI_2022_9	mat_multiply_elementwise_arr_opt	2.07	232	0	232	1	336
LX7HiFi4_RI_2022_9	mat_multiply_elementwise_original	3.1	348	120	468	1	342
LX7HiFi4_RI_2022_9	mat_multiply_elementwise_ptr_opt	2.05	230	0	230	1	336
LX7HiFi4_RI_2022_9	mat_multiply_elementwise_ws64_opt	3.7	415	120	535	1	184
LX7HiFi4_RI_2022_9	mat_multiply_optimized	39.06	4,374	0	4,374	3	455
LX7HiFi4_RI_2022_9	mat_multiply_original	43.27	4,846	252	5,098	3	747
Generic_HiFi4	mat_multiply_elementwise_arr_opt	2.54	290	0	290	1	362
Generic_HiFi4	mat_multiply_elementwise_original	3.31	377	120	497	1	366
Generic_HiFi4	mat_multiply_elementwise_ptr_opt	2.55	291	0	291	1	378
Generic_HiFi4	mat_multiply_elementwise_ws64_opt	4.46	508	120	628	1	244
Generic_HiFi4	mat_multiply_optimized	37.17	4,233	0	4,233	3	544
Generic_HiFi4	mat_multiply_original	43.35	4,937	252	5,189	3	923

		int i;
		const int shift_minus_one = a->fractions + b->fractions - c->fractions - 1;

Math: Optimise 16-bit matrix multiplication functions. #9088

Are you sure you want to change the base?

Math: Optimise 16-bit matrix multiplication functions. #9088

Conversation

ShriramShastry commented Apr 29, 2024 • edited Loading

lgirdwood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singalsu commented May 2, 2024

lyakh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShriramShastry May 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShriramShastry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShriramShastry May 3, 2024 • edited Loading

Choose a reason for hiding this comment

lyakh commented May 10, 2024

ShriramShastry commented May 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singalsu commented Aug 13, 2024

lgirdwood commented Aug 14, 2024

singalsu commented Aug 14, 2024

lgirdwood commented Aug 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lyakh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kv2019i commented Sep 6, 2024

kv2019i commented Sep 13, 2024

ShriramShastry commented Apr 29, 2024 •

edited

Loading

ShriramShastry May 3, 2024 •

edited

Loading

ShriramShastry May 3, 2024 •

edited

Loading

ShriramShastry commented May 10, 2024 •

edited

Loading