Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Math: Optimise 16-bit matrix multiplication functions. #9088

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

ShriramShastry
Copy link
Contributor

@ShriramShastry ShriramShastry commented Apr 29, 2024

Improve mat_multiply and mat_multiply_elementwise for 16-bit signed integers by refactoring operations and simplifying handling of Q0 data.

Changes:

  • Replace int64_t with int32_t for accumulators in mat_multiply and
    mat_multiply_elementwise, reducing cycle count by ~51.18% for elementwise
    operations and by ~8.18% for matrix multiplication.

  • Enhance pointer arithmetic within loops for better readability and
    compiler optimization opportunities.

  • Eliminate unnecessary conditionals by directly handling Q0 data in the
    algorithm's core logic.

  • Update fractional bit shift and rounding logic for more accurate
    fixed-point calculations.

Performance gains from these optimizations include a 1.08% reduction in
memory usage for elementwise functions and a 36.31% reduction for matrix
multiplication. The changes facilitate significant resource management
improvements in constrained environments.

Summary:

image

@ShriramShastry ShriramShastry force-pushed the shastry_matrix_perf_optimization branch 2 times, most recently from 44a73f4 to 01f4acb Compare April 29, 2024 14:06
@ShriramShastry ShriramShastry marked this pull request as ready for review April 29, 2024 14:27
@ShriramShastry ShriramShastry force-pushed the shastry_matrix_perf_optimization branch from 01f4acb to b3fa07c Compare April 29, 2024 15:02
Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me but @singalsu needs to review.

Long term @ShriramShastry @singalsu thinking aloud, I think it may make more sense to have Kconfig that will use the HiFi3/4/5 kernels from nnlib for xtensa targets i.e.
https://github.com/foss-xtensa/nnlib-hifi5/blob/master/xa_nnlib/algo/kernels/matXvec/hifi5/xa_nn_matXvec_16x16.c

@ShriramShastry ShriramShastry force-pushed the shastry_matrix_perf_optimization branch 5 times, most recently from 3427edd to 5493610 Compare May 2, 2024 03:34
z++;
}
}
int64_t acc;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As style comment, please keep the original style and define local variables in beginning of function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@singalsu
Copy link
Collaborator

singalsu commented May 2, 2024

This looks fine to me but @singalsu needs to review.

Long term @ShriramShastry @singalsu thinking aloud, I think it may make more sense to have Kconfig that will use the HiFi3/4/5 kernels from nnlib for xtensa targets i.e. https://github.com/foss-xtensa/nnlib-hifi5/blob/master/xa_nnlib/algo/kernels/matXvec/hifi5/xa_nn_matXvec_16x16.c

Is the Cadence's license compatible with all SOF usages (firmware, plugin)?

Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general - I'm not sure I understand what this PR is solving

* non-negative (> 0). Since shift = shift_minus_one + 1, the check for non-negative
* value is (shift_minus_one >= -1)
*/
const int64_t offset = (shift_minus_one >= -1) ? (1LL << shift_minus_one) : 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to search for this, and here's what I've found [1]:

The left-shift and right-shift operators should not be used for negative numbers. The result of is undefined behavior if any of the operands is a negative number. For example results of both 1 >> -1 and 1 << -1 is undefined.

But the examples under that comment show, that a left shift by -1 should be the same as a right shift by 1, i.e. 1 >> 1 which is 0. So, you can just replace >= above with a > and the result will be the same without that ambiguity.

[1] https://www.geeksforgeeks.org/left-shift-right-shift-operators-c-cpp/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you pointed out, it makes no sense to shift left by -1 (1 << -1). The same goal can be achieved without ambiguity by changing the condition to shift > -1.


return 0;
}
const int shift = shift_minus_one + 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about removing shift_minus_one entirely and just using shift with appropriate changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I could remove shift_minus_one and work only with shift. This essentially means that I'd be adding 1 to shift where I had previously subtracted 1 from shift_minus_one.

/* If shift == 0, then shift_minus_one == -1, which means the data is Q16.0
* Otherwise, add the offset before the shift to round up if necessary
*/
*z++ = (shift == 0) ? (int16_t)acc : (int16_t)((acc + offset) >> shift);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ?: operator can contain an implicit jump and therefore be rather expensive, so it should be avoided on performance sensitive paths. Here if shift == 0 if my "negative shift" understanding above is correct, then offset == 0, so the ?: isn't needed at all and you can just use (int16_t)((acc + offset) >> shift) always

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, Since offset is 0 when shift is 0, I can always use (int16_t)(acc + offset) >> shift without using the ?: operator, yeah probably simplifying the expression and we get potentially improving performance.

if (shift < -1 || shift > 62)
return -ERANGE;
/* Offset for rounding */
const int64_t offset = (shift >= 0) ? (1LL << (shift - 1)) : 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>= and a "negative shift" again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to avoid ambiguity when shifting by a negative number, the condition should be changed from >= to >. The result ensures that offset is 0 when shift is 0, and that no negative number attempts are made. I will make the modification

/* General case with shifting and offset for rounding */
for (i = 0; i < total_elements; i++) {
acc = (int64_t)x[i] * (int64_t)y[i]; /* Cast to int64_t to avoid overflow */
z[i] = (int16_t)((acc + offset) >> shift); /* Apply shift and offset */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in commit description you say: "Both functions now use incremented pointers in loop expressions for
better legibility and potential compiler optimisation." and here you remove pointer incrementing and replace it with indexing. In general - while legibility is subjective so I wouldn't comment on it, does this PR really improve anything? Does it fix any bugs or brings any measurable and measured performance improvements?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you've measured size improvements and also some performance improvements, but it would be good to understand which changes exactly introduce them. This would be important both for this PR and also for general understanding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyakh fwiw, changes to the code could be easier for compiler to vectorize in certain places hence the speedup.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgirdwood sure, and I'd very much like to know which exactly changes do that! If there are any recipes like pointer incrementing is slower than indexed access or anything similar, then it would be good to know those to optimise all our code. OTOH wouldn't it be the case that even slightly different code sequences generate more optimal assembly with different optimisations applied? E.g. sometimes loops like

for (int i = 0; i < N; i++)
    x[i] = io_read32(register);

would generate faster code and sometimes it's

for (int i = 0; i < N; i++, x++)
    *x = io_read32(register);

that runs faster? While in fact we'd expect the compiler to optimise both correctly, but in some more complex cases it might not be able to.
It is also possible for someone to modify some such hot-path code without checkout performance and cause a significant degradation. Maybe we need the CI to keep an eye on performance...

Copy link
Contributor Author

@ShriramShastry ShriramShastry May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code snippet used to measure performance

int mat_multiply_elementwise_ptr_opt(struct mat_matrix_16b *a, struct mat_matrix_16b *b,
                             struct mat_matrix_16b *c)
{
    int16_t *x = a->data;
    int16_t *y = b->data;
    int16_t *z = c->data;
    int32_t p;

	/* Check for NULL pointers */
	if (!a || !b || !c)
		return -EINVAL;

    /* Check for NULL pointers */
    if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows)
        return -EINVAL;

    const int total_elements = a->rows * a->columns;
    const int shift = a->fractions + b->fractions - c->fractions - 1;

    if (shift == -1) {
        // If no fixed-point fraction bits need to be adjusted

        // Pointer arithmetic equivalent
        for (int i = 0; i < total_elements; i++, x++, y++, z++)
            *z = *x * *y;
    } else {
        // Pointer arithmetic equivalent
        for (int i = 0; i < total_elements; i++, x++, y++, z++) {
            p = (int32_t)(*x) * (*y);
            *z = (int16_t)(((p >> shift) + 1) >> 1); // Shift to Qx.y
        }
    }

    return 0;
}

int mat_multiply_elementwise_arr_opt(struct mat_matrix_16b *a, struct mat_matrix_16b *b,
                                       struct mat_matrix_16b *c)
{
    int16_t *x = a->data;
    int16_t *y = b->data;
    int16_t *z = c->data;
    int32_t p;

	/* Check for NULL pointers */
	if (!a || !b || !c)
		return -EINVAL;

    /* Check for NULL pointers */
    if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows)
        return -EINVAL;

    const int total_elements = a->rows * a->columns;
    const int shift = a->fractions + b->fractions - c->fractions - 1;

    if (shift == -1) {
        // If no fixed-point fraction bits need to be adjusted
        // Array indexing method
        for (int i = 0; i < total_elements; i++)
            z[i] = x[i] * y[i];


    } else {
        // General case with shifting and rounding
        // Array indexing method
        for (int i = 0; i < total_elements; i++) {
            p = (int32_t)x[i] * y[i];
            z[i] = (int16_t)(((p >> shift) + 1) >> 1); // Shift to Qx.y
        }
    }

    return 0;
}

int mat_multiply_elementwise_ws64_opt(struct mat_matrix_16b* a, struct mat_matrix_16b* b, struct mat_matrix_16b* c) {
	// Dimension check
	if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows) {
		return -EINVAL;
	}

	int16_t* x = a->data;
	int16_t* y = b->data;
	int16_t* z = c->data;
	int total_elements = a->rows * a->columns;
	const int shift = a->fractions + b->fractions - c->fractions;

	for (int i = 0; i < total_elements; i++) {
		int64_t p = ((int64_t)(*x)) * (*y); // Use 64-bit for intermediate product to avoid overflow

		// Correct rounding
		if (shift > 0) {
			*z = (int16_t)((p + (1 << (shift - 1))) >> shift);
		}
		else {
			*z = (int16_t)p; // No shift needed
		}
		//printf("Intermediate result z[%d]: %d (for x: %d, y: %d)\n", i, *z, *x, *y);
		x++;
		y++;
		z++;
	}
	return 0;
}

Performance summary for xtensa testbench with the -O3 compiler optimisation option.
Four unit tests for 16-bit matrix multiplication are run; mat_multiply is called three times, while matrix element-wise multiplication is called once.

<style> <style> </style>
Compiler_tool Function Name Function (%) Function (F) Children (C) Total (F+C) Called Size (bytes)
LX7HiFi5_RI_2022_9 mat_multiply_elementwise_arr_opt 2.2 230 0 230 1 467
LX7HiFi5_RI_2022_9 mat_multiply_elementwise_original 3.31 347 120 467 1 462
LX7HiFi5_RI_2022_9 mat_multiply_elementwise_ptr_opt 2.18 228 0 228 1 457
LX7HiFi5_RI_2022_9 mat_multiply_elementwise_ws64_opt 3.96 414 120 534 1 193
LX7HiFi5_RI_2022_9 mat_multiply_optimized 40.04 4,185 0 4,185 3 477
LX7HiFi5_RI_2022_9 mat_multiply_original 41.19 4,306 252 4,558 3 749
Generic_HiFi5 mat_multiply_elementwise_arr_opt 2.2 230 0 230 1 467
Generic_HiFi5 mat_multiply_elementwise_original 3.32 347 120 467 1 462
Generic_HiFi5 mat_multiply_elementwise_ptr_opt 2.18 228 0 228 1 457
Generic_HiFi5 mat_multiply_elementwise_ws64_opt 3.96 414 120 534 1 193
Generic_HiFi5 mat_multiply_optimized 40.07 4,185 0 4,185 3 477
Generic_HiFi5 mat_multiply_original 41.23 4,306 252 4,558 3 749
LX7HiFi4_RI_2022_9 mat_multiply_elementwise_arr_opt 2.07 232 0 232 1 336
LX7HiFi4_RI_2022_9 mat_multiply_elementwise_original 3.1 348 120 468 1 342
LX7HiFi4_RI_2022_9 mat_multiply_elementwise_ptr_opt 2.05 230 0 230 1 336
LX7HiFi4_RI_2022_9 mat_multiply_elementwise_ws64_opt 3.7 415 120 535 1 184
LX7HiFi4_RI_2022_9 mat_multiply_optimized 39.06 4,374 0 4,374 3 455
LX7HiFi4_RI_2022_9 mat_multiply_original 43.27 4,846 252 5,098 3 747
Generic_HiFi4 mat_multiply_elementwise_arr_opt 2.54 290 0 290 1 362
Generic_HiFi4 mat_multiply_elementwise_original 3.31 377 120 497 1 366
Generic_HiFi4 mat_multiply_elementwise_ptr_opt 2.55 291 0 291 1 378
Generic_HiFi4 mat_multiply_elementwise_ws64_opt 4.46 508 120 628 1 244
Generic_HiFi4 mat_multiply_optimized 37.17 4,233 0 4,233 3 544
Generic_HiFi4 mat_multiply_original 43.35 4,937 252 5,189 3 923

image

performance plots
image
image
image
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the daily performance stats would be great to have and compare how PR impacts. In this case it would not help since we don't have test topology for MFCC. It's in my targets for quarter, so coming eventually (sof-hda-benchmark topologies, plenty already available). But the performance tracking is a bigger one that is now half done but the rest in risk.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also worth to note that Sriram's figures are with maximized optimization level in Xplorer GUI based test while our builds are using just -O2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct , I have used -O3 optimization level during measurements.

} else {
/* General case with shifting and offset for rounding */
for (i = 0; i < total_elements; i++) {
acc = (int64_t)x[i] * (int64_t)y[i]; /* Cast to int64_t to avoid overflow */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wonder how 64 bit multiply can be faster than original 32 bits. This code avoids one shift but still, it's quite surprising. Did you test the non-Q0 case for cycles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Multiplication in 64 bits does not yield faster results than multiplication in 32 bits; instead, it requires twice as many resources (2x more multipliers).

My last check-in wasn't very successful. I've since fixed the error and included a summary of the results for the most recent check-in.

Thank you

@ShriramShastry ShriramShastry force-pushed the shastry_matrix_perf_optimization branch from 5493610 to 37a0835 Compare May 9, 2024 12:44
Copy link
Contributor Author

@ShriramShastry ShriramShastry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review, and I have made the necessary changes. Please read through and make a suggestion.


return 0;
}
const int shift = shift_minus_one + 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I could remove shift_minus_one and work only with shift. This essentially means that I'd be adding 1 to shift where I had previously subtracted 1 from shift_minus_one.

* non-negative (> 0). Since shift = shift_minus_one + 1, the check for non-negative
* value is (shift_minus_one >= -1)
*/
const int64_t offset = (shift_minus_one >= -1) ? (1LL << shift_minus_one) : 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you pointed out, it makes no sense to shift left by -1 (1 << -1). The same goal can be achieved without ambiguity by changing the condition to shift > -1.

/* If shift == 0, then shift_minus_one == -1, which means the data is Q16.0
* Otherwise, add the offset before the shift to round up if necessary
*/
*z++ = (shift == 0) ? (int16_t)acc : (int16_t)((acc + offset) >> shift);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, Since offset is 0 when shift is 0, I can always use (int16_t)(acc + offset) >> shift without using the ?: operator, yeah probably simplifying the expression and we get potentially improving performance.

if (shift < -1 || shift > 62)
return -ERANGE;
/* Offset for rounding */
const int64_t offset = (shift >= 0) ? (1LL << (shift - 1)) : 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to avoid ambiguity when shifting by a negative number, the condition should be changed from >= to >. The result ensures that offset is 0 when shift is 0, and that no negative number attempts are made. I will make the modification

/* General case with shifting and offset for rounding */
for (i = 0; i < total_elements; i++) {
acc = (int64_t)x[i] * (int64_t)y[i]; /* Cast to int64_t to avoid overflow */
z[i] = (int16_t)((acc + offset) >> shift); /* Apply shift and offset */
Copy link
Contributor Author

@ShriramShastry ShriramShastry May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code snippet used to measure performance

int mat_multiply_elementwise_ptr_opt(struct mat_matrix_16b *a, struct mat_matrix_16b *b,
                             struct mat_matrix_16b *c)
{
    int16_t *x = a->data;
    int16_t *y = b->data;
    int16_t *z = c->data;
    int32_t p;

	/* Check for NULL pointers */
	if (!a || !b || !c)
		return -EINVAL;

    /* Check for NULL pointers */
    if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows)
        return -EINVAL;

    const int total_elements = a->rows * a->columns;
    const int shift = a->fractions + b->fractions - c->fractions - 1;

    if (shift == -1) {
        // If no fixed-point fraction bits need to be adjusted

        // Pointer arithmetic equivalent
        for (int i = 0; i < total_elements; i++, x++, y++, z++)
            *z = *x * *y;
    } else {
        // Pointer arithmetic equivalent
        for (int i = 0; i < total_elements; i++, x++, y++, z++) {
            p = (int32_t)(*x) * (*y);
            *z = (int16_t)(((p >> shift) + 1) >> 1); // Shift to Qx.y
        }
    }

    return 0;
}

int mat_multiply_elementwise_arr_opt(struct mat_matrix_16b *a, struct mat_matrix_16b *b,
                                       struct mat_matrix_16b *c)
{
    int16_t *x = a->data;
    int16_t *y = b->data;
    int16_t *z = c->data;
    int32_t p;

	/* Check for NULL pointers */
	if (!a || !b || !c)
		return -EINVAL;

    /* Check for NULL pointers */
    if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows)
        return -EINVAL;

    const int total_elements = a->rows * a->columns;
    const int shift = a->fractions + b->fractions - c->fractions - 1;

    if (shift == -1) {
        // If no fixed-point fraction bits need to be adjusted
        // Array indexing method
        for (int i = 0; i < total_elements; i++)
            z[i] = x[i] * y[i];


    } else {
        // General case with shifting and rounding
        // Array indexing method
        for (int i = 0; i < total_elements; i++) {
            p = (int32_t)x[i] * y[i];
            z[i] = (int16_t)(((p >> shift) + 1) >> 1); // Shift to Qx.y
        }
    }

    return 0;
}

int mat_multiply_elementwise_ws64_opt(struct mat_matrix_16b* a, struct mat_matrix_16b* b, struct mat_matrix_16b* c) {
	// Dimension check
	if (!a || !b || !c || a->columns != b->columns || a->rows != b->rows) {
		return -EINVAL;
	}

	int16_t* x = a->data;
	int16_t* y = b->data;
	int16_t* z = c->data;
	int total_elements = a->rows * a->columns;
	const int shift = a->fractions + b->fractions - c->fractions;

	for (int i = 0; i < total_elements; i++) {
		int64_t p = ((int64_t)(*x)) * (*y); // Use 64-bit for intermediate product to avoid overflow

		// Correct rounding
		if (shift > 0) {
			*z = (int16_t)((p + (1 << (shift - 1))) >> shift);
		}
		else {
			*z = (int16_t)p; // No shift needed
		}
		//printf("Intermediate result z[%d]: %d (for x: %d, y: %d)\n", i, *z, *x, *y);
		x++;
		y++;
		z++;
	}
	return 0;
}

Performance summary for xtensa testbench with the -O3 compiler optimisation option.
Four unit tests for 16-bit matrix multiplication are run; mat_multiply is called three times, while matrix element-wise multiplication is called once.

<style> <style> </style>
Compiler_tool Function Name Function (%) Function (F) Children (C) Total (F+C) Called Size (bytes)
LX7HiFi5_RI_2022_9 mat_multiply_elementwise_arr_opt 2.2 230 0 230 1 467
LX7HiFi5_RI_2022_9 mat_multiply_elementwise_original 3.31 347 120 467 1 462
LX7HiFi5_RI_2022_9 mat_multiply_elementwise_ptr_opt 2.18 228 0 228 1 457
LX7HiFi5_RI_2022_9 mat_multiply_elementwise_ws64_opt 3.96 414 120 534 1 193
LX7HiFi5_RI_2022_9 mat_multiply_optimized 40.04 4,185 0 4,185 3 477
LX7HiFi5_RI_2022_9 mat_multiply_original 41.19 4,306 252 4,558 3 749
Generic_HiFi5 mat_multiply_elementwise_arr_opt 2.2 230 0 230 1 467
Generic_HiFi5 mat_multiply_elementwise_original 3.32 347 120 467 1 462
Generic_HiFi5 mat_multiply_elementwise_ptr_opt 2.18 228 0 228 1 457
Generic_HiFi5 mat_multiply_elementwise_ws64_opt 3.96 414 120 534 1 193
Generic_HiFi5 mat_multiply_optimized 40.07 4,185 0 4,185 3 477
Generic_HiFi5 mat_multiply_original 41.23 4,306 252 4,558 3 749
LX7HiFi4_RI_2022_9 mat_multiply_elementwise_arr_opt 2.07 232 0 232 1 336
LX7HiFi4_RI_2022_9 mat_multiply_elementwise_original 3.1 348 120 468 1 342
LX7HiFi4_RI_2022_9 mat_multiply_elementwise_ptr_opt 2.05 230 0 230 1 336
LX7HiFi4_RI_2022_9 mat_multiply_elementwise_ws64_opt 3.7 415 120 535 1 184
LX7HiFi4_RI_2022_9 mat_multiply_optimized 39.06 4,374 0 4,374 3 455
LX7HiFi4_RI_2022_9 mat_multiply_original 43.27 4,846 252 5,098 3 747
Generic_HiFi4 mat_multiply_elementwise_arr_opt 2.54 290 0 290 1 362
Generic_HiFi4 mat_multiply_elementwise_original 3.31 377 120 497 1 366
Generic_HiFi4 mat_multiply_elementwise_ptr_opt 2.55 291 0 291 1 378
Generic_HiFi4 mat_multiply_elementwise_ws64_opt 4.46 508 120 628 1 244
Generic_HiFi4 mat_multiply_optimized 37.17 4,233 0 4,233 3 544
Generic_HiFi4 mat_multiply_original 43.35 4,937 252 5,189 3 923

image

performance plots
image
image
image
image

@lyakh
Copy link
Collaborator

lyakh commented May 10, 2024

Thank you for your review, and I have made the necessary changes. Please read through and make a suggestion.

Thanks, measurements are good, but I think it would be good to understand which changes improve performance

@ShriramShastry ShriramShastry force-pushed the shastry_matrix_perf_optimization branch from 37a0835 to 175565c Compare May 10, 2024 06:44
@ShriramShastry
Copy link
Contributor Author

ShriramShastry commented May 10, 2024

Thank you for your review, and I have made the necessary changes. Please read through and make a suggestion.

Thanks, measurements are good, but I think it would be good to understand which changes improve performance

Apologies for not responding to the question you asked earlier as well.
I'm trying to address them here.

Key Points of Improvement
(1)Accumulator Data Size Reduction

Original Code:
int64_t s;
PR GitHub Current Code:
int32_t acc;
Improvement Explanation: Using a 32-bit accumulator (int32_t rather than int64_t) reduces memory ( number of multipliers i.e. 1/2 for current PR GitHub code) usage per operation and improve processing speed, particularly on 32-bit systems where 64-bit arithmetic is more expensive.

(2) Streamlined Loop Structure and Conditional Logic
Original Code (mat_multiply):

if (shift_minus_one == -1) {
    for (...) { // Inner loop
        s += (int32_t)(*x) * (*y);
    }
    *z = (int16_t)s;
}

PR GitHub Current Code (mat_multiply):

acc = 0;
for (..., ..., ...) { // Inner loop
    acc += (int32_t)(*x++) * (*y);
    y += y_inc;
}
*z = (int16_t)(((acc >> shift) + 1) >> 1);

Improvement Explanation: The current PR GitHub code seamlessly integrates the special handling of fractional bits within the main loop, eliminating the need for separate conditionals for each case. This minimises branching and increases loop efficiency.

(3)Optimized Memory Access Patterns

Original Code:Memory access patterns were less optimised due to additional variables and possibly unoptimized pointer arithmetic.

x = a->data + a->columns * i;
y = b->data + j;

Improvement Explanation: More effective memory access and cache utilisation are consequently achieved by the PR GitHub current code by streamlining pointer arithmetic and lowering the number of temporary variables.

Unified Handling of Fractional Adjustments

While fractional bits adjustments are handled by both codes, the PR GitHub current implementation handles them more effectively by handling them directly within the computational loop, which lowers overhead.

int64_t s;
for (i = 0; i < a->rows; i++) {
    for (j = 0; j < b->columns; j++) {
        s = 0;
        x = a->data + a->columns * i;
        y = b->data + j;
        for (k = 0; k < b->rows; k++) {
            s += (int32_t)(*x) * (*y);
            x++;
            y += y_inc;
        }
        *z = (int16_t)(((s >> shift_minus_one) + 1) >> 1);
        z++;
    }
}

PR GitHub Current Code (Snippet for mat_multiply):

int32_t acc;
for (i = 0; i < a->rows; i++) {
    for (j = 0; j < b->columns; j++) {
        acc = 0;
        x = a->data + a->columns * i;
        y = b->data + j;
        for (k = 0; k < b->rows; k++) {
            acc += (int32_t)(*x++) * (*y);
            y += y_inc;
        }
        *z = (int16_t)(((acc >> shift) + 1) >> 1);
        z++;
    }
}

Summary for mat_multiply()

PR GitHub code makes use of smaller data sizes, loop structures, streamlined conditional logic, and optimised memory access patterns to minimise memory usage and speed up matrix multiplication operations.

Key Points of Improvement for mat_multiply_elementwise

Reduction in Accumulator Bit Width for Intermediate Results:
Original Code:
int64_t p;
PR GitHub Current Code:
int32_t prod;

Improvement Explanation: By eliminating the need to handle larger data types, using a 32-bit integer (int32_t) for intermediate multiplication results instead of a 64-bit integer (int64_t) can speed up the process on platforms where 32-bit instructions are more efficient.

Streamlining the Conditional Logic for Handling Q0 (shift == -1):
Original Code:

if (shift_minus_one == -1) {
    *z = *x * *y;
}

PR GitHub Current Code:

if (shift == -1) {
    *z = *x * *y;
}

Improvement Explanation: There is a special case for situations in which bit shift is not needed in both the original and PR GitHub current codes. Since this check is factored out of the loop, the current code performs it more effectively by using a single comparison rather than repeating it for every iteration.

Original Code (Snippet for mat_multiply_elementwise):

int64_t p;
const int shift_minus_one = a->fractions + b->fractions - c->fractions - 1;

// Handle no shift case (Q0 data) separately

if (shift_minus_one == -1) {
    for (i = 0; i < a->rows * a->columns; i++) {
        *z = *x * *y;
        x++; y++; z++;
    }
    return 0;
}

// Default case with shift

for (i = 0; i < a->rows * a->columns; i++) {
    p = (int32_t)(*x) * *y;
    *z = (int16_t)(((p >> shift_minus_one) + 1) >> 1);
    x++; y++; z++;
}

PR GitHub Current Code (Snippet for mat_multiply_elementwise):

int32_t prod;
const int shift = a->fractions + b->fractions - c->fractions - 1;

Perform multiplication with or without adjusting the fractional bits

for (int i = 0; i < a->rows * a->columns; i++) {
    if (shift == -1) {
        // Direct multiplication when no adjustment for fractional bits is needed
        *z = *x * *y;
    } else {
        // Multiply elements as int32_t and adjust with rounding
        prod = (int32_t)(*x) * (*y);
        *z = (int16_t)(((prod >> shift) + 1) >> 1);
    }
    x++; y++; z++;
}

Summary for mat_multiply_elementwise()
The reduction of the accumulator bit size and the more effective handling of the no-shift case (Q0 data) are the two main optimisations in the elementwise multiplication function. The current code speeds up multiplication operations by reducing the accumulator size. The existing code also refactors how it handles the special case where shifting is not required, which lessen branching and increase the efficiency of the loop.

I would be happy to add if you think that a breakdown of the instructions would be more helpful in a very specific aspects of the code changes between two codes.

@lgirdwood lgirdwood added this to the v2.11 milestone Jun 25, 2024
}

int64_t p;
{ int64_t p;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 lines

a->columns != b->columns || a->rows != b->rows ||
c->columns != a->columns || c->rows != a->rows) {
return -EINVAL;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you removed this in your previous commit

int16_t *x = a->data;
int16_t *y = b->data;
int16_t *z = c->data;
int i;
const int shift_minus_one = a->fractions + b->fractions - c->fractions - 1;
int32_t prod;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS this is the only real change in this commit:

-	int64_t p;
+	int32_t prod;

the rest is cosmetics. Please don't mix functional changes with cosmetics. This obscures the changes.

@singalsu
Copy link
Collaborator

I just tested this with MFCC in TGL platform. This patch reduced CPU_PEAK(AVG) from 85.1 to 81.5 MCPS. With LL pipelines the STFT hop/size causes uneven load, so this time using peak as more representative perf measure instead of average.

@lgirdwood
Copy link
Member

I just tested this with MFCC in TGL platform. This patch reduced CPU_PEAK(AVG) from 85.1 to 81.5 MCPS. With LL pipelines the STFT hop/size causes uneven load, so this time using peak as more representative perf measure instead of average.

@singalsu do you have more tests to run or good to go with the optimization after cleanups are resolved ?

@singalsu
Copy link
Collaborator

I just tested this with MFCC in TGL platform. This patch reduced CPU_PEAK(AVG) from 85.1 to 81.5 MCPS. With LL pipelines the STFT hop/size causes uneven load, so this time using peak as more representative perf measure instead of average.

@singalsu do you have more tests to run or good to go with the optimization after cleanups are resolved ?

I think this is good to merge after suggestions by @lyakh are addressed. Since the matrix operations are not a major part of MFCC these savings are quite significant. So no need for further performance tests.

@lgirdwood
Copy link
Member

I just tested this with MFCC in TGL platform. This patch reduced CPU_PEAK(AVG) from 85.1 to 81.5 MCPS. With LL pipelines the STFT hop/size causes uneven load, so this time using peak as more representative perf measure instead of average.

@singalsu do you have more tests to run or good to go with the optimization after cleanups are resolved ?

I think this is good to merge after suggestions by @lyakh are addressed. Since the matrix operations are not a major part of MFCC these savings are quite significant. So no need for further performance tests.

Great - @ShriramShastry you have a clear path now for merge after @lyakh comments are resolved.

@ShriramShastry ShriramShastry force-pushed the shastry_matrix_perf_optimization branch 4 times, most recently from 3d4f5fc to 2276903 Compare August 17, 2024 20:38
if (shift == -1)
*z = (int16_t)acc;
else
*z = (int16_t)(((acc >> shift) + 1) >> 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes it slower. Before this test was done once before all the looping. Now you do it in every iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check shift == -1 being inside the loop does introduce a minor inefficiency. I have addressed it.

@@ -25,56 +25,66 @@
* -EINVAL if input dimensions do not allow for multiplication.
* -ERANGE if the shift operation might cause integer overflow.
*/
int mat_multiply(struct mat_matrix_16b *a, struct mat_matrix_16b *b, struct mat_matrix_16b *c)
int mat_multiply(struct mat_matrix_16b *a, struct mat_matrix_16b *b,
struct mat_matrix_16b *c)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, split this commit: a single commit per optimisation and separate commits with any variable renaming, comments, and similar cosmetic changes, unrelated to actual optimisations. And if you claim "36.31% reduction in memory usage" I'd like to understand where it comes from, so far I don't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'II make.

int i;
const int shift_minus_one = a->fractions + b->fractions - c->fractions - 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same in this commit - please check that the commit description still matches the contents and I'd like to be able to see that too. As it stands, it's difficult to see where "unnecessary conditionals" are eliminated

This patch introduces Doxygen-style documentation to the matrix
multiplication functions. Clear descriptions and parameter details
are provided to facilitate better understanding and ease of use.

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
- Added checks for integer overflow during shifting.
- Validated matrix dimensions to prevent mismatches.
- Ensured non-null pointers before operating on matrices.

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
Changed the accumulator data type from `int64_t` to `int32_t` to reduce
instruction cycle count. This change results in an approximate 8.18% gain
in performance for matrix multiplication operations.

Performance Results:
Compiler Settings: -O2
+------------+------+------+--------+-----------+-----------+----------+
| Test Name  | Rows | Cols | Cycles | Max Error | RMS Error | Result   |
+------------+------+------+--------+-----------+-----------+----------+
| Test 1     | 3    | 5    | 6487   | 0.00      | 0.00      | Pass     |
| Test 2     | 6    | 8    | 6106   | 0.00      | 0.00      | Pass     |
+------------+------+------+--------+-----------+-----------+----------+

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
Enhanced pointer arithmetic within loops to improve readability and
reduce overhead. This change potentially reduces minor computational
overhead, contributing to overall performance improvements of around
8.23% for Test 1 and 16.00% for Test 2.

Performance Results:
Compiler Settings: -O3

+------------+------+------+--------+-----------+-----------+----------+
| Test Name  | Rows | Cols | Cycles | Max Error | RMS Error | Result   |
+------------+------+------+--------+-----------+-----------+----------+
| Test 1     | 3    | 5    | 5953   | 0.00      | 0.00      | Pass     |
| Test 2     | 6    | 8    | 5128   | 0.00      | 0.00      | Pass     |
+------------+------+------+--------+-----------+-----------+----------+

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
Updated comments for better clarity and understanding. Made cosmetic
changes such as reformatting code and renaming variables to enhance
readability without impacting functionality. This resulted in
approximately 7.97% and 15.00% performance improvements for
Test 1 and Test 2, respectively.

Performance Results:
Compiler Settings: -O2

+------------+------+------+--------+-----------+-----------+----------+
| Test Name  | Rows | Cols | Cycles | Max Error | RMS Error | Result   |
+------------+------+------+--------+-----------+-----------+----------+
| Test 1     | 3    | 5    | 5975   | 0.00      | 0.00      | Pass     |
| Test 2     | 6    | 8    | 5192   | 0.00      | 0.00      | Pass     |
+------------+------+------+--------+-----------+-----------+----------+

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
- Enhanced data pointers for matrix elements
- Streamlined loop iteration for matrix element-wise
 multiplication
- Achieved a 0.09% performance improvement in cycle count

| Rows | Cols | Cycles | Max Error | RMS Error | Result|
+------+------+--------+-----------+-----------+-------+
| 5    | 6    | 3359   | 0.00      | 0.00      | Pass  |

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
- Changed product variable from int64_t to int32_t
- Improved performance by reducing data size
- Achieved a 11.57% performance improvement in cycle count

| Rows | Cols | Cycles | Max Error | RMS Error | Result |
+------+------+--------+-----------+-----------+--------+
| 5    | 6    | 2972   | 0.00	   | 0.00      | Pass   |

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a general comment: changes are made not because they don't make anything worse, but because they improve something. When a PR is submitted it means, that the submitter is trying to convince reviewers and maintainers that his changes improve something, not just that they aren't breaking anything.

@@ -31,7 +31,7 @@ int mat_multiply(struct mat_matrix_16b *a, struct mat_matrix_16b *b, struct mat_
if (a->columns != b->rows || a->rows != c->rows || b->columns != c->columns)
return -EINVAL;

int64_t s;
int32_t s; /* Changed from int64_t to int32_t */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a comment in the code. Nobody will see the history when reading it. A "changed" comment implies that before it was something different, then someone changed it. But when you read code, you aren't that interested in the history, you want to know how and why it works in its present version. So comments should describe the present code, not its history.

@@ -51,12 +51,12 @@ int mat_multiply(struct mat_matrix_16b *a, struct mat_matrix_16b *b, struct mat_
x = a->data + a->columns * i;
y = b->data + j;
for (k = 0; k < b->rows; k++) {
s += (int32_t)(*x) * (*y);
x++;
/* Enhanced pointer arithmetic */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I don't understand this commend. "Enhanced" as compared to what or when? Does it help understanding the code? Or is "enhanced point" a Maths term that I don't know?

*z = (int16_t)s; /* For Q16.0 */
z++;
/* Enhanced pointer arithmetic */
*z++ = (int16_t)s;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally also like expressions like

*x++ = value;

but I don't think they actually improve readability, they're just more compact, so help scan the code faster. And I certainly don't expect these changes to affect performance. If you really want to convince me, you need to compare assembly before and after this change.


/* Check shift to ensure no integer overflow occurs during shifting */
if (shift_minus_one < -1 || shift_minus_one > 31)
if (shift < -1 || shift > 31)
return -ERANGE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your "cycles" figures before and after are very similar to the previous commit. This tells me that you probably measured a state before several changes (possibly the whole PR) and after them, not before and after each commit. So "cycles" don't seem correct.

y += y_inc;
}
/* Enhanced pointer arithmetic */
*z++ = (int16_t)s;
*z = (int16_t)acc;
z++; /* Move to the next element in the output matrix */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you just changed *z = x; z++; to *z++ = x; in the previous commit and claimed that it was an improvement.

}
/* Enhanced pointer arithmetic */
*z++ = (int16_t)(((s >> shift_minus_one) + 1) >> 1); /*Shift to Qx.y */
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of not really helpful comments, very confusing whether this changes anything. With this commit I don't think I can approve the PR

x++;
y++;
z++;
*z = (int16_t)(((p >> shift_minus_one) + 1) >> 1); /* Shift to Qx.y */
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this enhances or streamlines anything. NAK.

@kv2019i
Copy link
Collaborator

kv2019i commented Sep 6, 2024

Release reminder - one week to v2.11-rc1.

@kv2019i
Copy link
Collaborator

kv2019i commented Sep 13, 2024

FYI @ShriramShastry pushing to v2.12 (cutoff date today).

@kv2019i kv2019i modified the milestones: v2.11, v2.12 Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants