Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of convolve (and correlate) #650

Merged
merged 4 commits into from
May 23, 2024

Conversation

AngelEzquerra
Copy link
Contributor

The exising implementation of convolve (which is also used for correlate) was pretty slow. The slowness was due to the fact that we were using regular tensor element access in the inner loop of the convolution, which was very expensive.

The solution was to ensure that the input tensors were contiguous (by cloning them if necessary) and then using unsafe_raw_buf for all the tensor element accesses, which is safe because we know that the indexes are all within the tensor boundaries.

On my system this change makes a large convolution go from ~1.5 seconds in release mode and ~0.25 seconds in danger mode down to ~65 usecs in both modes (i.e. a x23 reduction in release mode and a x3.8 reduction in danger mode)!

@mratsim
Copy link
Owner

mratsim commented May 2, 2024

Can you check the speed of cross-correlation vs im2colgemm_conv2d:

proc im2colgemm_conv2d*[T](input, kernel, bias: Tensor[T],
padding: Size2D = (0,0),
stride: Size2D = (1,1)): Tensor[T] =
## Compute cross-correlate for image with the given kernel weights
# Implementation with ideas from http://cs231n.github.io/convolutional-networks/#conv
let
batch_size = input.shape[^4]
output_channels = kernel.shape[^4]
kernel_size = (height: kernel.nchw_height, width: kernel.nchw_width)
output_height = (input.nchw_height + (2*padding.height) - kernel.nchw_height) div stride.height + 1
output_width = (input.nchw_width + (2*padding.width) - kernel.nchw_width) div stride.width + 1
channels_col = input.nchw_channels * kernel.nchw_height * kernel.nchw_width
kernel_col = kernel.reshape(output_channels, channels_col)
result = newTensorUninit[T](batch_size, output_channels, output_height, output_width)
var input_col = newTensorUninit[T](channels_col, output_height * output_width)
var output: Tensor[T]
for i in 0..<batch_size: #TODO: batch matmul
im2col(input.atAxisIndex(0, i).squeeze(0), kernel_size, padding, stride, input_col)
# The following must be done without copy: GEMM will directly write in the result tensor
output = result.atAxisIndex(0, i).reshape(kernel_col.shape[0], input_col.shape[1])
gemm(1.T, kernel_col, input_col, 0.T, output)
if bias.rank > 0:
result +.= bias.unsqueeze(0)

You can put a batch size of 1 and number of color channel to 1, result should be the same then. Bias can be an empty Tensor.

I think it corresponds to "valid" for convolvemode

@AngelEzquerra
Copy link
Contributor Author

AngelEzquerra commented May 3, 2024

Can you check the speed of cross-correlation vs im2colgemm_conv2d:

proc im2colgemm_conv2d*[T](input, kernel, bias: Tensor[T],
padding: Size2D = (0,0),
stride: Size2D = (1,1)): Tensor[T] =
## Compute cross-correlate for image with the given kernel weights
# Implementation with ideas from http://cs231n.github.io/convolutional-networks/#conv
let
batch_size = input.shape[^4]
output_channels = kernel.shape[^4]
kernel_size = (height: kernel.nchw_height, width: kernel.nchw_width)
output_height = (input.nchw_height + (2*padding.height) - kernel.nchw_height) div stride.height + 1
output_width = (input.nchw_width + (2*padding.width) - kernel.nchw_width) div stride.width + 1
channels_col = input.nchw_channels * kernel.nchw_height * kernel.nchw_width
kernel_col = kernel.reshape(output_channels, channels_col)
result = newTensorUninit[T](batch_size, output_channels, output_height, output_width)
var input_col = newTensorUninit[T](channels_col, output_height * output_width)
var output: Tensor[T]
for i in 0..<batch_size: #TODO: batch matmul
im2col(input.atAxisIndex(0, i).squeeze(0), kernel_size, padding, stride, input_col)
# The following must be done without copy: GEMM will directly write in the result tensor
output = result.atAxisIndex(0, i).reshape(kernel_col.shape[0], input_col.shape[1])
gemm(1.T, kernel_col, input_col, 0.T, output)
if bias.rank > 0:
result +.= bias.unsqueeze(0)

You can put a batch size of 1 and number of color channel to 1, result should be the same then. Bias can be an empty Tensor.

I think it corresponds to "valid" for convolvemode

I have this a try and as you said im2colgemm_conv2d with the right arguments is equivalent to correlate called with valid mode. I made some measurements and the results are a little surprising. When working with some relatively large integer tensors (arange(50000) as the input tensor and arange(200) as the "filter" tensor) the performance of the version of convolve in this PR and im2colgemm_conv2d is basically the same (around 70 ms). However the same test with the same inputs converted to float gave very different results. This PR's convolve's performance worsened to ~100 ms, while im2colgemm_conv2d improved to just below 30 ms.

For reference scipy's upfirdn's performance is still much better in many cases but seems a little inconsistent (depending on the order of the inputs it can be worse). I'll need to do more tests to fully confirm this point and provide actual comparison measurements.

In any case it seems worth looking into using im2colgemm_conv2d's method. However, it is not clear to me if it'd be easy to extend its result to "full mode". Any thoughts on that @mratsim? And any ideas about how scipy might be so much faster?

@AngelEzquerra
Copy link
Contributor Author

Can you check the speed of cross-correlation vs im2colgemm_conv2d:

proc im2colgemm_conv2d*[T](input, kernel, bias: Tensor[T],
padding: Size2D = (0,0),
stride: Size2D = (1,1)): Tensor[T] =
## Compute cross-correlate for image with the given kernel weights
# Implementation with ideas from http://cs231n.github.io/convolutional-networks/#conv
let
batch_size = input.shape[^4]
output_channels = kernel.shape[^4]
kernel_size = (height: kernel.nchw_height, width: kernel.nchw_width)
output_height = (input.nchw_height + (2*padding.height) - kernel.nchw_height) div stride.height + 1
output_width = (input.nchw_width + (2*padding.width) - kernel.nchw_width) div stride.width + 1
channels_col = input.nchw_channels * kernel.nchw_height * kernel.nchw_width
kernel_col = kernel.reshape(output_channels, channels_col)
result = newTensorUninit[T](batch_size, output_channels, output_height, output_width)
var input_col = newTensorUninit[T](channels_col, output_height * output_width)
var output: Tensor[T]
for i in 0..<batch_size: #TODO: batch matmul
im2col(input.atAxisIndex(0, i).squeeze(0), kernel_size, padding, stride, input_col)
# The following must be done without copy: GEMM will directly write in the result tensor
output = result.atAxisIndex(0, i).reshape(kernel_col.shape[0], input_col.shape[1])
gemm(1.T, kernel_col, input_col, 0.T, output)
if bias.rank > 0:
result +.= bias.unsqueeze(0)

You can put a batch size of 1 and number of color channel to 1, result should be the same then. Bias can be an empty Tensor.
I think it corresponds to "valid" for convolvemode

I have this a try and as you said im2colgemm_conv2d with the right arguments is equivalent to correlate called with valid mode. I made some measurements and the results are a little surprising. When working with some relatively large integer tensors (arange(50000) as the input tensor and arange(200) as the "filter" tensor) the performance of the version of convolve in this PR and im2colgemm_conv2d is basically the same (around 70 ms). However the same test with the same inputs converted to float gave very different results. This PR's convolve's performance worsened to ~100 ms, while im2colgemm_conv2d improved to just below 30 ms.

For reference scipy's upfirdn's performance is still much better in many cases but seems a little inconsistent (depending on the order of the inputs it can be worse). I'll need to do more tests to fully confirm this point and provide actual comparison measurements.

In any case it seems worth looking into using im2colgemm_conv2d's method. However, it is not clear to me if it'd be easy to extend its result to "full mode". Any thoughts on that @mratsim? And any ideas about how scipy might be so much faster?

I saw that is is possible to replicate convolution's full and valid modes using the padding argument of im2colgemm_conv2d. It should also be possible to use the stride argument to efficiently replicate the down argument of scipy's upfirdn function.

I still don't know why upfirdn is sometimes much faster than im2colgemm_conv2d though.

@Vindaar
Copy link
Collaborator

Vindaar commented May 7, 2024

When working with some relatively large integer tensors (arange(50000) as the input tensor and arange(200) as the "filter" tensor) the performance of the version of convolve in this PR and im2colgemm_conv2d is basically the same (around 70 ms). However the same test with the same inputs converted to float gave very different results. This PR's convolve's performance worsened to ~100 ms, while im2colgemm_conv2d improved to just below 30 ms.

Possibly related to the fact that it uses gemm and for integer types it uses Mamy's laser backend, while for float it falls back to BLAS? iirc the laser gemm implementation only exists for integers.

Regarding the scipy implementation: you are compiling with OpenMP explicitly (-d:openmp), no? Easy to forget and I assume scipy will do so.

edit: The code for upfirdn doesn't look very complicated: https://github.com/scipy/scipy/blob/main/scipy/signal/_upfirdn.py

@AngelEzquerra
Copy link
Contributor Author

When working with some relatively large integer tensors (arange(50000) as the input tensor and arange(200) as the "filter" tensor) the performance of the version of convolve in this PR and im2colgemm_conv2d is basically the same (around 70 ms). However the same test with the same inputs converted to float gave very different results. This PR's convolve's performance worsened to ~100 ms, while im2colgemm_conv2d improved to just below 30 ms.

Possibly related to the fact that it uses gemm and for integer types it uses Mamy's laser backend, while for float it falls back to BLAS? iirc the laser gemm implementation only exists for integers.

Regarding the scipy implementation: you are compiling with OpenMP explicitly (-d:openmp), no? Easy to forget and I assume scipy will do so.

That seems a plausible explanation, thanks. Actually, I wasn't using that option. Unfortunatelly when I compile with -d:openmp (on windows) I get the following error:

initialization.nim: In function 'copyFromRaw__upfirdn_u348':
initialization.nim:287:63: error: invalid branch to/from OpenMP structured block
initialization.nim:287:213: error: invalid branch to/from OpenMP structured block
initialization.nim:289:63: error: invalid branch to/from OpenMP structured block
compilation terminated due to -fmax-errors=3.

edit: The code for upfirdn doesn't look very complicated: https://github.com/scipy/scipy/blob/main/scipy/signal/_upfirdn.py

The key is that to implement it efficiently you must do the downsampling as part of the convolution. You cannot do it after the convolution step because it's much less efficient (you'd calculate "down" times more samples than needed). I was originally planning to add an upfirdn function to impulse, but because of this problem I think it makes more sense to add it to arraymancer. We can add the filter window creation (e.g. kaiser, etc) to impulse, and use them to add a resample function (akin to Matlab and scipy's resample functions) to impulse.

@AngelEzquerra
Copy link
Contributor Author

@mratsim, I am having trouble to get the gems based correlation work for integers. I'm sure it can be fixed but perhaps it could make sense to merge this improvement in (which makes the performance significantly better) and switch to a gemm based solution on a separate PR?

The exising implementation of `convolve` (which is also used for `correlate`) was pretty slow. The slowness was due to the fact that we were using regular tensor element access in the inner loop of the convolution, which was very expensive.

The solution was to ensure that the input tensors were contiguous (by cloning them if necessary) and then using `unsafe_raw_buf` for all the tensor element accesses, which is safe because we know that the indexes are all within the tensor boundaries.

On my system this change makes a large convolution go from ~1.5 seconds in release mode and ~0.25 seconds in danger mode down to ~65 usecs in both modes (i.e. a x23 reduction in release mode and a x3.8 reduction in danger mode)!
…GEMM

Change the algorithm used to calculate the convolution and the correlation to one based on using the "gemm" blas operation. This is around 3 times faster than the previous "fast" algorithm for float and complex input tensors. For integer tensors this new algorithm is as fast as the previous algorithm. The reason it is not faster is that for integers we do not use BLAS' gemm function.

By using this new algorithm we were also able to add support for a new `down` argument for the `convolute` and `correlate` procedures. This will be useful to implement an efficient `upfirdn` function in the future.

This commit also adds many new tests for the `convolute` and `correlate` procedures. These are needed because to use the gemm based algorithm must handle differently the case in which the first input tensor is shorter than the second input tensor. Another set of tests are added because handling the different convolution / correlation modes using gemm is a bit tricky.

Thanks to @mratsim for the suggestion, and specially to @Vindaar for reviewing these changes and fixing multiple issues.
This basic procedure which is missing from Matlab complements the `down` argument of `convolve` and `correlate` (as well as the slicing syntax which can be used to downsample a tensor).
@AngelEzquerra
Copy link
Contributor Author

@mratsim, I am having trouble to get the gems based correlation work for integers. I'm sure it can be fixed but perhaps it could make sense to merge this improvement in (which makes the performance significantly better) and switch to a gemm based solution on a separate PR?

Thanks to @Vindaar 's help I was able to make the gemm based convolution algorithm work. I just updated the PR with the changes.

Comment on lines 405 to 407
# # Let's make sure both inputs are contiguous
# let f = f.asContiguous()
# let g = g.asContiguous()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed, as it's now further up. :)

@Vindaar
Copy link
Collaborator

Vindaar commented May 23, 2024

Great work!

@Vindaar Vindaar merged commit 13af280 into mratsim:master May 23, 2024
6 checks passed
@AngelEzquerra AngelEzquerra deleted the faster_convolve branch May 23, 2024 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants