Improve the performance of convolve (and correlate) #650

AngelEzquerra · 2024-05-01T21:05:02Z

The exising implementation of convolve (which is also used for correlate) was pretty slow. The slowness was due to the fact that we were using regular tensor element access in the inner loop of the convolution, which was very expensive.

The solution was to ensure that the input tensors were contiguous (by cloning them if necessary) and then using unsafe_raw_buf for all the tensor element accesses, which is safe because we know that the indexes are all within the tensor boundaries.

On my system this change makes a large convolution go from ~1.5 seconds in release mode and ~0.25 seconds in danger mode down to ~65 usecs in both modes (i.e. a x23 reduction in release mode and a x3.8 reduction in danger mode)!

mratsim · 2024-05-02T13:50:13Z

Can you check the speed of cross-correlation vs im2colgemm_conv2d:

Arraymancer/src/arraymancer/nn_primitives/fallback/conv.nim

Lines 81 to 106 in 1448698

    
           proc im2colgemm_conv2d*[T](input, kernel, bias: Tensor[T], 
        
                           padding: Size2D = (0,0), 
        
                           stride: Size2D = (1,1)): Tensor[T] = 
        
             ## Compute cross-correlate for image with the given kernel weights 
        
             # Implementation with ideas from http://cs231n.github.io/convolutional-networks/#conv 
        
             let 
        
               batch_size = input.shape[^4] 
        
               output_channels = kernel.shape[^4] 
        
               kernel_size = (height: kernel.nchw_height, width: kernel.nchw_width) 
        
               output_height = (input.nchw_height + (2*padding.height) - kernel.nchw_height) div stride.height + 1 
        
               output_width = (input.nchw_width + (2*padding.width) - kernel.nchw_width) div stride.width + 1 
        
               channels_col = input.nchw_channels * kernel.nchw_height * kernel.nchw_width 
        
               kernel_col = kernel.reshape(output_channels, channels_col) 
        
             result = newTensorUninit[T](batch_size, output_channels, output_height, output_width) 
        
             var input_col = newTensorUninit[T](channels_col, output_height * output_width) 
        
             var output: Tensor[T] 
        
             for i in 0..<batch_size: #TODO: batch matmul 
        
               im2col(input.atAxisIndex(0, i).squeeze(0), kernel_size, padding, stride, input_col) 
        
               # The following must be done without copy: GEMM will directly write in the result tensor 
        
               output = result.atAxisIndex(0, i).reshape(kernel_col.shape[0], input_col.shape[1]) 
        
               gemm(1.T, kernel_col, input_col, 0.T, output) 
        
             if bias.rank > 0: 
        
               result +.= bias.unsqueeze(0)

You can put a batch size of 1 and number of color channel to 1, result should be the same then. Bias can be an empty Tensor.

I think it corresponds to "valid" for convolvemode

AngelEzquerra · 2024-05-03T21:51:56Z

Can you check the speed of cross-correlation vs im2colgemm_conv2d:

Arraymancer/src/arraymancer/nn_primitives/fallback/conv.nim

Lines 81 to 106 in 1448698

proc im2colgemm_conv2d*[T](input, kernel, bias: Tensor[T],

padding: Size2D = (0,0),

stride: Size2D = (1,1)): Tensor[T] =

## Compute cross-correlate for image with the given kernel weights

# Implementation with ideas from http://cs231n.github.io/convolutional-networks/#conv

let

batch_size = input.shape[^4]

output_channels = kernel.shape[^4]

kernel_size = (height: kernel.nchw_height, width: kernel.nchw_width)

output_height = (input.nchw_height + (2*padding.height) - kernel.nchw_height) div stride.height + 1

output_width = (input.nchw_width + (2*padding.width) - kernel.nchw_width) div stride.width + 1

channels_col = input.nchw_channels * kernel.nchw_height * kernel.nchw_width

kernel_col = kernel.reshape(output_channels, channels_col)

result = newTensorUninit[T](batch_size, output_channels, output_height, output_width)

var input_col = newTensorUninit[T](channels_col, output_height * output_width)

var output: Tensor[T]

for i in 0..<batch_size: #TODO: batch matmul

im2col(input.atAxisIndex(0, i).squeeze(0), kernel_size, padding, stride, input_col)

# The following must be done without copy: GEMM will directly write in the result tensor

output = result.atAxisIndex(0, i).reshape(kernel_col.shape[0], input_col.shape[1])

gemm(1.T, kernel_col, input_col, 0.T, output)

if bias.rank > 0:

result +.= bias.unsqueeze(0)

You can put a batch size of 1 and number of color channel to 1, result should be the same then. Bias can be an empty Tensor.

I think it corresponds to "valid" for convolvemode

I have this a try and as you said im2colgemm_conv2d with the right arguments is equivalent to correlate called with valid mode. I made some measurements and the results are a little surprising. When working with some relatively large integer tensors (arange(50000) as the input tensor and arange(200) as the "filter" tensor) the performance of the version of convolve in this PR and im2colgemm_conv2d is basically the same (around 70 ms). However the same test with the same inputs converted to float gave very different results. This PR's convolve's performance worsened to ~100 ms, while im2colgemm_conv2d improved to just below 30 ms.

For reference scipy's upfirdn's performance is still much better in many cases but seems a little inconsistent (depending on the order of the inputs it can be worse). I'll need to do more tests to fully confirm this point and provide actual comparison measurements.

In any case it seems worth looking into using im2colgemm_conv2d's method. However, it is not clear to me if it'd be easy to extend its result to "full mode". Any thoughts on that @mratsim? And any ideas about how scipy might be so much faster?

AngelEzquerra · 2024-05-06T21:54:56Z

Can you check the speed of cross-correlation vs im2colgemm_conv2d:

Arraymancer/src/arraymancer/nn_primitives/fallback/conv.nim

Lines 81 to 106 in 1448698

proc im2colgemm_conv2d*[T](input, kernel, bias: Tensor[T],

padding: Size2D = (0,0),

stride: Size2D = (1,1)): Tensor[T] =

## Compute cross-correlate for image with the given kernel weights

# Implementation with ideas from http://cs231n.github.io/convolutional-networks/#conv

let

batch_size = input.shape[^4]

output_channels = kernel.shape[^4]

kernel_size = (height: kernel.nchw_height, width: kernel.nchw_width)

output_height = (input.nchw_height + (2*padding.height) - kernel.nchw_height) div stride.height + 1

output_width = (input.nchw_width + (2*padding.width) - kernel.nchw_width) div stride.width + 1

channels_col = input.nchw_channels * kernel.nchw_height * kernel.nchw_width

kernel_col = kernel.reshape(output_channels, channels_col)

result = newTensorUninit[T](batch_size, output_channels, output_height, output_width)

var input_col = newTensorUninit[T](channels_col, output_height * output_width)

var output: Tensor[T]

for i in 0..<batch_size: #TODO: batch matmul

im2col(input.atAxisIndex(0, i).squeeze(0), kernel_size, padding, stride, input_col)

# The following must be done without copy: GEMM will directly write in the result tensor

output = result.atAxisIndex(0, i).reshape(kernel_col.shape[0], input_col.shape[1])

gemm(1.T, kernel_col, input_col, 0.T, output)

if bias.rank > 0:

result +.= bias.unsqueeze(0)

You can put a batch size of 1 and number of color channel to 1, result should be the same then. Bias can be an empty Tensor.
I think it corresponds to "valid" for convolvemode

I have this a try and as you said im2colgemm_conv2d with the right arguments is equivalent to correlate called with valid mode. I made some measurements and the results are a little surprising. When working with some relatively large integer tensors (arange(50000) as the input tensor and arange(200) as the "filter" tensor) the performance of the version of convolve in this PR and im2colgemm_conv2d is basically the same (around 70 ms). However the same test with the same inputs converted to float gave very different results. This PR's convolve's performance worsened to ~100 ms, while im2colgemm_conv2d improved to just below 30 ms.

For reference scipy's upfirdn's performance is still much better in many cases but seems a little inconsistent (depending on the order of the inputs it can be worse). I'll need to do more tests to fully confirm this point and provide actual comparison measurements.

In any case it seems worth looking into using im2colgemm_conv2d's method. However, it is not clear to me if it'd be easy to extend its result to "full mode". Any thoughts on that @mratsim? And any ideas about how scipy might be so much faster?

I saw that is is possible to replicate convolution's full and valid modes using the padding argument of im2colgemm_conv2d. It should also be possible to use the stride argument to efficiently replicate the down argument of scipy's upfirdn function.

I still don't know why upfirdn is sometimes much faster than im2colgemm_conv2d though.

Vindaar · 2024-05-07T08:47:32Z

When working with some relatively large integer tensors (arange(50000) as the input tensor and arange(200) as the "filter" tensor) the performance of the version of convolve in this PR and im2colgemm_conv2d is basically the same (around 70 ms). However the same test with the same inputs converted to float gave very different results. This PR's convolve's performance worsened to ~100 ms, while im2colgemm_conv2d improved to just below 30 ms.

Possibly related to the fact that it uses gemm and for integer types it uses Mamy's laser backend, while for float it falls back to BLAS? iirc the laser gemm implementation only exists for integers.

Regarding the scipy implementation: you are compiling with OpenMP explicitly (-d:openmp), no? Easy to forget and I assume scipy will do so.

edit: The code for upfirdn doesn't look very complicated: https://github.com/scipy/scipy/blob/main/scipy/signal/_upfirdn.py

AngelEzquerra · 2024-05-07T18:38:13Z

When working with some relatively large integer tensors (arange(50000) as the input tensor and arange(200) as the "filter" tensor) the performance of the version of convolve in this PR and im2colgemm_conv2d is basically the same (around 70 ms). However the same test with the same inputs converted to float gave very different results. This PR's convolve's performance worsened to ~100 ms, while im2colgemm_conv2d improved to just below 30 ms.

Possibly related to the fact that it uses gemm and for integer types it uses Mamy's laser backend, while for float it falls back to BLAS? iirc the laser gemm implementation only exists for integers.

Regarding the scipy implementation: you are compiling with OpenMP explicitly (-d:openmp), no? Easy to forget and I assume scipy will do so.

That seems a plausible explanation, thanks. Actually, I wasn't using that option. Unfortunatelly when I compile with -d:openmp (on windows) I get the following error:

initialization.nim: In function 'copyFromRaw__upfirdn_u348':
initialization.nim:287:63: error: invalid branch to/from OpenMP structured block
initialization.nim:287:213: error: invalid branch to/from OpenMP structured block
initialization.nim:289:63: error: invalid branch to/from OpenMP structured block
compilation terminated due to -fmax-errors=3.

edit: The code for upfirdn doesn't look very complicated: https://github.com/scipy/scipy/blob/main/scipy/signal/_upfirdn.py

The key is that to implement it efficiently you must do the downsampling as part of the convolution. You cannot do it after the convolution step because it's much less efficient (you'd calculate "down" times more samples than needed). I was originally planning to add an upfirdn function to impulse, but because of this problem I think it makes more sense to add it to arraymancer. We can add the filter window creation (e.g. kaiser, etc) to impulse, and use them to add a resample function (akin to Matlab and scipy's resample functions) to impulse.

AngelEzquerra · 2024-05-11T12:11:50Z

@mratsim, I am having trouble to get the gems based correlation work for integers. I'm sure it can be fixed but perhaps it could make sense to merge this improvement in (which makes the performance significantly better) and switch to a gemm based solution on a separate PR?

The exising implementation of `convolve` (which is also used for `correlate`) was pretty slow. The slowness was due to the fact that we were using regular tensor element access in the inner loop of the convolution, which was very expensive. The solution was to ensure that the input tensors were contiguous (by cloning them if necessary) and then using `unsafe_raw_buf` for all the tensor element accesses, which is safe because we know that the indexes are all within the tensor boundaries. On my system this change makes a large convolution go from ~1.5 seconds in release mode and ~0.25 seconds in danger mode down to ~65 usecs in both modes (i.e. a x23 reduction in release mode and a x3.8 reduction in danger mode)!

@mratsim

…GEMM Change the algorithm used to calculate the convolution and the correlation to one based on using the "gemm" blas operation. This is around 3 times faster than the previous "fast" algorithm for float and complex input tensors. For integer tensors this new algorithm is as fast as the previous algorithm. The reason it is not faster is that for integers we do not use BLAS' gemm function. By using this new algorithm we were also able to add support for a new `down` argument for the `convolute` and `correlate` procedures. This will be useful to implement an efficient `upfirdn` function in the future. This commit also adds many new tests for the `convolute` and `correlate` procedures. These are needed because to use the gemm based algorithm must handle differently the case in which the first input tensor is shorter than the second input tensor. Another set of tests are added because handling the different convolution / correlation modes using gemm is a bit tricky. Thanks to @mratsim for the suggestion, and specially to @Vindaar for reviewing these changes and fixing multiple issues.

This basic procedure which is missing from Matlab complements the `down` argument of `convolve` and `correlate` (as well as the slicing syntax which can be used to downsample a tensor).

AngelEzquerra · 2024-05-16T20:14:07Z

@mratsim, I am having trouble to get the gems based correlation work for integers. I'm sure it can be fixed but perhaps it could make sense to merge this improvement in (which makes the performance significantly better) and switch to a gemm based solution on a separate PR?

Thanks to @Vindaar 's help I was able to make the gemm based convolution algorithm work. I just updated the PR with the changes.

Vindaar · 2024-05-23T08:01:02Z

src/arraymancer/tensor/math_functions.nim

+  # # Let's make sure both inputs are contiguous
+  # let f = f.asContiguous()
+  # let g = g.asContiguous()


Can be removed, as it's now further up. :)

Vindaar · 2024-05-23T08:06:17Z

Great work!

AngelEzquerra force-pushed the faster_convolve branch from 47ac37c to 8474811 Compare May 2, 2024 07:08

AngelEzquerra force-pushed the faster_convolve branch from 8474811 to ab284a9 Compare May 12, 2024 10:25

AngelEzquerra added 2 commits May 12, 2024 23:12

Fix small typo in gemm_strided comment

b9109f5

AngelEzquerra force-pushed the faster_convolve branch from f33e92e to b9109f5 Compare May 12, 2024 21:12

AngelEzquerra added 2 commits May 16, 2024 22:09

Add an upsample procedure

d729311

This basic procedure which is missing from Matlab complements the `down` argument of `convolve` and `correlate` (as well as the slicing syntax which can be used to downsample a tensor).

Vindaar reviewed May 23, 2024

View reviewed changes

AngelEzquerra force-pushed the faster_convolve branch from bcc1297 to d729311 Compare May 23, 2024 08:31

Vindaar merged commit 13af280 into mratsim:master May 23, 2024
6 checks passed

AngelEzquerra deleted the faster_convolve branch May 23, 2024 09:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the performance of convolve (and correlate) #650

Improve the performance of convolve (and correlate) #650

AngelEzquerra commented May 1, 2024

mratsim commented May 2, 2024 •

edited

Loading

AngelEzquerra commented May 3, 2024 •

edited

Loading

AngelEzquerra commented May 6, 2024

Vindaar commented May 7, 2024 •

edited

Loading

AngelEzquerra commented May 7, 2024

AngelEzquerra commented May 11, 2024

AngelEzquerra commented May 16, 2024

Vindaar May 23, 2024

Vindaar commented May 23, 2024

Improve the performance of convolve (and correlate) #650

Improve the performance of convolve (and correlate) #650

Conversation

AngelEzquerra commented May 1, 2024

mratsim commented May 2, 2024 • edited Loading

AngelEzquerra commented May 3, 2024 • edited Loading

AngelEzquerra commented May 6, 2024

Vindaar commented May 7, 2024 • edited Loading

AngelEzquerra commented May 7, 2024

AngelEzquerra commented May 11, 2024

AngelEzquerra commented May 16, 2024

Vindaar May 23, 2024

Choose a reason for hiding this comment

Vindaar commented May 23, 2024

mratsim commented May 2, 2024 •

edited

Loading

AngelEzquerra commented May 3, 2024 •

edited

Loading

Vindaar commented May 7, 2024 •

edited

Loading