-
-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the performance of convolve (and correlate) #650
Conversation
47ac37c
to
8474811
Compare
Can you check the speed of cross-correlation vs im2colgemm_conv2d: Arraymancer/src/arraymancer/nn_primitives/fallback/conv.nim Lines 81 to 106 in 1448698
You can put a batch size of 1 and number of color channel to 1, result should be the same then. Bias can be an empty Tensor. I think it corresponds to "valid" for convolvemode |
I have this a try and as you said For reference scipy's In any case it seems worth looking into using |
I saw that is is possible to replicate I still don't know why |
Possibly related to the fact that it uses Regarding the scipy implementation: you are compiling with OpenMP explicitly ( edit: The code for |
That seems a plausible explanation, thanks. Actually, I wasn't using that option. Unfortunatelly when I compile with
The key is that to implement it efficiently you must do the downsampling as part of the convolution. You cannot do it after the convolution step because it's much less efficient (you'd calculate "down" times more samples than needed). I was originally planning to add an |
@mratsim, I am having trouble to get the gems based correlation work for integers. I'm sure it can be fixed but perhaps it could make sense to merge this improvement in (which makes the performance significantly better) and switch to a gemm based solution on a separate PR? |
8474811
to
ab284a9
Compare
The exising implementation of `convolve` (which is also used for `correlate`) was pretty slow. The slowness was due to the fact that we were using regular tensor element access in the inner loop of the convolution, which was very expensive. The solution was to ensure that the input tensors were contiguous (by cloning them if necessary) and then using `unsafe_raw_buf` for all the tensor element accesses, which is safe because we know that the indexes are all within the tensor boundaries. On my system this change makes a large convolution go from ~1.5 seconds in release mode and ~0.25 seconds in danger mode down to ~65 usecs in both modes (i.e. a x23 reduction in release mode and a x3.8 reduction in danger mode)!
f33e92e
to
b9109f5
Compare
…GEMM Change the algorithm used to calculate the convolution and the correlation to one based on using the "gemm" blas operation. This is around 3 times faster than the previous "fast" algorithm for float and complex input tensors. For integer tensors this new algorithm is as fast as the previous algorithm. The reason it is not faster is that for integers we do not use BLAS' gemm function. By using this new algorithm we were also able to add support for a new `down` argument for the `convolute` and `correlate` procedures. This will be useful to implement an efficient `upfirdn` function in the future. This commit also adds many new tests for the `convolute` and `correlate` procedures. These are needed because to use the gemm based algorithm must handle differently the case in which the first input tensor is shorter than the second input tensor. Another set of tests are added because handling the different convolution / correlation modes using gemm is a bit tricky. Thanks to @mratsim for the suggestion, and specially to @Vindaar for reviewing these changes and fixing multiple issues.
This basic procedure which is missing from Matlab complements the `down` argument of `convolve` and `correlate` (as well as the slicing syntax which can be used to downsample a tensor).
Thanks to @Vindaar 's help I was able to make the gemm based convolution algorithm work. I just updated the PR with the changes. |
# # Let's make sure both inputs are contiguous | ||
# let f = f.asContiguous() | ||
# let g = g.asContiguous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be removed, as it's now further up. :)
Great work! |
bcc1297
to
d729311
Compare
The exising implementation of
convolve
(which is also used forcorrelate
) was pretty slow. The slowness was due to the fact that we were using regular tensor element access in the inner loop of the convolution, which was very expensive.The solution was to ensure that the input tensors were contiguous (by cloning them if necessary) and then using
unsafe_raw_buf
for all the tensor element accesses, which is safe because we know that the indexes are all within the tensor boundaries.On my system this change makes a large convolution go from ~1.5 seconds in release mode and ~0.25 seconds in danger mode down to ~65 usecs in both modes (i.e. a x23 reduction in release mode and a x3.8 reduction in danger mode)!