-
-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-contiguous tensor iteration optimization #659
Non-contiguous tensor iteration optimization #659
Conversation
…ooping over last two axes
…_size != tensor size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very nice, thank you!
I think we need in-code comment to describe the algorith, which would also be helpful for future maintenance.
@@ -166,25 +207,133 @@ template stridedIterationYield*(strider: IterKind, data, i, iter_pos: typed) = | |||
elif strider == IterKind.Iter_Values: yield (i, data[iter_pos]) | |||
elif strider == IterKind.Offset_Values: yield (iter_pos, data[iter_pos]) ## TODO: remove workaround for C++ backend | |||
|
|||
template stridedIterationLoop*(strider: IterKind, data, t, iter_offset, iter_size, prev_d, last_d: typed) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs a comment describing the algorithm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored the code a bit and added a comment
By the way, the same optimization can be applied to Also, are 0-rank tensors supported? And can they be iterated upon? Right now I added an assert to make sure that the rank is >0. |
iirc I tried to make them work and got involved into nasty compiler issues, but that was 7+ years ago. It's fine to assume unsupported for now. |
Note that I have a refactoring to allow parallel iteration on a variadic number of tensors here: https://github.com/mratsim/Arraymancer/blob/v0.7.32/src/arraymancer/laser/strided_iteration/foreach_common.nim#L101-L119 Instead of doing the dual/triple it would be more future forward to modify those and then replace the old iterations proc. This was motivated by GRU / LSTM needing iterations on 4 tensors at once: https://github.com/mratsim/Arraymancer/blob/v0.7.32/src/arraymancer/nn_primitives/nnp_gru.nim#L138-L142 |
Cool, I'll check out the variadic version. Might take a while, I'm new to nim and I see that the code is macro-heavy. |
No, we can add a check and fallback to a slow-path.
No worries, unfortunately that was iirc the cleanest solution. The reference code I used while developing generalizing the macros is this one: https://github.com/mratsim/laser/blob/master/benchmarks/loop_iteration/iter05_fusedpertensor.nim#L9-L143 |
By not resetting the offset here, operating on a Tensor view without cloning could cause undefined behavior, because we would be accessing elements outside the tensor buffer.
This PR caused a small regression related to |
* explicitly allow `openArray` in `[]`, `[]=` for tensors This was simply an oversight obviously * fix CI by compiling tests with `-d:ssl` * need a space, duh * use AWS mirror from PyTorch for MNIST download * fix regression caused by PR #659 By not resetting the offset here, operating on a Tensor view without cloning could cause undefined behavior, because we would be accessing elements outside the tensor buffer. * add test case for regression of #659
What?
Speeding up iteration over a single non-contiguous tensor by looping explicitly over the last two dimensions. (also changed
reshape
to usemap_inline
instead ofapply2_inline
)Why?
This operation is key to making non-contiguous tensors contiguous, and all other operations are typically much faster for contiguous tensors. The performance difference before and after optimization can be 10x and more in some cases.
How?
Calling
advanceStridedIteration
each step prevents proper vectorization, so instead we loop explicitly over the last two axes. This change is almost trivial when we iterate over a complete tensor, but is a bit tricky wheniter_offset != 0
oriter_size < t.size
. Most of the code handles the "ragged" ends of the tensor.We also reduce the rank of the tensor by coalescing axes together if possible. Contiguous and uniformly-strided tensors become 1-rank, non-contiguous with non-uniform strides are at least 2-rank, so they always have two axes to loop over. Also, coalescing axes makes sure that the last two axes are as large as possible, so the gain from looping is maximal.
When last axes have very small number of elements, specialization is used to remove the loops completely during compilation.
Benchmark
Code
bench.nim
Results (for reference)
With
-d:release -d:danger
original
optimized
With
-d:release -d:danger -d:openmp --exceptions:setjmp
original
optimized