Non-contiguous tensor iteration optimization #659

darkestpigeon · 2024-07-10T15:25:21Z

What?

Speeding up iteration over a single non-contiguous tensor by looping explicitly over the last two dimensions. (also changed reshape to use map_inline instead of apply2_inline)

Why?

This operation is key to making non-contiguous tensors contiguous, and all other operations are typically much faster for contiguous tensors. The performance difference before and after optimization can be 10x and more in some cases.

How?

Calling advanceStridedIteration each step prevents proper vectorization, so instead we loop explicitly over the last two axes. This change is almost trivial when we iterate over a complete tensor, but is a bit tricky when iter_offset != 0 or iter_size < t.size. Most of the code handles the "ragged" ends of the tensor.

We also reduce the rank of the tensor by coalescing axes together if possible. Contiguous and uniformly-strided tensors become 1-rank, non-contiguous with non-uniform strides are at least 2-rank, so they always have two axes to loop over. Also, coalescing axes makes sure that the last two axes are as large as possible, so the gain from looping is maximal.

When last axes have very small number of elements, specialization is used to remove the loops completely during compilation.

Benchmark

Code

bench.nim

import algorithm
import std/times
import std/strutils
import arraymancer


template echo_code(body: untyped) =
  echo astToStr(body)
  body

template timeit(body: untyped): untyped =
  block:
    var t = 0.0
    var t_values: seq[float]
    while t_values.len < 10 or t < 1:
      let start_t = cpuTime()
      block:
        body
      let end_t = cpuTime()
      t_values.add(end_t - start_t)
      t += end_t - start_t

    t_values.sort()

    let low_p = (t_values.len.float * 0.1).round.int
    let high_p = (t_values.len.float * 0.9).round.int - 1

    let mean = 0.5*(t_values[low_p] + t_values[high_p])
    let err = 0.5*(t_values[high_p] - t_values[low_p])

    echo astToStr(body)
    echo '\t',
      (1e3*mean).formatFloat(ffDecimal, 5), " \u00B1 ",
      (1e3*err).formatFloat(ffDecimal, 5), " ms"

block:
  echo_code:
    let x = randomTensor(1_000_000, max=1000).astype(float)
    let y = sin(x)
    assert block:
      var equal = true
      for i in 0..<1_000_000:
        equal = equal and (y[i] == sin(x[i]))
      equal

  timeit:
    discard sin(x)

block:
  echo_code:
    let x = randomTensor(1920, 1080, 3, max=255).astype(uint8)
    assert x == x.clone()

  timeit:
    discard x.clone()

block:
  echo_code:
    let x = randomTensor(1920, 1080, 3, max=255).astype(uint8)
    let y = x[1..999, 1..999]
    assert y == y.asContiguous()

  timeit:
    discard y.asContiguous()

block:
  echo_code:
    let x = randomTensor(1920, 1080, 3, max=255).astype(uint8)
    let y = x[1..999, 1..999, 0..1]
    assert y == y.asContiguous()

  timeit:
    discard y.asContiguous()

block:
  echo_code:
    let x = randomTensor(1920, 1080, 3, 3, max=255).astype(uint8)
    let y = x[1..999, 1..999, 0..1, 1..2]
    assert y == y.asContiguous()

  timeit:
    discard y.asContiguous()

block:
  echo_code:
    let x = randomTensor(1920, 1080, 5, max=255).astype(uint8)
    let y = x[1..999, 1..999, 1..3]
    assert y == y.asContiguous()

  timeit:
    discard y.asContiguous()

block:
  echo_code:
    let x = randomTensor(1920, 1080, 20, max=255).astype(uint8)
    let y = x[1..999, 1..999, 3..6]
    assert y == y.asContiguous()

  timeit:
    discard y.asContiguous()

block:
  echo_code:
    let x = randomTensor(1920, 1080, 3, max=255).astype(uint8)
    let y = x.permute(2, 0, 1)
    assert y == y.asContiguous()

  timeit:
    discard y.asContiguous()

block:
  echo_code:
    let x = randomTensor(1920, 1080, 3, max=255).astype(uint8)
    let y = x[_, _, _.._|-1]
    assert y == y.asContiguous()

  timeit:
    discard y.asContiguous()

block:
  echo_code:
    let x = arange(0, 15*411*44).astype(float32).reshape(15, 411, 44).permute(1, 0, 2) #(411, 15, 44)
    let y = x.reshape(411, 5, 132) # now contiguous

    assert x == y.reshape(411, 15, 44) # non-copying, stridedIteration not involved

  timeit:
    discard x.reshape(411, 5, 132)

Results (for reference)

With -d:release -d:danger

original

let x = randomTensor(1000000, max = 1000).astype(float)
let y = sin(x)
assert block:
  var equal = true
  for i in 0 ..< 1000000:
    equal = equal and (y[i] == sin(x[i]))
  equal

discard sin(x)
        10.31089 ± 0.05542 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
assert x == x.clone()

discard x.clone()
        0.16996 ± 0.00206 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999]
assert y == y.asContiguous()

discard y.asContiguous()
        1.86853 ± 0.02510 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1]
assert y == y.asContiguous()

discard y.asContiguous()
        1.32417 ± 0.01723 ms

let x = randomTensor(1920, 1080, 3, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1, 1 .. 2]
assert y == y.asContiguous()

discard y.asContiguous()
        2.86708 ± 0.04316 ms

let x = randomTensor(1920, 1080, 5, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 1 .. 3]
assert y == y.asContiguous()

discard y.asContiguous()
        1.87422 ± 0.02434 ms

let x = randomTensor(1920, 1080, 20, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 3 .. 6]
assert y == y.asContiguous()

discard y.asContiguous()
        2.40025 ± 0.09084 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x.permute(2, 0, 1)
assert y == y.asContiguous()

discard y.asContiguous()
        7.95384 ± 0.05741 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[_, _, _ .. _ |- 1]
assert y == y.asContiguous()

discard y.asContiguous()
        4.89026 ± 0.04802 ms

let x = arange(0, 15 * 411 * 44).astype(float32).reshape(15, 411, 44).permute(1, 0, 2)
let y = x.reshape(411, 5, 132)
assert x == y.reshape(411, 15, 44)

discard x.reshape(411, 5, 132)
        0.33438 ± 0.02463 ms

optimized

let x = randomTensor(1000000, max = 1000).astype(float)
let y = sin(x)
assert block:
  var equal = true
  for i in 0 ..< 1000000:
    equal = equal and (y[i] == sin(x[i]))
  equal

discard sin(x)
        10.46194 ± 0.09679 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
assert x == x.clone()

discard x.clone()
        0.17786 ± 0.00987 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999]
assert y == y.asContiguous()

discard y.asContiguous()
        0.08712 ± 0.00139 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1]
assert y == y.asContiguous()

discard y.asContiguous()
        0.24955 ± 0.01521 ms

let x = randomTensor(1920, 1080, 3, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1, 1 .. 2]
assert y == y.asContiguous()

discard y.asContiguous()
        1.41385 ± 0.00954 ms

let x = randomTensor(1920, 1080, 5, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 1 .. 3]
assert y == y.asContiguous()

discard y.asContiguous()
        0.36874 ± 0.01580 ms

let x = randomTensor(1920, 1080, 20, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 3 .. 6]
assert y == y.asContiguous()

discard y.asContiguous()
        1.53922 ± 0.04587 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x.permute(2, 0, 1)
assert y == y.asContiguous()

discard y.asContiguous()
        1.14373 ± 0.04919 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[_, _, _ .. _ |- 1]
assert y == y.asContiguous()

discard y.asContiguous()
        0.84792 ± 0.03781 ms

let x = arange(0, 15 * 411 * 44).astype(float32).reshape(15, 411, 44).permute(1, 0, 2)
let y = x.reshape(411, 5, 132)
assert x == y.reshape(411, 15, 44)

discard x.reshape(411, 5, 132)
        0.02671 ± 0.00100 ms

With -d:release -d:danger -d:openmp --exceptions:setjmp

original

let x = randomTensor(1000000, max = 1000).astype(float)
let y = sin(x)
assert block:
  var equal = true
  for i in 0 ..< 1000000:
    equal = equal and (y[i] == sin(x[i]))
  equal

discard sin(x)
        0.72820 ± 0.00724 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
assert x == x.clone()

discard x.clone()
        0.02334 ± 0.00027 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999]
assert y == y.asContiguous()

discard y.asContiguous()
        0.19307 ± 0.00054 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1]
assert y == y.asContiguous()

discard y.asContiguous()
        0.12059 ± 0.00033 ms

let x = randomTensor(1920, 1080, 3, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1, 1 .. 2]
assert y == y.asContiguous()

discard y.asContiguous()
        0.31961 ± 0.00707 ms

let x = randomTensor(1920, 1080, 5, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 1 .. 3]
assert y == y.asContiguous()

discard y.asContiguous()
        0.19341 ± 0.00065 ms

let x = randomTensor(1920, 1080, 20, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 3 .. 6]
assert y == y.asContiguous()

discard y.asContiguous()
        0.28808 ± 0.01241 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x.permute(2, 0, 1)
assert y == y.asContiguous()

discard y.asContiguous()
        0.42130 ± 0.01610 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[_, _, _ .. _ |- 1]
assert y == y.asContiguous()

discard y.asContiguous()
        0.39559 ± 0.00167 ms

let x = arange(0, 15 * 411 * 44).astype(float32).reshape(15, 411, 44).permute(1, 0, 2)
let y = x.reshape(411, 5, 132)
assert x == y.reshape(411, 15, 44)

discard x.reshape(411, 5, 132)
        0.02266 ± 0.00045 ms

optimized

let x = randomTensor(1000000, max = 1000).astype(float)
let y = sin(x)
assert block:
  var equal = true
  for i in 0 ..< 1000000:
    equal = equal and (y[i] == sin(x[i]))
  equal

discard sin(x)
        0.72824 ± 0.00665 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
assert x == x.clone()

discard x.clone()
        0.02336 ± 0.00024 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999]
assert y == y.asContiguous()

discard y.asContiguous()
        0.01309 ± 0.00022 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1]
assert y == y.asContiguous()

discard y.asContiguous()
        0.02233 ± 0.00099 ms

let x = randomTensor(1920, 1080, 3, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1, 1 .. 2]
assert y == y.asContiguous()

discard y.asContiguous()
        0.08743 ± 0.00033 ms

let x = randomTensor(1920, 1080, 5, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 1 .. 3]
assert y == y.asContiguous()

discard y.asContiguous()
        0.03073 ± 0.00067 ms

let x = randomTensor(1920, 1080, 20, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 3 .. 6]
assert y == y.asContiguous()

discard y.asContiguous()
        0.18865 ± 0.00334 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x.permute(2, 0, 1)
assert y == y.asContiguous()

discard y.asContiguous()
        0.10158 ± 0.00160 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[_, _, _ .. _ |- 1]
assert y == y.asContiguous()

discard y.asContiguous()
        0.07126 ± 0.00276 ms

let x = arange(0, 15 * 411 * 44).astype(float32).reshape(15, 411, 44).permute(1, 0, 2)
let y = x.reshape(411, 5, 132)
assert x == y.reshape(411, 15, 44)

discard x.reshape(411, 5, 132)
        0.00654 ± 0.00026 ms

…ooping over last two axes

…_size != tensor size

…r when copying)

mratsim

This is very nice, thank you!

I think we need in-code comment to describe the algorith, which would also be helpful for future maintenance.

mratsim · 2024-07-16T08:51:39Z

src/arraymancer/tensor/private/p_accessors.nim

@@ -166,25 +207,133 @@ template stridedIterationYield*(strider: IterKind, data, i, iter_pos: typed) =
  elif strider == IterKind.Iter_Values: yield (i, data[iter_pos])
  elif strider == IterKind.Offset_Values: yield (iter_pos, data[iter_pos]) ## TODO: remove workaround for C++ backend

+template stridedIterationLoop*(strider: IterKind, data, t, iter_offset, iter_size, prev_d, last_d: typed) =


This probably needs a comment describing the algorithm

Refactored the code a bit and added a comment

darkestpigeon · 2024-07-17T10:49:46Z

By the way, the same optimization can be applied to dualStridedIteration and tripleStridedIteration in the case when the tensor shapes can be broadcast to the same shape. This should bring similar speed-ups to operations like apply2_inline when non-contiguous tensors are involved.

Also, are 0-rank tensors supported? And can they be iterated upon? Right now I added an assert to make sure that the rank is >0.

mratsim · 2024-07-17T13:12:43Z

Also, are 0-rank tensors supported? And can they be iterated upon? Right now I added an assert to make sure that the rank is >0.

iirc I tried to make them work and got involved into nasty compiler issues, but that was 7+ years ago. It's fine to assume unsupported for now.

mratsim · 2024-07-17T13:17:39Z

By the way, the same optimization can be applied to dualStridedIteration and tripleStridedIteration in the case when the tensor shapes can be broadcast to the same shape. This should bring similar speed-ups to operations like apply2_inline when non-contiguous tensors are involved.

Note that I have a refactoring to allow parallel iteration on a variadic number of tensors here: https://github.com/mratsim/Arraymancer/blob/v0.7.32/src/arraymancer/laser/strided_iteration/foreach_common.nim#L101-L119

Instead of doing the dual/triple it would be more future forward to modify those and then replace the old iterations proc.

This was motivated by GRU / LSTM needing iterations on 4 tensors at once: https://github.com/mratsim/Arraymancer/blob/v0.7.32/src/arraymancer/nn_primitives/nnp_gru.nim#L138-L142

darkestpigeon · 2024-07-17T13:37:13Z

Cool, I'll check out the variadic version. Might take a while, I'm new to nim and I see that the code is macro-heavy.
Also, regarding the iteration. Is it safe to assume that all the tensors we're iterating on have compatible shapes (e.g. can be broadcasted to a common shape)?

mratsim · 2024-07-17T13:47:48Z

Is it safe to assume that all the tensors we're iterating on have compatible shapes (e.g. can be broadcasted to a common shape)?

No, we can add a check and fallback to a slow-path.

Might take a while, I'm new to nim and I see that the code is macro-heavy.

No worries, unfortunately that was iirc the cleanest solution.

The reference code I used while developing generalizing the macros is this one: https://github.com/mratsim/laser/blob/master/benchmarks/loop_iteration/iter05_fusedpertensor.nim#L9-L143

By not resetting the offset here, operating on a Tensor view without cloning could cause undefined behavior, because we would be accessing elements outside the tensor buffer.

Vindaar · 2024-09-20T16:09:19Z

This PR caused a small regression related to reshape that was not caught, due to the broken CI. I fixed it in #666. The issue was that the new reshape_with_copy implementation did not reset the offset of the input tensor.

* explicitly allow `openArray` in `[]`, `[]=` for tensors This was simply an oversight obviously * fix CI by compiling tests with `-d:ssl` * need a space, duh * use AWS mirror from PyTorch for MNIST download * fix regression caused by PR #659 By not resetting the offset here, operating on a Tensor view without cloning could cause undefined behavior, because we would be accessing elements outside the tensor buffer. * add test case for regression of #659

darkestpigeon added 4 commits July 10, 2024 12:33

optimized stridedIteration for non-contiguous tensors by explicitly l…

ab54f8f

…ooping over last two axes

fixed an introduced stridedIteration bug for iter_offset != 0 or iter…

f05e956

…_size != tensor size

made reshape use map_inline instead of apply2_inline

d486689

fixed a bug in reshape modification (was returning unintialized tenso…

70c011c

…r when copying)

mratsim approved these changes Jul 16, 2024

View reviewed changes

darkestpigeon added 2 commits July 17, 2024 12:29

refactored the loop code, added asserts for better clarity

827e2d2

added a comment for the stridedIterationLoop

7af4b91

mratsim merged commit 35adfc1 into mratsim:master Jul 17, 2024

darkestpigeon deleted the non-cont-iteration-optimization branch August 20, 2024 19:58

Vindaar added a commit that referenced this pull request Sep 20, 2024

fix regression caused by PR #659

93a6506

By not resetting the offset here, operating on a Tensor view without cloning could cause undefined behavior, because we would be accessing elements outside the tensor buffer.

Vindaar added a commit that referenced this pull request Sep 20, 2024

add test case for regression of #659

46c1206

Vindaar mentioned this pull request Sep 20, 2024

Explicitly allow openArray in Tensor [], []=, fix the CI #666

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-contiguous tensor iteration optimization #659

Non-contiguous tensor iteration optimization #659

darkestpigeon commented Jul 10, 2024 •

edited

Loading

mratsim left a comment

mratsim Jul 16, 2024

darkestpigeon Jul 17, 2024

darkestpigeon commented Jul 17, 2024 •

edited

Loading

mratsim commented Jul 17, 2024

mratsim commented Jul 17, 2024 •

edited

Loading

darkestpigeon commented Jul 17, 2024

mratsim commented Jul 17, 2024

Vindaar commented Sep 20, 2024 •

edited

Loading

Non-contiguous tensor iteration optimization #659

Non-contiguous tensor iteration optimization #659

Conversation

darkestpigeon commented Jul 10, 2024 • edited Loading

What?

Why?

How?

Benchmark

Code

Results (for reference)

mratsim left a comment

Choose a reason for hiding this comment

mratsim Jul 16, 2024

Choose a reason for hiding this comment

darkestpigeon Jul 17, 2024

Choose a reason for hiding this comment

darkestpigeon commented Jul 17, 2024 • edited Loading

mratsim commented Jul 17, 2024

mratsim commented Jul 17, 2024 • edited Loading

darkestpigeon commented Jul 17, 2024

mratsim commented Jul 17, 2024

Vindaar commented Sep 20, 2024 • edited Loading

darkestpigeon commented Jul 10, 2024 •

edited

Loading

darkestpigeon commented Jul 17, 2024 •

edited

Loading

mratsim commented Jul 17, 2024 •

edited

Loading

Vindaar commented Sep 20, 2024 •

edited

Loading