Performance difference between essentially the same einsum with dummy index interchanging #143

refraction-ray · 2020-06-09T03:09:30Z

Hi, there.

The relevant issue with this observation is google/TensorNetwork#650.

In simple code,

D = 800
D1 = 12
D2 = 10
a = np.random.rand(D1,D,D)
b = np.random.rand(D,D,D2)
print(timeit.timeit(lambda: np.einsum("ijk, jkl->il", a, b, optimize=True), number=100)) # 2.5s
print(timeit.timeit(lambda: np.einsum("ikj, kjl->il", a, b, optimize=True), number=100)) # 19s
print(timeit.timeit(lambda: oe.contract("ijk, jkl->il", a, b), number=100)) # 19s
print(timeit.timeit(lambda: oe.contract("ikj, kjl->il", a, b), number=100)) # 2.5s

This huge difference is possibly due to np.tensordot.

print(timeit.timeit(lambda: np.tensordot(a,b, [[1,2], [0,1]]), number=100)) #takes ~ 2.7 seconds
print(timeit.timeit(lambda: np.tensordot(a,b, [[2,1], [1,0]]), number=100)) #takes ~ 8.5 seconds
np.allclose(np.tensordot(a,b, [[1,2], [0,1]]),np.tensordot(a,b, [[2,1], [1,0]]) ) # True

The slower one may have invoked some transpose operations beyond matrix multiplications.

Any thoughts on this? And is there a persistent axes arguments order that make tensordot fast?

The text was updated successfully, but these errors were encountered:

jcmgray · 2020-06-09T05:29:51Z

I would have assumed that tensordot is fastest when the axes are contiguous last dimensions on the first array and contiguous first dimensions on the second array, such that matrix multiplication can be used (but slightly guessing here..).

If you use snakeviz you can see that it is indeed a reshape within tensordot that's costly on the first one:

Other lowest block is dot - & time spent just on that is similar to contraction two:

I guess the question for opt_einsum is whether there is a better default way to sort the axes supplied to tensordot at this point here.

jcmgray · 2020-06-09T05:39:22Z

Just to be clear, one is obviously free to choose the same permutation to apply to both left_pos right_pos aka axes1 & axes2.

One canonical choice would be

left_pos, right_pos = zip(*sorted(zip(left_pos, right_pos)))

which should make the two contractions above the same, but is it always the fastest ordering?

dgasmith · 2020-06-09T12:00:56Z

We talk about this some in #103. We can optimize the indices for a single contraction assuming that it is in C order, but it may cause issues down the line. In this example it doesn't matter for binary contractions however.

Another way to solve this is a custom tensor_blas function. I don't recall why we apparently deprecated it in favor of pure np.tensordot. I also though about pushing this into NumPy's tensordot implementation, but getting these things right for every edge case is a bit finicky.

jcmgray mentioned this issue Nov 2, 2020

Small fixes: 'dp' and memory_limit + tensordot axes order #154

Merged

1 task

ajz34 mentioned this issue Mar 4, 2021

Contraction route "ipx, iqx -> pq" probably slow and not multi-threading most time #157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance difference between essentially the same einsum with dummy index interchanging #143

Performance difference between essentially the same einsum with dummy index interchanging #143

refraction-ray commented Jun 9, 2020

jcmgray commented Jun 9, 2020 •

edited

Loading

jcmgray commented Jun 9, 2020 •

edited

Loading

dgasmith commented Jun 9, 2020

Performance difference between essentially the same einsum with dummy index interchanging #143

Performance difference between essentially the same einsum with dummy index interchanging #143

Comments

refraction-ray commented Jun 9, 2020

jcmgray commented Jun 9, 2020 • edited Loading

jcmgray commented Jun 9, 2020 • edited Loading

dgasmith commented Jun 9, 2020

jcmgray commented Jun 9, 2020 •

edited

Loading

jcmgray commented Jun 9, 2020 •

edited

Loading