Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance difference between essentially the same einsum with dummy index interchanging #143

Open
refraction-ray opened this issue Jun 9, 2020 · 3 comments

Comments

@refraction-ray
Copy link

Hi, there.

The relevant issue with this observation is google/TensorNetwork#650.

In simple code,

D = 800
D1 = 12
D2 = 10
a = np.random.rand(D1,D,D)
b = np.random.rand(D,D,D2)
print(timeit.timeit(lambda: np.einsum("ijk, jkl->il", a, b, optimize=True), number=100)) # 2.5s
print(timeit.timeit(lambda: np.einsum("ikj, kjl->il", a, b, optimize=True), number=100)) # 19s
print(timeit.timeit(lambda: oe.contract("ijk, jkl->il", a, b), number=100)) # 19s
print(timeit.timeit(lambda: oe.contract("ikj, kjl->il", a, b), number=100)) # 2.5s

This huge difference is possibly due to np.tensordot.

print(timeit.timeit(lambda: np.tensordot(a,b, [[1,2], [0,1]]), number=100)) #takes ~ 2.7 seconds
print(timeit.timeit(lambda: np.tensordot(a,b, [[2,1], [1,0]]), number=100)) #takes ~ 8.5 seconds
np.allclose(np.tensordot(a,b, [[1,2], [0,1]]),np.tensordot(a,b, [[2,1], [1,0]]) ) # True

The slower one may have invoked some transpose operations beyond matrix multiplications.

Any thoughts on this? And is there a persistent axes arguments order that make tensordot fast?

@jcmgray
Copy link
Collaborator

jcmgray commented Jun 9, 2020

I would have assumed that tensordot is fastest when the axes are contiguous last dimensions on the first array and contiguous first dimensions on the second array, such that matrix multiplication can be used (but slightly guessing here..).

If you use snakeviz you can see that it is indeed a reshape within tensordot that's costly on the first one:
oe-1

Other lowest block is dot - & time spent just on that is similar to contraction two:
oe-2

I guess the question for opt_einsum is whether there is a better default way to sort the axes supplied to tensordot at this point here.

@jcmgray
Copy link
Collaborator

jcmgray commented Jun 9, 2020

Just to be clear, one is obviously free to choose the same permutation to apply to both left_pos right_pos aka axes1 & axes2.

One canonical choice would be

left_pos, right_pos = zip(*sorted(zip(left_pos, right_pos)))

which should make the two contractions above the same, but is it always the fastest ordering?

@dgasmith
Copy link
Owner

dgasmith commented Jun 9, 2020

We talk about this some in #103. We can optimize the indices for a single contraction assuming that it is in C order, but it may cause issues down the line. In this example it doesn't matter for binary contractions however.

Another way to solve this is a custom tensor_blas function. I don't recall why we apparently deprecated it in favor of pure np.tensordot. I also though about pushing this into NumPy's tensordot implementation, but getting these things right for every edge case is a bit finicky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants