Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'out' option in 'contract' makes a copy #209

Closed
wxj6000 opened this issue Nov 24, 2022 · 6 comments
Closed

'out' option in 'contract' makes a copy #209

wxj6000 opened this issue Nov 24, 2022 · 6 comments

Comments

@wxj6000
Copy link

wxj6000 commented Nov 24, 2022

When 'out' is given, 'contract' is supposed to write the result to 'out' directly. However, in the current version, if 'out' is given, opt_einsum will make copies twice! Here are my sample code and the profiling result.

a = cupy.random.rand(2100,6,400)
out = cupy.random.rand(2100,5,400)
c2s = cupy.random.rand(6,5)
contract('ijk,jp->ipk', a, c2s, out=out)

==6876== Profiling result:

GPU activities:
Type Time(%) Time Calls Avg Min Max Name
51.39% 1.96283s 20000 98.141us 88.831us 112.29us cupy_copy__float64_float64
48.57% 1.85507s 10000 185.51us 183.07us 192.03us void gemmSN_NN_kernel<double, int=128, int=2, int=4, int=8, int=5, int=4, cublasGemvTensorStridedBatched, cublasGemvTensorStridedBatched>(cublasGemmSmallNParams<double const , cublasGemvTensorStridedBatched, double>)

@dgasmith
Copy link
Owner

It is likely the case that the first copy is a transpose as the GEMM zip index isn't aligned. Please try profiling the following contraction with an aligned zip index:

a = cupy.random.rand(2100, 400, 6)
out = cupy.random.rand(2100, 400, 5)
c2s = cupy.random.rand(6, 5)
contract('ijk,kp->ijp', a, c2s, out=out)

@wxj6000
Copy link
Author

wxj6000 commented Dec 26, 2022

Sorry for the late reply. I missed the notification. The above code indeed reduces one of the cupy copies

==1850== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 77.46% 317.16ms 1000 317.16us 315.68us 330.75us volta_dgemm_128x64_tt
22.40% 91.722ms 1000 91.722us 90.528us 93.727us cupy_copy__float64_float64

But there is still a copy operation. It is possible to avoid the operation by writing the data to 'out' directly, right? Am I missing something?

No copy operation is needed if I contract via
out=contract('ijk,kp->ijp', a, c2s)

==1988== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.82% 317.26ms 1000 317.26us 315.61us 330.62us volta_dgemm_128x64_tt
0.07% 209.41us 1 209.41us 209.41us 209.41us void generate_seed_pseudo<rng_config<curandStateXORWOW, curandOrdering=101>>(__int64, __int64, __int64, curandOrdering, curandStateXORWOW*, unsigned int*)

But the new issue is that the 'out.flags' becomes

C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
The difference between the out in the two cases contract('ijk,kp->ijp', a, c2s, out=out) and out=contract('ijk,kp->ijp', a, c2s) is zero.

@dgasmith
Copy link
Owner

dgasmith commented Jan 4, 2023

In general, only BLAS-like interfaces are implemented in ML code bases which assumes aligned GEMM operations like ij,jk->ik. Anytime you do not have an aligned GEMM (zip index is contiguous in memory) a copy or a for loop is required. out is generally as fast as not specifying out as the time it takes to allocate an array is generally much less than the computational time.

opt_einsum for this binary contraction will simply call cupy.tensordot.

@wxj6000
Copy link
Author

wxj6000 commented Jan 4, 2023

I see. Cupy can be powered by cuTENSOR, which is not a standard BLAS-like interface. The copy operation can probably be avoided there.

Actually, the 'copy' operation is pretty slow on GPU. For example, the following contraction

import cupy
import opt_einsum
a = cupy.random.rand(800, 800, 800)
b = cupy.random.rand(800, 800)
c = cupy.zeros([800, 800, 800])

contract('ijk,kl->ijl', a, b) takes 48 ms, while contract('ijk,kl->ijl', a, b, out=c) takes 75 ms on A100.

The other issue is also mentioned in another issue.
#211

@jcmgray
Copy link
Collaborator

jcmgray commented Jan 6, 2023

I think a lot of what I said here #211 (comment), also applies to the 'out' kwargs - essentially there is not a backend agnostic way to handle it.

Maybe for these types of lower level functionality/optimization, one would want to extract the contraction path / specification and write a custom kernel.

@wxj6000
Copy link
Author

wxj6000 commented Jan 7, 2023

I see. Those kwargs indeed are out of opt_einsum's scope. Thank you for clarifying.

@wxj6000 wxj6000 closed this as completed Jan 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants