-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[abandoned] Allowing zerocopy: makeAvailable(queue, dst, srcView)
#1820
base: develop
Are you sure you want to change the base?
Conversation
BufAcc bufAccS{alpaka::zero_memcpy(queue, bufHostS)}; | ||
float* const ptrBufAccS{alpaka::getPtrNative(bufAccS)}; | ||
auto pitchBufAccS = alpaka::getPitchBytes<1u>(bufAccS); | ||
alpaka::memcpy(queue, bufAccS, bufHostS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffer creation is moved down to where the memory copies were performed, because allocation/memcpy is now one potential step hidden behind zero_memcpy
.
include/alpaka/mem/buf/Traits.hpp
Outdated
//! Makes the content of the source view available on the device associated with the destination queue. If the | ||
//! destination shares the same memory space as the source view, no copy is performed and the destination view is | ||
//! updated to share the same buffer as the source view. Otherwise, a memcpy is performed. | ||
template<typename TQueue, typename TViewDst, typename TViewSrc> | ||
ALPAKA_FN_HOST void zero_memcpy(TQueue& dstQueue, TViewDst& viewDstOpt, TViewSrc const& viewSrc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version is used when the user already has a destination buffer. Typically, the case when transferring back from the device to the host buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our offline discussion yesterday: IMO TQueue& dstQueue
must be the source queue because you should not provide the control back to the caller before the memory operations on the source device are finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of two possible use cases:
- "transfer" data from host to device
- "transfer" data from device to host
where device could be any device for which we do not want to make the copy, but take the zero-copy path; it could be from a CPU backend, or a GPU backend with unified memory, etc.
In both cases I would expect to use a queue associated to the device, not to the host, same as for the regular memcpy
function.
So, maybe this should simply be TQueue queue
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fwyzard I think it must always be the source queue because there zero copy operation is instantly finished, so it takes no time but it is not allowed to use viewDstOpt
before all operations on viewSrc
are finished. To guard this case you need to wait for the source queue. In cases we would introduce a zero_memcpy_async
we would need to return an event that provides the information when we can use viewDstOpt
.
Even for host to device
we need the source queue (host queue) because the user can run a host-accelerated kernel with viewSrc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhh I am a little bit confused the function is called zero_memcpy
but is performing a copy in case the memory is not available on the destination device. So it is NOT zero-memcopy, it is some kind of an optimized (intelligent) copy.
In that case I think we need the source and destination queue to gard when the operation starts in case it is not zero copiable
and to secure that follow-up operations to the destination not start before the possible copy is finished.
IMO the function should not be called zero_memcpy
or I have another definition of zero memcopy in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both cases I would expect to use a queue associated to the device, not to the host, same as for the regular memcpy function.
So, maybe this should simply be TQueue queue ?
Exactly, so I renamed it yesterday like so, but GitHub is only showing Outdated
instead of the actual state of the PR.
it is not allowed to use viewDstOpt before all operations on viewSrc are finished. To guard this case you need to wait for the source queue.
So how do we guard against this in the standard alpaka::memcpy(queue, dstView, srcView)
? Isn't it the same problem here? So I wondered what the meaning of this queue actually is, and whether it needs to have any relation to the views. Looking into the implementation, there is actually no requirement on the queue related to the views. For CUDA, the associated devices of the views determine whether the memcpy kind is host-2-device, device-2-device, etc., but I could apparently use a queue to a device A and then memcpy between two views on different devices B and C. This may be a design bug, but it leads me to believe that the queue's meaning is to allow the user to sequence operations with respect to other operations. And if the user gets that wrong, then that's the users fault.
Even for host to device we need the source queue (host queue) because the user can run a host-accelerated kernel with viewSrc.
Could you point me to an example where we actually have a queue to the host device as well (in addition to the accelerator device) and how that would be used? This is the first time I hear of this use.
Mhh I am a little bit confused the function is called zero_memcpy but is performing a copy in case the memory is not available on the destination device.
Right, which is why I chose the provocative placeholder name zero_memcpy
. We still need to have that discussion on how to name this functionality.
I was tempted yesterday to define zero-copy as: zero_memcpy(queue, dev, srcView) -> optional<DstView>
or even omit the queue
parameter altogether, because there is no real work necessary in case the view is also accessible on the destination device. However, this is not the API I would want to have as a user and I would immediately need to implement:
if (auto dst = zero_memcpy(queue, dev, srcView))
return *dst;
else {
auto dst = allocBuf(...);
memcpy(queue, dst, src);
}
I think @j-stephan suggested in the past to use make_available
and @fwyzard suggested to use convey
or smart_memcpy
in #28.
In that case I think we need the source and destination queue to gard when the operation starts in case it is not zero copiable and to secure that follow-up operations to the destination not start before the possible copy is finished.
Again, this is the users responsibilty to use the queues correctly. If you create two queues and use two views on both of the queues, you can also create a race with just the normal alpaka::memcpy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point me to an example where we actually have a queue to the host device as well (in addition to the accelerator device) and how that would be used? This is the first time I hear of this use.
I would say we do not have such an example because we do not use accelerators colocated in one application.
As I know we are very "dirty" in all our examples and run host code without a use of an accelerator because we run the host code sequentially.
Even in PIConGPU we use native OpenMP where normally an accelerator would be the better way to run parallel host code to re-use all device functions we implemented. The problem with PIConGPU is that we use cupla
where we can not use two accelerators at the same time.
To be fair but this is one reason why we made alpaka but we never created an example :-(
Our tests are using multiple accelerators in one test-app but not cooperative as I know :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I know we are very "dirty" in all our examples and run host code without a use of an accelerator because we run the host code sequentially.
I don't think this is dirty. Alpaka should just coexist with ordinary C++. If I use a std::for_each(std::execution::par, ...)
or a std::jthread
next to an alpaka kernel call, this should be allowed (given that I also get all my synchronization right).
Our tests are using multiple accelerators in one test-app but not cooperative as I know :-(
Yeah, I think we should have a test/example for that, see #1909.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should co-exists with "normal" C++. Nver the less all our examples assume we do not need paralism on the host side and we simply use a sequential for loop e.g. to evaluate results instead of a kernel that can do it in parallel.
include/alpaka/mem/buf/Traits.hpp
Outdated
//! Makes the content of the source view available on the device associated with the destination queue. If the | ||
//! destination shares the same memory space as the source view, no copy is performed and the source view is | ||
//! returned. Otherwise | ||
template<typename TQueue, typename TViewSrc> | ||
ALPAKA_FN_HOST auto zero_memcpy(TQueue& dstQueue, TViewSrc const& viewSrc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version is used when the user does not have a destination buffer yet. Typically, the case when transferring from the host to a (potentially newly allocated) device buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I had to add another overload so the user could specify the destination device, because this is not necessarily the same as the queue device.
include/alpaka/mem/buf/Traits.hpp
Outdated
return viewSrc; | ||
} | ||
|
||
auto dst = allocAsyncBufIfSupported<Elem<TViewSrc>, Idx<TViewSrc>>(dstQueue, getExtentVec(viewSrc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What strategy should I use to allocate a new buffer here?
For now I implemented the new routines for buffers. I wondered whether it makes any sense to allow more generic views as well. I guess not, because views have no device associated with them? |
e57226c
to
ae0febb
Compare
include/alpaka/mem/buf/Traits.hpp
Outdated
namespace detail | ||
{ | ||
template<typename DevDst, typename DevSrc> | ||
auto canZeroCopy(const DevDst& devDst, const DevSrc& devSrc) -> bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be constexpr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. I chose to be conservative for now in case we ever need to call a runtime function to figure out whether we can zero-copy between two devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood.
By the way, what should be the behaviour of zero_memcpy(queue, bufOnSomeGpu, bufOnOtherGPu);
?
Make a copy, or let SomeGpu
access the buffer from OtherGpu
memory through nvlink/PCIe/etc. ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a very good question and I don't know what's best. I added it to #28.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer the second version as it makes use of (fast) hardware capabilities. Now the question to me is: Can and should we decide if this is possible automatically or let the users declare what they want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I renamed the feature makeAvailable
now, which is what its intend is. So makeAvailable(queue, bufOnSomeGpu, bufOnOtherGPu)
should make the memory available on another GPU. If zero-copying is possible by any chance (mapping memory through PCIe, nvlink etc), it should be done. If non of that is possible, a full copy is performed.
If the user intends to have a full copy anyway, they should just say so and write memcpy(queue, bufOnSomeGpu, bufOnOtherGPu)
. In that sense, we should decide this automatically. But that decision may depend on the user having declared the involved buffers in a certain way.
If the user intended to use a buffer over nvlink between 2 CUDA CPUs (a very explicit use), then they don't need to use alpaka and can just call the CUDA API directly.
ae0febb
to
3b670f5
Compare
a6d95c5
to
6cb3dae
Compare
f374037
to
92f086b
Compare
After discussion #1820 (comment) I think this method can be useful but is very dangerous for the user to use. There are two cases
As mentioned in #1820 (comment) I think this behavior must be reflected in the name of the "copy" function. |
I think the function Maybe the function should be called |
741852c
to
1760283
Compare
makeAvailable(queue, dst, srcView)
Adds an overload set `makeAvailable` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted. Fixes: alpaka-group#28
1760283
to
39b5dce
Compare
typename TViewSrc, | ||
std::enable_if_t<isDevice<TDevDst>, int> = 0, | ||
typename TViewDst = Buf<TDevDst, Elem<TViewSrc>, Dim<TViewSrc>, Idx<TViewSrc>>> | ||
ALPAKA_FN_HOST auto makeAvailable(TQueue& queue, TDevDst const& dstDev, TViewSrc const& viewSrc) -> TViewDst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please extend the documentation and point to the fact that the user must take care that all operations on viewSrc
are finished.
This was one point where I was always confused in this PR but now I understand that the user must link the queue to avoid data races.
alpaka::enqueue(quereWhereViewSrcIsUsed, event);
wait(destQueue, event);
auto newBuff = alpaka::makeAvailable(destQueue, destDev, viewSrc));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bernhardmgruber dstDev
is available in the queue
parameter, is it really required to have this extra parameter queue
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please extend the documentation and point to the fact that the user must take care that all operations on viewSrc are finished.
If you really insist, I can do it. It is also not documented for alpaka::memcpy
btw. and I think it is kind of obvious. It's also not true. You can have pending operations on the source view:
auto bufHost = alpaka::allocBuf<int, Idx>(devHost, extents);
auto bufAcc = alpaka::makeAvailable(queue, devAcc, bufHost);
alpaka::exec<Acc>(queue, workdiv, Kernel{}, alpaka::getPtrNative(bufAcc));
alpaka::makeAvailable(queue, bufHost, bufAcc); // pending kernel execution on bufAcc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dstDev is available in the queue parameter, is it really required to have this extra parameter queue?
yes, because the device of the queue is not always the destination device. See example above, where the destination is the host, but the queue's device is the accelerator.
I don't think this function is dangerous. It's just that we don't guarantee what the state of the source buffer is after
Are you fine with the proposed
This is an internal function and I think it's bad. Because you actually need a mix of compile and runtime information to get the data types of the buffers right, and
Which is exactly what I want. |
makeAvailable(queue, dst, srcView)
makeAvailable(queue, dst, srcView)
I am abandoning work on this. Feel free to use or discard anything from the changeset provided here. |
Adds an overload set
zero_memcpy
that only copies a buffer if the destination device requires the buffer in a different memory space in order to access it. Otherwise, no copy is performed and just the handles are adjusted.Motivating use case (for me):
Avoid duplicating data if the accelerator device is the same as the host device. This is the case in the standard workflow where we create a buffer on the host, then a buffer on the accelerator device, and then copy from host -> device. If e.g. the TBB or OpenMP backend is active, there is no need for two buffers and an actual memory copy.
This zero copy idea can be extended further to also include various other technologies like CUDA's USM (managed memory), or remove memory access etc. For now, I do not have the time to work out all those details, so I am just proposing a minimal API to accomplish my motivating use case.
Example
randomCells2D
is rewritten to use the new API.Addresses part of: #28
TODO:
alpaka::experimental
)