Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][Docs] Add sycl_ext_oneapi_virtual_mem extension and implementation #8954
[SYCL][Docs] Add sycl_ext_oneapi_virtual_mem extension and implementation #8954
Changes from 25 commits
10c344e
81703b6
ce0829c
0fef874
294dd96
053cbdc
1129750
7b1b5ec
ac8843f
f8fad93
0a6e5f9
d0b3229
27c7200
92fe6f5
d2c92f0
d6745aa
eb21fc3
f268d37
5f5b2f1
1f2d527
99b7e9c
14e64af
d88e141
0df58a4
f7004f1
ffb2982
a7b067d
bd3a7e7
c75d8ee
4fdf659
c1f51e5
8c8955a
c763606
452ff19
d3f58d5
9cb8dbe
2db3ac8
17a1f3b
7e2844c
99ee1f8
6d8aede
8c5b692
62f5dba
358b083
be0f060
391116c
6b62bec
2385c0b
633184e
d1d129f
77d0232
8278db7
3eb4f9b
3acb43a
58ab3cb
2f5638b
2e8b031
5d889c3
41cb1e6
d7f720e
1045a5c
3c05124
9d3529d
93d7368
fca62eb
47ba688
5e72a00
219ad30
3c61360
7c87219
81286e4
8b21ef4
1e1fe34
7dcb46c
5e49a0a
7c882dd
0a564f1
394f8ed
aff695d
08553af
b7f91ae
331fa45
150b9dc
538af56
0f29473
2879161
d812df0
c7d22eb
d14ea5f
05f25b1
586b3e2
774a5a5
337634f
851153c
3996b52
4215fa2
a5739f1
3cc68a4
260ab05
21397f0
09fb2fe
08bbd83
d3ae658
d75bacd
e7b2635
3d8261e
9c252ed
55f015a
6ea3b4e
0826458
9b7282b
1916da3
1fa82c0
997cc39
3be01f4
743eb2e
b34d068
93db42b
eeaaad0
e8c98b5
e5483be
290401c
6630795
5c7330d
187b0f7
8dd8cb3
327687b
639aab3
42cbcde
95cdcbb
56ff29c
24dea7a
f85146f
6df0773
0bd2ce3
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a guarantee that the returned granularity is >= the
numBytes
input parameter? I think it would be good to clarify this one way or the other in the spec.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe we can make any such guarantees. The granularity is some value the user must align both the pointer and the size based on.
As an example, consider a backend/device that always returns 1024 (note: CUDA doesn't care about
numBytes
), so when the user queries the granularity fornumBytes = 32
they get 1024 and must adjust it accordingly, meaning they would need to reserve 1024 bytes. Continuing the example, if they asked for the granularity whennumBytes = 1025
, they would once again get 1024 and would adjust their reservation to 2048.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. I think we should just clarify this for
get_recommended_mem_granularity
and say that the returned granularity could be less than, greater than, or equal tonumBytes
.There's something about
get_minimum_mem_granularity
that seems weird. For example, suppose the device supports page sizes of both 1024 and 4096. If the user callsget_minimum_mem_granularity(4096)
, what will they get? Is the return value 1024 even though the requestednumBytes
doesn't fit in that granularity? This makes me wonder if the minimum granularity should depend onnumBytes
at all.Another weird thing is the word "minimum". This word implies that the application could also choose a larger granularity, but that's not the case. Presumably, the device supports a fixed set of granularities, and the application must choose one of them. This makes me wonder if the API should instead just return a list of all the supported granularities like:
If the application wants the minimum one, they can just use the first element in the returned vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the L0 interface corresponding to the granularity query (zeVirtualMemQueryPageSize) does not differentiate between recommended and minimum, but does take the size. Conversely, the CUDA query (cuMemGetAllocationGranularity) doesn't have the size argument, but has both a minimum and recommended mode.
Maybe we can remove the size from the minimum query by passing
1
to the L0 query and return the corresponding value, but it is not clear to me if that is actually always a valid granularity. @jandres742 - Do you know?As for being able to return a list of valid granularities, I don't see how we can do that with the current L0 interfaces. For CUDA would have one or two elements (minimum and recommended, or one if they are the same.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, @steffenlarsen . cuMemGetAllocationGranularity doesnt have a size parameter, so what that API does is:
"Here's the minimum and recommended granularities, please adjust your requested size to it"
While zeVirtualMemQueryPageSize says:
"For the size you want, here's the minimum granularity you should use for functionality and performance".
So the semantics of both APIs is different: CUDA's always returns the same numbers for a type of allocation, user needs to adjust the size, L0's already returns the granularity adjusted to the size.
Now, the SYCL APIs proposed here are accepting a size, represented by numBytes
So I guess there's an expectation that the granularity returned by sycl::get_recommended_mem_granularity should take into account that size, which is what L0 is doing. So I dont think we should pass 1 to L0 API. What I think we should do is to modify cuda_piextVirtualMemGranularityGetInfo to not ignore the mem_size parameter, and instead, returned the granularity based on that size, something like:
Now, if the intention of sycl::get_recommended_mem_granularity is to return a set of granularities, and have the user adjust the size, then current implementation in CUDA backend is ok, and for L0, a size of 1 could be passed.
So the main question here is : what is the intention of sycl::get_recommended_mem_granularity? is it to return the granularity based on the size passed (which is what L0 does, and for which we will need changes in the CUDA backend) or to ignore the size and return standard granularities (on which case it would be better to remove num_bytes from sycl::get_recommended_mem_granularity and to pass 1 to L0 API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This latest change by @steffenlarsen aligns the SYCL API with CUDA, which makes migration easy, so that's good.
Does this cause us to lose some performance on Level Zero, though? Let's say the user wants a moderately big (1Mb) address range. With the current API, we'll call Level Zero
zeVirtualMemQueryPageSize
withsize
set to1
. Will this return a different answer than if we called it withsize
set to 1Mb? If it is different, will this result is worse performance?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An option is to add another PI function asking the backend to align for us. For CUDA we would just be using the recommended granularity and then L0 could work its magic. It means we would have somewhat similar APIs, but we get the best of both worlds as new users could just leverage this instead of doing their own aligning while people translating code have their 1:1 mapping in the existing functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any value in adding an API that just applies the alignment to the user's size. It's easy enough for the application to do that themselves.
I'm wondering if Level Zero chooses a different recommended alignment for big vs. small sizes, for example. As a purely hypothetical example, let's say the h/w supports both small and big page sizes. In such a case, it would be better to allocate small data blocks using small pages and large data blocks using big pages. However, each page size would have a different alignment requirement. Is that what's going on with the Level Zero API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conversation seems to have stalled out waiting for a response from someone on the Level Zero team. Removing the "size" parameter to
get_mem_granularity
makes the API easier to use and easier for SYCLomatic to migrate CUDA code. That's all good.I'm just a little worried that there will be some negative impact if we always pass a size of "1" to the Level Zero
zeVirtualMemQueryPageSize
call. Can someone on the Level Zero team say whether this will cause problems?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PVC and ATS-M, zeVirtualMemQueryPageSize will return 64KBytes for any size less than 2MBytes, and will return 2MBytes for any size equal to or greater than 2MBytes. Given this, entering size of 1 should probably not be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that there is no
device
parameter here. Does this mean that the call reserves the address range in all devices insyclContext
? What if those devices have different required granularities? Must the address range satisfy the required granularity for all devices in the context? I wonder if there should be adevice
parameter here?If we decide there is not a
device
parameter to this call, then I think we should remove thedevice
parameter fromget_minimum_mem_granularity
andget_recommended_mem_granularity
. These APIs all work together. It makes no sense to get the memory granularity for one specific device if the allocation API requires all devices in the context to have that granularity.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CUDA interface has the restriction
while the Level Zero interface mentions the page size (used here as the granularity) as
Neither takes a device, despite the granularity queries taking a device in both interfaces. I am not sure if either backend will ever return different minimums, I suspect the actual requirement for the alignment and size comes into play when you map them onto physical memory, which are allocated on specific devices. Depending on how we should read the Level Zero requirement here, we could rephrase the reservation interface requirement to be that it must be aligned in accordance with the granularity of any device it will be mapped to. Arguably, this is more of an implicit requirement from the map function though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since neither the Level Zero nor the CUDA API takes a device handle, I assume that both APIs must be reserving the address range in all devices contained by the context. Would you agree?
In that case, wouldn't it make sense to remove the
device
parameter fromget_minimum_mem_granularity
andget_recommended_mem_granularity
also?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be okay with it, but what would it do if the devices report different granularity? I assume the best solution would be to try and find the smallest value that is a multiple of all the reported granularities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's ask @jandres742 about the Level Zero API.
The documentation for zeVirtualMemReserve says:
However,
zeVirtualMemQueryPageSize
takes anhDevice
parameter whilezeVirtualMemReserve
does not. What does the statement I quote above mean exactly? Does it mean that the application must callzeVirtualMemQueryPageSize
for every device in thehContext
and find an address that is aligned properly for every one of those devices?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a reasonable restriction to have for other backends? Are page sizes on the Intel GPUs always a multiple of the host page size anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmlueck :
Both, as as mentioned in the spec:
"The starting address and size must be page aligned. See zeVirtualMemQueryPageSize."
"The virtual start address and size must be page aligned. See zeVirtualMemQueryPageSize."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conversation never got resolved. I think the core problem is that the following statement is unclear:
The function
get_mem_granularity
returns the allocation granularity of a particular device, howeverreserve_virtual_mem
does not take a device parameter. Therefore, it's not clear what granularity we are talking about in the statement I quote above.I think the solution might be to remove the
syclDevice
parameter fromget_mem_granularity
. As a result,get_mem_granularity
would return the allocation granularity for a context (not a device). This solves the problem becausereserve_virtual_mem
also takes a context.However, this probably requires a change to Level Zero because
zeVirtualMemQueryPageSize
returns a page size for a particular device, not for a context. See my comments in this thread above, though. I think the Level Zero API also has problems, which would be solved by changing the definition ofzeVirtualMemQueryPageSize
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Examining Level Zero code: the zeVirtualMemQueryPageSize() API calls the following internal function:
ze_result_t ContextImp::queryVirtualMemPageSize(ze_device_handle_t hDevice,
size_t size,
size_t *pagesize) {
// Retrieve the page size and heap required for this allocation size requested.
getPageAlignedSizeRequired(size, nullptr, pagesize);
return ZE_RESULT_SUCCESS;
}
So, the hDevice handle is not used. Going down the call tree, the following is eventually called:
size_t DrmMemoryManager::selectAlignmentAndHeap(size_t size, HeapIndex *heap) {
AlignmentSelector::CandidateAlignment alignmentBase = alignmentSelector.selectAlignment(size);
size_t pageSizeAlignment = alignmentBase.alignment;
auto rootDeviceCount = this->executionEnvironment.rootDeviceEnvironments.size();
So all detected devices are cycled in this loop. I verified this to be the case using a board with 4x ATS-M devices and verified that all 4 devices were looped. Based upon this, only one call to zeVirtualMemQueryPageSize() should be required for the given driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance we can change the parameters of
zeVirtualMemQueryPageSize
to removehDevice
or document that thehDevice
parameter isn't used? It would be nice if SYCL could rely on documented behavior of Level Zero, rather than making an assumption based on the current implementation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume there are limitations about mapping the same memory twice without an intervening unmap. There are two cases:
Can you map a new virtual address over physical memory that is already mapped to a different address? I presume the answer is "no", but we should state this explicitly. For example:
If a virtual memory address is mapped to some physical memory, can that virtual memory also be mapped to some different physical memory? Again, I presume the answer is "no", so maybe:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see nothing in both CUDA and L0 disallowing this.
This is definitely not allowed. The suggested wording has been added. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting. What happens in this case? Is the physical memory automatically unmapped from the old virtual address range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume it would be like having two pointers to the same memory location, but I don't know for certain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any tests for this scenario? It seems like a weird case, so I think we should have some tests if we are going to say that it is supported. For example, the test should try writing a value to one address and then reading from the other. Maybe something like:
I would actually be surprised if this works because:
p1
andp2
alias each other.p1
andp2
alias each other.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion offline: Unlikely to work and even if it does, it may only work for certain cases. We restrict it for now and then we can consider loosening it in the future if needed. The restriction has been added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must
ptr
andnumBytes
be the exact range from some previous call tophysical_mem::map
, or may they be a subrange?If a subrange is allowed, do
ptr
andnumBytes
need to be a multiple of the minimum granularity?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like both backends allow sub-range for this, but L0 talks about "page size" but doesn't refer to the zeVirtualMemQueryPageSize query like other APIs, so I am unsure whether it refers to the host page size or the result of that query. I have assumed the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check your wording, then? The wording says:
This would correspond to the host page size, right? Not the value returned by
zeVirtualMemQueryPageSize
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for L0 this would correspond to the LCM of
zeVirtualMemQueryPageSize
of all devices in that context. For CUDA it would be the host page size.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you changed the wording in 8dd8cb3 to say that the
ptr
must be aligned to the device's minimum granularity. This clarifies my concern about the wording, so I think we can resolve this issue.