Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Docs] Add sycl_ext_oneapi_virtual_mem extension and implementation #8954

Merged
merged 135 commits into from
Jul 1, 2024
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
135 commits
Select commit Hold shift + click to select a range
10c344e
[SYCL][Docs] Add sycl_ext_oneapi_virtual_mem extension and implementa…
steffenlarsen Mar 22, 2023
81703b6
Fix declaration in ESIMD emu
steffenlarsen Apr 5, 2023
ce0829c
Initialize CUDA property structs
steffenlarsen Apr 6, 2023
0fef874
Add Windows symbols
steffenlarsen Apr 6, 2023
294dd96
Clarify need for unmapped free
steffenlarsen Apr 6, 2023
053cbdc
Change std::min to std::max
steffenlarsen Apr 6, 2023
1129750
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Apr 6, 2023
7b1b5ec
Fix formatting
steffenlarsen Apr 6, 2023
ac8843f
Reduce work-item count
steffenlarsen Apr 6, 2023
f8fad93
Fix formatting
steffenlarsen Apr 6, 2023
0a6e5f9
Rephrase recommended granularity
steffenlarsen Apr 6, 2023
d0b3229
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Apr 11, 2023
27c7200
Address comments
steffenlarsen Apr 11, 2023
92fe6f5
Add missing offset mention to map overload
steffenlarsen Apr 11, 2023
d2c92f0
Remove memory size argument in granularity queries
steffenlarsen Apr 18, 2023
d6745aa
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Apr 18, 2023
eb21fc3
Make physical_mem release in line with other similar interfaces
steffenlarsen Apr 19, 2023
f268d37
Fix formatting
steffenlarsen Apr 19, 2023
5f5b2f1
Remove use of context in HIP PI
steffenlarsen Apr 19, 2023
1f2d527
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Apr 19, 2023
99b7e9c
Change to use _ur_object
steffenlarsen Apr 19, 2023
14e64af
Change info query names
steffenlarsen Apr 19, 2023
d88e141
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Apr 24, 2023
0df58a4
Add new access mode enum
steffenlarsen Apr 24, 2023
f7004f1
Change to uintptr_t for ranges and adjust descriptions
steffenlarsen Apr 24, 2023
ffb2982
Fix wording
steffenlarsen Apr 25, 2023
a7b067d
Specify aspect in specification section
steffenlarsen Apr 25, 2023
bd3a7e7
Add missing aspect enum
steffenlarsen Apr 25, 2023
c75d8ee
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Apr 26, 2023
4fdf659
Add Windows symbols
steffenlarsen Apr 26, 2023
c1f51e5
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen May 3, 2023
8c8955a
Fix merge mistake
steffenlarsen May 3, 2023
c763606
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen May 12, 2023
452ff19
Remove old set_inaccessible declaration
steffenlarsen May 15, 2023
d3f58d5
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen May 15, 2023
9cb8dbe
Reduce granularity query surface
steffenlarsen May 16, 2023
2db3ac8
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen May 16, 2023
17a1f3b
Adjust for recent plugin changes
steffenlarsen May 16, 2023
7e2844c
Adjust Windows symbols
steffenlarsen May 16, 2023
99ee1f8
Change shift of flag to 1-increment
steffenlarsen May 17, 2023
6d8aede
Fix use of old function in test
steffenlarsen May 17, 2023
8c5b692
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Jun 20, 2023
62f5dba
Fix formatting
steffenlarsen Jun 20, 2023
358b083
Fix formatting again
steffenlarsen Jun 20, 2023
be0f060
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Jun 22, 2023
391116c
Use correct handle for CUDA physical memory
steffenlarsen Jun 22, 2023
6b62bec
CUDA implementation fixes
steffenlarsen Jun 22, 2023
2385c0b
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Jun 22, 2023
633184e
Remove remaining RT:: uses
steffenlarsen Jun 22, 2023
d1d129f
Fix handles and use of incomplete types
steffenlarsen Jun 22, 2023
77d0232
Fix formatting
steffenlarsen Jun 22, 2023
8278db7
Fix naming and static
steffenlarsen Jun 22, 2023
3eb4f9b
Fix missing symbols
steffenlarsen Jun 22, 2023
3acb43a
Fix small mistakes in virtual_mem.cpp
steffenlarsen Jun 22, 2023
58ab3cb
Add missing __SYCL_EXPORT to declarations
steffenlarsen Jun 23, 2023
2f5638b
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Jun 29, 2023
2e8b031
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Jul 6, 2023
5d889c3
Add new aspect to config
steffenlarsen Jul 6, 2023
41cb1e6
Add missing semi-colon
steffenlarsen Jul 6, 2023
d7f720e
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Sep 1, 2023
1045a5c
Fix formatting
steffenlarsen Sep 1, 2023
3c05124
Order source files
steffenlarsen Sep 1, 2023
9d3529d
Remove redundant asserts
steffenlarsen Sep 1, 2023
93d7368
Remove obsolete workaround
steffenlarsen Sep 1, 2023
fca62eb
Fix getDeviceOrdinal
steffenlarsen Sep 1, 2023
47ba688
Fix getDeviceOrdinal attempt 2
steffenlarsen Sep 1, 2023
5e72a00
Fix includes in CUDA device adapter
steffenlarsen Sep 1, 2023
219ad30
Add missing cuda and hip symbols to dump
steffenlarsen Sep 1, 2023
3c61360
Add missing windows symbols
steffenlarsen Sep 4, 2023
7c87219
Set protection mode none if unspecified
steffenlarsen Sep 4, 2023
81286e4
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Oct 9, 2023
8b21ef4
Remove new adapter files
steffenlarsen Oct 9, 2023
1e1fe34
Update tag
steffenlarsen Oct 9, 2023
7dcb46c
Update tag
steffenlarsen Oct 10, 2023
5e49a0a
Update tag
steffenlarsen Oct 11, 2023
7c882dd
Update tag
steffenlarsen Oct 13, 2023
0a564f1
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Oct 16, 2023
394f8ed
Move OpenCL adapter changes
steffenlarsen Oct 16, 2023
aff695d
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Nov 21, 2023
08553af
Fix missed merge conflict
steffenlarsen Nov 21, 2023
b7f91ae
Remove unused files
steffenlarsen Nov 21, 2023
331fa45
Update tag
steffenlarsen Nov 21, 2023
150b9dc
Update tag
steffenlarsen Nov 21, 2023
538af56
Update tag
steffenlarsen Nov 21, 2023
0f29473
Update tag
steffenlarsen Nov 21, 2023
2879161
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Nov 21, 2023
d812df0
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Dec 8, 2023
c7d22eb
Bump tag
steffenlarsen Dec 8, 2023
d14ea5f
Fix formatting
steffenlarsen Dec 8, 2023
05f25b1
Fix aspect and missed merge conflict
steffenlarsen Dec 8, 2023
586b3e2
Missing comma
steffenlarsen Dec 8, 2023
774a5a5
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Dec 15, 2023
337634f
Add missing windows symbol
steffenlarsen Dec 15, 2023
851153c
Implement L0 limitation workaround
steffenlarsen Dec 15, 2023
3996b52
Fix formatting
steffenlarsen Dec 15, 2023
4215fa2
Bump tag
steffenlarsen Dec 15, 2023
a5739f1
Merge branch 'sycl' into steffen/virtual_mem_ext
steffenlarsen Dec 17, 2023
3cc68a4
Merge branch 'sycl' into steffen/virtual_mem_ext
steffenlarsen Jan 2, 2024
260ab05
Merge branch 'sycl' into steffen/virtual_mem_ext
steffenlarsen Jan 22, 2024
21397f0
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Mar 6, 2024
09fb2fe
Address spec wording comments
steffenlarsen Mar 6, 2024
08bbd83
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Apr 19, 2024
d3ae658
Unmap per VA range
steffenlarsen Apr 19, 2024
d75bacd
Fix formatting
steffenlarsen Apr 19, 2024
e7b2635
Clarify returned pointer from map
steffenlarsen Apr 19, 2024
3d8261e
Disallow multi-unmapping for now.
steffenlarsen Apr 19, 2024
9c252ed
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen May 23, 2024
55f015a
Add native_cpu PI interfaces
steffenlarsen May 23, 2024
6ea3b4e
Add pi nativecpu symbols
steffenlarsen May 27, 2024
0826458
Ext changes
steffenlarsen May 29, 2024
9b7282b
Remove simplified map
steffenlarsen May 29, 2024
1916da3
Add exception for out-of-memory
steffenlarsen May 29, 2024
1fa82c0
Change back to 2023
steffenlarsen May 29, 2024
997cc39
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen May 29, 2024
3be01f4
Remove windows symbol
steffenlarsen May 29, 2024
743eb2e
Add new context granularity query
steffenlarsen May 30, 2024
b34d068
Amend granularity reqs
steffenlarsen May 30, 2024
93db42b
Fix windows build
steffenlarsen May 30, 2024
eeaaad0
Fix wrong name
steffenlarsen May 30, 2024
e8c98b5
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen May 30, 2024
e5483be
Add missing Windows symbol
steffenlarsen May 30, 2024
290401c
Clarify map and access functions
steffenlarsen May 31, 2024
6630795
Change from host page size to minimum granularity
steffenlarsen May 31, 2024
5c7330d
Address comments
steffenlarsen Jun 20, 2024
187b0f7
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Jun 20, 2024
8dd8cb3
Change requirement for access mode functions
steffenlarsen Jun 20, 2024
327687b
Reword and add restriction
steffenlarsen Jun 20, 2024
639aab3
Fix formatting
steffenlarsen Jun 20, 2024
42cbcde
Address comments
steffenlarsen Jun 27, 2024
95cdcbb
Merge remote-tracking branch 'intel/sycl' into steffen/virtual_mem_ext
steffenlarsen Jun 27, 2024
56ff29c
Fix typo
steffenlarsen Jun 27, 2024
24dea7a
Add noexcept(false) to dtors
steffenlarsen Jun 27, 2024
f85146f
Merge branch 'sycl' into steffen/virtual_mem_ext
steffenlarsen Jun 27, 2024
6df0773
Merge branch 'sycl' into steffen/virtual_mem_ext
steffenlarsen Jul 1, 2024
0bd2ce3
Add physical_mem dtor body to avoid RHEL issues
steffenlarsen Jul 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,347 @@
= sycl_ext_oneapi_virtual_mem

:source-highlighter: coderay
:coderay-linenums-mode: table

// This section needs to be after the document title.
:doctype: book
:toc2:
:toc: left
:encoding: utf-8
:lang: en
:dpcpp: pass:[DPC++]

// Set the default source code type in this document to C++,
// for syntax highlighting purposes. This is needed because
// docbook uses c++ and html5 uses cpp.
:language: {basebackend@docbook:c++:cpp}


== Notice

[%hardbreaks]
Copyright (C) 2023-2023 Intel Corporation. All rights reserved.
steffenlarsen marked this conversation as resolved.
Show resolved Hide resolved

Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by
permission by Khronos.


== Contact

To report problems with this extension, please open a new issue at:

https://github.com/intel/llvm/issues


== Dependencies

This extension is written against the SYCL 2020 revision 6 specification. All
steffenlarsen marked this conversation as resolved.
Show resolved Hide resolved
references below to the "core SYCL specification" or to section numbers in the
SYCL specification refer to that revision.


== Status

This is an experimental extension specification, intended to provide early
access to features and gather community feedback. Interfaces defined in this
specification are implemented in {dpcpp}, but they are not finalized and may
change incompatibly in future versions of {dpcpp} without prior notice.
*Shipping software products should not rely on APIs defined in this
specification.*


== Backend support status

The APIs in this extension may be used only on a device that has
`aspect::ext_oneapi_virtual_mem`. The application must check that the devices
in the corresponding context have this aspect before using any of the APIs
introduced in this extension. If the application fails to do this, the
implementation throws a synchronous exception with the
`errc::feature_not_supported` error code.

== Overview

This extension adds the notion of "virtual memory ranges" to SYCL, introducing
a way to map an address range onto multiple allocations of physical memory,
allowing users to avoid expensive reallocations and potentially running out of
device memory while relocating the corresponding memory.
gmlueck marked this conversation as resolved.
Show resolved Hide resolved


== Specification

=== Feature test macro

This extension provides a feature-test macro as described in the core SYCL
specification. An implementation supporting this extension must predefine the
macro `SYCL_EXT_ONEAPI_VIRTUAL_MEM` to one of the values defined in the table
below. Applications can test for the existence of this macro to determine if
the implementation supports this feature, or applications can test the macro's
value to determine which of the extension's features the implementation
supports.

[%header,cols="1,5"]
|===
|Value
|Description

|1
|The APIs of this experimental extension are not versioned, so the
feature-test macro always has this value.
|===


gmlueck marked this conversation as resolved.
Show resolved Hide resolved
=== Memory granularity

Working with virtual address ranges and the underlying physical memory requires
KseniyaTikhomirova marked this conversation as resolved.
Show resolved Hide resolved
the user to align and adjust in accordance with a specified minimum granularity.

In addition, contexts may specify a recommended granularity for a given device
to potentially achieve higher performance. Distinction between minimum and
recommended is specified by the context and may vary between devices.
steffenlarsen marked this conversation as resolved.
Show resolved Hide resolved

The interfaces for querying these granularities are defined as:

```c++
namespace sycl::ext::oneapi::experimental {

size_t get_minimum_mem_granularity(const device &syclDevice, const context &syclContext);
size_t get_minimum_mem_granularity(const queue &syclQueue);
size_t get_minimum_mem_granularity(const physical_mem &syclPhysicalMem);

size_t get_recommended_mem_granularity(const device &syclDevice, const context &syclContext);
size_t get_recommended_mem_granularity(const queue &syclQueue);
size_t get_recommended_mem_granularity(const physical_mem &syclPhysicalMem);

steffenlarsen marked this conversation as resolved.
Show resolved Hide resolved
} // namespace sycl::ext::oneapi::experimental
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a guarantee that the returned granularity is >= the numBytes input parameter? I think it would be good to clarify this one way or the other in the spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe we can make any such guarantees. The granularity is some value the user must align both the pointer and the size based on.

As an example, consider a backend/device that always returns 1024 (note: CUDA doesn't care about numBytes), so when the user queries the granularity for numBytes = 32 they get 1024 and must adjust it accordingly, meaning they would need to reserve 1024 bytes. Continuing the example, if they asked for the granularity when numBytes = 1025, they would once again get 1024 and would adjust their reservation to 2048.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I think we should just clarify this for get_recommended_mem_granularity and say that the returned granularity could be less than, greater than, or equal to numBytes.

There's something about get_minimum_mem_granularity that seems weird. For example, suppose the device supports page sizes of both 1024 and 4096. If the user calls get_minimum_mem_granularity(4096), what will they get? Is the return value 1024 even though the requested numBytes doesn't fit in that granularity? This makes me wonder if the minimum granularity should depend on numBytes at all.

Another weird thing is the word "minimum". This word implies that the application could also choose a larger granularity, but that's not the case. Presumably, the device supports a fixed set of granularities, and the application must choose one of them. This makes me wonder if the API should instead just return a list of all the supported granularities like:

std::vector<size_t> get_mem_granularities(const device &syclDevice, const context &syclContext);

If the application wants the minimum one, they can just use the first element in the returned vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the L0 interface corresponding to the granularity query (zeVirtualMemQueryPageSize) does not differentiate between recommended and minimum, but does take the size. Conversely, the CUDA query (cuMemGetAllocationGranularity) doesn't have the size argument, but has both a minimum and recommended mode.

Maybe we can remove the size from the minimum query by passing 1 to the L0 query and return the corresponding value, but it is not clear to me if that is actually always a valid granularity. @jandres742 - Do you know?

As for being able to return a list of valid granularities, I don't see how we can do that with the current L0 interfaces. For CUDA would have one or two elements (minimum and recommended, or one if they are the same.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, @steffenlarsen . cuMemGetAllocationGranularity doesnt have a size parameter, so what that API does is:

"Here's the minimum and recommended granularities, please adjust your requested size to it"

While zeVirtualMemQueryPageSize says:

"For the size you want, here's the minimum granularity you should use for functionality and performance".

So the semantics of both APIs is different: CUDA's always returns the same numbers for a type of allocation, user needs to adjust the size, L0's already returns the granularity adjusted to the size.

Now, the SYCL APIs proposed here are accepting a size, represented by numBytes

sycl::get_recommended_mem_granularity(size_t numBytes...)

So I guess there's an expectation that the granularity returned by sycl::get_recommended_mem_granularity should take into account that size, which is what L0 is doing. So I dont think we should pass 1 to L0 API. What I think we should do is to modify cuda_piextVirtualMemGranularityGetInfo to not ignore the mem_size parameter, and instead, returned the granularity based on that size, something like:

pi_result cuda_piextVirtualMemGranularityGetInfo(
    pi_context context, pi_device device, size_t mem_size,
    pi_virtual_mem_granularity_info param_name, size_t param_value_size,
    void *param_value, size_t *param_value_size_ret) {
...
      size_t granularity;
      result = PI_CHECK_ERROR(
          cuMemGetAllocationGranularity(&granularity, &alloc_props, flags));

       granularity = ROUND_UP(mem_size, granularity);

...

Now, if the intention of sycl::get_recommended_mem_granularity is to return a set of granularities, and have the user adjust the size, then current implementation in CUDA backend is ok, and for L0, a size of 1 could be passed.

So the main question here is : what is the intention of sycl::get_recommended_mem_granularity? is it to return the granularity based on the size passed (which is what L0 does, and for which we will need changes in the CUDA backend) or to ignore the size and return standard granularities (on which case it would be better to remove num_bytes from sycl::get_recommended_mem_granularity and to pass 1 to L0 API).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This latest change by @steffenlarsen aligns the SYCL API with CUDA, which makes migration easy, so that's good.

Does this cause us to lose some performance on Level Zero, though? Let's say the user wants a moderately big (1Mb) address range. With the current API, we'll call Level Zero zeVirtualMemQueryPageSize with size set to 1. Will this return a different answer than if we called it with size set to 1Mb? If it is different, will this result is worse performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An option is to add another PI function asking the backend to align for us. For CUDA we would just be using the recommended granularity and then L0 could work its magic. It means we would have somewhat similar APIs, but we get the best of both worlds as new users could just leverage this instead of doing their own aligning while people translating code have their 1:1 mapping in the existing functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any value in adding an API that just applies the alignment to the user's size. It's easy enough for the application to do that themselves.

I'm wondering if Level Zero chooses a different recommended alignment for big vs. small sizes, for example. As a purely hypothetical example, let's say the h/w supports both small and big page sizes. In such a case, it would be better to allocate small data blocks using small pages and large data blocks using big pages. However, each page size would have a different alignment requirement. Is that what's going on with the Level Zero API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversation seems to have stalled out waiting for a response from someone on the Level Zero team. Removing the "size" parameter to get_mem_granularity makes the API easier to use and easier for SYCLomatic to migrate CUDA code. That's all good.

I'm just a little worried that there will be some negative impact if we always pass a size of "1" to the Level Zero zeVirtualMemQueryPageSize call. Can someone on the Level Zero team say whether this will cause problems?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PVC and ATS-M, zeVirtualMemQueryPageSize will return 64KBytes for any size less than 2MBytes, and will return 2MBytes for any size equal to or greater than 2MBytes. Given this, entering size of 1 should probably not be an issue.

```

[frame="topbot",options="header,footer"]
|=====================
|Function |Description

|`size_t get_minimum_mem_granularity(const device &syclDevice, const context &syclContext)` |
Returns the minimum granularity of physical and virtual memory allocations on
`syclDevice` in the `syclContext`.

If `syclDevice` does not have `aspect::ext_oneapi_virtual_mem` the call throws
an exception with `errc::feature_not_supported`.

|`size_t get_minimum_mem_granularity(const queue &syclQueue)` |
Same as `get_minimum_mem_granularity(syclQueue.get_device(), syclQueue.get_context())`.

|`size_t get_minimum_mem_granularity(const physical_mem &syclPhysicalMem)` |
Same as `get_minimum_mem_granularity(syclPhysicalMem.get_device(), syclPhysicalMem.get_context())`.

|`size_t get_recommended_mem_granularity(const device &syclDevice, const context &syclContext)` |
Returns the recommended granularity of physical and virtual memory allocations
on `syclDevice` in the `syclContext`.

If `syclDevice` does not have `aspect::ext_oneapi_virtual_mem` the call throws
an exception with `errc::feature_not_supported`.

|`size_t get_recommended_mem_granularity(const queue &syclQueue)` |
Same as `get_recommended_mem_granularity(syclQueue.get_device(), syclQueue.get_context())`.

|`size_t get_recommended_mem_granularity(const physical_mem &syclPhysicalMem)` |
Same as `get_recommended_mem_granularity(syclPhysicalMem.get_device(), syclPhysicalMem.get_context())`.

|=====================

=== Reserving virtual address ranges

Virtual address ranges are represented by a pointer and a number of bytes
gmlueck marked this conversation as resolved.
Show resolved Hide resolved
reserved for it. The pointer must be aligned in accordance with the minimum
granularity, as queried through `get_minimum_mem_granularity`, and likewise the
number of bytes must be a multiple of this granularity. It is the responsibility
of the user to manage the constituents of any virtual address range they
reserve.

The interfaces for reserving, freeing, and manipulating the access mode of a
virtual address range are defined as:

```c++
namespace sycl::ext::oneapi::experimental {

uintptr_t reserve_virtual_mem(uintptr_t start, size_t numBytes, const context &syclContext);
uintptr_t reserve_virtual_mem(size_t numBytes, const context &syclContext);

void free_virtual_mem(uintptr_t ptr, size_t numBytes, const context &syclContext);

} // namespace sycl::ext::oneapi::experimental
```
gmlueck marked this conversation as resolved.
Show resolved Hide resolved

[frame="topbot",options="header,footer"]
|=====================
|Function |Description

|`uintptr_t reserve_virtual_mem(uintptr_t start, size_t numBytes, const context &syclContext)` |
Reserves a virtual memory range in `syclContext` with `numBytes` bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that there is no device parameter here. Does this mean that the call reserves the address range in all devices in syclContext? What if those devices have different required granularities? Must the address range satisfy the required granularity for all devices in the context? I wonder if there should be a device parameter here?

If we decide there is not a device parameter to this call, then I think we should remove the device parameter from get_minimum_mem_granularity and get_recommended_mem_granularity. These APIs all work together. It makes no sense to get the memory granularity for one specific device if the allocation API requires all devices in the context to have that granularity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA interface has the restriction

The size and address parameters must be a multiple of the host page size and the alignment must be a power of two or zero for default alignment.

while the Level Zero interface mentions the page size (used here as the granularity) as

The starting address and size must be page aligned. See zeVirtualMemQueryPageSize.

Neither takes a device, despite the granularity queries taking a device in both interfaces. I am not sure if either backend will ever return different minimums, I suspect the actual requirement for the alignment and size comes into play when you map them onto physical memory, which are allocated on specific devices. Depending on how we should read the Level Zero requirement here, we could rephrase the reservation interface requirement to be that it must be aligned in accordance with the granularity of any device it will be mapped to. Arguably, this is more of an implicit requirement from the map function though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since neither the Level Zero nor the CUDA API takes a device handle, I assume that both APIs must be reserving the address range in all devices contained by the context. Would you agree?

In that case, wouldn't it make sense to remove the device parameter from get_minimum_mem_granularity and get_recommended_mem_granularity also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be okay with it, but what would it do if the devices report different granularity? I assume the best solution would be to try and find the smallest value that is a multiple of all the reported granularities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ask @jandres742 about the Level Zero API.

The documentation for zeVirtualMemReserve says:

The starting address and size must be page aligned. See zeVirtualMemQueryPageSize

However, zeVirtualMemQueryPageSize takes an hDevice parameter while zeVirtualMemReserve does not. What does the statement I quote above mean exactly? Does it mean that the application must call zeVirtualMemQueryPageSize for every device in the hContext and find an address that is aligned properly for every one of those devices?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a reasonable restriction to have for other backends? Are page sizes on the Intel GPUs always a multiple of the host page size anyways?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlueck :

Does zeVirtualMemReserve have any hard requirements on the alignment of the pStart and size parameters?
Or, is it only zeVirtualMemMap that has an alignment requirement?

Both, as as mentioned in the spec:

"The starting address and size must be page aligned. See zeVirtualMemQueryPageSize."

"The virtual start address and size must be page aligned. See zeVirtualMemQueryPageSize."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversation never got resolved. I think the core problem is that the following statement is unclear:

start must be aligned in accordance with either the minimum or recommended granularity, as returned by a call to get_mem_granularity. Likewise, numBytes must be a multiple of the granularity.

The function get_mem_granularity returns the allocation granularity of a particular device, however reserve_virtual_mem does not take a device parameter. Therefore, it's not clear what granularity we are talking about in the statement I quote above.

I think the solution might be to remove the syclDevice parameter from get_mem_granularity. As a result, get_mem_granularity would return the allocation granularity for a context (not a device). This solves the problem because reserve_virtual_mem also takes a context.

However, this probably requires a change to Level Zero because zeVirtualMemQueryPageSize returns a page size for a particular device, not for a context. See my comments in this thread above, though. I think the Level Zero API also has problems, which would be solved by changing the definition of zeVirtualMemQueryPageSize.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examining Level Zero code: the zeVirtualMemQueryPageSize() API calls the following internal function:

ze_result_t ContextImp::queryVirtualMemPageSize(ze_device_handle_t hDevice,
size_t size,
size_t *pagesize) {
// Retrieve the page size and heap required for this allocation size requested.
getPageAlignedSizeRequired(size, nullptr, pagesize);
return ZE_RESULT_SUCCESS;
}

So, the hDevice handle is not used. Going down the call tree, the following is eventually called:

size_t DrmMemoryManager::selectAlignmentAndHeap(size_t size, HeapIndex *heap) {
AlignmentSelector::CandidateAlignment alignmentBase = alignmentSelector.selectAlignment(size);
size_t pageSizeAlignment = alignmentBase.alignment;
auto rootDeviceCount = this->executionEnvironment.rootDeviceEnvironments.size();

// If all devices can support HEAP EXTENDED, then that heap is used, otherwise the HEAP based on the size is used.
for (auto rootDeviceIndex = 0u; rootDeviceIndex < rootDeviceCount; rootDeviceIndex++) {
    auto gfxPartition = getGfxPartition(rootDeviceIndex);
    if (gfxPartition->getHeapLimit(HeapIndex::heapExtended) > 0) {
        auto alignSize = size >= 8 * MemoryConstants::gigaByte && Math::isPow2(size);
        if (debugManager.flags.UseHighAlignmentForHeapExtended.get() != -1) {
            alignSize = !!debugManager.flags.UseHighAlignmentForHeapExtended.get();
        }

        if (alignSize) {
            pageSizeAlignment = Math::prevPowerOfTwo(size);
        }

        *heap = HeapIndex::heapExtended;
    } else {
        pageSizeAlignment = alignmentBase.alignment;
        *heap = alignmentBase.heap;
        break;
    }
}
return pageSizeAlignment;

So all detected devices are cycled in this loop. I verified this to be the case using a board with 4x ATS-M devices and verified that all 4 devices were looped. Based upon this, only one call to zeVirtualMemQueryPageSize() should be required for the given driver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we can change the parameters of zeVirtualMemQueryPageSize to remove hDevice or document that the hDevice parameter isn't used? It would be nice if SYCL could rely on documented behavior of Level Zero, rather than making an assumption based on the current implementation.


`start` specifies the requested start of the new virtual memory range
reservation. If the implementation is unable to reserve the virtual memory range
at the specified address, the implementation will pick another suitable address.

`start` must be aligned in accordance with the minimum granularity, as returned
by a call to `get_minimum_mem_granularity`. Likewise, `numBytes` must be a
multiple of the granularity. Attempting to call this function without meeting
these requirements results in undefined behavior.

If any of the devices in `syclContext` does not have
steffenlarsen marked this conversation as resolved.
Show resolved Hide resolved
`aspect::ext_oneapi_virtual_mem` the call throws an exception with
`errc::feature_not_supported`.

|`uintptr_t reserve_virtual_mem(size_t numBytes, const device &syclDevice, const context &syclContext)` |
Same as `reserve_virtual_mem(0, numBytes, syclDevice, syclContext)`.
steffenlarsen marked this conversation as resolved.
Show resolved Hide resolved

|`void free_virtual_mem(uintptr_t ptr, size_t numBytes, const context &syclContext)` |
Frees a virtual memory range specified by `ptr` and `numBytes`. `ptr` must be
the same as returned by a call to `reserve_virtual_mem` and `numBytes` must be
the same as the size of the range specified in the reservation call.

The virtual memory range must not currently be mapped to physical memory. A call
to this function with a mapped virtual memory range results in undefined
behavior.

|=====================


=== Physical memory representation

:crs: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:reference-semantics

To represent the underlying physical device memory a virtual address is mapped,
steffenlarsen marked this conversation as resolved.
Show resolved Hide resolved
the `physical_mem` class is added. This new class is defined as:

```c++
namespace sycl::ext::oneapi::experimental {

enum class address_access_mode : /*unspecified*/ {
none,
read,
read_write
};

class physical_mem {
public:
physical_mem(const device &syclDevice, const context &syclContext, size_t numBytes);
physical_mem(const queue &syclQueue, size_t numBytes);

/* -- common interface members -- */

void *map(uintptr_t ptr, size_t numBytes, size_t offset = 0) const;
void *map(uintptr_t ptr, size_t numBytes, address_access_mode mode, size_t offset = 0) const;

context get_context() const;
device get_device() const;

size_t size() const noexcept;
};

} // namespace sycl::ext::oneapi::experimental
```

`physical_mem` has common reference semantics, as described in
{crs}[section 4.5.2. Common reference semantics].

[frame="topbot",options="header,footer"]
|============================
|Member function |Description

|`physical_mem(const device &syclDevice, const context &syclContext, size_t numBytes)` |
Constructs a `physical_mem` instance using the `syclDevice` provided. This
device must either be contained by `syclContext` or it must be a descendent
device of some device that is contained by that context, otherwise this function
throws a synchronous exception with the `errc::invalid` error code.

This will allocate `numBytes` of physical memory on the device. `numBytes` must
be a multiple of the minimum granularity, as returned by a call to
`get_minimum_mem_granularity`
steffenlarsen marked this conversation as resolved.
Show resolved Hide resolved

|`physical_mem(const queue &syclQueue, size_t numBytes)` |
Same as `physical_mem(syclQueue.get_device(), syclQueue.get_context, numBytes)`.

|`void *map(uintptr_t ptr, size_t numBytes, size_t offset = 0)` |
Same as `map(ptr, numBytes, address_access_mode::none, offset)`.
steffenlarsen marked this conversation as resolved.
Show resolved Hide resolved

|`void *map(uintptr_t ptr, size_t numBytes, address_access_mode mode, size_t offset = 0)` |
steffenlarsen marked this conversation as resolved.
Show resolved Hide resolved
Maps a virtual memory range, specified by `ptr` and `numBytes`, to the physical
memory corresponding to this instance of `physical_mem`, starting at an offset
of `offset` bytes.
gmlueck marked this conversation as resolved.
Show resolved Hide resolved

If `mode` is `address_access_mode::read` or `address_access_mode::read_write`
the returned pointer is accessible after the call as read-only or read-write
respectively. Otherwise, it considered inaccessible and accessing it will result
steffenlarsen marked this conversation as resolved.
Show resolved Hide resolved
in undefined behavior.

Writing to any address in the virtual memory range with access mode set to
`access_mode::read` results in undefined behavior.

An accessible pointer behaves the same as a pointer to device USM memory and can
be used in place of a device USM pointer in any interface accepting one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there are limitations about mapping the same memory twice without an intervening unmap. There are two cases:

  • Can you map a new virtual address over physical memory that is already mapped to a different address? I presume the answer is "no", but we should state this explicitly. For example:

    Attempting to map virtual memory to any portion of physical memory that is already mapped results in undefined behavior.

  • If a virtual memory address is mapped to some physical memory, can that virtual memory also be mapped to some different physical memory? Again, I presume the answer is "no", so maybe:

    A virtual memory address cannot be simultaneously mapped to more than one physical memory region. Attempting to violate this results in undefined behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you map a new virtual address over physical memory that is already mapped to a different address?

I see nothing in both CUDA and L0 disallowing this.

If a virtual memory address is mapped to some physical memory, can that virtual memory also be mapped to some different physical memory?

This is definitely not allowed. The suggested wording has been added. 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you map a new virtual address over physical memory that is already mapped to a different address?

I see nothing in both CUDA and L0 disallowing this.

That's interesting. What happens in this case? Is the physical memory automatically unmapped from the old virtual address range?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume it would be like having two pointers to the same memory location, but I don't know for certain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any tests for this scenario? It seems like a weird case, so I think we should have some tests if we are going to say that it is supported. For example, the test should try writing a value to one address and then reading from the other. Maybe something like:

int *p1 = /* some mapped address */;
int *p2 = /* a different address that points to the same physical memory */;

*p1 = VALUE1;
*p2 = VALUE2;
assert(*p1 == VALUE2);

I would actually be surprised if this works because:

  • The compiler would need to know that p1 and p2 alias each other.
  • The hardware cache would need to know that p1 and p2 alias each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion offline: Unlikely to work and even if it does, it may only work for certain cases. We restrict it for now and then we can consider loosening it in the future if needed. The restriction has been added.


|`context get_context() const` |
Returns the SYCL context associated with the instance of `physical_mem`.

|`device get_device() const` |
Returns the SYCL device associated with the instance of `physical_mem`.

|`size_t size() const` |
Returns the size of the corresponding physical memory in bytes.

|============================
gmlueck marked this conversation as resolved.
Show resolved Hide resolved

Virtual memory address ranges are mapped to the a `physical_mem` through the
`map` member functions, where the access mode can also be specified.
To further get or set the access mode of a mapped virtual address range, the
user does not need to know the associated `physical_mem` and can just call the
following free functions.

```c++
namespace sycl::ext::oneapi::experimental {

void set_access_mode(const void *ptr, size_t numBytes, address_access_mode mode, const context &syclContext);

address_access_mode get_access_mode(const void *ptr, size_t numBytes, const context &syclContext);

void unmap(const void *ptr, size_t numBytes, const context &syclContext);

} // namespace sycl::ext::oneapi::experimental
```

[frame="topbot",options="header,footer"]
|=====================
|Function |Description

|`void set_access_mode(const void *ptr, size_t numBytes, address_access_mode mode, const context &syclContext)` |
Sets the access mode of a virtual memory range specified by `ptr` and
steffenlarsen marked this conversation as resolved.
Show resolved Hide resolved
`numBytes`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must ptr and numBytes be the exact range from some previous call to physical_mem::map, or may they be a subrange?

If a subrange is allowed, do ptr and numBytes need to be a multiple of the minimum granularity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like both backends allow sub-range for this, but L0 talks about "page size" but doesn't refer to the zeVirtualMemQueryPageSize query like other APIs, so I am unsure whether it refers to the host page size or the result of that query. I have assumed the latter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check your wording, then? The wording says:

ptr must be aligned to the minimum memory granularity of syclContext and numBytes must be a multiple of the minimum memory granularity of syclContext.

This would correspond to the host page size, right? Not the value returned by zeVirtualMemQueryPageSize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for L0 this would correspond to the LCM of zeVirtualMemQueryPageSize of all devices in that context. For CUDA it would be the host page size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you changed the wording in 8dd8cb3 to say that the ptr must be aligned to the device's minimum granularity. This clarifies my concern about the wording, so I think we can resolve this issue.


If `mode` is `address_access_mode::read` or `address_access_mode::read_write`
`ptr` pointer is accessible after the call as read-only or read-write
respectively. Otherwise, it is considered inaccessible and accessing it will result
KseniyaTikhomirova marked this conversation as resolved.
Show resolved Hide resolved
in undefined behavior.

Writing to any address in the virtual memory range with access mode set to
`address_access_mode::read` results in undefined behavior.

An accessible pointer behaves the same as a pointer to device USM memory and can
be used in place of a device USM pointer in any interface accepting one.

|`address_access_mode get_access_mode(const void *ptr, size_t numBytes, const context &syclContext)` |
Returns the access mode of the virtual memory range specified by `ptr` and
`numBytes`. If the virtual memory range is inaccessible `std::nullopt` is
returned.
steffenlarsen marked this conversation as resolved.
Show resolved Hide resolved

|`void unmap(const void *ptr, size_t numBytes, const device &syclDevice, const context &syclContext)` |
KseniyaTikhomirova marked this conversation as resolved.
Show resolved Hide resolved
Unmaps the range specified by `ptr` and `numBytes`. The range must have been
mapped through a call to `physical_mem::map()` prior to calling this. The range
must not be a proper sub-range of a previously mapped range, but `ptr` and
`numBytes` may span multiple contiguous ranges. `syclContext` must be the same
as the context returned by the `get_context()` member function on the
steffenlarsen marked this conversation as resolved.
Show resolved Hide resolved
`physical_mem` the address ranges are currently mapped to.

After this call, the full range will again be ready to be mapped through a calls
steffenlarsen marked this conversation as resolved.
Show resolved Hide resolved
to `physical_mem::map()`.

|=====================
12 changes: 12 additions & 0 deletions sycl/include/sycl/detail/pi.def
Original file line number Diff line number Diff line change
Expand Up @@ -163,4 +163,16 @@ _PI_API(piextQueueCreate2)
_PI_API(piextQueueGetNativeHandle2)
_PI_API(piextQueueCreateWithNativeHandle2)

// Virtual memory
_PI_API(piextVirtualMemGranularityGetInfo)
_PI_API(piextPhysicalMemCreate)
_PI_API(piextPhysicalMemRetain)
_PI_API(piextPhysicalMemRelease)
_PI_API(piextVirtualMemReserve)
_PI_API(piextVirtualMemFree)
_PI_API(piextVirtualMemMap)
_PI_API(piextVirtualMemUnmap)
_PI_API(piextVirtualMemSetAccess)
_PI_API(piextVirtualMemGetInfo)

#undef _PI_API
Loading