Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an Implementation of ONEAPI_DEVICE_SELECTOR #220

Closed
alycm opened this issue Feb 7, 2023 · 28 comments
Closed

Provide an Implementation of ONEAPI_DEVICE_SELECTOR #220

alycm opened this issue Feb 7, 2023 · 28 comments
Assignees
Labels
enhancement New feature or request

Comments

@alycm
Copy link

alycm commented Feb 7, 2023

Currently SYCL and OpenMP have their own implementations of ONEAPI_DEVICE_SELECTOR, see compiler docs. Those could be eliminated if Unified Runtime provides a single solution for this.

The expectation would be that Unified Runtime implements ONEAPI_DEVICE_SELECTOR directly while matching current behavior as end-users could already be making use of this feature.

Level Zero also has ZE_AFFINITY_MASK, see Level Zero programming guide, which does a similar thing. I believe that ONEAPI_DEVICE_SELECTOR replaces ZE_AFFINITY_MASK but their interaction is something to consider.

@alycm alycm added the needs-discussion This needs further discussion label Feb 7, 2023
@jandres742
Copy link

@alycm thanks.

Idea is that oneAPI applications (or in this case, applications using UR) set either ONEAPI_DEVICE_SELECTOR or the ZE_AFFINITY_MASK.

Main idea is that with ONEAPI_DEVICE_SELECTOR, oneAPI libraries expose only a sub-set of devices to the application, while letting all middleware libraries (and UR) having full visibility of all the devices in the system.

This is very useful for instance for MPI+SYCL applications running on multi-device systems, where each rank would have visibility to just 1 or a sub-set of those devices, while MPI would have full visibility to all devices, allowing it to implement any optimizations that rely on the knowledge of the full topology of the system.

Now, if an application choses to use both ONEAPI_DEVICE_SELECTOR and ZE_AFFINITY_MASK, then ZE_AFFINITY_MASK defines what the ONEAPI_DEVICE_SELECTOR may expose, given that L0 sits lower in the stack and closer to the devices.

In that sense, I think UR would just read the ONEAPI_DEVICE_SELECTOR, and mask the devices that L0 already exposes to the layers above:

Example 0 (typical usage):
8 GPU system, d0 to d7
ZE_AFFINITY_MASK not set: L0 exposes zeDevice0 (d0) to zeDevice7(d7)
ONEAPI_DEVICE_SELECTOR set to only expose 2 devices: Application sees only 2 devices, zeDevice0=d0 and zeDevice1=d1

Example 1 (more complex, but still legal):
8 GPU system, d0 to d7
ZE_AFFINITY_MASK set to expose only d3 to d7: L0 exposes zeDevice0 (d3) to zeDevice3(d7)
ONEAPI_DEVICE_SELECTOR set to only expose 2 devices: Application sees only 2 devices, zeDevice0=d3 and zeDevice1=d4

so from the UR point of view, ONEAPI_DEVICE_SELECTOR would expose the devices L0 decides to expose through the L0 device query interfaces.

@alycm
Copy link
Author

alycm commented Jun 7, 2023

I can't assign this ticket to him as he isn't a member of the project but @Wee-Free-Scot is working on this.

@Wee-Free-Scot
Copy link
Contributor

Thanks @alycm -- I am indeed working on this. Is there any way I can be added to the project?

@kbenzie
Copy link
Contributor

kbenzie commented Jun 8, 2023

I've made the request to add you @Wee-Free-Scot

@Wee-Free-Scot Wee-Free-Scot self-assigned this Jun 9, 2023
@Wee-Free-Scot
Copy link
Contributor

Goal: there is a single env var (ODS) that consistently affects how devices are exposed by many oneAPI components.

Relevance to UR: "consistently" implies a single code-base for parts of the functionality that are common,

Code/API design concerns:

Some targets need to see all the "real" devices (i.e. ignore the ODS env var): MPI, oneCCL, and oneSHMEM.
Some targets need to see only "selected" devices (i.e. obey the ODS env var): SYCL, OpenMP, CHIP-SPV, Julia
(Other oneAPI components: TBC.)

Of those targets that need "selected" devices,
some permit at most one backend: OpenMP and CHIP-SPV
some permit multiple backends: SYCL
(Others: TBC.)

Common functionality:

Figuring out whether the ODS env var is set.
Reading the ODS env var string from the environment.
Parsing the ODS string into terms.
If all terms are exclusions, prefix an additional "accept any device" term.

Diverging functionality:

Return list of devices, ignoring ODS entirely.
Return list of devices, obeying only ODS terms for backend X (supplied by caller).
Return list of devices, obeying all ODS terms.

Suggestion -- UR needs 3 APIs:

  1. urGetRealDevices
  2. urGetSelectedDevices
  3. urGetSelectedDevicesForPlatform

@pbalcer
Copy link
Contributor

pbalcer commented Jun 14, 2023

Instead of adding new APIs for enumerating devices, it might be easier to decide whether to use "OneAPI Device Selector" (ODS) based on some flag to urInit, either as an opt-in or an opt-out.

@Wee-Free-Scot
Copy link
Contributor

Wee-Free-Scot commented Jun 14, 2023

The flag might need to be more than just yes/no -- some callers permit only one platform/backend and some might need to pass in the name of that one platform/backend.

CHIP-SPV assumes "there is one backend and it is Level Zero", unless an env var is set that switches to "there is only backend and it is OpenCL". CHIP-SPV could read its env var, decide between only-L0 or only-OCL, and then initialize UR passing in backend name, e.g. "level_zero" (string or an enum?).

OpenMP assumes "there is one backend" but does not (AFAIK) offer a way to control which is the chosen backend. There is an open question of how OpenMP should make the choice. One option is to choose the first backend mentioned in ODS.

So the flag would at least need values meaning "ignore ODS" (for MPI), "single platform: <this one>" (for CHIP-SPV), and "multi-platform" (for SYCL). Perhaps there is another value meaning "single platform: you choose for me" (for OpenMP).

@pbalcer
Copy link
Contributor

pbalcer commented Jun 14, 2023

Sounds like there's some overlap with what we wanted to do in #355.

@kbenzie
Copy link
Contributor

kbenzie commented Jun 14, 2023

It seems to me that the default path should respect ONEAPI_DEVICE_SELECTOR and that the real devices query should be the opt in path for the environments where that is needed.

Suggestion -- UR needs 3 APIs:

1. urGetRealDevices

2. urGetSelectedDevices

3. urGetSelectedDevicesForPlatform

Since we already have urDeviceGet I think it would make sense for it to remain the default path, i.e. the path that respects ONEAPI_DEVICE_SELECTOR. It already takes a platform handle as the first argument, so I think fulfils API 3.

What differentiates 2. and 3. here? Is urGetSelectedDevices intended to return all devices for all platforms? If so that's a mismatch with how urDeviceGet currently works as we must first enumerate all platforms before enumerating all devices.

The flag might need to be more than just yes/no -- some callers permit only one platform/backend and some might need to pass in the name of that one platform/backend.

Given that ONEAPI_DEVICE_SELECTOR is also filtering backends (which are a 1:1 mapping to platforms in L0, CUDA, and HIP, but not OpenCL) do we also need a way to get the list of "real" platforms/backends?

Instead of adding new APIs for enumerating devices, it might be easier to decide whether to use "OneAPI Device Selector" (ODS) based on some flag to urInit, either as an opt-in or an opt-out.

I've wondered about this also but my mental (which may be incorrect) model for urInit is it would result in a platform not even being initilaized and thus the real devices query would not function as expected in environments.

@Wee-Free-Scot
Copy link
Contributor

API (2) is intended to directly support this SYCL code:

auto devices = device::get_devices(info::device_type::gpu);

This returns all GPU devices exposed by the ODS env var (or all GPU devices from all backends/platforms, if ODS is not set, as if ODS had been set to ":").

Of course, this could be implemented as:
a. Get a list of all platforms (either all platforms known to UR or all platforms mentioned in ODS -- they have the same result)
b. For each platform in the list
c. Get a list of all devices for that platform -- urDeviceGet( /* platform */ ) -- API (3)
d. Concatenate the devices for that platform into a devices-found-so-far list

This is how SYCL currently implements ODS support, probably because there is no abstraction layer that can hide the loop.

So, API (2) is not strictly necessary -- it is a wrapper over repeated calls to API (3).

If SYCL is really the only UR client that is interested in a cross-platform device list, then SYCL should be forced to call API (3) repeatedly. Alternatively, there could be a UR sentinel value (e.g. urWildcardPlatform) that means "all platforms".

@Wee-Free-Scot
Copy link
Contributor

backends (which are a 1:1 mapping to platforms in L0, CUDA, and HIP, but not OpenCL)

OpenCL is one of the backends that ODS understands.

I was assuming that backend, platform, and adapter are effectively synonyms in the context of UR. Is that not true?

@kbenzie
Copy link
Contributor

kbenzie commented Jun 14, 2023

OpenCL is one of the backends that ODS understands.

I was assuming that backend, platform, and adapter are effectively synonyms in the context of UR. Is that not true?

For CUDA and HIP, which both assume either only NVIDIA or AMD respectively, there is one platform per adapter and that is analagous to a SYCL backend.

For Level Zero, I'm less sure but looking at zeDriverGet that looks a lot like urPlatformGet if you squint your eyes. I think that leaves the possibility that a the Level Zero adapter could report multiple platforms, one per driver.

For the OpenCL adapter we actually pass on the underlaying OpenCL platforms so it
depends on what OpenCL drivers are installed and enumerated by the OpenCL ICD Loader.

Stated differently:

  • A SYCL backend has a 1:1 mapping to a Unified Runtime adapter
  • A Unified Runtime adapter may expose multiple platforms

@pbalcer
Copy link
Contributor

pbalcer commented Jun 15, 2023

Instead of adding new APIs for enumerating devices, it might be easier to decide whether to use "OneAPI Device Selector" (ODS) based on some flag to urInit, either as an opt-in or an opt-out.

I've wondered about this also but my mental (which may be incorrect) model for urInit is it would result in a platform not even being initilaized and thus the real devices query would not function as expected in environments.

But should software ever need to query both "selected" and "real" devices? I'm not sure.

CHIP-SPV assumes "there is one backend and it is Level Zero", unless an env var is set that switches to "there is only backend and it is OpenCL". CHIP-SPV could read its env var, decide between only-L0 or only-OCL, and then initialize UR passing in backend name, e.g. "level_zero" (string or an enum?).

OpenMP assumes "there is one backend" but does not (AFAIK) offer a way to control which is the chosen backend. There is an open question of how OpenMP should make the choice. One option is to choose the first backend mentioned in ODS.

So the flag would at least need values meaning "ignore ODS" (for MPI), "single platform: " (for CHIP-SPV), and "multi-platform" (for SYCL). Perhaps there is another value meaning "single platform: you choose for me" (for OpenMP).

I think this conflates two separate things: the administrative task of filtering available platforms through an environment variable and software programmatically selecting its supported configuration. The former is for the user of software to pick what they want to run with, and the latter is for the software developer to restrict their supported configurations. These are both valid functionalities, but I'm not sure if they can or should be solved using the same mechanism. Especially since software can already selectively use platforms by querying UR_PLATFORM_INFO_BACKEND, so I think the existing UR API covers the CHIP-SPV and OpenMP cases (provided that there will be a way to opt-out of ODS).

@Wee-Free-Scot
Copy link
Contributor

the existing UR API ... provided that there will be a way to opt-out of ODS

I think the idea behind ODS was to replace all of the different env vars that each component had invented to solve this common problem. We might instead be heading towards "one env var to rule them all" -- that is, keeping a bunch of the existing env vars but defining how they interact with the one env var (ODS).

should software ever need to query both "selected" and "real" devices?

A possible software application usage might be to try adding SYCL offload to a program than already uses MPI for inter-node communication. This software would query both: MPI would need to see all "real" devices; SYCL would want to see only the "selected" devices.

It is harder to come up with a non-pathological example of needing two different "selected" sets concurrently: for example, some part of the application does offload via SYCL, whereas another part does offload using OpenMP.

Mixing the MPI requirement (ignore ODS entirely) and exactly one of the "selected" requirements is an easy binary choice -- do or do not. An open question is: will the "special backend selected" set for OpenMP always be a subset of the "all selected" set for SYCL and identical to the "special backend selected" set for CHIP-SPV and Julia? If not (which seems plausible to me) and if we wish to support multiple clients simultaneously using different special backends (less likely, IMHO) then UR should delegate the specializations into the clients, but that would mean ODS needs to capture the superset or a set of sets.

this conflates two separate things: the administrative task of filtering available platforms through an environment variable and software programmatically selecting its supported configuration

I tend to agree with this observation. UR could/should limit its responsibility to parsing the ODS env var and exposing the resulting information programmatically. Each client SW component (SYCL, OpenMP, CHIP-SPV, etc.) can do whatever they want with that information. Every client should document their interpretation algorithm.

It would dramatically simplify our implementation (and documentation) job if exactly one backend was the special one for all UR clients that treat one backend as special. We should consider imposing that rule, until/unless there is significant push-back from customers.

I can see two ways of choosing the backend that will be special:

  1. augment the syntax of ODS -- for example, "the first backend mentioned in an inclusive term is the special one", What happens with all exclusive terms? Pick Level Zero. Unless that is excluded, then what? Hmm.
  2. use a different env var -- for example, a new env var like "ONEAPI_PREFERED_BACKEND" (could be platform or adapter here). Naming this "UR_anything" suggests that it applies to UR and thereby to all UR clients, but MPI will definitely ignore it, SYCL will almost certainly ignore it, etc.

@pbalcer
Copy link
Contributor

pbalcer commented Jun 19, 2023

A possible software application usage might be to try adding SYCL offload to a program than already uses MPI for inter-node communication. This software would query both: MPI would need to see all "real" devices; SYCL would want to see only the "selected" devices.

Maybe a stupid question, but does this assume that both MPI and the program use the same UR "instance"? Or that UR has external linkage for its global state, and so MPI and SYCL implicitly operate on the same set of platforms (i.e., urInit is called by both, but only the first one does anything)?
If so, then I see your point... Do we need it to happen this way? Can't we just not export the globals so each component gets its own UR state?

@Wee-Free-Scot
Copy link
Contributor

Multiple concurrent instances of UR in a single OS process is a design choice that requires non-trivial support code inside UR and requires documentation that answers questions like "can I pass a UR handle that I got from my instance of UR into a library that also/separately initialized its own instance of UR?"

Specifically the simplest cross-component use-case is something like this:

  1. allocate USM memory via SYCL with sycl::malloc_shared (needs a sycl::device input param)
  2. submit a kernel that uses the buffer to a SYCL queue with sycl::queue::submit
  3. ask MPI to send the (new) buffer content to another node via MPI_Send
  4. intra-node communication performs better if MPI can "see" all node-local devices, not just those SYCL can "see"

The memory backing the sycl::buffer is used by SYCL and by MPI; both need to be able to query properties using their instance of UR and get sensible answers, even though only one instance of UR was used to allocate that memory.

We can construct more elaborate MPI+SYCL examples and examples using MPI and OpenMP offload, but you get the idea.

Examples with oneCCL can ask questions about sharing device and context objects/handles because creating a communicator in oneCCL requires a device handle and a context handle. oneCCL uses the device to perform the communication (especially computations for reductions, but also data movement to copy engines). The user might submit their kernels via SYCL, whereas oneCCL will submit kernels directly to Level Zero -- in future, perhaps, "directly" via UR.

Here a scenario might be:

  1. use SYCL to enumerate all devices
  2. choose one (or more) to use for application kernels
  3. choose one SYCL device to use for oneCCL
  4. ask oneCCL to communicate with other processes (which are using devices unknown to SYCL)
  5. intra-node communication performs better if oneCCL can "see" all node-local devices, not just those SYCL can "see"

@pbalcer
Copy link
Contributor

pbalcer commented Jun 20, 2023

Right, but is all this necessary? It seems the only use case for more elaborate handling of the ODS variable (than a simple opt-out flag) is when the UR global state has an external linkage. And I don't think that's what we want to do. Or is it? I was planning to add a linker version script so that UR only exports its public API symbols and nothing else.
@kbenzie can you chime in?

@kbenzie
Copy link
Contributor

kbenzie commented Jun 20, 2023

@pbalcer sure, I need to catch up on the discussion first then I'm come back to this.

@kbenzie
Copy link
Contributor

kbenzie commented Jun 20, 2023

Multiple concurrent instances of UR in a single OS process is a design choice

Currently we have a single instance per process and I don't think we should change that.

I was planning to add a linker version script so that UR only exports its public API symbols and nothing else.

I support this, the only global symbols that should be visible from the loader are the functions which the user can link against and call, nothing else.


I'll come back to this tomorrow though, I feel things are tending towards becoming overcomplicated and I'd like a little more time to digest everything and come up with a proposal for next steps.

@kbenzie
Copy link
Contributor

kbenzie commented Jun 22, 2023

I'm a day late, apologies.

How about we start building the simplest possible prototype now, evaluate it once its working, then iterate on the design from there?

I feel like the minimal changes to get something working would be:

  1. Introduce the ONEAPI_DEVICE_SELECTOR parsing code
  2. Invoke the parser & implement filtering inside urInit()
    • Keep the current behaviour of always initializing all adapters for simplicity
  3. Change urPlatformGet and urDeviceGet to return the filtered results

This definitely doesn't cover our use cases but I feel like these things will need to be done not matter of future design decisions. It also feels like a reasonable jumping off point for think about how to expose the unfiltered lists, because we'll have implementation details to look at and inform our design decisions moving forward.

Something that would also help me grok of the scope of this would be to have a list of all the use cases in a single place. This would help with evaluating and prioritising our design and implementaiton efforts.

@pbalcer
Copy link
Contributor

pbalcer commented Jun 26, 2023

Currently we have a single instance per process and I don't think we should change that.

I'm not sure if the current implementations are correct though. How should urInit and urTeardown behave when called multiple times?
For teardown, the level-zero adapter I think will just crash, and the CUDA one will try to call invalid XPTI handles.

These are likely just bugs that can be easily fixed, but this highlights a bigger problem - this sort of API is unintuitive and might be error-prone. For both adapter developers and users (they don't know what actually happens when they call urInit or urTeardown).

The urInit docs state that The application may call this function multiple times with different flags or environment variables enabled. but also that Only one instance of each adapter will be initialized per process.. If we don't have some non-global state object that init and teardown operate on, the only thing we can actually do when applications calls urInit multiple times with different flags or environment variables enabled is to ignore all the subsequent calls (which is what the loader currently does for its own state). At best, the current wording in docs is misleading.

How about we start building the simplest possible prototype now, evaluate it once its working, then iterate on the design from there?

The problem I think is that we have a more fundamental issue with the management of global state lifetime which is now causing difficulties with the design of this new feature. The implementation of #620 will likely hit the exact same problem.

My suggestion would be to take a step back and think whether we need urInit and urTeardown in the first place. Most loader initialization takes place in the library constructor/destructor. Is there anything in the adapters that need these entry points that can't be similarly managed (I don't think so)? We could then create a dedicated filtering/configuration mechanism with local state for urPlatformGet or urDeviceGet.

@Wee-Free-Scot
Copy link
Contributor

I was not thinking that the device list(s) would be global state within the UR but, as long as the env var is read exactly once (to avoid race conditions with env var modifications, if any), then caching the result of the first call to an idempotent procedure and returning it for all subsequent calls is semantically indistinguishable from actually recomputing the identical output during each call.

Therefore, the discussions about managing global state are relevant and potentially problematic, as stated by @pbalcer.

As a point of reference, MPI has recently confronted this issue -- it has always that MPI_INIT and MPI_FINALIZE, which can only be called at most once by each process; in MPI-4.0, the concept of MPI sessions was introduced, which permits multiple concurrent initializations of MPI (sessions) and re-initialization of MPI (destroy all sessions, wait a bit, create a new one). This allows each library to have its own handle to MPI that is isolated from all the others, but it raises problems with every MPI object that is not scoped by session (such as MPI datatypes). This is now coming back to bite the MPI Forum and the Sessions WG and providing a great case study in how not to handle this kind of API change.

@kbenzie
Copy link
Contributor

kbenzie commented Jul 4, 2023

The urInit docs state that The application may call this function multiple times with different flags or environment variables enabled. but also that Only one instance of each adapter will be initialized per process..

Looks like this language is copied directly from zeInit. Perhaps we can look there to get an idea exactly what the semantics of that look like in the code.

I would personally remove may call this function multiple times with different flags or environment variables enabled from UR because it feels at odds with having one instance per process to me.

Is there anything in the adapters that need these entry points that can't be similarly managed (I don't think so)?

As I linked above I do believe LevelZero requires urInit to call zeInit, the CUDA adapter enables some tracing but nothing else, the HIP adapter doesn't do anything, and the OpenCL adapter sets up a function pointer cache but nothing else.

Currently in the pi2ur layer urInit(0) is called inside piPlatformGet, this almost certainly happens multiple times but with no change in flags.

Meanwhile piTearDown is called only once a global handler which unloads the plugins. Given this is defacto use and urTearDown doesn't have much to say about anything stating it should only be called once per process when absolutely nothing else will be done with any adapters seems prudent.

Overall I think this does point to keeping urInit/urTearDown around but its not that convincing on balance.

Another option might be to move loader constructor/destructor into urInit/urTearDown? This would allow the application to completely avoid doing any loading if it so desired rather than doing this a .so/.dll load time.

as long as the env var is read exactly once

My gut feeling would be to wrap the implementation details of urInit in std::call_once in addition to removing the clause about allowing multiple calls to it with different flags.

@pbalcer
Copy link
Contributor

pbalcer commented Jul 5, 2023

I would personally remove may call this function multiple times with different flags or environment variables enabled from UR because it feels at odds with having one instance per process to me.

Agreed.

As I linked above I do believe LevelZero requires urInit to call zeInit, the CUDA adapter enables some tracing but nothing else, the HIP adapter doesn't do anything, and the OpenCL adapter sets up a function pointer cache but nothing else.

But would there be any observable side-effects of loading the level-zero driver and calling zeInit during the library constructor? I found the implementation for compute-runtime here, but I'm unable to tell from that.

Meanwhile piTearDown is called only once a global handler which unloads the plugins. Given this is defacto use and urTearDown doesn't have much to say about anything stating it should only be called once per process when absolutely nothing else will be done with any adapters seems prudent.

But how can two independent modules that share UR instance coordinate which one calls urTearDown? In compute runtime, the actual driver teardown happens on library destruction.
In UR, if we want to keep the current teardown semantics, it might be a good idea to do refcounting.

Overall I think this does point to keeping urInit/urTearDown around but its not that convincing on balance.

Currently, software has no way of knowing whether urInit does anything. The same I think should be true of urTeardown. This makes things like flags, any filtering we would do at urInit, or the config we are planning on adding #681, unreliable. What's worse, it would appear to work in most scenarios, but break down in complex applications like in the examples @Wee-Free-Scot described.

My gut feeling would be to wrap the implementation details of urInit in std::call_once in addition to removing the clause about allowing multiple calls to it with different flags.

This is what the loader does for its own layers, but the urInit call is then forwarded to adapters every time - should this also happen only once?

@Wee-Free-Scot
Copy link
Contributor

Using std::call_once works for setup because it implements "only the first call does something" but it isn't so good for tear-down because it has no way of implementing "only the last call does something" -- that needs ref-counting. That's a concern whichever level of the software stack uses it.

Can the adapters be re-initialized? Call zeInit, then zeTearDown, then zeInit a second time? MPI is moving away from MPI_INIT and MPI_FINALIZE (must be called exactly once in each MPI process) to MPI_SESSION_INIT and MPI_SESSION_FINALIZE (supports arbitrary concurrent sessions and chronological gaps with re-initialization). This was done (after years of debate) because "there can be only one!" was widely denounced as too restrictive and bad for fault tolerance.

If all the adapters protect themselves against multiple/concurrent calls to their setup and tear-down functions (either ensuring only one does something or by doing something on every call), then the higher software layers, like the UR, don't need to implement that protection as well. Question: is it required that all adapters must implement "only one active instance" semantics, or is it permitted for an adapter to do something active during more than one (or all) calls to the setup and tear-down functions? If all adapters must implement "call once" semantics, then it might make sense to unify that code, i.e., move the protection up one layer into the UR, but only if no adapter will need that protection when used without the UR intermediary. If adapters have freedom in this regard, then enforcing "call once" in UR is A Bad Idea(tm).

@kbenzie
Copy link
Contributor

kbenzie commented Jul 7, 2023

I agree urTearDown() in its current form is problematic.

With #681 in the works to enable programatically enabling layers in the loader I think this is another reason to keep urInit(), or some future equivelant, around as a concept.

@callumfare has also recently noticed some things which are looking like they could be resolve by the addition of an adapter handle:

  • urPlatformCreateWithNativeHandle() currently has no handle to dispatch to the correct adapter so is currently broken
  • Properly supporting urPlatformGetLastError() in the SYCL RT requires either passing around a platform handle in a lot more places, or potentially less intrusive an adapter handle could be used instead of a platform

So it seems like having urInit() return a reference counted adapter handle would solve a number of issues.

This is what the loader does for its own layers, but the urInit call is then forwarded to adapters every time - should this also happen only once?

I noticed the L0 adapter has a std::call_once already.

@Wee-Free-Scot
Copy link
Contributor

More information for this issue -- the Sysman API has been delinked from the L0 Core API (see https://jira.devtools.intel.com/browse/LOCI-3748 and https://jira.devtools.intel.com/browse/LOCI-3950), so client software can now enumerate devices using L0 Core (which obeys ZE_AFFINITY_MASK) or via Sysman (which ignores ZE_AFFINITY_MASK).

This is related because ignoring ZE_AFFINITY_MASK must imply ignoring ONEAPI_DEVICE_SELECTOR as well.

It also adds another client SW that needs the "all devices" output: VTune enumerates all devices and all XeLinks, then gets HW counters from all of them.

It clarifies that (if not using UR) MPI would use Sysman to find all devices and L0 Core to submit compute/copy kernels only to the devices selected and not masked by the env vars.

Thus, for L0/Sysman platforms, we (could/should) have two different APIs in UR (urDeviceGet and urDeviceGetSelected in my draft PR #740) that delegate to two different lower-level functionalities (newly separated out) to support the two different requirements we had already discussed (all devices and selected devices).

Naming questions

Terminology -- Sysman uses words like topology and vertex (AKA device) -- perhaps suggesting something like urTopologyGetVertices instead of urDeviceGet, which frees up that name to mean what urDeviceGetSelected means in my PR.

Portability -- does "Sysman is different from L0 Core" and/or the "topology/vertex" naming make any sense for CUDA or OpenCL platforms?

To me, this doesn't feel like the right direction for naming in UR (not sure why not, tbh), but getting all devices in the node is functionality from Sysman, so we should ask "how faithful do we want to be to the Sysman naming terminology?"

@kbenzie
Copy link
Contributor

kbenzie commented Feb 22, 2024

Fixed by #740.

@kbenzie kbenzie closed this as completed Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants