-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide an Implementation of ONEAPI_DEVICE_SELECTOR #220
Comments
@alycm thanks. Idea is that oneAPI applications (or in this case, applications using UR) set either ONEAPI_DEVICE_SELECTOR or the ZE_AFFINITY_MASK. Main idea is that with ONEAPI_DEVICE_SELECTOR, oneAPI libraries expose only a sub-set of devices to the application, while letting all middleware libraries (and UR) having full visibility of all the devices in the system. This is very useful for instance for MPI+SYCL applications running on multi-device systems, where each rank would have visibility to just 1 or a sub-set of those devices, while MPI would have full visibility to all devices, allowing it to implement any optimizations that rely on the knowledge of the full topology of the system. Now, if an application choses to use both ONEAPI_DEVICE_SELECTOR and ZE_AFFINITY_MASK, then ZE_AFFINITY_MASK defines what the ONEAPI_DEVICE_SELECTOR may expose, given that L0 sits lower in the stack and closer to the devices. In that sense, I think UR would just read the ONEAPI_DEVICE_SELECTOR, and mask the devices that L0 already exposes to the layers above: Example 0 (typical usage): Example 1 (more complex, but still legal): so from the UR point of view, ONEAPI_DEVICE_SELECTOR would expose the devices L0 decides to expose through the L0 device query interfaces. |
I can't assign this ticket to him as he isn't a member of the project but @Wee-Free-Scot is working on this. |
Thanks @alycm -- I am indeed working on this. Is there any way I can be added to the project? |
I've made the request to add you @Wee-Free-Scot |
Goal: there is a single env var (ODS) that consistently affects how devices are exposed by many oneAPI components. Relevance to UR: "consistently" implies a single code-base for parts of the functionality that are common, Code/API design concerns: Some targets need to see all the "real" devices (i.e. ignore the ODS env var): MPI, oneCCL, and oneSHMEM. Of those targets that need "selected" devices, Common functionality: Figuring out whether the ODS env var is set. Diverging functionality: Return list of devices, ignoring ODS entirely. Suggestion -- UR needs 3 APIs:
|
Instead of adding new APIs for enumerating devices, it might be easier to decide whether to use "OneAPI Device Selector" (ODS) based on some flag to |
The flag might need to be more than just yes/no -- some callers permit only one platform/backend and some might need to pass in the name of that one platform/backend. CHIP-SPV assumes "there is one backend and it is Level Zero", unless an env var is set that switches to "there is only backend and it is OpenCL". CHIP-SPV could read its env var, decide between only-L0 or only-OCL, and then initialize UR passing in backend name, e.g. "level_zero" (string or an enum?). OpenMP assumes "there is one backend" but does not (AFAIK) offer a way to control which is the chosen backend. There is an open question of how OpenMP should make the choice. One option is to choose the first backend mentioned in ODS. So the flag would at least need values meaning "ignore ODS" (for MPI), "single platform: <this one>" (for CHIP-SPV), and "multi-platform" (for SYCL). Perhaps there is another value meaning "single platform: you choose for me" (for OpenMP). |
Sounds like there's some overlap with what we wanted to do in #355. |
It seems to me that the default path should respect
Since we already have What differentiates 2. and 3. here? Is
Given that
I've wondered about this also but my mental (which may be incorrect) model for |
API (2) is intended to directly support this SYCL code: auto devices = device::get_devices(info::device_type::gpu); This returns all GPU devices exposed by the ODS env var (or all GPU devices from all backends/platforms, if ODS is not set, as if ODS had been set to ":"). Of course, this could be implemented as: This is how SYCL currently implements ODS support, probably because there is no abstraction layer that can hide the loop. So, API (2) is not strictly necessary -- it is a wrapper over repeated calls to API (3). If SYCL is really the only UR client that is interested in a cross-platform device list, then SYCL should be forced to call API (3) repeatedly. Alternatively, there could be a UR sentinel value (e.g. urWildcardPlatform) that means "all platforms". |
OpenCL is one of the backends that ODS understands. I was assuming that backend, platform, and adapter are effectively synonyms in the context of UR. Is that not true? |
For CUDA and HIP, which both assume either only NVIDIA or AMD respectively, there is one platform per adapter and that is analagous to a SYCL backend. For Level Zero, I'm less sure but looking at zeDriverGet that looks a lot like urPlatformGet if you squint your eyes. I think that leaves the possibility that a the Level Zero adapter could report multiple platforms, one per driver. For the OpenCL adapter we actually pass on the underlaying OpenCL platforms so it Stated differently:
|
But should software ever need to query both "selected" and "real" devices? I'm not sure.
I think this conflates two separate things: the administrative task of filtering available platforms through an environment variable and software programmatically selecting its supported configuration. The former is for the user of software to pick what they want to run with, and the latter is for the software developer to restrict their supported configurations. These are both valid functionalities, but I'm not sure if they can or should be solved using the same mechanism. Especially since software can already selectively use platforms by querying |
I think the idea behind ODS was to replace all of the different env vars that each component had invented to solve this common problem. We might instead be heading towards "one env var to rule them all" -- that is, keeping a bunch of the existing env vars but defining how they interact with the one env var (ODS).
A possible software application usage might be to try adding SYCL offload to a program than already uses MPI for inter-node communication. This software would query both: MPI would need to see all "real" devices; SYCL would want to see only the "selected" devices. It is harder to come up with a non-pathological example of needing two different "selected" sets concurrently: for example, some part of the application does offload via SYCL, whereas another part does offload using OpenMP. Mixing the MPI requirement (ignore ODS entirely) and exactly one of the "selected" requirements is an easy binary choice -- do or do not. An open question is: will the "special backend selected" set for OpenMP always be a subset of the "all selected" set for SYCL and identical to the "special backend selected" set for CHIP-SPV and Julia? If not (which seems plausible to me) and if we wish to support multiple clients simultaneously using different special backends (less likely, IMHO) then UR should delegate the specializations into the clients, but that would mean ODS needs to capture the superset or a set of sets.
I tend to agree with this observation. UR could/should limit its responsibility to parsing the ODS env var and exposing the resulting information programmatically. Each client SW component (SYCL, OpenMP, CHIP-SPV, etc.) can do whatever they want with that information. Every client should document their interpretation algorithm. It would dramatically simplify our implementation (and documentation) job if exactly one backend was the special one for all UR clients that treat one backend as special. We should consider imposing that rule, until/unless there is significant push-back from customers. I can see two ways of choosing the backend that will be special:
|
Maybe a stupid question, but does this assume that both MPI and the program use the same UR "instance"? Or that UR has external linkage for its global state, and so MPI and SYCL implicitly operate on the same set of platforms (i.e., |
Multiple concurrent instances of UR in a single OS process is a design choice that requires non-trivial support code inside UR and requires documentation that answers questions like "can I pass a UR handle that I got from my instance of UR into a library that also/separately initialized its own instance of UR?" Specifically the simplest cross-component use-case is something like this:
The memory backing the sycl::buffer is used by SYCL and by MPI; both need to be able to query properties using their instance of UR and get sensible answers, even though only one instance of UR was used to allocate that memory. We can construct more elaborate MPI+SYCL examples and examples using MPI and OpenMP offload, but you get the idea. Examples with oneCCL can ask questions about sharing device and context objects/handles because creating a communicator in oneCCL requires a device handle and a context handle. oneCCL uses the device to perform the communication (especially computations for reductions, but also data movement to copy engines). The user might submit their kernels via SYCL, whereas oneCCL will submit kernels directly to Level Zero -- in future, perhaps, "directly" via UR. Here a scenario might be:
|
Right, but is all this necessary? It seems the only use case for more elaborate handling of the ODS variable (than a simple opt-out flag) is when the UR global state has an external linkage. And I don't think that's what we want to do. Or is it? I was planning to add a linker version script so that UR only exports its public API symbols and nothing else. |
@pbalcer sure, I need to catch up on the discussion first then I'm come back to this. |
Currently we have a single instance per process and I don't think we should change that.
I support this, the only global symbols that should be visible from the loader are the functions which the user can link against and call, nothing else. I'll come back to this tomorrow though, I feel things are tending towards becoming overcomplicated and I'd like a little more time to digest everything and come up with a proposal for next steps. |
I'm a day late, apologies. How about we start building the simplest possible prototype now, evaluate it once its working, then iterate on the design from there? I feel like the minimal changes to get something working would be:
This definitely doesn't cover our use cases but I feel like these things will need to be done not matter of future design decisions. It also feels like a reasonable jumping off point for think about how to expose the unfiltered lists, because we'll have implementation details to look at and inform our design decisions moving forward. Something that would also help me grok of the scope of this would be to have a list of all the use cases in a single place. This would help with evaluating and prioritising our design and implementaiton efforts. |
I'm not sure if the current implementations are correct though. How should These are likely just bugs that can be easily fixed, but this highlights a bigger problem - this sort of API is unintuitive and might be error-prone. For both adapter developers and users (they don't know what actually happens when they call The
The problem I think is that we have a more fundamental issue with the management of global state lifetime which is now causing difficulties with the design of this new feature. The implementation of #620 will likely hit the exact same problem. My suggestion would be to take a step back and think whether we need |
I was not thinking that the device list(s) would be global state within the UR but, as long as the env var is read exactly once (to avoid race conditions with env var modifications, if any), then caching the result of the first call to an idempotent procedure and returning it for all subsequent calls is semantically indistinguishable from actually recomputing the identical output during each call. Therefore, the discussions about managing global state are relevant and potentially problematic, as stated by @pbalcer. As a point of reference, MPI has recently confronted this issue -- it has always that MPI_INIT and MPI_FINALIZE, which can only be called at most once by each process; in MPI-4.0, the concept of MPI sessions was introduced, which permits multiple concurrent initializations of MPI (sessions) and re-initialization of MPI (destroy all sessions, wait a bit, create a new one). This allows each library to have its own handle to MPI that is isolated from all the others, but it raises problems with every MPI object that is not scoped by session (such as MPI datatypes). This is now coming back to bite the MPI Forum and the Sessions WG and providing a great case study in how not to handle this kind of API change. |
Looks like this language is copied directly from zeInit. Perhaps we can look there to get an idea exactly what the semantics of that look like in the code. I would personally remove may call this function multiple times with different flags or environment variables enabled from UR because it feels at odds with having one instance per process to me.
As I linked above I do believe LevelZero requires Currently in the pi2ur layer Meanwhile Overall I think this does point to keeping Another option might be to move loader constructor/destructor into
My gut feeling would be to wrap the implementation details of |
Agreed.
But would there be any observable side-effects of loading the level-zero driver and calling
But how can two independent modules that share UR instance coordinate which one calls
Currently, software has no way of knowing whether
This is what the loader does for its own layers, but the |
Using Can the adapters be re-initialized? Call If all the adapters protect themselves against multiple/concurrent calls to their setup and tear-down functions (either ensuring only one does something or by doing something on every call), then the higher software layers, like the UR, don't need to implement that protection as well. Question: is it required that all adapters must implement "only one active instance" semantics, or is it permitted for an adapter to do something active during more than one (or all) calls to the setup and tear-down functions? If all adapters must implement "call once" semantics, then it might make sense to unify that code, i.e., move the protection up one layer into the UR, but only if no adapter will need that protection when used without the UR intermediary. If adapters have freedom in this regard, then enforcing "call once" in UR is A Bad Idea(tm). |
I agree With #681 in the works to enable programatically enabling layers in the loader I think this is another reason to keep @callumfare has also recently noticed some things which are looking like they could be resolve by the addition of an adapter handle:
So it seems like having
I noticed the L0 adapter has a |
More information for this issue -- the Sysman API has been delinked from the L0 Core API (see https://jira.devtools.intel.com/browse/LOCI-3748 and https://jira.devtools.intel.com/browse/LOCI-3950), so client software can now enumerate devices using L0 Core (which obeys ZE_AFFINITY_MASK) or via Sysman (which ignores ZE_AFFINITY_MASK). This is related because ignoring ZE_AFFINITY_MASK must imply ignoring ONEAPI_DEVICE_SELECTOR as well. It also adds another client SW that needs the "all devices" output: VTune enumerates all devices and all XeLinks, then gets HW counters from all of them. It clarifies that (if not using UR) MPI would use Sysman to find all devices and L0 Core to submit compute/copy kernels only to the devices selected and not masked by the env vars. Thus, for L0/Sysman platforms, we (could/should) have two different APIs in UR (urDeviceGet and urDeviceGetSelected in my draft PR #740) that delegate to two different lower-level functionalities (newly separated out) to support the two different requirements we had already discussed (all devices and selected devices). Naming questions Terminology -- Sysman uses words like topology and vertex (AKA device) -- perhaps suggesting something like Portability -- does "Sysman is different from L0 Core" and/or the "topology/vertex" naming make any sense for CUDA or OpenCL platforms? To me, this doesn't feel like the right direction for naming in UR (not sure why not, tbh), but getting all devices in the node is functionality from Sysman, so we should ask "how faithful do we want to be to the Sysman naming terminology?" |
Fixed by #740. |
Currently SYCL and OpenMP have their own implementations of
ONEAPI_DEVICE_SELECTOR
, see compiler docs. Those could be eliminated if Unified Runtime provides a single solution for this.The expectation would be that Unified Runtime implements
ONEAPI_DEVICE_SELECTOR
directly while matching current behavior as end-users could already be making use of this feature.Level Zero also has
ZE_AFFINITY_MASK
, see Level Zero programming guide, which does a similar thing. I believe thatONEAPI_DEVICE_SELECTOR
replacesZE_AFFINITY_MASK
but their interaction is something to consider.The text was updated successfully, but these errors were encountered: