[uma] Improve reporting of pool/provider specific errors #640

igchor · 2023-06-20T18:08:36Z

Initially I wanted to implement the same get_last_error function for both memory pool and memory provider. However, I realized that the use cases are a bit different:

for memory provider, we want to get the native code (e.g. to propagate from L0-provider to L0-adapter)
for memory pool, we are mostly interested with UMA error code (why the allocation/de-allocation failed) - not necessarily with a native error code.

Becasue of that the interface for memory pool and provider is different.

Also, there is one problem with the current approach - return value from umaPoolGetLastAllocationError is meaningful only if the last allocation operation (malloc, aligned_malloc, etc.) failed (returned NULL). This means that it cannot be used to report errors from memory_pool::free() since that function does not return anything (and so, we don't know if it failed and whether it's safe to call umaPoolGetLastAllocationError).

If we want to make it work free we could make the return value well-defined in all cases (e.g. UMA_RESULT_SUCCESS if the previous operation succeeded). This would require any pool implementation to clear last error code on every operation. We could probably provide some callback in memory_pool_ops so that the pool-implementation does not need to do that manually and we'd just call it from memory_pool.c (in implementation of umaPoolMalloc, etc). @pbalcer @vinser52 do you think that would make sense?

igchor · 2023-06-20T18:23:48Z

As an example, in case there is a provider-specific error on allocation, the code would look like this:

void* ptr = umaPoolMalloc(hPool, size);
if (!ptr) {
    uma_result_t ret = umaPoolGetLastAllocationError(hPool);
    if (ret == UMA_RESULT_ERROR_MEMORY_PROVIDER_SPECIFIC) {
        const char *msg;
        int32_t native_error;
        umaMemoryProviderGetLastNativeError(getProvider(hPool), &msg, &native_error);
        // cast native_error to ze_result_t and handle it
    } else {
        ...
    }
}

pbalcer

Principle of least surprise. We should not be inventing new behaviors here IMHO. How free behaves with respect to errors is well defined - it doesn't directly return anything and preserves errno (or umaPoolGetLastAllocationError in this case). A user can take no useful action when free fails (assuming that anyone would check that error in the first place?).
Callbacks might make sense.

igchor · 2023-06-21T17:18:48Z

Principle of least surprise. We should not be inventing new behaviors here IMHO. How free behaves with respect to errors is well defined - it doesn't directly return anything and preserves errno (or umaPoolGetLastAllocationError in this case). A user can take no useful action when free fails (assuming that anyone would check that error in the first place?). Callbacks might make sense.

I was also thinking about integrating this with UR. Currently urUSMFree() can return errors such as DEVICE_LOST. I don't think there is any other reasonable action that the user can take, other than just resetting the device (or the application). Still, it might be useful to get this error sooner rather than later (on free, rather than on the next allocation) so I think we should have some way of communicating this error.

Callbacks might be a good idea in a general case: I guess it should be even possible to implement a C++ API that would use e.g. exceptions on top of that.

DamianDuy · 2023-06-26T07:13:44Z

source/common/uma_pools/disjoint_pool.cpp

@@ -67,6 +67,10 @@ static void *AlignPtrUp(void *Ptr, const size_t Alignment) {
 return static_cast<char *>(AlignedPtr) + Alignment;
 }

+struct MemoryProviderError {
+ uma_result_t code;


LGTM
nit: perhaps a better name for this variable would be retCode? I think it may be more descriptive.

pbalcer

Let's leave it as is for now. We can always return to the problem with free later.
Please rebase.

igchor · 2023-06-26T18:00:19Z

Let's leave it as is for now. We can always return to the problem with free later. Please rebase.

Done.

source/common/uma_pools/disjoint_pool.cpp

source/common/unified_memory_allocation/include/uma/memory_provider.h

vinser52 · 2023-06-27T15:13:51Z

source/common/unified_memory_allocation/include/uma/memory_provider.h

-umaMemoryProviderGetLastResult(uma_memory_provider_handle_t hProvider,
- const char **ppMessage);
+/// \param pError [out] pointer to an integer where the adapter specific error code will be stored
+void umaMemoryProviderGetLastNativeError(uma_memory_provider_handle_t hProvider,


what if some pool uses several memory providers, e.g. one for data and another one for metadata?
what will be the user flow?

That's a good point. User might need to know which (metadata or data) provider failed. I think we need to extend error codes to specify DATA_PROVIDER_ERROR and METADATA_PROVIDER_ERROR.

I do not like the idea of introducing provider type-specific errors, especially if we consider changing the way how we specify provider type in the pool.

We can think of storing the handle to the provider that failed and remove hProvider argument from the umaMemoryProviderGetLastNativeError function.

In general, I would like to see some code-snippet that demonstrates how these API suppose to be used by client (e.g. UR).

I've extended the error codes in #641

If we assume that umaMemoryProviderGetLastNativeError can only be called by the memory pool user, then we could probably replace the hProvider handle with a pool handle. I don't think we can remove the handle entirely because we need to know which function to call (from memory_provider_ops that is associated with that handle).

Still I think umaMemoryProviderGetLastNativeError might also be useful inside a memory pool implementation (e.g. when a pool always uses some specific metadata provider).

The code would look similar to what I proposed in the comment above (just with extra switch for metadata/data provider):

void* ptr = umaPoolMalloc(hPool, size); if (!ptr) { uma_result_t ret = umaPoolGetLastAllocationError(hPool); uma_memory_provider_handle_t provider; if (ret == UMA_RESULT_ERROR_DATA_MEMORY_PROVIDER_SPECIFIC) { provider = umaPoolGetDataMemoryProvider(hPool): } else if (ret == UMA_RESULT_ERROR_METADATA_MEMORY_PROVIDER_SPECIFIC { provider = umaPoolGetMetadataMemoryProvider(hPool): } else { .... } const char *msg; int32_t native_error; umaMemoryProviderGetLastNativeError(provider, &msg, &native_error); // cast native_error to ze_result_t and handle it }

But then how would you know which provider failed? (i.e. which provider to pass to the umaMemoryProviderGetLastNativeError)?

If we decide not to merge #641 we need some other way of specyfing which memory provider is the one that failed. The simplest approach would be just to have only one provider per pool (as we had originally).

@vinser52 Anyway, this PR does not introduce UMA_RESULT_ERROR_DATA_MEMORY_PROVIDER_SPECIFIC nor UMA_RESULT_ERROR_METADATA_MEMORY_PROVIDER_SPECIFIC so I think this discussion can continue on this thread: https://github.com/oneapi-src/unified-runtime/pull/641/files#r1244132297

Can this PR be merged or do you have any other comments?

In general, to interpret the native error code you should know the concrete implementation referred by the memory provider handle.

Also, I think it is useful to know which provider failed: whether it's metadata or data provider.

And it does not matter whether it DATA or METADATA memory provider. You just need to know which memory provider failed. For example, imagine the following situation:

You have two memory providers (different implementations - different ops structure) and pass them to umaPoolCreate function.

umaPoolMalloc call failed because one of the memory providers failed. Does it really matter whether failed memory provider was used as DATA or METADATA? One obvious solution could be just to have umaPoolGetLastFailedProvider function to get failed provider. It will help to avoid if-else if-...-else block.

But an even better idea is to look how similar problem solved in STL: std::error_code

Can this PR be merged or do you have any other comments?

It would be useful to understand how clients are going to use this native error? Do we have a code snippet from UR perspective? Let's discuss it by phone first.

Done. I've implemented umaGetLastFailedProvider as a standalone function that does not accept a pool handle so that it can be implemented in memory_provider.c (as opposed to in each memory pool separately).

source/common/unified_memory_allocation/include/uma/memory_pool.h

vinser52 · 2023-06-27T15:23:09Z

I was also thinking about integrating this with UR. Currently urUSMFree() can return errors such as DEVICE_LOST

I think it is a memory provider-specific error. You will not notice this error until coarse grain allocation/deallocation.

Memory provider: Remove the get_last_result function and replace it with get_last_error. This primary difference is the addition of the pError out parameter which returns a native error code emitted from a failed provider function which resulted in returning UMA_RESULT_ERROR_MEMORY_PROVIDER_SPECIFIC. Memory_pool: Remove the get_last_result function and replace it with get_last_allocation_status. For memory pool, I don't think we care about a pool-specific error codes. We do, however, want to know why the previous allocation/de-allocation failed. Memory pool allocation functions only return a pointer so we need an extra function (get_last_allocation_status) to get this information.

use get_last_allocation_error to return last error from allocation functions.

igchor · 2023-06-27T16:34:06Z

I was also thinking about integrating this with UR. Currently urUSMFree() can return errors such as DEVICE_LOST

I think it is a memory provider-specific error. You will not notice this error until coarse grain allocation/deallocation.

Yes, of course - but you probably need to know that the deallocation failed (due to whatever reason) so that you know that you need to query memory provider for the actual reason.

by returning the name directly instead of through out param. This function is not allowed to fail anyway.

which returns handle to the last failed provider. This makes error reporting API independent from how memory pool is using the providers (i.e. whether there is only a single provider or multiple)

igchor · 2023-06-29T14:30:47Z

@pbalcer could you take one more look and merge if it's OK? I've added one extra function umaGetLastFailedMemoryProvider() which is useful in case we have multiple memory providers (there's a discussion about this in the comments).

igchor requested a review from kswiecicki June 20, 2023 18:12

igchor mentioned this pull request Jun 20, 2023

[uma] Replace list of providers with explicit data and metadata providers. #641

Closed

pbalcer approved these changes Jun 21, 2023

View reviewed changes

kswiecicki approved these changes Jun 22, 2023

View reviewed changes

DamianDuy reviewed Jun 26, 2023

View reviewed changes

pbalcer approved these changes Jun 26, 2023

View reviewed changes

igchor force-pushed the uma_error branch from 5c2b851 to f4eb128 Compare June 26, 2023 18:00

vinser52 reviewed Jun 27, 2023

View reviewed changes

igchor added 2 commits June 27, 2023 16:16

[uma] Improve error reporing int DisjointPool

438c0d3

use get_last_allocation_error to return last error from allocation functions.

igchor force-pushed the uma_error branch from f4eb128 to 438c0d3 Compare June 27, 2023 16:36

igchor added 2 commits June 28, 2023 17:57

[uma] simplify get_name

599942a

by returning the name directly instead of through out param. This function is not allowed to fail anyway.

[uma] add umaGetLastFailedMemoryProvider

6f0df6c

which returns handle to the last failed provider. This makes error reporting API independent from how memory pool is using the providers (i.e. whether there is only a single provider or multiple)

vinser52 approved these changes Jun 29, 2023

View reviewed changes

pbalcer approved these changes Jun 29, 2023

View reviewed changes

pbalcer merged commit 3f89f04 into oneapi-src:main Jun 29, 2023
18 checks passed

igchor deleted the uma_error branch June 29, 2023 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[uma] Improve reporting of pool/provider specific errors #640

[uma] Improve reporting of pool/provider specific errors #640

igchor commented Jun 20, 2023

igchor commented Jun 20, 2023

pbalcer left a comment

igchor commented Jun 21, 2023

DamianDuy Jun 26, 2023

pbalcer left a comment

igchor commented Jun 26, 2023

vinser52 Jun 27, 2023

igchor Jun 27, 2023

vinser52 Jun 27, 2023

igchor Jun 27, 2023

igchor Jun 27, 2023 •

edited

Loading

igchor Jun 27, 2023

igchor Jun 27, 2023

igchor Jun 27, 2023 •

edited

Loading

vinser52 Jun 28, 2023 •

edited

Loading

igchor Jun 28, 2023

vinser52 commented Jun 27, 2023

igchor commented Jun 27, 2023

igchor commented Jun 29, 2023

[uma] Improve reporting of pool/provider specific errors #640

[uma] Improve reporting of pool/provider specific errors #640

Conversation

igchor commented Jun 20, 2023

igchor commented Jun 20, 2023

pbalcer left a comment

Choose a reason for hiding this comment

igchor commented Jun 21, 2023

Choose a reason for hiding this comment

pbalcer left a comment

Choose a reason for hiding this comment

igchor commented Jun 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igchor Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igchor Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

vinser52 Jun 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinser52 commented Jun 27, 2023

igchor commented Jun 27, 2023

igchor commented Jun 29, 2023

igchor Jun 27, 2023 •

edited

Loading

igchor Jun 27, 2023 •

edited

Loading

vinser52 Jun 28, 2023 •

edited

Loading