Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[uma] Improve reporting of pool/provider specific errors #640

Merged
merged 4 commits into from
Jun 29, 2023

Conversation

igchor
Copy link
Member

@igchor igchor commented Jun 20, 2023

Initially I wanted to implement the same get_last_error function for both memory pool and memory provider. However, I realized that the use cases are a bit different:

  • for memory provider, we want to get the native code (e.g. to propagate from L0-provider to L0-adapter)
  • for memory pool, we are mostly interested with UMA error code (why the allocation/de-allocation failed) - not necessarily with a native error code.

Becasue of that the interface for memory pool and provider is different.

Also, there is one problem with the current approach - return value from umaPoolGetLastAllocationError is meaningful only if the last allocation operation (malloc, aligned_malloc, etc.) failed (returned NULL). This means that it cannot be used to report errors from memory_pool::free() since that function does not return anything (and so, we don't know if it failed and whether it's safe to call umaPoolGetLastAllocationError).

If we want to make it work free we could make the return value well-defined in all cases (e.g. UMA_RESULT_SUCCESS if the previous operation succeeded). This would require any pool implementation to clear last error code on every operation. We could probably provide some callback in memory_pool_ops so that the pool-implementation does not need to do that manually and we'd just call it from memory_pool.c (in implementation of umaPoolMalloc, etc). @pbalcer @vinser52 do you think that would make sense?

@igchor igchor requested a review from kswiecicki June 20, 2023 18:12
@igchor
Copy link
Member Author

igchor commented Jun 20, 2023

As an example, in case there is a provider-specific error on allocation, the code would look like this:

void* ptr = umaPoolMalloc(hPool, size);
if (!ptr) {
    uma_result_t ret = umaPoolGetLastAllocationError(hPool);
    if (ret == UMA_RESULT_ERROR_MEMORY_PROVIDER_SPECIFIC) {
        const char *msg;
        int32_t native_error;
        umaMemoryProviderGetLastNativeError(getProvider(hPool), &msg, &native_error);
        // cast native_error to ze_result_t and handle it
    } else {
        ...
    }
}

Copy link
Contributor

@pbalcer pbalcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Principle of least surprise. We should not be inventing new behaviors here IMHO. How free behaves with respect to errors is well defined - it doesn't directly return anything and preserves errno (or umaPoolGetLastAllocationError in this case). A user can take no useful action when free fails (assuming that anyone would check that error in the first place?).
Callbacks might make sense.

@igchor
Copy link
Member Author

igchor commented Jun 21, 2023

Principle of least surprise. We should not be inventing new behaviors here IMHO. How free behaves with respect to errors is well defined - it doesn't directly return anything and preserves errno (or umaPoolGetLastAllocationError in this case). A user can take no useful action when free fails (assuming that anyone would check that error in the first place?). Callbacks might make sense.

I was also thinking about integrating this with UR. Currently urUSMFree() can return errors such as DEVICE_LOST. I don't think there is any other reasonable action that the user can take, other than just resetting the device (or the application). Still, it might be useful to get this error sooner rather than later (on free, rather than on the next allocation) so I think we should have some way of communicating this error.

Callbacks might be a good idea in a general case: I guess it should be even possible to implement a C++ API that would use e.g. exceptions on top of that.

@@ -67,6 +67,10 @@ static void *AlignPtrUp(void *Ptr, const size_t Alignment) {
return static_cast<char *>(AlignedPtr) + Alignment;
}

struct MemoryProviderError {
uma_result_t code;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
nit: perhaps a better name for this variable would be retCode? I think it may be more descriptive.

Copy link
Contributor

@pbalcer pbalcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it as is for now. We can always return to the problem with free later.
Please rebase.

@igchor
Copy link
Member Author

igchor commented Jun 26, 2023

Let's leave it as is for now. We can always return to the problem with free later. Please rebase.

Done.

source/common/uma_pools/disjoint_pool.cpp Show resolved Hide resolved
umaMemoryProviderGetLastResult(uma_memory_provider_handle_t hProvider,
const char **ppMessage);
/// \param pError [out] pointer to an integer where the adapter specific error code will be stored
void umaMemoryProviderGetLastNativeError(uma_memory_provider_handle_t hProvider,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if some pool uses several memory providers, e.g. one for data and another one for metadata?
what will be the user flow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. User might need to know which (metadata or data) provider failed. I think we need to extend error codes to specify DATA_PROVIDER_ERROR and METADATA_PROVIDER_ERROR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like the idea of introducing provider type-specific errors, especially if we consider changing the way how we specify provider type in the pool.

We can think of storing the handle to the provider that failed and remove hProvider argument from the umaMemoryProviderGetLastNativeError function.

In general, I would like to see some code-snippet that demonstrates how these API suppose to be used by client (e.g. UR).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've extended the error codes in #641

Copy link
Member Author

@igchor igchor Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume that umaMemoryProviderGetLastNativeError can only be called by the memory pool user, then we could probably replace the hProvider handle with a pool handle. I don't think we can remove the handle entirely because we need to know which function to call (from memory_provider_ops that is associated with that handle).

Still I think umaMemoryProviderGetLastNativeError might also be useful inside a memory pool implementation (e.g. when a pool always uses some specific metadata provider).

The code would look similar to what I proposed in the comment above (just with extra switch for metadata/data provider):

void* ptr = umaPoolMalloc(hPool, size);
if (!ptr) {
    uma_result_t ret = umaPoolGetLastAllocationError(hPool);
    uma_memory_provider_handle_t provider;
    if (ret == UMA_RESULT_ERROR_DATA_MEMORY_PROVIDER_SPECIFIC) {
        provider = umaPoolGetDataMemoryProvider(hPool):
    } else if (ret == UMA_RESULT_ERROR_METADATA_MEMORY_PROVIDER_SPECIFIC {
        provider = umaPoolGetMetadataMemoryProvider(hPool):
    } else {
        ....
    }
    const char *msg;
    int32_t native_error;
    umaMemoryProviderGetLastNativeError(provider, &msg, &native_error);
    // cast native_error to ze_result_t and handle it
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then how would you know which provider failed? (i.e. which provider to pass to the umaMemoryProviderGetLastNativeError)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide not to merge #641 we need some other way of specyfing which memory provider is the one that failed. The simplest approach would be just to have only one provider per pool (as we had originally).

Copy link
Member Author

@igchor igchor Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinser52 Anyway, this PR does not introduce UMA_RESULT_ERROR_DATA_MEMORY_PROVIDER_SPECIFIC nor UMA_RESULT_ERROR_METADATA_MEMORY_PROVIDER_SPECIFIC so I think this discussion can continue on this thread: https://github.com/oneapi-src/unified-runtime/pull/641/files#r1244132297

Can this PR be merged or do you have any other comments?

Copy link
Contributor

@vinser52 vinser52 Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, to interpret the native error code you should know the concrete implementation referred by the memory provider handle.

Also, I think it is useful to know which provider failed: whether it's metadata or data provider.

And it does not matter whether it DATA or METADATA memory provider. You just need to know which memory provider failed. For example, imagine the following situation:

  1. You have two memory providers (different implementations - different ops structure) and pass them to umaPoolCreate function.
  2. umaPoolMalloc call failed because one of the memory providers failed. Does it really matter whether failed memory provider was used as DATA or METADATA? One obvious solution could be just to have umaPoolGetLastFailedProvider function to get failed provider. It will help to avoid if-else if-...-else block.

But an even better idea is to look how similar problem solved in STL: std::error_code

Can this PR be merged or do you have any other comments?

It would be useful to understand how clients are going to use this native error? Do we have a code snippet from UR perspective? Let's discuss it by phone first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've implemented umaGetLastFailedProvider as a standalone function that does not accept a pool handle so that it can be implemented in memory_provider.c (as opposed to in each memory pool separately).

@vinser52
Copy link
Contributor

I was also thinking about integrating this with UR. Currently urUSMFree() can return errors such as DEVICE_LOST

I think it is a memory provider-specific error. You will not notice this error until coarse grain allocation/deallocation.

Memory provider:
 Remove the get_last_result function and replace it with
 get_last_error. This primary difference is the addition of
 the pError out parameter which returns a native error code
 emitted from a failed provider function which resulted in returning
 UMA_RESULT_ERROR_MEMORY_PROVIDER_SPECIFIC.

Memory_pool:
 Remove the get_last_result function and replace it with
 get_last_allocation_status. For memory pool, I don't think we care
 about a pool-specific error codes. We do, however, want to know why the
 previous allocation/de-allocation failed. Memory pool allocation
 functions only return a pointer so we need an extra function
 (get_last_allocation_status) to get this information.
use get_last_allocation_error to return last error
from allocation functions.
@igchor
Copy link
Member Author

igchor commented Jun 27, 2023

I was also thinking about integrating this with UR. Currently urUSMFree() can return errors such as DEVICE_LOST

I think it is a memory provider-specific error. You will not notice this error until coarse grain allocation/deallocation.

Yes, of course - but you probably need to know that the deallocation failed (due to whatever reason) so that you know that you need to query memory provider for the actual reason.

by returning the name directly instead of through out param. This
function is not allowed to fail anyway.
which returns handle to the last failed provider.

This makes error reporting API independent from how
memory pool is using the providers (i.e. whether there is
only a single provider or multiple)
@igchor
Copy link
Member Author

igchor commented Jun 29, 2023

@pbalcer could you take one more look and merge if it's OK? I've added one extra function umaGetLastFailedMemoryProvider() which is useful in case we have multiple memory providers (there's a discussion about this in the comments).

@pbalcer pbalcer merged commit 3f89f04 into oneapi-src:main Jun 29, 2023
18 checks passed
@igchor igchor deleted the uma_error branch June 29, 2023 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants