-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[uma] Improve reporting of pool/provider specific errors #640
Conversation
As an example, in case there is a provider-specific error on allocation, the code would look like this: void* ptr = umaPoolMalloc(hPool, size);
if (!ptr) {
uma_result_t ret = umaPoolGetLastAllocationError(hPool);
if (ret == UMA_RESULT_ERROR_MEMORY_PROVIDER_SPECIFIC) {
const char *msg;
int32_t native_error;
umaMemoryProviderGetLastNativeError(getProvider(hPool), &msg, &native_error);
// cast native_error to ze_result_t and handle it
} else {
...
}
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Principle of least surprise. We should not be inventing new behaviors here IMHO. How free behaves with respect to errors is well defined - it doesn't directly return anything and preserves errno (or umaPoolGetLastAllocationError
in this case). A user can take no useful action when free fails (assuming that anyone would check that error in the first place?).
Callbacks might make sense.
I was also thinking about integrating this with UR. Currently urUSMFree() can return errors such as Callbacks might be a good idea in a general case: I guess it should be even possible to implement a C++ API that would use e.g. exceptions on top of that. |
@@ -67,6 +67,10 @@ static void *AlignPtrUp(void *Ptr, const size_t Alignment) { | |||
return static_cast<char *>(AlignedPtr) + Alignment; | |||
} | |||
|
|||
struct MemoryProviderError { | |||
uma_result_t code; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
nit: perhaps a better name for this variable would be retCode? I think it may be more descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it as is for now. We can always return to the problem with free
later.
Please rebase.
Done. |
umaMemoryProviderGetLastResult(uma_memory_provider_handle_t hProvider, | ||
const char **ppMessage); | ||
/// \param pError [out] pointer to an integer where the adapter specific error code will be stored | ||
void umaMemoryProviderGetLastNativeError(uma_memory_provider_handle_t hProvider, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if some pool uses several memory providers, e.g. one for data and another one for metadata?
what will be the user flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. User might need to know which (metadata or data) provider failed. I think we need to extend error codes to specify DATA_PROVIDER_ERROR and METADATA_PROVIDER_ERROR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like the idea of introducing provider type-specific errors, especially if we consider changing the way how we specify provider type in the pool.
We can think of storing the handle to the provider that failed and remove hProvider
argument from the umaMemoryProviderGetLastNativeError
function.
In general, I would like to see some code-snippet that demonstrates how these API suppose to be used by client (e.g. UR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've extended the error codes in #641
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we assume that umaMemoryProviderGetLastNativeError
can only be called by the memory pool user, then we could probably replace the hProvider
handle with a pool
handle. I don't think we can remove the handle entirely because we need to know which function to call (from memory_provider_ops that is associated with that handle).
Still I think umaMemoryProviderGetLastNativeError
might also be useful inside a memory pool implementation (e.g. when a pool always uses some specific metadata provider).
The code would look similar to what I proposed in the comment above (just with extra switch for metadata/data provider):
void* ptr = umaPoolMalloc(hPool, size);
if (!ptr) {
uma_result_t ret = umaPoolGetLastAllocationError(hPool);
uma_memory_provider_handle_t provider;
if (ret == UMA_RESULT_ERROR_DATA_MEMORY_PROVIDER_SPECIFIC) {
provider = umaPoolGetDataMemoryProvider(hPool):
} else if (ret == UMA_RESULT_ERROR_METADATA_MEMORY_PROVIDER_SPECIFIC {
provider = umaPoolGetMetadataMemoryProvider(hPool):
} else {
....
}
const char *msg;
int32_t native_error;
umaMemoryProviderGetLastNativeError(provider, &msg, &native_error);
// cast native_error to ze_result_t and handle it
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then how would you know which provider failed? (i.e. which provider to pass to the umaMemoryProviderGetLastNativeError
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide not to merge #641 we need some other way of specyfing which memory provider is the one that failed. The simplest approach would be just to have only one provider per pool (as we had originally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinser52 Anyway, this PR does not introduce UMA_RESULT_ERROR_DATA_MEMORY_PROVIDER_SPECIFIC nor UMA_RESULT_ERROR_METADATA_MEMORY_PROVIDER_SPECIFIC so I think this discussion can continue on this thread: https://github.com/oneapi-src/unified-runtime/pull/641/files#r1244132297
Can this PR be merged or do you have any other comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, to interpret the native error code you should know the concrete implementation referred by the memory provider handle.
Also, I think it is useful to know which provider failed: whether it's metadata or data provider.
And it does not matter whether it DATA or METADATA memory provider. You just need to know which memory provider failed. For example, imagine the following situation:
- You have two memory providers (different implementations - different ops structure) and pass them to
umaPoolCreate
function. umaPoolMalloc
call failed because one of the memory providers failed. Does it really matter whether failed memory provider was used as DATA or METADATA? One obvious solution could be just to haveumaPoolGetLastFailedProvider
function to get failed provider. It will help to avoidif-else if-...-else
block.
But an even better idea is to look how similar problem solved in STL: std::error_code
Can this PR be merged or do you have any other comments?
It would be useful to understand how clients are going to use this native error? Do we have a code snippet from UR perspective? Let's discuss it by phone first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I've implemented umaGetLastFailedProvider
as a standalone function that does not accept a pool
handle so that it can be implemented in memory_provider.c (as opposed to in each memory pool separately).
I think it is a memory provider-specific error. You will not notice this error until coarse grain allocation/deallocation. |
Memory provider: Remove the get_last_result function and replace it with get_last_error. This primary difference is the addition of the pError out parameter which returns a native error code emitted from a failed provider function which resulted in returning UMA_RESULT_ERROR_MEMORY_PROVIDER_SPECIFIC. Memory_pool: Remove the get_last_result function and replace it with get_last_allocation_status. For memory pool, I don't think we care about a pool-specific error codes. We do, however, want to know why the previous allocation/de-allocation failed. Memory pool allocation functions only return a pointer so we need an extra function (get_last_allocation_status) to get this information.
use get_last_allocation_error to return last error from allocation functions.
Yes, of course - but you probably need to know that the deallocation failed (due to whatever reason) so that you know that you need to query memory provider for the actual reason. |
by returning the name directly instead of through out param. This function is not allowed to fail anyway.
which returns handle to the last failed provider. This makes error reporting API independent from how memory pool is using the providers (i.e. whether there is only a single provider or multiple)
@pbalcer could you take one more look and merge if it's OK? I've added one extra function |
Initially I wanted to implement the same
get_last_error
function for both memory pool and memory provider. However, I realized that the use cases are a bit different:Becasue of that the interface for memory pool and provider is different.
Also, there is one problem with the current approach - return value from
umaPoolGetLastAllocationError
is meaningful only if the last allocation operation (malloc, aligned_malloc, etc.) failed (returned NULL). This means that it cannot be used to report errors from memory_pool::free() since that function does not return anything (and so, we don't know if it failed and whether it's safe to callumaPoolGetLastAllocationError
).If we want to make it work
free
we could make the return value well-defined in all cases (e.g. UMA_RESULT_SUCCESS if the previous operation succeeded). This would require any pool implementation to clear last error code on every operation. We could probably provide some callback in memory_pool_ops so that the pool-implementation does not need to do that manually and we'd just call it from memory_pool.c (in implementation of umaPoolMalloc, etc). @pbalcer @vinser52 do you think that would make sense?