-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propagate errors from Parquet reader kernels back to host #14167
Propagate errors from Parquet reader kernels back to host #14167
Conversation
CC @nvdbaranec @etseidl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of a better way to do this. Do we want to define some constants for the error codes?
Definitely, just not sure if it should be in this PR. Related - should we return the error code as a bitmask? Would returning multiple errors even be useful? |
I think a bitmask might be a bit much, and limits us to 32 errors. There will probably be more ways to fail than that, esp if we also return errors from the preprocessing kernels. |
cpp/src/io/parquet/page_decode.cuh
Outdated
cuda::atomic_ref<int32_t, cuda::thread_scope_block> ref{const_cast<int&>(error)}; | ||
ref.store(err, cuda::std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this atomic necessary? I didn't see any places where anything other than thread 0 (of the block) sets the error code. I suppose that may not be the case in the future. Based on how this is called, I wonder if an atomic OR is better here so we can stash multiple error types as individual bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it atomic since we probably don't need to worry about performance when failing. This seemed like a safe option for future checks as well.
About the error code as mask - Ed is concerned about the limit on the number of errors that this would impose. I could be convinced to go either way, don't expect the trade-off to be relevant in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH the most common error condition is going to be a buffer overrun detected somewhere. We could probably get away without codes at all and have a single error bit. The host code calling the kernel can report which kernel failed. It just comes down to how fine grained you want the error reporting to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see it either way. It's so hard to even know what thread failed and the context of why (possibly because some other thread did something wrong) having a set of bits could act as bread-crumbs to lead you to where things really went wrong. But on the other hand, you're a lot more limited on what you can report. I'm fine either way. Parallel error reporting is amusing in any case.
Looks like we're leaning towards a mask to aggregate errors. I'll make the changes. |
…fea-read_parquet-error-report
…fea-read_parquet-error-report
…ule/cudf into fea-read_parquet-error-report
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I like this mechanism. The explicit names also remove some of the mystery when reading the code itself too.
Co-authored-by: Yunsong Wang <yunsongw@nvidia.com>
…ule/cudf into fea-read_parquet-error-report
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just a few naming nits :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/merge |
Fixes #13656. Uses the error reporting introduced in #14167 to report errors in header parsing. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #14237
Description
Pass the error code to the host when a kernel detects invalid input.
If multiple errors types are detected, they are combined using a bitwise OR so that caller gets the aggregate error code that includes all types of errors that occurred.
Does not change the kernel side checks.
Checklist