-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FileReaderBuilder for arrow-ipc to allow reading large no. of column files #5136
Conversation
arrow-ipc/src/reader.rs
Outdated
pub fn with_verifier_options(mut self, verifier_options: VerifierOptions) -> Self { | ||
self.verifier_options = verifier_options; | ||
self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't sure if preferable to have this for maximum flexibility, or have something as suggested here: #4434 (comment)
Which might be more user friendly, though abstracts away the inner flatbuffers setting being changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer the max columns option as it both avoids exposing flatbuffer types in our public API, and is more obvious to users why it might be relevant to them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that keys in the schema custom metadata can also contribute to the table count in the footer flatbuffer, not sure if naming it something like with_max_columns()
would be accurate.
Is it possible to simply abstract over those flatbuffer settings without exposing the inner flatbuffer struct, such as
.with_flatbuffers_max_tables(10000000)
.with_flatbuffers_max_depth(100)
- In case a user has a file with a deeply nested schema and might want to tune this parameter as well, unlikely as it might be
Can then document these methods to explain what effect tuning them would have on the file reader, etc.
I've refactored to not expose the flatbuffer types, but am still keeping the flatbuffer terminology. I figured that since tables refers to both key-value metadata pairs and also columns, it would be better to just expose the flatbuffer setting and document it rather than name it Also added setting for depth while I was at it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
I have double-checked that it is only the footer where we need to handle this, as the message encoding is already flattened and doesn't contain nested tables. |
Which issue does this PR close?
Closes #4432
Rationale for this change
Following suggestion as per #4434 (comment)
What changes are included in this PR?
New
FileReaderBuilder
for arrow-ipc reader to allow configuring theVerifierOptions
which can allow reading an IPC file with massive number of columns (over 1 million) if user configures correctly.Are there any user-facing changes?