Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FileReaderBuilder for arrow-ipc to allow reading large no. of column files #5136

Merged
merged 3 commits into from
Dec 26, 2023

Conversation

Jefffrey
Copy link
Contributor

Which issue does this PR close?

Closes #4432

Rationale for this change

Following suggestion as per #4434 (comment)

What changes are included in this PR?

New FileReaderBuilder for arrow-ipc reader to allow configuring the VerifierOptions which can allow reading an IPC file with massive number of columns (over 1 million) if user configures correctly.

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 28, 2023
Comment on lines 527 to 529
pub fn with_verifier_options(mut self, verifier_options: VerifierOptions) -> Self {
self.verifier_options = verifier_options;
self
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't sure if preferable to have this for maximum flexibility, or have something as suggested here: #4434 (comment)

Which might be more user friendly, though abstracts away the inner flatbuffers setting being changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer the max columns option as it both avoids exposing flatbuffer types in our public API, and is more obvious to users why it might be relevant to them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that keys in the schema custom metadata can also contribute to the table count in the footer flatbuffer, not sure if naming it something like with_max_columns() would be accurate.

Is it possible to simply abstract over those flatbuffer settings without exposing the inner flatbuffer struct, such as

.with_flatbuffers_max_tables(10000000)
.with_flatbuffers_max_depth(100)
  • In case a user has a file with a deeply nested schema and might want to tune this parameter as well, unlikely as it might be

Can then document these methods to explain what effect tuning them would have on the file reader, etc.

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Dec 25, 2023

I've refactored to not expose the flatbuffer types, but am still keeping the flatbuffer terminology. I figured that since tables refers to both key-value metadata pairs and also columns, it would be better to just expose the flatbuffer setting and document it rather than name it with_approx_max_columns or something similar as that could cause confusion since it'll also be affected by number of metadata key-value pairs

Also added setting for depth while I was at it

@Jefffrey Jefffrey marked this pull request as ready for review December 25, 2023 11:00
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@tustvold
Copy link
Contributor

I have double-checked that it is only the footer where we need to handle this, as the message encoding is already flattened and doesn't contain nested tables.

@tustvold tustvold merged commit add8f56 into apache:master Dec 26, 2023
25 checks passed
@Jefffrey Jefffrey deleted the ipc_file_reader_builder branch December 26, 2023 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implicit one million column limit on arrow files
2 participants