Add FileReaderBuilder for arrow-ipc to allow reading large no. of column files #5136

Jefffrey · 2023-11-28T12:44:36Z

Which issue does this PR close?

Rationale for this change

Following suggestion as per #4434 (comment)

What changes are included in this PR?

New FileReaderBuilder for arrow-ipc reader to allow configuring the VerifierOptions which can allow reading an IPC file with massive number of columns (over 1 million) if user configures correctly.

Are there any user-facing changes?

Jefffrey · 2023-11-28T12:45:49Z

arrow-ipc/src/reader.rs

+    pub fn with_verifier_options(mut self, verifier_options: VerifierOptions) -> Self {
+        self.verifier_options = verifier_options;
+        self


Wasn't sure if preferable to have this for maximum flexibility, or have something as suggested here: #4434 (comment)

Which might be more user friendly, though abstracts away the inner flatbuffers setting being changed.

I think I would prefer the max columns option as it both avoids exposing flatbuffer types in our public API, and is more obvious to users why it might be relevant to them

Considering that keys in the schema custom metadata can also contribute to the table count in the footer flatbuffer, not sure if naming it something like with_max_columns() would be accurate.

Is it possible to simply abstract over those flatbuffer settings without exposing the inner flatbuffer struct, such as

.with_flatbuffers_max_tables(10000000) .with_flatbuffers_max_depth(100)

In case a user has a file with a deeply nested schema and might want to tune this parameter as well, unlikely as it might be

Can then document these methods to explain what effect tuning them would have on the file reader, etc.

Jefffrey · 2023-12-25T10:59:56Z

I've refactored to not expose the flatbuffer types, but am still keeping the flatbuffer terminology. I figured that since tables refers to both key-value metadata pairs and also columns, it would be better to just expose the flatbuffer setting and document it rather than name it with_approx_max_columns or something similar as that could cause confusion since it'll also be affected by number of metadata key-value pairs

Also added setting for depth while I was at it

tustvold

Thank you

tustvold · 2023-12-26T13:04:15Z

I have double-checked that it is only the footer where we need to handle this, as the message encoding is already flattened and doesn't contain nested tables.

Add FileReaderBuilder for arrow-ipc

89be024

github-actions bot added the arrow Changes to the arrow crate label Nov 28, 2023

Jefffrey commented Nov 28, 2023

View reviewed changes

Jefffrey marked this pull request as draft November 30, 2023 21:38

tustvold mentioned this pull request Dec 22, 2023

Builder interface for arrow_ipc::FileWriter #5236

Open

Jefffrey added 2 commits December 25, 2023 16:19

Merge branch 'master' into ipc_file_reader_builder

f178b4f

Switch parameter to not expose flatbuffer types

67ce7d1

Jefffrey marked this pull request as ready for review December 25, 2023 11:00

tustvold approved these changes Dec 26, 2023

View reviewed changes

tustvold merged commit add8f56 into apache:master Dec 26, 2023
25 checks passed

Jefffrey deleted the ipc_file_reader_builder branch December 26, 2023 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FileReaderBuilder for arrow-ipc to allow reading large no. of column files #5136

Add FileReaderBuilder for arrow-ipc to allow reading large no. of column files #5136

Jefffrey commented Nov 28, 2023

Jefffrey Nov 28, 2023

tustvold Nov 29, 2023

Jefffrey Dec 1, 2023

Jefffrey commented Dec 25, 2023 •

edited

Loading

tustvold left a comment

tustvold commented Dec 26, 2023

Add FileReaderBuilder for arrow-ipc to allow reading large no. of column files #5136

Add FileReaderBuilder for arrow-ipc to allow reading large no. of column files #5136

Conversation

Jefffrey commented Nov 28, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Jefffrey Nov 28, 2023

Choose a reason for hiding this comment

tustvold Nov 29, 2023

Choose a reason for hiding this comment

Jefffrey Dec 1, 2023

Choose a reason for hiding this comment

Jefffrey commented Dec 25, 2023 • edited Loading

tustvold left a comment

Choose a reason for hiding this comment

tustvold commented Dec 26, 2023

Jefffrey commented Dec 25, 2023 •

edited

Loading