feat(parquet): Add next_row_group API for ParquetRecordBatchStream #6907

Xuanwo · 2024-12-20T04:45:28Z

Which issue does this PR close?

Closes ParquetRecordBatchStream API to fetch the next row group while decoding #6559
Related to Low-Level Arrow Parquet Reader #5522
Closes Add a 'prefetch' option to ParquetRecordBatchStream to load the next row group while decoding #6676

Rationale for this change

Add async fn next_row_group() for ParquetRecordBatchStream so that users can fecth row groups based on their needs and decode the data seperately.

This PR marks the first step in further decoupling the I/O and decoding processes of Parquet reading.

What changes are included in this PR?

Add new API:

pub async fn next_row_group(&mut self) -> Result<Option<ParquetRecordBatchReader>> { ... }

Are there any user-facing changes?

Yes.

Signed-off-by: Xuanwo <github@xuanwo.io>

Xuanwo · 2024-12-20T04:48:47Z

parquet/src/arrow/async_reader/mod.rs

+    /// - `Ok(None)` if the stream has ended.
+    /// - `Err(error)` if the stream has errored. All subsequent calls will return `Ok(None)`.
+    /// - `Ok(Some(reader))` which holds all the data for the row group.
+    pub async fn next_row_group(&mut self) -> Result<Option<ParquetRecordBatchReader>> {


I'm not sure if next_row_group is the best name, open to other options.

I think it is a good, clear name as it clearly explains what it does

tustvold

So this PR does have a certain elegant simplicity to it, however, it doesn't really solve the separation of IO and compute given that reader_factory.read_factory potentially performs CPU-bound parquet decoding as part of late materialization / filter pushdown. It also has no ability to be parallelised.

Given that this isn't adding a host of additional complexity, I don't object to merging this in, but I wanted to flag that a solution to that problem likely will require something a bit different.

tustvold · 2024-12-20T20:01:29Z

parquet/src/arrow/async_reader/mod.rs

+    pub async fn next_row_group(&mut self) -> Result<Option<ParquetRecordBatchReader>> {
+        loop {
+            match &mut self.state {
+                StreamState::Decoding(_) | StreamState::Reading(_) => unreachable!(),


I think this should probably return an error saying not to mix polling the stream and using this API

alamb · 2024-12-20T21:02:51Z

So this PR does have a certain elegant simplicity to it, however, it doesn't really solve the separation of IO and compute given that reader_factory.read_factory potentially performs CPU-bound parquet decoding as part of late materialization / filter pushdown.

I agree it doesn't solve (nor claim to) separting CPU and compute. Also, neither does what is currently in the repo

It also has no ability to be parallelised.

I don't understand the assertion that this can't be parallelized. Do you mean there is now way to have concurrent outstanding fetch requests?

As I understand it, once the reader is returned, reading from the returned stream actually decodes the parquet data so this PR would allow the next IO to be interleaved with actually decoding the data.

Given that this isn't adding a host of additional complexity, I don't object to merging this in, but I wanted to flag that a solution to that problem likely will require something a bit different.

I think we could support concurrent download / decode on multiple row groups of the same file today by creating multiple ParquetRecordBatchStream (each for a different row group / set of row groups) 🤔 Maybe it doesn't need a new API

alamb

Thank you very much @Xuanwo

After @tustvold 's comments about error vs panic are addressed, I think this PR looks good to me.

@masonh22 can you give this PR a look and see if it would work for your usecase?

alamb · 2024-12-20T21:04:14Z

FYI @etseidl

tustvold · 2024-12-20T21:06:45Z

I agree it doesn't solve (nor claim to) separting CPU and compute. Also, neither does what is currently in the repo

Right this was in response to #6676 (comment) which instigated this PR.

I'm mostly wary of merging an API if we're going to have to replace it in order to meet the desired use-case

I don't understand the assertion that this can't be parallelized. Do you mean there is now way to have concurrent outstanding fetch requests?

The PR attests to be related to #5522 which concerns this

Edit:

I think we could support concurrent download / decode on multiple row groups of the same file today by creating multiple ParquetRecordBatchStream (each for a different row group / set of row groups) 🤔 Maybe it doesn't need a new API

Yes, which is what Datafusion does today. It is somewhat arcane to get it to work, but is documented here

alamb · 2024-12-20T21:15:01Z

'm mostly wary of merging an API if we're going to have to replace it in order to meet the desired use-case

I agree if we have some actual alternative in mind we should evaluate that prior to merging this PR

It seems to me this PR makes it possible to interleave IO and decode which the current API does not.

I agree it does not address the other parts of #5522 (like parallel decode of columns, for example). I updated the description to say it closed #6559

alamb · 2024-12-20T21:15:56Z

IN my opinion, even if we add some newer low level API there is still value to this higher one that permits interleaved download and decode, as described on

ParquetRecordBatchStream API to fetch the next row group while decoding #6559

masonh22 · 2024-12-21T04:06:57Z

I like this! This will work for what I need.

tustvold

Happy for this to be merged, unless @Xuanwo it doesn't meet your requirements and you plan to add something different instead

Xuanwo · 2024-12-23T01:27:21Z

Thank you, @alamb and @tustvold, for the review. I will address the error-handling issues, and then we can proceed with merging!

Signed-off-by: Xuanwo <github@xuanwo.io>

alamb · 2024-12-24T14:21:46Z

fyi @thinkharderdev

alamb · 2024-12-24T14:22:30Z

I think this is an improvement so merging it in. If others have additional ideas on other improvements or changes please open another PR.

Thanks again @Xuanwo @tustvold and @masonh22

…pache#6907) * feat(parquet): Add next_row_group API for ParquetRecordBatchStream Signed-off-by: Xuanwo <github@xuanwo.io> * chore: Returning error instead of using unreachable Signed-off-by: Xuanwo <github@xuanwo.io> --------- Signed-off-by: Xuanwo <github@xuanwo.io>

feat(parquet): Add next_row_group API for ParquetRecordBatchStream

e91fa31

Signed-off-by: Xuanwo <github@xuanwo.io>

github-actions bot added the parquet Changes to the parquet crate label Dec 20, 2024

Xuanwo mentioned this pull request Dec 20, 2024

Low-Level Arrow Parquet Reader #5522

Open

Xuanwo commented Dec 20, 2024

View reviewed changes

This was referenced Dec 20, 2024

refactor: Remove spawn and channel inside arrow reader apache/iceberg-rust#806

Merged

Add a 'prefetch' option to ParquetRecordBatchStream to load the next row group while decoding #6676

Closed

tustvold reviewed Dec 20, 2024

View reviewed changes

alamb approved these changes Dec 20, 2024

View reviewed changes

tustvold approved these changes Dec 21, 2024

View reviewed changes

chore: Returning error instead of using unreachable

5510b48

Signed-off-by: Xuanwo <github@xuanwo.io>

alamb merged commit 10cf03c into apache:main Dec 24, 2024
16 checks passed

Xuanwo deleted the add-next-row-group branch December 24, 2024 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parquet): Add next_row_group API for ParquetRecordBatchStream #6907

feat(parquet): Add next_row_group API for ParquetRecordBatchStream #6907

Xuanwo commented Dec 20, 2024 •

edited by alamb

Loading

Xuanwo Dec 20, 2024

alamb Dec 20, 2024

tustvold left a comment •

edited

Loading

tustvold Dec 20, 2024

alamb commented Dec 20, 2024

alamb left a comment

alamb commented Dec 20, 2024

tustvold commented Dec 20, 2024 •

edited

Loading

alamb commented Dec 20, 2024

alamb commented Dec 20, 2024

masonh22 commented Dec 21, 2024

tustvold left a comment

Xuanwo commented Dec 23, 2024

alamb commented Dec 24, 2024

alamb commented Dec 24, 2024

feat(parquet): Add next_row_group API for ParquetRecordBatchStream #6907

feat(parquet): Add next_row_group API for ParquetRecordBatchStream #6907

Conversation

Xuanwo commented Dec 20, 2024 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Xuanwo Dec 20, 2024

Choose a reason for hiding this comment

alamb Dec 20, 2024

Choose a reason for hiding this comment

tustvold left a comment • edited Loading

Choose a reason for hiding this comment

tustvold Dec 20, 2024

Choose a reason for hiding this comment

alamb commented Dec 20, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Dec 20, 2024

tustvold commented Dec 20, 2024 • edited Loading

alamb commented Dec 20, 2024

alamb commented Dec 20, 2024

masonh22 commented Dec 21, 2024

tustvold left a comment

Choose a reason for hiding this comment

Xuanwo commented Dec 23, 2024

alamb commented Dec 24, 2024

alamb commented Dec 24, 2024

Xuanwo commented Dec 20, 2024 •

edited by alamb

Loading

tustvold left a comment •

edited

Loading

tustvold commented Dec 20, 2024 •

edited

Loading