improve: reuse `Arc<dyn Array>` in parquet record batch reader. #4864

RinChanNOWWW · 2023-09-27T02:09:20Z

Both in arrow_reader and async_reader, if there are predicates, the reader will first read arrays (wrapped in RecordBatch) to evaluate the predicates and get a row selection. And then the reader will use the row selection to output the final needed arrays.

If some arrays in the final output are contained in the prefetched arrays, we also have to deserialize them again. This is quite wasteful.

We should have a reasonable way to reuse the arrays.

The text was updated successfully, but these errors were encountered:

RinChanNOWWW · 2023-09-27T02:11:12Z

In my opinion, we can store the primitive arrays once we read in the parquet record batch reader, and combine them into a RecordBatch when needed.

tustvold · 2023-09-27T08:09:33Z

The challenge is bounding the memory usage of the reader and balancing this against the cost of decoding, which for primitives is relatively low. I don't think it is as simple as "keep the arrays around", it will probably require some sort of heuristic

RinChanNOWWW · 2023-09-27T09:21:48Z

The challenge is bounding the memory usage of the reader and balancing this against the cost of decoding
I don't think it is as simple as "keep the arrays around"

We can only keep the arrays we need to output and release the others.

which for primitives is relatively low

Besides decoding there is also decompression. And it may become a large cost for variable-length type like string.

it will probably require some sort of heuristic

Yes. Such as all the output arrays are contained in predicate arrays.

tustvold · 2023-09-27T09:26:34Z

We can only keep the arrays we need to output and release the others.

That is still potentially a non-trivial amount of data, it could theoretically be an entire column chunk worth, which could easily blow your memory budget 😅

Perhaps we could add a configurable threshold of the maximum number of bytes to keep around in this way, and fallback to decoding columns again if this threshold is exceeded?

alamb · 2024-12-20T21:11:36Z

🎣 -- I believe @XiangpengHao was thinking about working on this (I know you have a paper to write too, etc lol)

RinChanNOWWW added the enhancement Any new improvement worthy of a entry in the changelog label Sep 27, 2023

tustvold mentioned this issue Mar 17, 2024

Low-Level Arrow Parquet Reader #5522

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve: reuse `Arc<dyn Array>` in parquet record batch reader. #4864

improve: reuse `Arc<dyn Array>` in parquet record batch reader. #4864

RinChanNOWWW commented Sep 27, 2023

RinChanNOWWW commented Sep 27, 2023 •

edited

Loading

tustvold commented Sep 27, 2023

RinChanNOWWW commented Sep 27, 2023 •

edited

Loading

tustvold commented Sep 27, 2023 •

edited

Loading

alamb commented Dec 20, 2024

improve: reuse Arc<dyn Array> in parquet record batch reader. #4864

improve: reuse Arc<dyn Array> in parquet record batch reader. #4864

Comments

RinChanNOWWW commented Sep 27, 2023

RinChanNOWWW commented Sep 27, 2023 • edited Loading

tustvold commented Sep 27, 2023

RinChanNOWWW commented Sep 27, 2023 • edited Loading

tustvold commented Sep 27, 2023 • edited Loading

alamb commented Dec 20, 2024

improve: reuse `Arc<dyn Array>` in parquet record batch reader. #4864

improve: reuse `Arc<dyn Array>` in parquet record batch reader. #4864

RinChanNOWWW commented Sep 27, 2023 •

edited

Loading

RinChanNOWWW commented Sep 27, 2023 •

edited

Loading

tustvold commented Sep 27, 2023 •

edited

Loading