Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve: reuse Arc<dyn Array> in parquet record batch reader. #4864

Open
RinChanNOWWW opened this issue Sep 27, 2023 · 5 comments
Open

improve: reuse Arc<dyn Array> in parquet record batch reader. #4864

RinChanNOWWW opened this issue Sep 27, 2023 · 5 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@RinChanNOWWW
Copy link
Contributor

Both in arrow_reader and async_reader, if there are predicates, the reader will first read arrays (wrapped in RecordBatch) to evaluate the predicates and get a row selection. And then the reader will use the row selection to output the final needed arrays.

If some arrays in the final output are contained in the prefetched arrays, we also have to deserialize them again. This is quite wasteful.

We should have a reasonable way to reuse the arrays.

@RinChanNOWWW RinChanNOWWW added the enhancement Any new improvement worthy of a entry in the changelog label Sep 27, 2023
@RinChanNOWWW
Copy link
Contributor Author

RinChanNOWWW commented Sep 27, 2023

In my opinion, we can store the primitive arrays once we read in the parquet record batch reader, and combine them into a RecordBatch when needed.

@tustvold
Copy link
Contributor

The challenge is bounding the memory usage of the reader and balancing this against the cost of decoding, which for primitives is relatively low. I don't think it is as simple as "keep the arrays around", it will probably require some sort of heuristic

@RinChanNOWWW
Copy link
Contributor Author

RinChanNOWWW commented Sep 27, 2023

The challenge is bounding the memory usage of the reader and balancing this against the cost of decoding
I don't think it is as simple as "keep the arrays around"

We can only keep the arrays we need to output and release the others.

which for primitives is relatively low

Besides decoding there is also decompression. And it may become a large cost for variable-length type like string.

it will probably require some sort of heuristic

Yes. Such as all the output arrays are contained in predicate arrays.

@tustvold
Copy link
Contributor

tustvold commented Sep 27, 2023

We can only keep the arrays we need to output and release the others.

That is still potentially a non-trivial amount of data, it could theoretically be an entire column chunk worth, which could easily blow your memory budget 😅

Perhaps we could add a configurable threshold of the maximum number of bytes to keep around in this way, and fallback to decoding columns again if this threshold is exceeded?

@alamb
Copy link
Contributor

alamb commented Dec 20, 2024

🎣 -- I believe @XiangpengHao was thinking about working on this (I know you have a paper to write too, etc lol)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants