Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory overhead of parquet dictionary encoder #5828

Open
alamb opened this issue May 31, 2024 · 3 comments
Open

Improve memory overhead of parquet dictionary encoder #5828

alamb opened this issue May 31, 2024 · 3 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented May 31, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As part of #5770, @XiangpengHao has been creating parquet files with large numbers of columns. He was not able to create a file with 10,000 columns and 1M rows in each row group with a single floating pint value (42.0), due to running out of memory

We believe we see that the parquet encoder requires at minimum 8 bytes per value regardless of the actual value.

This is substantial when writing large numbers of columns. For example, writing 10,000 columns to 1M row row groups, requires 80GB of memory (8 * 10,000 * 1,000,000 = 80,000,000,000)

His initial analysis showed that significant memory consumption is from the dictionary encoder's indices: https://github.com/apache/arrow-rs/blob/master/parquet/src/encodings/encoding/dict_encoder.rs#L80 (permalink) , where each value consume 8 bytes of memory. This is evidenced by the observation that changing from f64 to f32 does not reduce memory usage.

To put it another way, the dictionary indices of a row group are kept in memory, which takes row_count*num_column*8 bytes, regardless what actual values are

Describe the solution you'd like

Some way to write large / wide parquet files with less memory required

Describe alternatives you've considered

@XiangpengHao reported he tried to disable dictionary encoding and directly use RLE encoding, but RLE only supports boolean values: https://github.com/apache/arrow-rs/blob/master/parquet/src/encodings/encoding/mod.rs#L196

Additional context
Given we are simply encoding the same value over and over again, maybe this is not a important case to optimize

However, the same thing might apply for very sparse columns (e.g. if you are encoding 1M values and all but two of them are NULL 🤔 )

@alamb alamb added parquet Changes to the parquet crate enhancement Any new improvement worthy of a entry in the changelog labels May 31, 2024
@tustvold
Copy link
Contributor

tustvold commented Jun 1, 2024

One could definitely envisage the dictionary encoder buffering the encoded indices.

However, one could argue if encoding 1M rows to a single page is advisable in the general case.
#5797 tracks setting a default row count limit which I suspect would significantly reduce this issue

@tustvold tustvold changed the title Improve memory overhead of parquet encoder Improve memory overhead of parquet dictionary encoder Jun 1, 2024
@alamb
Copy link
Contributor Author

alamb commented Jun 1, 2024

@alamb
Copy link
Contributor Author

alamb commented Jun 4, 2024

One thing that might also be worth considering is adding a memory_used API on the parquet writer, that gives an estimate of the current memory used (including the dictionary encoder, etc)

https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

2 participants