Improve memory overhead of parquet dictionary encoder #5828
Labels
enhancement
Any new improvement worthy of a entry in the changelog
parquet
Changes to the parquet crate
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As part of #5770, @XiangpengHao has been creating parquet files with large numbers of columns. He was not able to create a file with 10,000 columns and 1M rows in each row group with a single floating pint value (
42.0
), due to running out of memoryWe believe we see that the parquet encoder requires at minimum 8 bytes per value regardless of the actual value.
This is substantial when writing large numbers of columns. For example, writing 10,000 columns to 1M row row groups, requires 80GB of memory (
8 * 10,000 * 1,000,000 = 80,000,000,000
)His initial analysis showed that significant memory consumption is from the dictionary encoder's indices: https://github.com/apache/arrow-rs/blob/master/parquet/src/encodings/encoding/dict_encoder.rs#L80 (permalink) , where each value consume 8 bytes of memory. This is evidenced by the observation that changing from f64 to f32 does not reduce memory usage.
To put it another way, the dictionary indices of a row group are kept in memory, which takes
row_count*num_column*8 bytes
, regardless what actual values areDescribe the solution you'd like
Some way to write large / wide parquet files with less memory required
Describe alternatives you've considered
@XiangpengHao reported he tried to disable dictionary encoding and directly use RLE encoding, but RLE only supports boolean values: https://github.com/apache/arrow-rs/blob/master/parquet/src/encodings/encoding/mod.rs#L196
Additional context
Given we are simply encoding the same value over and over again, maybe this is not a important case to optimize
However, the same thing might apply for very sparse columns (e.g. if you are encoding 1M values and all but two of them are NULL 🤔 )
The text was updated successfully, but these errors were encountered: