Improve memory overhead of parquet dictionary encoder #5828

alamb · 2024-05-31T21:09:36Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As part of #5770, @XiangpengHao has been creating parquet files with large numbers of columns. He was not able to create a file with 10,000 columns and 1M rows in each row group with a single floating pint value (42.0), due to running out of memory

We believe we see that the parquet encoder requires at minimum 8 bytes per value regardless of the actual value.

This is substantial when writing large numbers of columns. For example, writing 10,000 columns to 1M row row groups, requires 80GB of memory (8 * 10,000 * 1,000,000 = 80,000,000,000)

His initial analysis showed that significant memory consumption is from the dictionary encoder's indices: https://github.com/apache/arrow-rs/blob/master/parquet/src/encodings/encoding/dict_encoder.rs#L80 (permalink) , where each value consume 8 bytes of memory. This is evidenced by the observation that changing from f64 to f32 does not reduce memory usage.

To put it another way, the dictionary indices of a row group are kept in memory, which takes row_count*num_column*8 bytes, regardless what actual values are

Describe the solution you'd like

Some way to write large / wide parquet files with less memory required

Describe alternatives you've considered

@XiangpengHao reported he tried to disable dictionary encoding and directly use RLE encoding, but RLE only supports boolean values: https://github.com/apache/arrow-rs/blob/master/parquet/src/encodings/encoding/mod.rs#L196

Additional context
Given we are simply encoding the same value over and over again, maybe this is not a important case to optimize

However, the same thing might apply for very sparse columns (e.g. if you are encoding 1M values and all but two of them are NULL 🤔 )

The text was updated successfully, but these errors were encountered:

tustvold · 2024-06-01T10:27:50Z

One could definitely envisage the dictionary encoder buffering the encoded indices.

However, one could argue if encoding 1M rows to a single page is advisable in the general case.
#5797 tracks setting a default row count limit which I suspect would significantly reduce this issue

alamb · 2024-06-01T10:45:21Z

Specifically, i think this refers to https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterProperties.html#method.data_page_row_count_limit

alamb · 2024-06-04T13:37:11Z

One thing that might also be worth considering is adding a memory_used API on the parquet writer, that gives an estimate of the current memory used (including the dictionary encoder, etc)

https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html

alamb added parquet Changes to the parquet crate enhancement Any new improvement worthy of a entry in the changelog labels May 31, 2024

tustvold changed the title ~~Improve memory overhead of parquet encoder~~ Improve memory overhead of parquet dictionary encoder Jun 1, 2024

alamb mentioned this issue Jun 1, 2024

Limit Parquet Page Row Count By Default to reduce writer memory requirements with highly compressable columns #5797

Closed

alamb mentioned this issue Jun 7, 2024

API to get memory usage for parquet ArrowWriter #5851

Closed

hveiga mentioned this issue Jul 17, 2024

Potential memory issue when using COPY with PARTITIONED BY apache/datafusion#11042

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory overhead of parquet dictionary encoder #5828

Improve memory overhead of parquet dictionary encoder #5828

alamb commented May 31, 2024 •

edited

Loading

tustvold commented Jun 1, 2024 •

edited

Loading

alamb commented Jun 1, 2024

alamb commented Jun 4, 2024

Improve memory overhead of parquet dictionary encoder #5828

Improve memory overhead of parquet dictionary encoder #5828

Comments

alamb commented May 31, 2024 • edited Loading

tustvold commented Jun 1, 2024 • edited Loading

alamb commented Jun 1, 2024

alamb commented Jun 4, 2024

alamb commented May 31, 2024 •

edited

Loading

tustvold commented Jun 1, 2024 •

edited

Loading