Parquet: Explore ways to accelerate table writing #946

rcaudy · 2021-08-01T16:24:30Z

Currently, there's nothing at all parallel about our Parquet table writing, barring using multiple tables/files. We should investigate options here if it becomes a performance bottleneck for users.

malhotrashivam · 2023-08-28T21:23:26Z

Potential optimization idea (found during #4334):
Currently, while using dictionary encoding, if we hit the limit on dictionary size, we discard all the work done so far and fall back to plain encoding for all the pages. A more optimized way would be to add a dictionary page first with the data collected so far and then use plain encoding for all the following pages.

malhotrashivam · 2023-10-09T22:33:28Z

Some more optimization opportunities:

In the current writing code, we copy the contents of the table into a buffer (inside TransferObject class) and then use that buffer for writing to the parquet file (inside ColumnWriter class). We can skip this intermediate step of creating the buffer and directly write to the parquet file. An example for long type column is here: (Demo Commit) Pull writing code one layer above #4587.
Currently, we use RLE only when using dictionary encoding for strings and we use bit packing for booleans. Other than that, we always use Plain encoding. We should look into using RLE encoding by default for all data types since that can dramatically reduce the number of bytes written, especially for logical types used for byte or char data type values which are all written as Int32 on disk.
As found during Refactored parquet writing code #4541, when writing identical content, pyarrow generally writes fewer number of pages per file than our code. This can lead to performance benefits since writing each page requires writing additional metadata, which is a performance hit. One major difference is that pyarrow uses RLE whereas our code uses Plain encoding. This could lead to significantly fewer actual bytes being written by pyarrow compared to deephaven.

malhotrashivam · 2023-10-25T18:27:11Z

The first and third suggestion above didn't show much improvement. More details can be found on this doc.

rcaudy added feature request New feature or request parquet Related to the Parquet integration labels Aug 1, 2021

rcaudy added this to the Backlog milestone Aug 1, 2021

rcaudy self-assigned this Aug 1, 2021

rcaudy changed the title ~~Explore ways to accelerate Parquet table writing~~ Parquet: Explore ways to accelerate table writing Aug 3, 2021

malhotrashivam mentioned this issue Aug 28, 2023

Introduce a limit on maximum dictionary size for a parquet file #4334

Merged

malhotrashivam mentioned this issue Sep 14, 2023

Table cannot be written to Parquet #3328

Closed

malhotrashivam mentioned this issue Oct 5, 2023

(Demo Commit) Pull writing code one layer above #4587

Closed

This was referenced Oct 17, 2023

Write parquet file headers to a bufferedOutputStream #4661

Merged

Added buffered stream for all parquet writes #4669

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: Explore ways to accelerate table writing #946

Parquet: Explore ways to accelerate table writing #946

rcaudy commented Aug 1, 2021

malhotrashivam commented Aug 28, 2023 •

edited

Loading

malhotrashivam commented Oct 9, 2023 •

edited

Loading

malhotrashivam commented Oct 25, 2023 •

edited

Loading

Parquet: Explore ways to accelerate table writing #946

Parquet: Explore ways to accelerate table writing #946

Comments

rcaudy commented Aug 1, 2021

malhotrashivam commented Aug 28, 2023 • edited Loading

malhotrashivam commented Oct 9, 2023 • edited Loading

malhotrashivam commented Oct 25, 2023 • edited Loading

malhotrashivam commented Aug 28, 2023 •

edited

Loading

malhotrashivam commented Oct 9, 2023 •

edited

Loading

malhotrashivam commented Oct 25, 2023 •

edited

Loading