Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Explore ways to accelerate table writing #946

Open
rcaudy opened this issue Aug 1, 2021 · 3 comments
Open

Parquet: Explore ways to accelerate table writing #946

rcaudy opened this issue Aug 1, 2021 · 3 comments
Assignees
Labels
feature request New feature or request parquet Related to the Parquet integration
Milestone

Comments

@rcaudy
Copy link
Member

rcaudy commented Aug 1, 2021

Currently, there's nothing at all parallel about our Parquet table writing, barring using multiple tables/files. We should investigate options here if it becomes a performance bottleneck for users.

@rcaudy rcaudy added feature request New feature or request parquet Related to the Parquet integration labels Aug 1, 2021
@rcaudy rcaudy added this to the Backlog milestone Aug 1, 2021
@rcaudy rcaudy self-assigned this Aug 1, 2021
@rcaudy rcaudy changed the title Explore ways to accelerate Parquet table writing Parquet: Explore ways to accelerate table writing Aug 3, 2021
@malhotrashivam
Copy link
Contributor

malhotrashivam commented Aug 28, 2023

Potential optimization idea (found during #4334):
Currently, while using dictionary encoding, if we hit the limit on dictionary size, we discard all the work done so far and fall back to plain encoding for all the pages. A more optimized way would be to add a dictionary page first with the data collected so far and then use plain encoding for all the following pages.

@malhotrashivam
Copy link
Contributor

malhotrashivam commented Oct 9, 2023

Some more optimization opportunities:

  1. In the current writing code, we copy the contents of the table into a buffer (inside TransferObject class) and then use that buffer for writing to the parquet file (inside ColumnWriter class). We can skip this intermediate step of creating the buffer and directly write to the parquet file. An example for long type column is here: (Demo Commit) Pull writing code one layer above #4587.
  2. Currently, we use RLE only when using dictionary encoding for strings and we use bit packing for booleans. Other than that, we always use Plain encoding. We should look into using RLE encoding by default for all data types since that can dramatically reduce the number of bytes written, especially for logical types used for byte or char data type values which are all written as Int32 on disk.
  3. As found during Refactored parquet writing code #4541, when writing identical content, pyarrow generally writes fewer number of pages per file than our code. This can lead to performance benefits since writing each page requires writing additional metadata, which is a performance hit. One major difference is that pyarrow uses RLE whereas our code uses Plain encoding. This could lead to significantly fewer actual bytes being written by pyarrow compared to deephaven.

@malhotrashivam
Copy link
Contributor

malhotrashivam commented Oct 25, 2023

The first and third suggestion above didn't show much improvement. More details can be found on this doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request parquet Related to the Parquet integration
Projects
None yet
Development

No branches or pull requests

2 participants