Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for LZ4_RAW compression codec for Parquet #4446

Merged
merged 18 commits into from
Sep 18, 2023

Conversation

malhotrashivam
Copy link
Contributor

@malhotrashivam malhotrashivam commented Sep 5, 2023

Closes #3148

Also, added a new adapter for LZ4 which tries to decompress a file with LZ4_RAW decompressors in case it fails with LZ4 decompressors. The retry mechanism is particularly useful for decompressing parquet files that are compressed with LZ4_RAW but tagged as LZ4 in the metadata.

These are some interesting links to read about the LZ4 format and LZ4_RAW or LZ4 Block format

Documentation Update:

  • For writeTable Java Page, do the following additions for the ParquetInstructions argument:
    • ParquetTools.UNCOMPRESSED - The output will not be compressed.
    • ParquetTools.SNAPPY - Aims for high speed, and a reasonable amount of compression, based on Snappy compression format by Google.
    • ParquetTools.LZ4_RAW - A codec based on the LZ4 block format , recommended to use instead of ParquetTools.LZ4.
    • For ParquetTools.LZ4, mention that this is deprecated and users should use ParquetTools.LZ4_RAW.
    • Also, mention that if ParquetInstructions not specified will default to ParquetTools.SNAPPY compression format.

Similar changes to be done for the python page.

Overall, there should be 7 compression formats now UNCOMPRESSED, LZ4, LZ4_RAW, LZO, GZIP, ZSTD, SNAPPY with default being SNAPPY and we need to update the documentation everywhere to reflect this. For Java pages, they are referred as ParquetTools.<format> (example ParquetTools.LZ4) and for python, we refer them as "<format>" (example "LZ4").

  • Make similar changes to the instructions provided at the cheat sheet for Java and Python
  • Make similar changes to "How to write and read single Parquet files" page in Java and Python
  • Make similar changes to "How to write and read multiple Parquet files" page in Java and Python

@malhotrashivam malhotrashivam added feature request New feature or request parquet Related to the Parquet integration DocumentationNeeded NoReleaseNotesNeeded No release notes are needed. labels Sep 5, 2023
@malhotrashivam malhotrashivam added this to the September 2023 milestone Sep 5, 2023
@malhotrashivam malhotrashivam self-assigned this Sep 5, 2023
chipkent
chipkent previously approved these changes Sep 6, 2023
Copy link
Member

@chipkent chipkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python LGTM. Did not review java.

niloc132
niloc132 previously approved these changes Sep 6, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Sep 14, 2023
@deephaven deephaven unlocked this conversation Sep 14, 2023
@malhotrashivam malhotrashivam merged commit f59ad13 into deephaven:main Sep 18, 2023
10 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Sep 18, 2023
@deephaven-internal
Copy link
Contributor

Labels indicate documentation is required. Issues for documentation have been opened:

How-to: https://github.com/deephaven/deephaven.io/issues/3199
Reference: https://github.com/deephaven/deephaven.io/issues/3200

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
DocumentationNeeded feature request New feature or request NoReleaseNotesNeeded No release notes are needed. parquet Related to the Parquet integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LZ4_RAW parquet support
5 participants