Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Format] Encoding spec incorrect for dictionary fallback #404

Open
asfimport opened this issue Jan 3, 2023 · 3 comments
Open

[Format] Encoding spec incorrect for dictionary fallback #404

asfimport opened this issue Jan 3, 2023 · 3 comments

Comments

@asfimport
Copy link
Collaborator

The spec for DICTIONARY_ENCODING states that:

If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding.

https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8

However, the parquet-mr implementation was deliberately changed to a different fallback mechanism in https://issues.apache.org/jira/browse/PARQUET-52

I'm assuming the parquet-mr implementation is authoritative here. But then the spec is incorrect and should be fixed to reflect expected behavior.

Reporter: Antoine Pitrou / @pitrou

Note: This issue was originally created as PARQUET-2221. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

@asfimport
Copy link
Collaborator Author

Gang Wu / @wgtmac:
IMHO, the specs is authoritative to the reader implementation to correctly read Parquet files created by different writers. But it is writer implementer's choice to fallback to any standard encoding. This is what the video coding standard does (e.g. H.264/AVC and H.265/HEVC).

What's more, the writer implementation can even rewrite the dictionary page and dictionary-encoded data pages to fallback encoding if fallback happens and discard the dictionary-encoded pages, just like what Apache ORC does. Mixing dictionary encoding and non-dictionary encoding in the same column chunk makes the implementation of features like reading dictionary and predicate pushdown much complicated.

cc [~shangx@uber.com]

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
I agree with @wgtmac here.  I think we should probably have language like we've done in previous cases like "for maximum compatibility" but then say any mix of page encodings is valid as long as the ordering is valid.

 

In terms of mixing dictionary encodings with others, it does make things a little bit harder but I don't think we should make it not-allowed (but point out the potential benefits of unified encoding).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants