Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1604: Deprecate non-utf8 bloom filter for Java writer #1776

Closed
wants to merge 5 commits into from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Jan 31, 2024

What changes were proposed in this pull request?

This PR aims to deprecate non-utf8 bloom filter for writer.

  1. deprecate org.apache.orc.OrcFile.WriterOptions#bloomFilterVersion
  2. deprecate `org.apache.orc.OrcFile.WriterOptions#getBloomFilterVersion
  3. deprecate org.apache.orc.impl.writer.WriterContext#getBloomFilterVersion

Why are the changes needed?

  1. orc.bloom.filter.write.version=original will write two copies of data instead of one, which increases the size of ORC and will also cause Spark2.x to fail to read BloomFilterUtf8
    comment-17800800
  2. C++ writer does not implement original
  3. Plan to remove non-utf8 bloom filter in orc-format ORCv2.md

How was this patch tested?

GA

Was this patch authored or co-authored using generative AI tooling?

No

@dongjoon-hyun dongjoon-hyun added this to the 2.0.0 milestone Jan 31, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @cxzl25 . I agree with your intention.

However, Apache ORC community need a deprecation process before the removal of any public API. I don't think we can remove the API.

Could you convert this PR into a deprecation PR instead of a removal PR?

This reverts commit d2e021a.
@cxzl25 cxzl25 changed the title ORC-1604: Remove non-utf8 bloom filter for writer ORC-1604: Deprecate non-utf8 bloom filter for writer Feb 1, 2024
@cxzl25 cxzl25 changed the title ORC-1604: Deprecate non-utf8 bloom filter for writer ORC-1604: Deprecate non-utf8 bloom filter for Java writer Feb 1, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

dongjoon-hyun pushed a commit that referenced this pull request Feb 3, 2024
### What changes were proposed in this pull request?
This PR aims to deprecate non-utf8 bloom filter for writer.
1. deprecate `org.apache.orc.OrcFile.WriterOptions#bloomFilterVersion`
2. deprecate `org.apache.orc.OrcFile.WriterOptions#getBloomFilterVersion
3. deprecate `org.apache.orc.impl.writer.WriterContext#getBloomFilterVersion`

### Why are the changes needed?
1. `orc.bloom.filter.write.version=original` will write two copies of data instead of one, which increases the size of ORC and will also cause Spark2.x to fail to read `BloomFilterUtf8`
[comment-17800800](https://issues.apache.org/jira/browse/ORC-297?focusedCommentId=17800800&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17800800)
4. C++ writer does not implement original
5. Plan to remove non-utf8 bloom filter in `orc-format` `ORCv2.md`

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #1776 from cxzl25/ORC-1604.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 6c3c451)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

Thank you for updates. Merged to main/2.0 for Apache ORC 2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants