Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Hudi merged view files for partition path updates without compaction #24283

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

codope
Copy link
Contributor

@codope codope commented Dec 19, 2024

Description

Support Hudi merged view files for partition path updates without compaction. This is needed with Merge-on-Read Hudi tables when partition has been updated for a record and the table has not been compacted yet.

  • Add a client and session config to specify tables (comma-separated schemaName.tableName) where this support is needed.
  • Based on config, if the current tables needs to be listed using merged view, then do so in HudiDirectoryLister.
  • Add a test for Merge-on-Read table with a record being updated from partition p1 to p2 and no compaction.
  • Add some test fixtures to support above test scenario.

Motivation and Context

Support Hudi merged view files for partition path updates without compaction. This is needed with Merge-on-Read Hudi tables when partition has been updated for a record and the table has not been compacted yet.

Impact

Support Hudi merged view files for partition path updates without compaction. This is needed with Merge-on-Read Hudi tables when partition has been updated for a record and the table has not been compacted yet. As a result, even the read-optimized view of Merge-on-Read tables under partition path updates without compaction should not return any duplicates.

Test Plan

  • Add a test for Merge-on-Read table with a record being updated from partition p1 to p2 and no compaction.
  • Add some test fixtures to support above test scenario.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

@codope codope requested a review from a team as a code owner December 19, 2024 04:25
@codope codope requested a review from presto-oss December 19, 2024 04:25
@tdcmeehan tdcmeehan self-assigned this Dec 19, 2024
@@ -1647,6 +1648,19 @@ public boolean isHudiMetadataEnabled()
return this.hudiMetadataEnabled;
}

@Config("hive.hudi-tables-use-merged-view")
@ConfigDescription("For Hudi tables prefer to fetch the list of files from the merged file system view")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the description should be more clear about the type of value it's looking for. The type is a string, so for users it should be more explicit what the value should be. When I read this I initially thought it should be a boolean.

Based on looking at the tests it looks like it needs a dot-separated name of schema and table? If so, it should be clear about which tables should be added here

Suggested change
@ConfigDescription("For Hudi tables prefer to fetch the list of files from the merged file system view")
@ConfigDescription("For Hudi tables, A comma-separated list in the form of <schema>.<table> which should prefer to fetch the list of files from the merged file system view")

@@ -608,6 +609,11 @@ public HiveSessionProperties(HiveClientConfig hiveClientConfig, OrcFileWriterCon
"For Hudi tables prefer to fetch the list of file names, sizes and other metadata from the internal metadata table rather than storage",
hiveClientConfig.isHudiMetadataEnabled(),
false),
stringProperty(
HUDI_TABLES_USE_MERGED_VIEW,
"For Hudi tables, use merged view to read data",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the description here as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants