Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(iceberg): Date partition value parse issue #12126

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nmahadevuni
Copy link
Collaborator

@nmahadevuni nmahadevuni commented Jan 20, 2025

fixes prestodb/presto#24371

Iceberg partition values are already in daysSinceEpoch, but in velox we assume its in date form and try to convert as with Hive. Fixed this.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 20, 2025
Copy link

netlify bot commented Jan 20, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 67b025c
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67972e8d0283d70008b14117

return applyFilter(*filter, result.value());
int32_t result = 0;
if (tableFormat == SplitReader::TableFormat::kIceberg) {
result = boost::lexical_cast<int32_t>(partitionValue.c_str());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use std::stoi here instead of including a boost header?
In other comments i see that this function is slow. This could be a problem?

Also we wouldn't need a new include and add a new dependency here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::stoi is converting a string like "2022-04-05" to int value 2022, so its not safe and may lead to wrong results.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::stoi will work for converting int string to int. But it also doesn't throw error if we input a string like "2022-04-05" as it just converts it to 2022.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, 2022-04-05 is parsed by boost into what number? It is not a valid integer in the first place. So either function will not work.

Also you are forgetting that the string is actually the days since epoch like you have in your description. Which means it is not an actual date formatted string. If this was the case you need a date parser here and not a string to int parse function in the first place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to use std::stoi

Copy link
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nmahadevuni : Thanks for this fix. Have bunch of review comments.

VELOX_CHECK(!result.hasError());
return applyFilter(*filter, result.value());
int32_t result = 0;
if (tableFormat == SplitReader::TableFormat::kIceberg) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here about this behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please add to the PR description.

partitionKeys = {},
const std::vector<std::string> filters = {},
const std::string duckDbSql = "",
const int32_t numPrefetchSplits = 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have this parameter to the function ? Since it seems like we never test its use really.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this.

partitionKeys["ds"] = "17627";

std::vector<RowVectorPtr> dataVectors;
VectorPtr c0 = vectorMaker_.flatVector<int64_t>({1});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vectorMaker_ is deprecated. Use makeFlatVector API instead.

@@ -477,6 +514,15 @@ class HiveIcebergTest : public HiveConnectorTestBase {
return PlanBuilder(pool_.get()).tableScan(rowType_).planNode();
}

core::PlanNodePtr tableScanNode(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think there is a need for this function as its only used in a single place and doesn't really represent anything.

using T = typename TypeTraits<kind>::NativeType;
if (!value.has_value()) {
return std::make_shared<ConstantVector<T>>(pool, size, true, type, T());
}

if (type->isDate()) {
auto days = DATE()->toDays(static_cast<folly::StringPiece>(value.value()));
int32_t days = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the wrong level of abstraction to fix this problem especially if this is a generic function.

We should add this logic in SplitReader::setPartitionValue directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all these are Hive/Iceberg util functions, this should be ok? Because this issue happens in three cases, and for only this one case we can move the logic to SplitReader::setPartitionValue and in the other cases, we cannot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you anticipate this conversion for other info columns as well ?

The name of this function was too generic.. but when I look at the uses it is only for filter values. I feel we can change this function name to filterValueFromString and then its application seems more restrictive.

What are the 3 cases ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rechecked, its in 2 places. I see this conversion happens from SplitReader::setPartitionValue() and SplitReader::filterOnStats(). I think the current name is appropriate as its not only used for filter.

const std::unordered_map<std::string, std::optional<std::string>>
partitionKeys = {},
const std::vector<std::string> filters = {},
const std::string duckDbSql = "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not reasonable to have an empty duckDbSql for this function, as that is the main sql string to verify the plan results with. Since we know the plan we are generating, it might be better to build the duckDBSql in this function itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test case, we don't know how to generate the duckDbSql, when we add more test cases, we can enhance this function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an empty string check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this code a very specific plan with a TableScanNode and partitionfilters, columnfilters is setup. This maps to a quite precise Sql. We could just build it in the logic.

But I'm also fine with the sql passed as a parameter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the default "" value for this parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test case, we generate int vector for date values and create the data file which is ok for velox, but cannot create duckdb table with the same vectors, so using a sql statement to verify.

auto scanNodeId = plan->id();
auto it = planStats.find(scanNodeId);
ASSERT_TRUE(it != planStats.end());
ASSERT_TRUE(it->second.peakMemoryBytes > 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you testing this ? Isn't matching results sufficient ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not required, removed it.

@@ -225,6 +226,41 @@ class HiveIcebergTest : public HiveConnectorTestBase {
ASSERT_TRUE(it->second.peakMemoryBytes > 0);
}

void assertQuery(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this function to assertPartitionKey.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to keep this name generic, since it could be used to test any case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is building a very specific plan with TableScanNode and IcebergSplits... The assertQuery function name is being used in all the TestBase classes for very generic usage.

if (tableFormat == SplitReader::TableFormat::kIceberg) {
result = boost::lexical_cast<int32_t>(partitionValue.c_str());
} else {
result = DATE()->toDays((folly::StringPiece)partitionValue);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix this to use C++ cast since we are touching this code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't get it. toDays converts date into daysSinceEpoch, where to use C++ cast?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ style cast:

static_cast<folly::StringPiece>(partitionValue)

See the example in a line you actually removed (below in the review).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added C++ cast in both places.

@@ -154,6 +157,7 @@ class SplitReader {
std::shared_ptr<HiveColumnHandle>>* const partitionKeys_;
const ConnectorQueryCtx* connectorQueryCtx_;
const std::shared_ptr<const HiveConfig> hiveConfig_;
const TableFormat tableFormat_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need a const for enums and scalars.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it a const reference.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing something? This is just a plain const and not a const reference. Also for enums you don't need a reference. And the argument to the constructor is not a const & either.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was looking at a different place altogether. You are right. I removed all const references.

@@ -634,12 +636,16 @@ namespace {
bool applyPartitionFilter(
const TypePtr& type,
const std::string& partitionValue,
common::Filter* filter) {
common::Filter* filter,
const SplitReader::TableFormat& tableFormat) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its better to move the read-only const& parameters before the writable * ones. So lets move tableFormat to the first parameter.

using T = typename TypeTraits<kind>::NativeType;
if (!value.has_value()) {
return std::make_shared<ConstantVector<T>>(pool, size, true, type, T());
}

if (type->isDate()) {
auto days = DATE()->toDays(static_cast<folly::StringPiece>(value.value()));
int32_t days = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you anticipate this conversion for other info columns as well ?

The name of this function was too generic.. but when I look at the uses it is only for filter values. I feel we can change this function name to filterValueFromString and then its application seems more restrictive.

What are the 3 cases ?

const std::unordered_map<std::string, std::optional<std::string>>
partitionKeys = {},
const std::vector<std::string> filters = {},
const std::string duckDbSql = "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this code a very specific plan with a TableScanNode and partitionfilters, columnfilters is setup. This maps to a quite precise Sql. We could just build it in the logic.

But I'm also fine with the sql passed as a parameter.

const std::unordered_map<std::string, std::optional<std::string>>
partitionKeys = {},
const std::vector<std::string> filters = {},
const std::string duckDbSql = "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the default "" value for this parameter.

@nmahadevuni nmahadevuni force-pushed the fix_ice_date_partition_value_parse branch 2 times, most recently from d3564ed to c7843f4 Compare January 24, 2025 06:55
@nmahadevuni
Copy link
Collaborator Author

Thank you @aditi-pandit @majetideepak @czentgr . I have addressed your comments. Please review.

@nmahadevuni nmahadevuni force-pushed the fix_ice_date_partition_value_parse branch from c7843f4 to 67b025c Compare January 27, 2025 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[native] Iceberg read from partitioned Date column fails
5 participants