-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(iceberg): Date partition value parse issue #12126
base: main
Are you sure you want to change the base?
fix(iceberg): Date partition value parse issue #12126
Conversation
✅ Deploy Preview for meta-velox canceled.
|
return applyFilter(*filter, result.value()); | ||
int32_t result = 0; | ||
if (tableFormat == SplitReader::TableFormat::kIceberg) { | ||
result = boost::lexical_cast<int32_t>(partitionValue.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not use std::stoi
here instead of including a boost header?
In other comments i see that this function is slow. This could be a problem?
Also we wouldn't need a new include and add a new dependency here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::stoi
is converting a string like "2022-04-05" to int value 2022, so its not safe and may lead to wrong results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::stoi
should work here. https://www.geeksforgeeks.org/convert-string-to-int-in-cpp/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::stoi
will work for converting int string to int. But it also doesn't throw error if we input a string like "2022-04-05" as it just converts it to 2022.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, 2022-04-05
is parsed by boost into what number? It is not a valid integer in the first place. So either function will not work.
Also you are forgetting that the string is actually the days since epoch like you have in your description. Which means it is not an actual date formatted string. If this was the case you need a date parser here and not a string to int parse function in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to use std::stoi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nmahadevuni : Thanks for this fix. Have bunch of review comments.
VELOX_CHECK(!result.hasError()); | ||
return applyFilter(*filter, result.value()); | ||
int32_t result = 0; | ||
if (tableFormat == SplitReader::TableFormat::kIceberg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here about this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please add to the PR description.
partitionKeys = {}, | ||
const std::vector<std::string> filters = {}, | ||
const std::string duckDbSql = "", | ||
const int32_t numPrefetchSplits = 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you have this parameter to the function ? Since it seems like we never test its use really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this.
partitionKeys["ds"] = "17627"; | ||
|
||
std::vector<RowVectorPtr> dataVectors; | ||
VectorPtr c0 = vectorMaker_.flatVector<int64_t>({1}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vectorMaker_ is deprecated. Use makeFlatVector API instead.
@@ -477,6 +514,15 @@ class HiveIcebergTest : public HiveConnectorTestBase { | |||
return PlanBuilder(pool_.get()).tableScan(rowType_).planNode(); | |||
} | |||
|
|||
core::PlanNodePtr tableScanNode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think there is a need for this function as its only used in a single place and doesn't really represent anything.
using T = typename TypeTraits<kind>::NativeType; | ||
if (!value.has_value()) { | ||
return std::make_shared<ConstantVector<T>>(pool, size, true, type, T()); | ||
} | ||
|
||
if (type->isDate()) { | ||
auto days = DATE()->toDays(static_cast<folly::StringPiece>(value.value())); | ||
int32_t days = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the wrong level of abstraction to fix this problem especially if this is a generic function.
We should add this logic in SplitReader::setPartitionValue directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all these are Hive/Iceberg util functions, this should be ok? Because this issue happens in three cases, and for only this one case we can move the logic to SplitReader::setPartitionValue and in the other cases, we cannot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you anticipate this conversion for other info columns as well ?
The name of this function was too generic.. but when I look at the uses it is only for filter values. I feel we can change this function name to filterValueFromString and then its application seems more restrictive.
What are the 3 cases ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rechecked, its in 2 places. I see this conversion happens from SplitReader::setPartitionValue() and SplitReader::filterOnStats(). I think the current name is appropriate as its not only used for filter.
const std::unordered_map<std::string, std::optional<std::string>> | ||
partitionKeys = {}, | ||
const std::vector<std::string> filters = {}, | ||
const std::string duckDbSql = "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not reasonable to have an empty duckDbSql for this function, as that is the main sql string to verify the plan results with. Since we know the plan we are generating, it might be better to build the duckDBSql in this function itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test case, we don't know how to generate the duckDbSql, when we add more test cases, we can enhance this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an empty string check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this code a very specific plan with a TableScanNode and partitionfilters, columnfilters is setup. This maps to a quite precise Sql. We could just build it in the logic.
But I'm also fine with the sql passed as a parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the default "" value for this parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test case, we generate int vector for date values and create the data file which is ok for velox, but cannot create duckdb table with the same vectors, so using a sql statement to verify.
auto scanNodeId = plan->id(); | ||
auto it = planStats.find(scanNodeId); | ||
ASSERT_TRUE(it != planStats.end()); | ||
ASSERT_TRUE(it->second.peakMemoryBytes > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you testing this ? Isn't matching results sufficient ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not required, removed it.
@@ -225,6 +226,41 @@ class HiveIcebergTest : public HiveConnectorTestBase { | |||
ASSERT_TRUE(it->second.peakMemoryBytes > 0); | |||
} | |||
|
|||
void assertQuery( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename this function to assertPartitionKey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to keep this name generic, since it could be used to test any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is building a very specific plan with TableScanNode and IcebergSplits... The assertQuery function name is being used in all the TestBase classes for very generic usage.
if (tableFormat == SplitReader::TableFormat::kIceberg) { | ||
result = boost::lexical_cast<int32_t>(partitionValue.c_str()); | ||
} else { | ||
result = DATE()->toDays((folly::StringPiece)partitionValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix this to use C++ cast since we are touching this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't get it. toDays converts date into daysSinceEpoch, where to use C++ cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C++ style cast:
static_cast<folly::StringPiece>(partitionValue)
See the example in a line you actually removed (below in the review).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added C++ cast in both places.
velox/connectors/hive/SplitReader.h
Outdated
@@ -154,6 +157,7 @@ class SplitReader { | |||
std::shared_ptr<HiveColumnHandle>>* const partitionKeys_; | |||
const ConnectorQueryCtx* connectorQueryCtx_; | |||
const std::shared_ptr<const HiveConfig> hiveConfig_; | |||
const TableFormat tableFormat_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need a const
for enums and scalars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it a const reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we missing something? This is just a plain const and not a const reference. Also for enums you don't need a reference. And the argument to the constructor is not a const & either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was looking at a different place altogether. You are right. I removed all const references.
@@ -634,12 +636,16 @@ namespace { | |||
bool applyPartitionFilter( | |||
const TypePtr& type, | |||
const std::string& partitionValue, | |||
common::Filter* filter) { | |||
common::Filter* filter, | |||
const SplitReader::TableFormat& tableFormat) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its better to move the read-only const& parameters before the writable * ones. So lets move tableFormat to the first parameter.
using T = typename TypeTraits<kind>::NativeType; | ||
if (!value.has_value()) { | ||
return std::make_shared<ConstantVector<T>>(pool, size, true, type, T()); | ||
} | ||
|
||
if (type->isDate()) { | ||
auto days = DATE()->toDays(static_cast<folly::StringPiece>(value.value())); | ||
int32_t days = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you anticipate this conversion for other info columns as well ?
The name of this function was too generic.. but when I look at the uses it is only for filter values. I feel we can change this function name to filterValueFromString and then its application seems more restrictive.
What are the 3 cases ?
const std::unordered_map<std::string, std::optional<std::string>> | ||
partitionKeys = {}, | ||
const std::vector<std::string> filters = {}, | ||
const std::string duckDbSql = "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this code a very specific plan with a TableScanNode and partitionfilters, columnfilters is setup. This maps to a quite precise Sql. We could just build it in the logic.
But I'm also fine with the sql passed as a parameter.
const std::unordered_map<std::string, std::optional<std::string>> | ||
partitionKeys = {}, | ||
const std::vector<std::string> filters = {}, | ||
const std::string duckDbSql = "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the default "" value for this parameter.
d3564ed
to
c7843f4
Compare
Thank you @aditi-pandit @majetideepak @czentgr . I have addressed your comments. Please review. |
c7843f4
to
67b025c
Compare
fixes prestodb/presto#24371
Iceberg partition values are already in daysSinceEpoch, but in velox we assume its in date form and try to convert as with Hive. Fixed this.