Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Render date literals as strings in time range filters #724

Merged
merged 1 commit into from
Aug 16, 2023

Conversation

tlento
Copy link
Contributor

@tlento tlento commented Aug 15, 2023

MetricFlow's rendering has historically cast date literals to
the underlying engine type in date range comparison filters, so
if a user passed in a start_time of 2021-01-01 and an end_time
of 2021-01-31 we would render something like:

WHERE ds BETWEEN CAST('2021-01-01' AS TIMESTAMP) AND CAST('2021-01-31' AS TIMESTAMP)

This turns out to cause problems with time partition columns, because
they typically only apply pruning to literal predicates - any value
processed by a CAST function prevents pruning from kicking in, as the
engine takes the appropriately conservative approach of doing full
row-by-row comparisons between the cast output and the partition value.

In researching this issue two things became clear:

  1. We need a holistic answer to ensuring engines apply partition
    pruning, one which is beyond the scope of this change.
  2. We can solve part of the problem by removing the CAST call
    in time range filters based on literal value inputs.

It turns out every engine we support does implicit type coercion
between string values and date/time types. This means we don't have
to care what the appropriate time type column might be in this case,
because the standard behavior is to simply do the comparison correctly
regardless of underlying partition column type.

This change simple removes the CAST calls from this particular rendering
block, which should enable users to at least run queries that will do
some level of pruning, even if the solution is incomplete with respect
to the types of time partition columns it can address.

Note that we assume all engines accept and convert from ISO-standard
date strings, but this is not a change from the prior behavior, which
assumed the same thing but wrapped it in a CAST function call.

MetricFlow's rendering has historically cast date literals to
the underlying engine type in date range comparison filters, so
if a user passed in a start_time of 2021-01-01 and an end_time
of 2021-01-31 we would render something like:

`WHERE ds BETWEEN CAST('2021-01-01' AS TIMESTAMP) AND CAST('2021-01-31' AS TIMESTAMP)`

This turns out to cause problems with time partition columns, because
they typically only apply pruning to literal predicates - any value
processed by a CAST function prevents pruning from kicking in, as the
engine takes the appropriately conservative approach of doing full
row-by-row comparisons between the cast output and the partition value.

In researching this issue two things became clear:

1. We need a holistic answer to ensuring engines apply partition
pruning, one which is beyond the scope of this change.
2. We can solve part of the problem by removing the CAST call
in time range filters based on literal value inputs.

It turns out every engine we support does implicit type coercion
between string values and date/time types. This means we don't have
to care what the appropriate time type column might be in this case,
because the standard behavior is to simply do the comparison correctly
regardless of underlying partition column type.

This change simple removes the CAST calls from this particular rendering
block, which should enable users to at least run queries that will do
some level of pruning, even if the solution is incomplete with respect
to the types of time partition columns it can address.

Note that we assume all engines accept and convert from ISO-standard
date strings, but this is not a change from the prior behavior, which
assumed the same thing but wrapped it in a CAST function call.
Copy link
Contributor Author

tlento commented Aug 15, 2023

@tlento tlento linked an issue Aug 15, 2023 that may be closed by this pull request
3 tasks
@tlento tlento requested a review from plypaul August 15, 2023 06:16
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS August 15, 2023 06:17 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS August 15, 2023 06:17 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS August 15, 2023 06:17 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS August 15, 2023 06:17 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS August 15, 2023 06:17 — with GitHub Actions Inactive
Copy link
Contributor

@plypaul plypaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall, this was put in due to an issue with a previously supported engine.

Base automatically changed from remove-environment-variable-helper to main August 16, 2023 19:52
@tlento
Copy link
Contributor Author

tlento commented Aug 16, 2023

As I recall, this was put in due to an issue with a previously supported engine.

That's what I remember too but I can't find any evidence of it. It's possible the offending engine has updated their type handling. It's also possible we didn't understand BigQuery's coercion logic - their documentation on this is hidden well enough that it appears BigQuery simply doesn't support it, but they do (and they have to because of the way their partition pruning works).

@tlento tlento merged commit 82a278f into main Aug 16, 2023
24 checks passed
@tlento tlento deleted the patch-fix-time-partition-pruning-error branch August 16, 2023 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Add support for partition elimination in BigQuery
2 participants