-
-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(mapping-optimizer): Support in operator for mapping optimizer #5685
feat(mapping-optimizer): Support in operator for mapping optimizer #5685
Conversation
This was a TODO item. But on the spans dataset, one easy to encounter situation is a condition like `sentry_tags[key] IN (value1, value2)`. This results in a sql like `in((arrayElement(sentry_tags.value, indexOf(sentry_tags.key, 'key')) AS `_snuba_sentry_tags[key]`), ['value1', 'value1'])` which scans the entire `sentry_tags.key` and `sentry_tags.value` columns. The optimization here is to use the tags hash map which gives us a condition like `hasAny(_sentry_tags_hash_map, array(cityHash64('environment=prod'), cityHash64('environment=production')))`.
Codecov ReportAll modified and coverable lines are covered by tests ✅
✅ All tests successful. No failed tests found Additional details and impacted files@@ Coverage Diff @@
## master #5685 +/- ##
=======================================
Coverage 89.92% 89.93%
=======================================
Files 898 898
Lines 43453 43474 +21
Branches 299 299
=======================================
+ Hits 39077 39098 +21
Misses 4334 4334
Partials 42 42 ☔ View full report in Codecov by Sentry. |
@@ -265,7 +351,7 @@ def _get_condition_without_redundant_checks( | |||
if tag_exist_match: | |||
matched_tag_exists_conditions[condition_id] = tag_exist_match | |||
if not tag_exist_match: | |||
eq_match = self.__optimizable_pattern.match(cond) | |||
eq_match = self.__equals_condition_pattern.match(cond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@volokluev would you happen to know if I need to implement this removing of redundant checks for IN
conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could but it's not strictly necessary. I'm not sure how often we get those cases with IN conditions. Definitely something that can be added later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is less common on the older datasets but more likely to happen with the spans dataset as we have sentry_tags
which contains some more commonly used columns.
The example I ran into was with environment
. For a 24h period, it read >48GiB of data, and after applying this optimization, I saw it was reduced to <24GiB of data. On 7 day periods, the query was already timing out. So this optimization should already be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yes absolutely. Merge the PR. I was saying that the redundant clause optimization is probably not going to be as applicable for IN clauses
PR reverted: 8c6329d |
Re-apply #5685 This was a TODO item. But on the spans dataset, one easy to encounter situation is a condition like ``` sentry_tags[key] IN (value1, value2) ``` This results in a sql like ``` in((arrayElement(sentry_tags.value, indexOf(sentry_tags.key, 'key')) AS `_snuba_sentry_tags[key]`), ['value1', 'value1']) ``` which scans the entire `sentry_tags.key` and `sentry_tags.value` columns. The optimization here is to use the tags hash map which gives us a condition like ``` hasAny(_sentry_tags_hash_map, array(cityHash64('key=value1'), cityHash64('key=value1'))) ```
…5691) Re-apply #5685 This was a TODO item. But on the spans dataset, one easy to encounter situation is a condition like ``` sentry_tags[key] IN (value1, value2) ``` This results in a sql like ``` in((arrayElement(sentry_tags.value, indexOf(sentry_tags.key, 'key')) AS `_snuba_sentry_tags[key]`), ['value1', 'value1']) ``` which scans the entire `sentry_tags.key` and `sentry_tags.value` columns. The optimization here is to use the tags hash map which gives us a condition like ``` hasAny(_sentry_tags_hash_map, array(cityHash64('key=value1'), cityHash64('key=value1'))) ```
…5691) Re-apply #5685 This was a TODO item. But on the spans dataset, one easy to encounter situation is a condition like ``` sentry_tags[key] IN (value1, value2) ``` This results in a sql like ``` in((arrayElement(sentry_tags.value, indexOf(sentry_tags.key, 'key')) AS `_snuba_sentry_tags[key]`), ['value1', 'value1']) ``` which scans the entire `sentry_tags.key` and `sentry_tags.value` columns. The optimization here is to use the tags hash map which gives us a condition like ``` hasAny(_sentry_tags_hash_map, array(cityHash64('key=value1'), cityHash64('key=value1'))) ```
…5691) (#5692) Re-apply #5685 This was a TODO item. But on the spans dataset, one easy to encounter situation is a condition like ``` sentry_tags[key] IN (value1, value2) ``` This results in a sql like ``` in((arrayElement(sentry_tags.value, indexOf(sentry_tags.key, 'key')) AS `_snuba_sentry_tags[key]`), ['value1', 'value1']) ``` which scans the entire `sentry_tags.key` and `sentry_tags.value` columns. The optimization here is to use the tags hash map which gives us a condition like ``` hasAny(_sentry_tags_hash_map, array(cityHash64('key=value1'), cityHash64('key=value1'))) ```
This was a TODO item. But on the spans dataset, one easy to encounter situation is a condition like
This results in a sql like
which scans the entire
sentry_tags.key
andsentry_tags.value
columns. The optimization here is to use the tags hash map which gives us a condition like