Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(mapping-optimizer): Support in operator for mapping optimizer #5685

Merged

Conversation

Zylphrex
Copy link
Member

@Zylphrex Zylphrex commented Mar 25, 2024

This was a TODO item. But on the spans dataset, one easy to encounter situation is a condition like

sentry_tags[key] IN (value1, value2)

This results in a sql like

in((arrayElement(sentry_tags.value, indexOf(sentry_tags.key, 'key')) AS `_snuba_sentry_tags[key]`), ['value1', 'value1'])

which scans the entire sentry_tags.key and sentry_tags.value columns. The optimization here is to use the tags hash map which gives us a condition like

hasAny(_sentry_tags_hash_map, array(cityHash64('key=value1'), cityHash64('key=value1')))

This was a TODO item. But on the spans dataset, one easy to encounter situation
is a condition like `sentry_tags[key] IN (value1, value2)`. This results in a
sql like
`in((arrayElement(sentry_tags.value, indexOf(sentry_tags.key, 'key')) AS `_snuba_sentry_tags[key]`), ['value1', 'value1'])`
which scans the entire `sentry_tags.key` and `sentry_tags.value` columns. The
optimization here is to use the tags hash map which gives us a condition like
`hasAny(_sentry_tags_hash_map, array(cityHash64('environment=prod'), cityHash64('environment=production')))`.
@Zylphrex Zylphrex requested a review from a team as a code owner March 25, 2024 17:33
Copy link

codecov bot commented Mar 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.93%. Comparing base (f5f9208) to head (cfe6365).
Report is 1 commits behind head on master.

✅ All tests successful. No failed tests found ☺️

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5685   +/-   ##
=======================================
  Coverage   89.92%   89.93%           
=======================================
  Files         898      898           
  Lines       43453    43474   +21     
  Branches      299      299           
=======================================
+ Hits        39077    39098   +21     
  Misses       4334     4334           
  Partials       42       42           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -265,7 +351,7 @@ def _get_condition_without_redundant_checks(
if tag_exist_match:
matched_tag_exists_conditions[condition_id] = tag_exist_match
if not tag_exist_match:
eq_match = self.__optimizable_pattern.match(cond)
eq_match = self.__equals_condition_pattern.match(cond)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volokluev would you happen to know if I need to implement this removing of redundant checks for IN conditions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could but it's not strictly necessary. I'm not sure how often we get those cases with IN conditions. Definitely something that can be added later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is less common on the older datasets but more likely to happen with the spans dataset as we have sentry_tags which contains some more commonly used columns.

The example I ran into was with environment. For a 24h period, it read >48GiB of data, and after applying this optimization, I saw it was reduced to <24GiB of data. On 7 day periods, the query was already timing out. So this optimization should already be helpful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes absolutely. Merge the PR. I was saying that the redundant clause optimization is probably not going to be as applicable for IN clauses

@Zylphrex Zylphrex merged commit cf89313 into master Mar 26, 2024
32 checks passed
@Zylphrex Zylphrex deleted the txiao/feat/support-in-operator-for-mapping-optimizers branch March 26, 2024 14:14
@getsentry-bot
Copy link
Contributor

PR reverted: 8c6329d

getsentry-bot added a commit that referenced this pull request Mar 26, 2024
…mizer (#5685)"

This reverts commit cf89313.

Co-authored-by: Zylphrex <10239353+Zylphrex@users.noreply.github.com>
Zylphrex added a commit that referenced this pull request Mar 26, 2024
Re-apply #5685

This was a TODO item. But on the spans dataset, one easy to encounter situation is a condition like
```
sentry_tags[key] IN (value1, value2)
```
This results in a sql like
```
in((arrayElement(sentry_tags.value, indexOf(sentry_tags.key, 'key')) AS `_snuba_sentry_tags[key]`), ['value1', 'value1'])
```
which scans the entire `sentry_tags.key` and `sentry_tags.value` columns. The optimization here is to use the tags hash map which gives us a condition like
```
hasAny(_sentry_tags_hash_map, array(cityHash64('key=value1'), cityHash64('key=value1')))
```
Zylphrex added a commit that referenced this pull request Mar 26, 2024
…5691)

Re-apply #5685

This was a TODO item. But on the spans dataset, one easy to encounter situation is a condition like
```
sentry_tags[key] IN (value1, value2)
```
This results in a sql like
```
in((arrayElement(sentry_tags.value, indexOf(sentry_tags.key, 'key')) AS `_snuba_sentry_tags[key]`), ['value1', 'value1'])
```
which scans the entire `sentry_tags.key` and `sentry_tags.value` columns. The optimization here is to use the tags hash map which gives us a condition like
```
hasAny(_sentry_tags_hash_map, array(cityHash64('key=value1'), cityHash64('key=value1')))
```
Zylphrex added a commit that referenced this pull request Mar 26, 2024
…5691)

Re-apply #5685

This was a TODO item. But on the spans dataset, one easy to encounter situation is a condition like
```
sentry_tags[key] IN (value1, value2)
```
This results in a sql like
```
in((arrayElement(sentry_tags.value, indexOf(sentry_tags.key, 'key')) AS `_snuba_sentry_tags[key]`), ['value1', 'value1'])
```
which scans the entire `sentry_tags.key` and `sentry_tags.value` columns. The optimization here is to use the tags hash map which gives us a condition like
```
hasAny(_sentry_tags_hash_map, array(cityHash64('key=value1'), cityHash64('key=value1')))
```
Zylphrex added a commit that referenced this pull request Mar 27, 2024
…5691) (#5692)

Re-apply #5685

This was a TODO item. But on the spans dataset, one easy to encounter situation is a condition like
```
sentry_tags[key] IN (value1, value2)
```
This results in a sql like
```
in((arrayElement(sentry_tags.value, indexOf(sentry_tags.key, 'key')) AS `_snuba_sentry_tags[key]`), ['value1', 'value1'])
```
which scans the entire `sentry_tags.key` and `sentry_tags.value` columns. The optimization here is to use the tags hash map which gives us a condition like
```
hasAny(_sentry_tags_hash_map, array(cityHash64('key=value1'), cityHash64('key=value1')))
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants