-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref(similarity): Run single query for events with large event counts #72183
Conversation
🔍 Existing Issues For ReviewYour pull request is modifying functions with the following pre-existing issues: 📄 File: src/sentry/tasks/backfill_seer_grouping_records.py
Did you find this useful? React with a 👍 or 👎 |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #72183 +/- ##
=======================================
Coverage 78.04% 78.05%
=======================================
Files 6596 6596
Lines 293886 293904 +18
Branches 50700 50706 +6
=======================================
+ Hits 229370 229394 +24
+ Misses 58256 58255 -1
+ Partials 6260 6255 -5
|
group_id_last_seen_no_embeddings_high_count = { | ||
group_id: last_seen | ||
for (group_id, _, times_seen, last_seen) in groups_to_backfill_batch | ||
if group_id in groups_to_backfill_with_no_embedding and times_seen >= HIGH_GROUP_EVENT_COUNT | ||
} | ||
group_id_last_seen_no_embeddings_low_count = { | ||
group_id: last_seen | ||
for (group_id, _, _, last_seen) in groups_to_backfill_batch | ||
if group_id not in group_id_last_seen_no_embeddings_high_count | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i think it would be cleaner if we just returned one dictionary, and you can put the count in the dictionary, so it'd be
group_id: {last_seen: 12123123, times_seen: 34343}
which is returned from this function
snuba_results_high_count = get_data_from_snuba_single_group_query( | ||
project, group_id_last_seen_no_embeddings_high_count | ||
) | ||
snuba_results_low_count = get_data_from_snuba_bulk_groups_query( | ||
project, group_id_last_seen_no_embeddings_low_count | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cleaner if you put this logic into one function which this main function would call, and then you'd combine the results in that function. so that way further down in the logic, it doesn't have to care what a high result / low result is, it's all snuba results that are treated the same
snuba_results_high_count, | ||
snuba_results_low_count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this function's arguments could remain the same if you do the above refactor
@@ -61,6 +74,11 @@ class GroupStacktraceData(TypedDict): | |||
stacktrace_list: list[str] | |||
|
|||
|
|||
class EventGroupSnubaResult(TypedDict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't appear to be used?
match=events_entity, | ||
select=[ | ||
Column("group_id"), | ||
Function("max", [Column("event_id")], "event_id"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change the function to be any
instead of max
?
and one final request: could we make it so that the bulk queries are controled by an option? e.g. if the option is false, we only run the single event queries. if the option is true, we'll try to do the bulk as well i have a feeling we may run into snuba issues even if we filter out high event counts from the bulk query. |
This pull request has gone three weeks without activity. In another week, I will close it. But! If you comment or otherwise update it, I will reset the clock, and if you add the label "A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀 |
Modify record backfill script to only run single queries for groups with over 1 million events
Run bulk queries with modified timestamp for groups with under 1 million events