Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(similarity): Run single query for events with large event counts #72183

Closed
wants to merge 9 commits into from

Conversation

jangjodi
Copy link
Member

@jangjodi jangjodi commented Jun 5, 2024

Modify record backfill script to only run single queries for groups with over 1 million events
Run bulk queries with modified timestamp for groups with under 1 million events

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jun 5, 2024
Copy link

sentry-io bot commented Jun 5, 2024

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: src/sentry/tasks/backfill_seer_grouping_records.py

Function Unhandled Issue
backfill_seer_grouping_records KeyError: 5172984251 sentry.tasks.backfill_seer_g...
Event Count: 1

Did you find this useful? React with a 👍 or 👎

@jangjodi jangjodi requested a review from a team June 5, 2024 23:14
Base automatically changed from jodi/record-backfill-use-event-message to master June 6, 2024 00:06
Copy link

codecov bot commented Jun 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.05%. Comparing base (04ea071) to head (c9a236c).
Report is 862 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #72183   +/-   ##
=======================================
  Coverage   78.04%   78.05%           
=======================================
  Files        6596     6596           
  Lines      293886   293904   +18     
  Branches    50700    50706    +6     
=======================================
+ Hits       229370   229394   +24     
+ Misses      58256    58255    -1     
+ Partials     6260     6255    -5     
Files Coverage Δ
src/sentry/tasks/backfill_seer_grouping_records.py 93.28% <100.00%> (+4.34%) ⬆️

... and 4 files with indirect coverage changes

@jangjodi jangjodi marked this pull request as ready for review June 12, 2024 15:23
Comment on lines +326 to +335
group_id_last_seen_no_embeddings_high_count = {
group_id: last_seen
for (group_id, _, times_seen, last_seen) in groups_to_backfill_batch
if group_id in groups_to_backfill_with_no_embedding and times_seen >= HIGH_GROUP_EVENT_COUNT
}
group_id_last_seen_no_embeddings_low_count = {
group_id: last_seen
for (group_id, _, _, last_seen) in groups_to_backfill_batch
if group_id not in group_id_last_seen_no_embeddings_high_count
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i think it would be cleaner if we just returned one dictionary, and you can put the count in the dictionary, so it'd be

group_id: {last_seen: 12123123, times_seen: 34343} which is returned from this function

Comment on lines +139 to +144
snuba_results_high_count = get_data_from_snuba_single_group_query(
project, group_id_last_seen_no_embeddings_high_count
)
snuba_results_low_count = get_data_from_snuba_bulk_groups_query(
project, group_id_last_seen_no_embeddings_low_count
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner if you put this logic into one function which this main function would call, and then you'd combine the results in that function. so that way further down in the logic, it doesn't have to care what a high result / low result is, it's all snuba results that are treated the same

Comment on lines +150 to +151
snuba_results_high_count,
snuba_results_low_count,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this function's arguments could remain the same if you do the above refactor

@@ -61,6 +74,11 @@ class GroupStacktraceData(TypedDict):
stacktrace_list: list[str]


class EventGroupSnubaResult(TypedDict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't appear to be used?

match=events_entity,
select=[
Column("group_id"),
Function("max", [Column("event_id")], "event_id"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change the function to be any instead of max?

@JoshFerge
Copy link
Member

and one final request: could we make it so that the bulk queries are controled by an option? e.g. if the option is false, we only run the single event queries.

if the option is true, we'll try to do the bulk as well

i have a feeling we may run into snuba issues even if we filter out high event counts from the bulk query.

@getsantry
Copy link
Contributor

getsantry bot commented Jul 4, 2024

This pull request has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you add the label WIP, I will leave it alone unless WIP is removed ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

@getsantry getsantry bot added the Stale label Jul 4, 2024
@getsantry getsantry bot closed this Jul 12, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jul 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants