ref(escalating-issues): Auto-transition tasks should update up to 500_000 groups per minute #56168

NisanthanNanthakumar · 2023-09-13T15:28:49Z

Objective:

The current implementation of the auto-transition tasks is leading to spiky memory pressure on RabbitMQ. This is partly due to the hot shards of big orgs with a lot of groups. The alternative approach is to consistently send x number of messages of groups older than 7 days. No need to care about org or project bc all those groups have the same status and substatus changes. This iteration sends up to 50 child tasks with each task updating 10_000 groups. We will increase the number of tasks if the backlog of groups_to_be_updated is increasing.

codecov · 2023-09-13T15:56:32Z

Codecov Report

Merging #56168 (7a6a69d) into master (0fffed9) will decrease coverage by 1.39%.
Report is 256 commits behind head on master.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #56168      +/-   ##
==========================================
- Coverage   79.99%   78.61%   -1.39%     
==========================================
  Files        5062     5079      +17     
  Lines      217728   218659     +931     
  Branches    36856    37014     +158     
==========================================
- Hits       174182   171895    -2287     
- Misses      38200    41204    +3004     
- Partials     5346     5560     +214

Files Changed	Coverage
src/sentry/conf/server.py	`ø`
src/sentry/issues/update_inbox.py	`ø`
src/sentry/issues/ongoing.py	`100.00%`
src/sentry/tasks/auto_ongoing_issues.py	`100.00%`

NisanthanNanthakumar · 2023-09-13T17:45:38Z

src/sentry/tasks/auto_ongoing_issues.py

@@ -162,20 +124,19 @@ def auto_transition_issues_new_to_ongoing(
 for new_groups in chunked(
 RangeQuerySetWrapper(
 Group.objects.filter(


this query needs the new index from this PR: #56180

NisanthanNanthakumar · 2023-09-13T21:54:03Z

src/sentry/conf/server.py

- # Run job every 10 minutes
- "schedule": crontab(minute="*/10"),
+ # Run job every minute
+ "schedule": crontab(minute="*/1"),


changing to run every minute.

hubertsentry · 2023-09-14T17:17:57Z

@NisanthanNanthakumar do you know what's the throughput for the worker? We need to make sure workers can consume 50000 groups in a minute.

armenzg

This looks very good. My only request for changes has to do with the change of signature for tasks since past in the queue would face a conflict with the new signature.

armenzg · 2023-09-14T17:13:22Z

src/sentry/tasks/auto_ongoing_issues.py

 first_seen_lte: int,
- organization_id: int,


Changing the signature of a task is a risky change.
You have to write a PR that can handle both the old signature and the new signature (probably making the call being keyword args).

@armenzg ahh yea forgot about this! I will keep this signature for this PR. Once this PR get deployed, new tasks will instantiated with only first_seen_lte. Then I'll make a new PR to remove the unnecessary args.

armenzg · 2023-09-14T17:15:37Z

src/sentry/tasks/auto_ongoing_issues.py

@@ -201,41 +181,67 @@ def auto_transition_issues_new_to_ongoing(
 @retry(on=(OperationalError,))
 @log_error_if_queue_has_items
 def auto_transition_issues_regressed_to_ongoing(
- project_ids: List[int],


Same comment as above.

armenzg · 2023-09-14T17:19:36Z

src/sentry/tasks/auto_ongoing_issues.py

+ silo_mode=SiloMode.REGION,
+)
+@retry(on=(OperationalError,))
+def run_auto_transition_issues_escalating_to_ongoing(


Optional feedback: This function and run_auto_transition_issues_regressed_to_ongoing are almost the same in case you want to join them and pass the status/substatus as parameters.

I want to leave it as separate child tasks for analytics purposes.

NisanthanNanthakumar · 2023-09-14T17:20:56Z

@hubertsentry this graph show the current duration of the auto_transition_issues_new_to_ongoing task.

It peaks at 12s for hot shards. but this iteration will limit each child task to a single bulk update of 10_000 groups. So I expect it to be much faster.

hubertsentry · 2023-09-14T17:47:59Z

sounds good, since we have at least 1 workers, and 1 worker should have 2 processes. It might not be too bad

armenzg

This looks great! Good test.

… query (#56180) ## Objective: Required for #56168

NisanthanNanthakumar · 2023-09-19T23:19:32Z

src/sentry/tasks/auto_ongoing_issues.py

- date_added_lte=int(seven_days_ago.timestamp()),
- expires=now + timedelta(hours=1),
- )
+ schedule_auto_transition_issues_new_to_ongoing.delay(


renamed the child tasks so that I can remove the unnecessary positional arguments.

NisanthanNanthakumar · 2023-09-19T23:20:52Z

src/sentry/tasks/auto_ongoing_issues.py

 "most_recent_group_first_seen_seven_days_ago": most_recent_group_first_seen_seven_days_ago.id,
 "first_seen_lte": first_seen_lte,
 },
 )

- for new_groups in chunked(
+ base_queryset = Group.objects.filter(


we will reuse the base queryset for the analytics query.

NisanthanNanthakumar · 2023-09-19T23:21:58Z

src/sentry/tasks/auto_ongoing_issues.py

 step=ITERATOR_CHUNK,
+ limit=ITERATOR_CHUNK * 50,
+ result_value_getter=lambda item: item,
+ callbacks=[get_last_id, get_total_count],


use the callback functionality to get the id of the last object from all the potential iterations and the total count

This is very cool. Good usage 👍🏻

NisanthanNanthakumar · 2023-09-19T23:22:40Z

src/sentry/tasks/auto_ongoing_issues.py

 )

+ remaining_groups_queryset = base_queryset._clone()


we use _clone() to ensure that you're working with separate instances

armenzg

Good job on all those tests! 🎉

armenzg · 2023-09-20T12:56:15Z

src/sentry/tasks/auto_ongoing_issues.py

 step=ITERATOR_CHUNK,
+ limit=ITERATOR_CHUNK * 50,
+ result_value_getter=lambda item: item,
+ callbacks=[get_last_id, get_total_count],


This is very cool. Good usage 👍🏻

sentry-io · 2023-09-20T18:11:11Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ OperationalError: QueryCanceled('canceling statement due to user request\n') sentry.tasks.schedule_auto_transition_issues_ne... View Issue
‼️ OperationalError: QueryCanceled('canceling statement due to user request\nCONTEXT: parallel worker\n') sentry.tasks.schedule_auto_transition_issues_ne... View Issue

_{Did you find this useful? React with a 👍 or 👎}

NisanthanNanthakumar added 2 commits September 13, 2023 08:23

ref(escalating-issues): Handle hot shards in auto-transition tasks

0e08ee9

run the job every minute

b34a575

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Sep 13, 2023

vercel bot deployed to Preview September 13, 2023 15:31 View deployment

NisanthanNanthakumar mentioned this pull request Sep 13, 2023

feat(escalating-issues): Create new index on Group for more efficient query #56180

Merged

NisanthanNanthakumar commented Sep 13, 2023

View reviewed changes

trigger child tasks & add comments & and pass group_ids

62ec35a

NisanthanNanthakumar changed the title ~~Escalating issues/ref hot org shards in auto transition tasks~~ ref(escalating-issues): Auto-transition tasks should update up to 500_000 groups per minute Sep 13, 2023

vercel bot deployed to Preview September 13, 2023 21:47 View deployment

NisanthanNanthakumar marked this pull request as ready for review September 13, 2023 21:49

NisanthanNanthakumar requested a review from a team as a code owner September 13, 2023 21:49

NisanthanNanthakumar requested a review from hubertsentry September 13, 2023 21:49

NisanthanNanthakumar commented Sep 13, 2023

View reviewed changes

snigdhas approved these changes Sep 14, 2023

View reviewed changes

armenzg requested changes Sep 14, 2023

View reviewed changes

add test

eaf09f6

vercel bot deployed to Preview September 14, 2023 17:33 View deployment

keep task function args

ca4d6c0

NisanthanNanthakumar requested a review from armenzg September 14, 2023 17:40

vercel bot deployed to Preview September 14, 2023 17:42 View deployment

armenzg approved these changes Sep 14, 2023

View reviewed changes

NisanthanNanthakumar pushed a commit that referenced this pull request Sep 15, 2023

feat(escalating-issues): Create new index on Group for more efficient…

381818a

… query (#56180) ## Objective: Required for #56168

add analytics & rename subtasks

3979c3b

NisanthanNanthakumar commented Sep 19, 2023

View reviewed changes

vercel bot deployed to Preview September 19, 2023 23:20 View deployment

NisanthanNanthakumar commented Sep 19, 2023

View reviewed changes

NisanthanNanthakumar requested review from armenzg, snigdhas and a team September 19, 2023 23:23

add test coverage for other tasks

7a6a69d

vercel bot deployed to Preview September 19, 2023 23:36 View deployment

armenzg approved these changes Sep 20, 2023

View reviewed changes

NisanthanNanthakumar merged commit 58ee52a into master Sep 20, 2023
51 checks passed

NisanthanNanthakumar deleted the escalating-issues/ref-hot-org-shards-in-auto-transition-tasks branch September 20, 2023 16:58

github-actions bot locked and limited conversation to collaborators Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref(escalating-issues): Auto-transition tasks should update up to 500_000 groups per minute #56168

ref(escalating-issues): Auto-transition tasks should update up to 500_000 groups per minute #56168

NisanthanNanthakumar commented Sep 13, 2023 •

edited

Loading

codecov bot commented Sep 13, 2023 •

edited

Loading

NisanthanNanthakumar Sep 13, 2023

NisanthanNanthakumar Sep 13, 2023

hubertsentry commented Sep 14, 2023

armenzg left a comment

armenzg Sep 14, 2023

NisanthanNanthakumar Sep 14, 2023

armenzg Sep 14, 2023

armenzg Sep 14, 2023

NisanthanNanthakumar Sep 14, 2023 •

edited

Loading

NisanthanNanthakumar commented Sep 14, 2023

hubertsentry commented Sep 14, 2023

armenzg left a comment

NisanthanNanthakumar Sep 19, 2023 •

edited

Loading

NisanthanNanthakumar Sep 19, 2023

NisanthanNanthakumar Sep 19, 2023

armenzg Sep 20, 2023

NisanthanNanthakumar Sep 19, 2023

armenzg left a comment

armenzg Sep 20, 2023

sentry-io bot commented Sep 20, 2023 •

edited

Loading

ref(escalating-issues): Auto-transition tasks should update up to 500_000 groups per minute #56168

ref(escalating-issues): Auto-transition tasks should update up to 500_000 groups per minute #56168

Conversation

NisanthanNanthakumar commented Sep 13, 2023 • edited Loading

Objective:

codecov bot commented Sep 13, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hubertsentry commented Sep 14, 2023

armenzg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NisanthanNanthakumar Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

NisanthanNanthakumar commented Sep 14, 2023

hubertsentry commented Sep 14, 2023

armenzg left a comment

Choose a reason for hiding this comment

NisanthanNanthakumar Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

armenzg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sentry-io bot commented Sep 20, 2023 • edited Loading

Suspect Issues

NisanthanNanthakumar commented Sep 13, 2023 •

edited

Loading

codecov bot commented Sep 13, 2023 •

edited

Loading

NisanthanNanthakumar Sep 14, 2023 •

edited

Loading

NisanthanNanthakumar Sep 19, 2023 •

edited

Loading

sentry-io bot commented Sep 20, 2023 •

edited

Loading