Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(escalating-issues): Auto-transition tasks should update up to 500_000 groups per minute #56168

Conversation

NisanthanNanthakumar
Copy link
Contributor

@NisanthanNanthakumar NisanthanNanthakumar commented Sep 13, 2023

Objective:

The current implementation of the auto-transition tasks is leading to spiky memory pressure on RabbitMQ. This is partly due to the hot shards of big orgs with a lot of groups. The alternative approach is to consistently send x number of messages of groups older than 7 days. No need to care about org or project bc all those groups have the same status and substatus changes. This iteration sends up to 50 child tasks with each task updating 10_000 groups. We will increase the number of tasks if the backlog of groups_to_be_updated is increasing.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Sep 13, 2023
@codecov
Copy link

codecov bot commented Sep 13, 2023

Codecov Report

Merging #56168 (7a6a69d) into master (0fffed9) will decrease coverage by 1.39%.
Report is 256 commits behind head on master.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #56168      +/-   ##
==========================================
- Coverage   79.99%   78.61%   -1.39%     
==========================================
  Files        5062     5079      +17     
  Lines      217728   218659     +931     
  Branches    36856    37014     +158     
==========================================
- Hits       174182   171895    -2287     
- Misses      38200    41204    +3004     
- Partials     5346     5560     +214     
Files Changed Coverage
src/sentry/conf/server.py ø
src/sentry/issues/update_inbox.py ø
src/sentry/issues/ongoing.py 100.00%
src/sentry/tasks/auto_ongoing_issues.py 100.00%

@@ -162,20 +124,19 @@ def auto_transition_issues_new_to_ongoing(
for new_groups in chunked(
RangeQuerySetWrapper(
Group.objects.filter(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this query needs the new index from this PR: #56180

@NisanthanNanthakumar NisanthanNanthakumar changed the title Escalating issues/ref hot org shards in auto transition tasks ref(escalating-issues): Auto-transition tasks should update up to 500_000 groups per minute Sep 13, 2023
@NisanthanNanthakumar NisanthanNanthakumar marked this pull request as ready for review September 13, 2023 21:49
@NisanthanNanthakumar NisanthanNanthakumar requested a review from a team as a code owner September 13, 2023 21:49
# Run job every 10 minutes
"schedule": crontab(minute="*/10"),
# Run job every minute
"schedule": crontab(minute="*/1"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing to run every minute.

@hubertsentry
Copy link
Contributor

@NisanthanNanthakumar do you know what's the throughput for the worker? We need to make sure workers can consume 50000 groups in a minute.

Copy link
Member

@armenzg armenzg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good. My only request for changes has to do with the change of signature for tasks since past in the queue would face a conflict with the new signature.

first_seen_lte: int,
organization_id: int,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the signature of a task is a risky change.
You have to write a PR that can handle both the old signature and the new signature (probably making the call being keyword args).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@armenzg ahh yea forgot about this! I will keep this signature for this PR. Once this PR get deployed, new tasks will instantiated with only first_seen_lte. Then I'll make a new PR to remove the unnecessary args.

@@ -201,41 +181,67 @@ def auto_transition_issues_new_to_ongoing(
@retry(on=(OperationalError,))
@log_error_if_queue_has_items
def auto_transition_issues_regressed_to_ongoing(
project_ids: List[int],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

silo_mode=SiloMode.REGION,
)
@retry(on=(OperationalError,))
def run_auto_transition_issues_escalating_to_ongoing(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional feedback: This function and run_auto_transition_issues_regressed_to_ongoing are almost the same in case you want to join them and pass the status/substatus as parameters.

Copy link
Contributor Author

@NisanthanNanthakumar NisanthanNanthakumar Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to leave it as separate child tasks for analytics purposes.

@NisanthanNanthakumar
Copy link
Contributor Author

@hubertsentry this graph show the current duration of the auto_transition_issues_new_to_ongoing task.

It peaks at 12s for hot shards. but this iteration will limit each child task to a single bulk update of 10_000 groups. So I expect it to be much faster.

@hubertsentry
Copy link
Contributor

sounds good, since we have at least 1 workers, and 1 worker should have 2 processes. It might not be too bad

Copy link
Member

@armenzg armenzg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Good test.

NisanthanNanthakumar pushed a commit that referenced this pull request Sep 15, 2023
date_added_lte=int(seven_days_ago.timestamp()),
expires=now + timedelta(hours=1),
)
schedule_auto_transition_issues_new_to_ongoing.delay(
Copy link
Contributor Author

@NisanthanNanthakumar NisanthanNanthakumar Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed the child tasks so that I can remove the unnecessary positional arguments.

"most_recent_group_first_seen_seven_days_ago": most_recent_group_first_seen_seven_days_ago.id,
"first_seen_lte": first_seen_lte,
},
)

for new_groups in chunked(
base_queryset = Group.objects.filter(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will reuse the base queryset for the analytics query.

step=ITERATOR_CHUNK,
limit=ITERATOR_CHUNK * 50,
result_value_getter=lambda item: item,
callbacks=[get_last_id, get_total_count],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the callback functionality to get the id of the last object from all the potential iterations and the total count

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool. Good usage 👍🏻

)

remaining_groups_queryset = base_queryset._clone()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use _clone() to ensure that you're working with separate instances

Copy link
Member

@armenzg armenzg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job on all those tests! 🎉

step=ITERATOR_CHUNK,
limit=ITERATOR_CHUNK * 50,
result_value_getter=lambda item: item,
callbacks=[get_last_id, get_total_count],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool. Good usage 👍🏻

@NisanthanNanthakumar NisanthanNanthakumar merged commit 58ee52a into master Sep 20, 2023
51 checks passed
@NisanthanNanthakumar NisanthanNanthakumar deleted the escalating-issues/ref-hot-org-shards-in-auto-transition-tasks branch September 20, 2023 16:58
@sentry-io
Copy link

sentry-io bot commented Sep 20, 2023

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ OperationalError: QueryCanceled('canceling statement due to user request\n') sentry.tasks.schedule_auto_transition_issues_ne... View Issue
  • ‼️ OperationalError: QueryCanceled('canceling statement due to user request\nCONTEXT: parallel worker\n') sentry.tasks.schedule_auto_transition_issues_ne... View Issue

Did you find this useful? React with a 👍 or 👎

@github-actions github-actions bot locked and limited conversation to collaborators Oct 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants