Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(crons): Correct logic for mark_ok #57730

Merged

Conversation

evanpurkhiser
Copy link
Member

Previously when incidents were enabled (a recovery threshold was set) a
monitor would not have it's next_checkin or next_checkin_latest
updated until it recovered.

@evanpurkhiser evanpurkhiser requested a review from a team as a code owner October 6, 2023 21:52
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Oct 6, 2023
Comment on lines -22 to -23
@with_feature("organizations:issue-platform")
@patch("sentry.issues.producer.produce_occurrence_to_kafka")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this more unit-test like by removing the dependency on mark_failed from this test.

"-date_added"
)[:recovery_threshold]

# Incident recovers when ew have successive threshold check-ins
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we*

# is not recovering
allow_status_update = True

# Resolve the incident if we have met the recovery recovery_threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this just be if we have met the recovery_threshold?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah

)

# Resolve the incident
if incident_recovering and monitor_env.status != MonitorStatus.OK:
Copy link
Contributor

@davidenwang davidenwang Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we simplify the logic/execution overall here? Right now it seems like the order of conditions are:

  1. Check if we have a recovery threshold (if not just update the status)
  2. If we have, fetch N recent check-ins and check that they are all ok
  3. If they are all ok check to see if our monitor is in an incident (!= MonitorStatus.OK)
  4. If true, resolve the incident, and then allow a status update on the monitor

But would it be better to instead

  1. Check monitor_env.status != MonitorStatus.OK first, if it IS OK then we can do nothing
  2. If not OK then chcek for a recovery threshold, proceed with the previous steps 2 and then 4

Only mentioning this because hopefully the majority of the time, user's monitor statuses should be OK which means we shouldn't check the N most recent check-ins every time they send an OK check-in (assuming they have a recovery threshold)

Previously when incidents were enabled (a recovery threshold was set) a
monitor would not have it's `next_checkin` or `next_checkin_latest`
updated until it recovered.
)
return

recovery_threshold = monitor_env.monitor.config.get("recovery_threshold", 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be 1 as the default? or idk maybe a comment. the min value is 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have the default here as 0 since if it's not set we don't want to enable incidents yet

return

recovery_threshold = monitor_env.monitor.config.get("recovery_threshold", 0)
using_incidents = bool(recovery_threshold)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just change this to recovery_threshold > 1 which is functionally the same thing for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I basically just duplicated what the logic was before if recovery_threshold:

@evanpurkhiser
Copy link
Member Author

@rjo100 mind just approving as is and we can clean up after?

@evanpurkhiser evanpurkhiser enabled auto-merge (squash) October 10, 2023 19:51
@evanpurkhiser evanpurkhiser merged commit dcd6da4 into master Oct 10, 2023
49 checks passed
@evanpurkhiser evanpurkhiser deleted the evanpurkhiser/ref-crons-correct-logic-for-mark-ok branch October 10, 2023 20:00
@sentry-io
Copy link

sentry-io bot commented Oct 10, 2023

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ OperationalError: QueryCanceled('canceling statement due to user request\n') monitors.monitor_consumer View Issue

Did you find this useful? React with a 👍 or 👎

@evanpurkhiser evanpurkhiser added the Trigger: Revert add to a merged PR to revert it (skips CI) label Oct 10, 2023
@getsentry-bot
Copy link
Contributor

PR reverted: 8aa99c7

getsentry-bot added a commit that referenced this pull request Oct 10, 2023
This reverts commit dcd6da4.

Co-authored-by: evanpurkhiser <1421724+evanpurkhiser@users.noreply.github.com>
@github-actions github-actions bot locked and limited conversation to collaborators Oct 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components Trigger: Revert add to a merged PR to revert it (skips CI)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants