-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref(crons): Correct logic for mark_ok #57730
ref(crons): Correct logic for mark_ok #57730
Conversation
@with_feature("organizations:issue-platform") | ||
@patch("sentry.issues.producer.produce_occurrence_to_kafka") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made this more unit-test like by removing the dependency on mark_failed
from this test.
src/sentry/monitors/logic/mark_ok.py
Outdated
"-date_added" | ||
)[:recovery_threshold] | ||
|
||
# Incident recovers when ew have successive threshold check-ins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we*
src/sentry/monitors/logic/mark_ok.py
Outdated
# is not recovering | ||
allow_status_update = True | ||
|
||
# Resolve the incident if we have met the recovery recovery_threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this just be if we have met the recovery_threshold
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah
src/sentry/monitors/logic/mark_ok.py
Outdated
) | ||
|
||
# Resolve the incident | ||
if incident_recovering and monitor_env.status != MonitorStatus.OK: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we simplify the logic/execution overall here? Right now it seems like the order of conditions are:
- Check if we have a recovery threshold (if not just update the status)
- If we have, fetch N recent check-ins and check that they are all ok
- If they are all ok check to see if our monitor is in an incident (!= MonitorStatus.OK)
- If true, resolve the incident, and then allow a status update on the monitor
But would it be better to instead
- Check
monitor_env.status != MonitorStatus.OK
first, if it ISOK
then we can do nothing - If not
OK
then chcek for a recovery threshold, proceed with the previous steps 2 and then 4
Only mentioning this because hopefully the majority of the time, user's monitor statuses should be OK which means we shouldn't check the N most recent check-ins every time they send an OK check-in (assuming they have a recovery threshold)
481f494
to
63fc2e1
Compare
63fc2e1
to
57ebe7e
Compare
Previously when incidents were enabled (a recovery threshold was set) a monitor would not have it's `next_checkin` or `next_checkin_latest` updated until it recovered.
57ebe7e
to
5893654
Compare
) | ||
return | ||
|
||
recovery_threshold = monitor_env.monitor.config.get("recovery_threshold", 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be 1
as the default? or idk maybe a comment. the min value is 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have the default here as 0
since if it's not set we don't want to enable incidents yet
return | ||
|
||
recovery_threshold = monitor_env.monitor.config.get("recovery_threshold", 0) | ||
using_incidents = bool(recovery_threshold) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just change this to recovery_threshold > 1
which is functionally the same thing for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I basically just duplicated what the logic was before if recovery_threshold:
@rjo100 mind just approving as is and we can clean up after? |
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
PR reverted: 8aa99c7 |
This reverts commit dcd6da4. Co-authored-by: evanpurkhiser <1421724+evanpurkhiser@users.noreply.github.com>
Previously when incidents were enabled (a recovery threshold was set) a
monitor would not have it's
next_checkin
ornext_checkin_latest
updated until it recovered.