-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Post mortem report for incident on Aug 26, 2024 #473
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,75 @@ | ||||||||||||||
--- | ||||||||||||||
layout: post | ||||||||||||||
title: "Degraded performance of OBS Web and Email Notifications system" | ||||||||||||||
category: deployments | ||||||||||||||
author: Rubhan Azeem <rubhan.azeem@suse.com> | ||||||||||||||
--- | ||||||||||||||
|
||||||||||||||
<!-- | ||||||||||||||
Classify the severity of this problem. We usually say: | ||||||||||||||
- service degradation: if only a few customers where impacted | ||||||||||||||
- severe service degradation: if nearly every customer was impacted | ||||||||||||||
- downtime: if every visit to the OBS ended up on some error page | ||||||||||||||
--> | ||||||||||||||
Comment on lines
+8
to
+13
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe this will revive the excerpt on the index page... |
||||||||||||||
|
||||||||||||||
Our Email and Web notifications system experienced a service degradation. | ||||||||||||||
|
||||||||||||||
<!-- | ||||||||||||||
What happened, for how long and who was impacted by it? | ||||||||||||||
For customers to be able to identify if their problem related to this post mortem or not. | ||||||||||||||
--> | ||||||||||||||
Comment on lines
+17
to
+20
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
On Friday, August 23rd at 11:59 UTC, Email and Web notifications stopped working until August 26th at 10:16 UTC. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
In the notification creation process, the `payload` field in the `events` table is copied to the `event_payload` field in the `notifications` table. These fields are required to have the same column size, but we discovered they don't have the same size. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
As a result, when the system attempted to create a notification for an event with a long description in its payload, an exception was raised, causing the notification creation process to fail. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
## Detection | ||||||||||||||
|
||||||||||||||
We received the first Grafana alert on Friday, August 23rd at 15:00 UTC, indicating that no notifications had been sent in the last three hours. Upon checking Errbit, we found a single exception with over 100,000 occurrences | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
## Root Cause | ||||||||||||||
|
||||||||||||||
The root cause of the problem was a mismatch in column sizes between the `payload` field in the `events` table and the `event_payload` field in the `notifications` table. Because of this mismatch, when the system attempted to create a notification from an event with a very long description in its payload, an exception was raised, that blocked the creation of further email and web notifications. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
## Resolution | ||||||||||||||
<!-- | ||||||||||||||
How did you resolve or work around this problem? | ||||||||||||||
For customers and community to understand what happened technically. | ||||||||||||||
--> | ||||||||||||||
Comment on lines
+37
to
+40
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
We were able to identify the problematic event, take a backup of that event, and then delete it. After this, the system began to recover gradually. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
### Action Items | ||||||||||||||
|
||||||||||||||
<!-- | ||||||||||||||
Are there any actions we are going to do that are not done yet? | ||||||||||||||
For customers and community to be able to follow up on this. | ||||||||||||||
--> | ||||||||||||||
Comment on lines
+46
to
+49
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
Additionally we have created some follow up action items: | ||||||||||||||
|
||||||||||||||
| Action Item | Owner | | ||||||||||||||
|--- |--- | | ||||||||||||||
| [Data migration to update column size](https://github.com/openSUSE/open-build-service/pull/16751) | Developer Team | | ||||||||||||||
| [Improve exception handling](ihttps://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team | | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A comment in a PR is not a way for us to track work... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
## Lessons Learned | ||||||||||||||
<!-- | ||||||||||||||
Describe what went well, what went wrong and where we go lucky during the resolution of this problem. | ||||||||||||||
--> | ||||||||||||||
Comment on lines
+59
to
+61
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
- Ensure that all connected tables have consistent column sizes by thoroughly reviewing the system architecture. We recognize that there is close coupling of components, and this design can be improved. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And where are we going to do that? And when? |
||||||||||||||
- We had failed notifications in delayed jobs, and that saved us from data loss. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I understand what you want to say here... |
||||||||||||||
- Errbit and Grafana alerts helped us realize the problem. | ||||||||||||||
|
||||||||||||||
## Timeline (All Time in UTC) | ||||||||||||||
|
||||||||||||||
- *August 23rd, 11:59* The system stopped creating web and email notifications | ||||||||||||||
- *August 23rd, 15:00* We received the first alert from Grafana | ||||||||||||||
- *August 26th, 06:25* We checked the exceptions on Errbit | ||||||||||||||
- *August 26th, 09:31* We declared the incident | ||||||||||||||
- *August 26th, 10:16* We identified and deleted the event record causing the problem. Failed notifications began to recover slowly | ||||||||||||||
- *August 26th, 10:20* We declared the incident resolved | ||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.