Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post mortem report for incident on Aug 26, 2024 #473

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions _posts/deployments/2024-08-26-post-mortem.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
layout: post
title: "Degraded performance of OBS Web and Email Notifications system"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: "Degraded performance of OBS Web and Email Notifications system"
title: "Service degradation of OBS Web and Email Notifications system"

category: deployments
author: Rubhan Azeem <rubhan.azeem@suse.com>
---

<!--
Classify the severity of this problem. We usually say:
- service degradation: if only a few customers where impacted
- severe service degradation: if nearly every customer was impacted
- downtime: if every visit to the OBS ended up on some error page
-->
Comment on lines +8 to +13
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!--
Classify the severity of this problem. We usually say:
- service degradation: if only a few customers where impacted
- severe service degradation: if nearly every customer was impacted
- downtime: if every visit to the OBS ended up on some error page
-->

Copy link
Member

@hennevogel hennevogel Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this will revive the excerpt on the index page...


Our Email and Web notifications system experienced a service degradation.

<!--
What happened, for how long and who was impacted by it?
For customers to be able to identify if their problem related to this post mortem or not.
-->
Comment on lines +17 to +20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!--
What happened, for how long and who was impacted by it?
For customers to be able to identify if their problem related to this post mortem or not.
-->


On Friday, August 23rd at 11:59 UTC, Email and Web notifications stopped working until August 26th at 10:16 UTC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
On Friday, August 23rd at 11:59 UTC, Email and Web notifications stopped working until August 26th at 10:16 UTC.
No user of build.opensuse.org received notifications via the web interface / email from Friday, August 23rd at 11:59 UTC until Monday, August 26th at 10:16 UTC. No notification / email was lost, all of them got delivered with a delay. Some of them with a delay of several days.


In the notification creation process, the `payload` field in the `events` table is copied to the `event_payload` field in the `notifications` table. These fields are required to have the same column size, but we discovered they don't have the same size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In the notification creation process, the `payload` field in the `events` table is copied to the `event_payload` field in the `notifications` table. These fields are required to have the same column size, but we discovered they don't have the same size.
The cause of this was mis-matching database table column size restrictions we did not notice.


As a result, when the system attempted to create a notification for an event with a long description in its payload, an exception was raised, causing the notification creation process to fail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As a result, when the system attempted to create a notification for an event with a long description in its payload, an exception was raised, causing the notification creation process to fail.
As a result, when the system attempted to create a notification an exception was raised, causing the notification creation process to fail entirely.


## Detection

We received the first Grafana alert on Friday, August 23rd at 15:00 UTC, indicating that no notifications had been sent in the last three hours. Upon checking Errbit, we found a single exception with over 100,000 occurrences
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We received the first Grafana alert on Friday, August 23rd at 15:00 UTC, indicating that no notifications had been sent in the last three hours. Upon checking Errbit, we found a single exception with over 100,000 occurrences
We received the first alert on Friday, August 23rd at 15:00 UTC, indicating that no notifications had been sent in the last three hours.


## Root Cause

The root cause of the problem was a mismatch in column sizes between the `payload` field in the `events` table and the `event_payload` field in the `notifications` table. Because of this mismatch, when the system attempted to create a notification from an event with a very long description in its payload, an exception was raised, that blocked the creation of further email and web notifications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The root cause of the problem was a mismatch in column sizes between the `payload` field in the `events` table and the `event_payload` field in the `notifications` table. Because of this mismatch, when the system attempted to create a notification from an event with a very long description in its payload, an exception was raised, that blocked the creation of further email and web notifications.
In February we changed the size limitations of the `payload` field in the `events` table from TEXT to MEDIUMTEXT ([PR#15649](https://github.com/openSUSE/open-build-service/pull/15649)). During the notification process we copy the content of this column to another to the `event_payload` field in the `notifications` table with it TEXT. This mismatch caused an exception to be raised when the system attempted to create a notification from an event with a very long `payload` column.
Additionally the queue the handles the notification creations work sequentially, with a very many retries and very long hold off time. That blocked the creation of *all* email and web notifications.


## Resolution
<!--
How did you resolve or work around this problem?
For customers and community to understand what happened technically.
-->
Comment on lines +37 to +40
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!--
How did you resolve or work around this problem?
For customers and community to understand what happened technically.
-->


We were able to identify the problematic event, take a backup of that event, and then delete it. After this, the system began to recover gradually.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We were able to identify the problematic event, take a backup of that event, and then delete it. After this, the system began to recover gradually.
We identified the problematic event and took it out of the queue. After this, the system began to recover itself gradually.


### Action Items

<!--
Are there any actions we are going to do that are not done yet?
For customers and community to be able to follow up on this.
-->
Comment on lines +46 to +49
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!--
Are there any actions we are going to do that are not done yet?
For customers and community to be able to follow up on this.
-->


Additionally we have created some follow up action items:

| Action Item | Owner |
|--- |--- |
| [Data migration to update column size](https://github.com/openSUSE/open-build-service/pull/16751) | Developer Team |
| [Improve exception handling](ihttps://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment in a PR is not a way for us to track work...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [Improve exception handling](ihttps://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team |
| [Improve exception handling](https://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team |


## Lessons Learned
<!--
Describe what went well, what went wrong and where we go lucky during the resolution of this problem.
-->
Comment on lines +59 to +61
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!--
Describe what went well, what went wrong and where we go lucky during the resolution of this problem.
-->


- Ensure that all connected tables have consistent column sizes by thoroughly reviewing the system architecture. We recognize that there is close coupling of components, and this design can be improved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And where are we going to do that? And when?

- We had failed notifications in delayed jobs, and that saved us from data loss.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you want to say here...

- Errbit and Grafana alerts helped us realize the problem.

## Timeline (All Time in UTC)

- *August 23rd, 11:59* The system stopped creating web and email notifications
- *August 23rd, 15:00* We received the first alert from Grafana
- *August 26th, 06:25* We checked the exceptions on Errbit
- *August 26th, 09:31* We declared the incident
- *August 26th, 10:16* We identified and deleted the event record causing the problem. Failed notifications began to recover slowly
- *August 26th, 10:20* We declared the incident resolved