Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post mortem report for incident on Aug 26, 2024 #473

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rubhanazeem
Copy link
Member

No description provided.

@rubhanazeem rubhanazeem changed the title Postmortem report for incident on Aug 26, 2024 Post mortem report for incident on Aug 26, 2024 Aug 27, 2024
Copy link

netlify bot commented Aug 27, 2024

Deploy Preview for openbuildservice ready!

Name Link
🔨 Latest commit 2cd51cb
🔍 Latest deploy log https://app.netlify.com/sites/openbuildservice/deploys/66cf440ce5783a00084b4512
😎 Deploy Preview https://deploy-preview-473--openbuildservice.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Describe what went well, what went wrong and where we go lucky during the resolution of this problem.
-->

- Ensure all connected tables have the same column sizes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not only one way to do this. This requires knowledge of the system architecture.
One thing I can point out for this is a through review.

Copy link
Member

@hennevogel hennevogel Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this isn't something we learned. We know we should do code-review :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it rather points to a bad architectural decision that we should revise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've improved this reasoning 🙂

@hennevogel
Copy link
Member

@darix brought up some worthwhile thoughts in openSUSE/open-build-service#16751

@rubhanazeem rubhanazeem force-pushed the post_mortem_08262024 branch 2 times, most recently from 1975ebb to 0c3cf5b Compare August 28, 2024 15:01
@rubhanazeem
Copy link
Member Author

@darix brought up some worthwhile thoughts in openSUSE/open-build-service#16751

I've checked the conversation. Sounds like we are going with the migration to change column size

@hennevogel
Copy link
Member

hennevogel commented Aug 28, 2024

Sounds like we are going with the migration to change column size

For sure but what about the other things mentioned? Like more sophisticated exception handling...

@rubhanazeem rubhanazeem force-pushed the post_mortem_08262024 branch from 0c3cf5b to 2cd51cb Compare August 28, 2024 15:36
@rubhanazeem
Copy link
Member Author

rubhanazeem commented Aug 28, 2024

Sounds like we are going with the migration to change column size

For sure but what about the other things mentioned? Like more sophisticated exception handling...

Done. I've mentioned this in action items

Comment on lines +8 to +13
<!--
Classify the severity of this problem. We usually say:
- service degradation: if only a few customers where impacted
- severe service degradation: if nearly every customer was impacted
- downtime: if every visit to the OBS ended up on some error page
-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!--
Classify the severity of this problem. We usually say:
- service degradation: if only a few customers where impacted
- severe service degradation: if nearly every customer was impacted
- downtime: if every visit to the OBS ended up on some error page
-->

Copy link
Member

@hennevogel hennevogel Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this will revive the excerpt on the index page...

Comment on lines +17 to +20
<!--
What happened, for how long and who was impacted by it?
For customers to be able to identify if their problem related to this post mortem or not.
-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!--
What happened, for how long and who was impacted by it?
For customers to be able to identify if their problem related to this post mortem or not.
-->

Comment on lines +37 to +40
<!--
How did you resolve or work around this problem?
For customers and community to understand what happened technically.
-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!--
How did you resolve or work around this problem?
For customers and community to understand what happened technically.
-->

Comment on lines +46 to +49
<!--
Are there any actions we are going to do that are not done yet?
For customers and community to be able to follow up on this.
-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!--
Are there any actions we are going to do that are not done yet?
For customers and community to be able to follow up on this.
-->

Comment on lines +59 to +61
<!--
Describe what went well, what went wrong and where we go lucky during the resolution of this problem.
-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!--
Describe what went well, what went wrong and where we go lucky during the resolution of this problem.
-->


On Friday, August 23rd at 11:59 UTC, Email and Web notifications stopped working until August 26th at 10:16 UTC.

In the notification creation process, the `payload` field in the `events` table is copied to the `event_payload` field in the `notifications` table. These fields are required to have the same column size, but we discovered they don't have the same size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In the notification creation process, the `payload` field in the `events` table is copied to the `event_payload` field in the `notifications` table. These fields are required to have the same column size, but we discovered they don't have the same size.
The cause of this was mis-matching database table column size restrictions we did not notice.

For customers to be able to identify if their problem related to this post mortem or not.
-->

On Friday, August 23rd at 11:59 UTC, Email and Web notifications stopped working until August 26th at 10:16 UTC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
On Friday, August 23rd at 11:59 UTC, Email and Web notifications stopped working until August 26th at 10:16 UTC.
No user of build.opensuse.org received notifications via the web interface / email from Friday, August 23rd at 11:59 UTC until Monday, August 26th at 10:16 UTC. No notification / email was lost, all of them got delivered with a delay. Some of them with a delay of several days.

@@ -0,0 +1,75 @@
---
layout: post
title: "Degraded performance of OBS Web and Email Notifications system"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: "Degraded performance of OBS Web and Email Notifications system"
title: "Service degradation of OBS Web and Email Notifications system"


In the notification creation process, the `payload` field in the `events` table is copied to the `event_payload` field in the `notifications` table. These fields are required to have the same column size, but we discovered they don't have the same size.

As a result, when the system attempted to create a notification for an event with a long description in its payload, an exception was raised, causing the notification creation process to fail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As a result, when the system attempted to create a notification for an event with a long description in its payload, an exception was raised, causing the notification creation process to fail.
As a result, when the system attempted to create a notification an exception was raised, causing the notification creation process to fail entirely.


## Detection

We received the first Grafana alert on Friday, August 23rd at 15:00 UTC, indicating that no notifications had been sent in the last three hours. Upon checking Errbit, we found a single exception with over 100,000 occurrences
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We received the first Grafana alert on Friday, August 23rd at 15:00 UTC, indicating that no notifications had been sent in the last three hours. Upon checking Errbit, we found a single exception with over 100,000 occurrences
We received the first alert on Friday, August 23rd at 15:00 UTC, indicating that no notifications had been sent in the last three hours.


## Root Cause

The root cause of the problem was a mismatch in column sizes between the `payload` field in the `events` table and the `event_payload` field in the `notifications` table. Because of this mismatch, when the system attempted to create a notification from an event with a very long description in its payload, an exception was raised, that blocked the creation of further email and web notifications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The root cause of the problem was a mismatch in column sizes between the `payload` field in the `events` table and the `event_payload` field in the `notifications` table. Because of this mismatch, when the system attempted to create a notification from an event with a very long description in its payload, an exception was raised, that blocked the creation of further email and web notifications.
In February we changed the size limitations of the `payload` field in the `events` table from TEXT to MEDIUMTEXT ([PR#15649](https://github.com/openSUSE/open-build-service/pull/15649)). During the notification process we copy the content of this column to another to the `event_payload` field in the `notifications` table with it TEXT. This mismatch caused an exception to be raised when the system attempted to create a notification from an event with a very long `payload` column.
Additionally the queue the handles the notification creations work sequentially, with a very many retries and very long hold off time. That blocked the creation of *all* email and web notifications.

For customers and community to understand what happened technically.
-->

We were able to identify the problematic event, take a backup of that event, and then delete it. After this, the system began to recover gradually.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We were able to identify the problematic event, take a backup of that event, and then delete it. After this, the system began to recover gradually.
We identified the problematic event and took it out of the queue. After this, the system began to recover itself gradually.

| Action Item | Owner |
|--- |--- |
| [Data migration to update column size](https://github.com/openSUSE/open-build-service/pull/16751) | Developer Team |
| [Improve exception handling](ihttps://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment in a PR is not a way for us to track work...

Describe what went well, what went wrong and where we go lucky during the resolution of this problem.
-->

- Ensure that all connected tables have consistent column sizes by thoroughly reviewing the system architecture. We recognize that there is close coupling of components, and this design can be improved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And where are we going to do that? And when?

-->

- Ensure that all connected tables have consistent column sizes by thoroughly reviewing the system architecture. We recognize that there is close coupling of components, and this design can be improved.
- We had failed notifications in delayed jobs, and that saved us from data loss.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you want to say here...

| Action Item | Owner |
|--- |--- |
| [Data migration to update column size](https://github.com/openSUSE/open-build-service/pull/16751) | Developer Team |
| [Improve exception handling](ihttps://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [Improve exception handling](ihttps://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team |
| [Improve exception handling](https://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants