openSUSE · rubhanazeem · Aug 27, 2024 · hennevogel · Aug 29, 2024 · hennevogel
diff --git a/_posts/deployments/2024-08-26-post-mortem.md b/_posts/deployments/2024-08-26-post-mortem.md
@@ -0,0 +1,75 @@
+---
+layout: post
+title: "Degraded performance of OBS Web and Email Notifications system"
-title: "Degraded performance of OBS Web and Email Notifications system"
+title: "Service degradation of OBS Web and Email Notifications system"
-title: "Degraded performance of OBS Web and Email Notifications system"
+title: "Service degradation of OBS Web and Email Notifications system"
+category: deployments
+author: Rubhan Azeem <rubhan.azeem@suse.com>
+---
+
+<!--
+  Classify the severity of this problem. We usually say:
+  - service degradation: if only a few customers where impacted
+  - severe service degradation: if nearly every customer was impacted
+  - downtime: if every visit to the OBS ended up on some error page
+-->
-<!--
-  Classify the severity of this problem. We usually say:
-  - service degradation: if only a few customers where impacted
-  - severe service degradation: if nearly every customer was impacted
-  - downtime: if every visit to the OBS ended up on some error page
-->
-<!--
-  Classify the severity of this problem. We usually say:
-  - service degradation: if only a few customers where impacted
-  - severe service degradation: if nearly every customer was impacted
-  - downtime: if every visit to the OBS ended up on some error page
-->
+
+Our Email and Web notifications system experienced a service degradation.
+
+<!--
+  What happened, for how long and who was impacted by it?
+  For customers to be able to identify if their problem related to this post mortem or not.
+-->
-<!--
-  What happened, for how long and who was impacted by it?
-  For customers to be able to identify if their problem related to this post mortem or not.
-->
-<!--
-  What happened, for how long and who was impacted by it?
-  For customers to be able to identify if their problem related to this post mortem or not.
-->
+
+On Friday, August 23rd at 11:59 UTC, Email and Web notifications stopped working until August 26th at 10:16 UTC.
-On Friday, August 23rd at 11:59 UTC, Email and Web notifications stopped working until August 26th at 10:16 UTC.
+No user of build.opensuse.org received notifications via the web interface / email from Friday, August 23rd at 11:59 UTC until Monday, August 26th at 10:16 UTC. No notification / email was lost, all of them got delivered with a delay. Some of them with a delay of several days.
-On Friday, August 23rd at 11:59 UTC, Email and Web notifications stopped working until August 26th at 10:16 UTC.
+No user of build.opensuse.org received notifications via the web interface / email from Friday, August 23rd at 11:59 UTC until Monday, August 26th at 10:16 UTC. No notification / email was lost, all of them got delivered with a delay. Some of them with a delay of several days.
+
+In the notification creation process, the `payload` field in the `events` table is copied to the `event_payload` field in the `notifications` table. These fields are required to have the same column size, but we discovered they don't have the same size.
-In the notification creation process, the `payload` field in the `events` table is copied to the `event_payload` field in the `notifications` table. These fields are required to have the same column size, but we discovered they don't have the same size.
+The cause of this was mis-matching database table column size restrictions we did not notice. 
-In the notification creation process, the `payload` field in the `events` table is copied to the `event_payload` field in the `notifications` table. These fields are required to have the same column size, but we discovered they don't have the same size.
+The cause of this was mis-matching database table column size restrictions we did not notice. 
+
+As a result, when the system attempted to create a notification for an event with a long description in its payload, an exception was raised, causing the notification creation process to fail.
-As a result, when the system attempted to create a notification for an event with a long description in its payload, an exception was raised, causing the notification creation process to fail.
+As a result, when the system attempted to create a notification an exception was raised, causing the notification creation process to fail entirely.
-As a result, when the system attempted to create a notification for an event with a long description in its payload, an exception was raised, causing the notification creation process to fail.
+As a result, when the system attempted to create a notification an exception was raised, causing the notification creation process to fail entirely.
+
+## Detection
+
+We received the first Grafana alert on Friday, August 23rd at 15:00 UTC, indicating that no notifications had been sent in the last three hours. Upon checking Errbit, we found a single exception with over 100,000 occurrences
-We received the first Grafana alert on Friday, August 23rd at 15:00 UTC, indicating that no notifications had been sent in the last three hours. Upon checking Errbit, we found a single exception with over 100,000 occurrences
+We received the first alert on Friday, August 23rd at 15:00 UTC, indicating that no notifications had been sent in the last three hours.
-We received the first Grafana alert on Friday, August 23rd at 15:00 UTC, indicating that no notifications had been sent in the last three hours. Upon checking Errbit, we found a single exception with over 100,000 occurrences
+We received the first alert on Friday, August 23rd at 15:00 UTC, indicating that no notifications had been sent in the last three hours.
+
+## Root Cause
+
+The root cause of the problem was a mismatch in column sizes between the `payload` field in the `events` table and the `event_payload` field in the `notifications` table. Because of this mismatch, when the system attempted to create a notification from an event with a very long description in its payload, an exception was raised, that blocked the creation of further email and web notifications.
-The root cause of the problem was a mismatch in column sizes between the `payload` field in the `events` table and the `event_payload` field in the `notifications` table. Because of this mismatch, when the system attempted to create a notification from an event with a very long description in its payload, an exception was raised, that blocked the creation of further email and web notifications.
+In February we changed the size limitations of the `payload` field in the `events` table from TEXT to MEDIUMTEXT ([PR#15649](https://github.com/openSUSE/open-build-service/pull/15649)). During the notification process we copy the content of this column to another to the `event_payload` field in the `notifications` table with it TEXT. This mismatch caused an exception to be raised when the system attempted to create a notification from an event with a very long `payload` column.
+
+Additionally the queue the handles the notification creations work sequentially, with a very many retries and very long hold off time. That blocked the creation of *all* email and web notifications.
-The root cause of the problem was a mismatch in column sizes between the `payload` field in the `events` table and the `event_payload` field in the `notifications` table. Because of this mismatch, when the system attempted to create a notification from an event with a very long description in its payload, an exception was raised, that blocked the creation of further email and web notifications.
+In February we changed the size limitations of the `payload` field in the `events` table from TEXT to MEDIUMTEXT ([PR#15649](https://github.com/openSUSE/open-build-service/pull/15649)). During the notification process we copy the content of this column to another to the `event_payload` field in the `notifications` table with it TEXT. This mismatch caused an exception to be raised when the system attempted to create a notification from an event with a very long `payload` column.
+
+Additionally the queue the handles the notification creations work sequentially, with a very many retries and very long hold off time. That blocked the creation of *all* email and web notifications.
+
+## Resolution
+<!--
+  How did you resolve or work around this problem?
+  For customers and community to understand what happened technically.
+-->
-<!--
-  How did you resolve or work around this problem?
-  For customers and community to understand what happened technically.
-->
-<!--
-  How did you resolve or work around this problem?
-  For customers and community to understand what happened technically.
-->
+
+We were able to identify the problematic event, take a backup of that event, and then delete it. After this, the system began to recover gradually.
-We were able to identify the problematic event, take a backup of that event, and then delete it. After this, the system began to recover gradually.
+We identified the problematic event and took it out of the queue. After this, the system began to recover itself gradually.
-We were able to identify the problematic event, take a backup of that event, and then delete it. After this, the system began to recover gradually.
+We identified the problematic event and took it out of the queue. After this, the system began to recover itself gradually.
+
+### Action Items
+
+<!--
+  Are there any actions we are going to do that are not done yet?
+  For customers and community to be able to follow up on this.
+-->
-<!--
-  Are there any actions we are going to do that are not done yet?
-  For customers and community to be able to follow up on this.
-->
-<!--
-  Are there any actions we are going to do that are not done yet?
-  For customers and community to be able to follow up on this.
-->
+
+Additionally we have created some follow up action items:
+
+| Action Item | Owner |
+|---          |---    |
+| [Data migration to update column size](https://github.com/openSUSE/open-build-service/pull/16751) | Developer Team |
+| [Improve exception handling](ihttps://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team |
-| [Improve exception handling](ihttps://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team |
+| [Improve exception handling](https://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team |
-| [Improve exception handling](ihttps://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team |
+| [Improve exception handling](https://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team |
+
+## Lessons Learned
+<!--
+  Describe what went well, what went wrong and where we go lucky during the resolution of this problem.
+-->
-<!--
-  Describe what went well, what went wrong and where we go lucky during the resolution of this problem.
-->
-<!--
-  Describe what went well, what went wrong and where we go lucky during the resolution of this problem.
-->
+
+- Ensure that all connected tables have consistent column sizes by thoroughly reviewing the system architecture. We recognize that there is close coupling of components, and this design can be improved.
+- We had failed notifications in delayed jobs, and that saved us from data loss.
+- Errbit and Grafana alerts helped us realize the problem.
+
+## Timeline (All Time in UTC)
+
+- *August 23rd, 11:59* The system stopped creating web and email notifications
+- *August 23rd, 15:00* We received the first alert from Grafana
+- *August 26th, 06:25* We checked the exceptions on Errbit
+- *August 26th, 09:31* We declared the incident
+- *August 26th, 10:16* We identified and deleted the event record causing the problem. Failed notifications began to recover slowly
+- *August 26th, 10:20* We declared the incident resolved
+