Skip to content

Commit

Permalink
[sophora-server] add alerting rules
Browse files Browse the repository at this point in the history
  • Loading branch information
philmtd committed Oct 12, 2023
1 parent a022bdb commit 1f2ce47
Show file tree
Hide file tree
Showing 4 changed files with 185 additions and 2 deletions.
2 changes: 1 addition & 1 deletion charts/sophora-server/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 1.7.0
version: 1.8.0

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
Expand Down
95 changes: 95 additions & 0 deletions charts/sophora-server/alerting-runbook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Alerting Runbook

This document is a reference to the alerts this Helm chart can fire.

## Sophora Server: General

### SophoraServerOffline

**Severity:** medium

**Summary:** The Sophora server is offline for more than 10 minutes.

**Remediation steps:**

* Check if the server is down for maintenance or incident remediation
* Check whether the server is in a crash loop
* Check the server logs for error messages
* Try to restart the server

### SophoraServerAPISlow

**Severity:** medium

**Summary:** The API of the server exhibits a response time exceeding 150ms for more than 5 minutes at the 95th percentile.

**Remediation steps:**

* Check if the server is experiencing a higher API call volume than usual (e.g. imports)
* Check the server's logs for errors that could be related to a slower API response time
* Check if the server has enough RAM and CPU at hand
* If the server is a staging server, consider scaling the statefulset up to cover higher loads
* Check if a newly added or modified server script is inefficient and adds an overhead to many API calls

## Sophora Server: State related alerts

### SophoraServerStateUnknown

**Severity:** medium

**Summary:** Sophora server's state is unknown

**Remediation steps:**

* Check the logs of the server

### SophoraServerStateSynchronizationDelayed

**Severity:** medium

**Summary:** Sophora server's synchronization is delayed

**Remediation steps:**

* Check the logs of the server
* See if the issue persists after waiting a little longer
* Try to fix the issue by restarting the server
* Check the logs of the primary server for any related errors

### SophoraServerStateQueueTooLong

**Severity:** medium

**Summary:** Sophora server's queue is too long and the server is not up to date

**Remediation steps:**

* Check the logs of the server
* See if the issue persists after waiting a little longer
* Try to fix the issue by restarting the server
* Check the logs of the primary server for any related errors

### SophoraServerStateUnavailable

**Severity:** medium

**Summary:** The Sophora server is unavailable and the cause should be investigated.

**Remediation steps:**

* Check the logs of the server
* Check the logs of the primary server for any related errors
* Restart the server

### SophoraServerStateConnectionLost

**Severity:** medium

**Summary:** The Sophora server is disconnected from its primary server and cannot receive replication events.

**Remediation steps:**

* Check if the primary server is running
* Check the logs of the server
* Check the logs of the primary server
* Check whether there are any network issues
82 changes: 82 additions & 0 deletions charts/sophora-server/templates/prometheusrule.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
{{- if .Values.prometheusRule.enabled }}
{{- $defaultRulesEnabled := .Values.prometheusRule.defaultRulesEnabled }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: {{ include "sophora-server.fullname" . }}
labels: {{- include "sophora-server.labels" . | nindent 4 }}
spec:
groups:
- name: {{ template "sophora-server.fullname" $ }}
rules:
{{- if $defaultRulesEnabled }}
- name: SophoraServerOffline
for: 10m
expr: 'up{container="sophora-server", job="{{ include "sophora-server.fullname" . }}"} != 1'
labels:
severity: medium
annotations:
summary: Sophora Server offline.
description: The server "{{`{{ $labels.service }}`}}" is offline for more than 10 minutes.
runbook_url: 'https://github.com/subshell/helm-charts/blob/main/charts/sophora-server/alerting-runbook.md'
- name: SophoraServerAPISlow
for: 5m
expr: 'histogram_quantile(0.95, sum(rate(sophora_server_contentmanager_call_duration_seconds_bucket{job="{{ include "sophora-server.fullname" . }}"}[1m])) by (pod, le)) > 0.15'
labels:
severity: medium
annotations:
summary: Sophora Server API is slow
description: The API of the server "{{`{{ $labels.pod }}`}}" exhibits a response time exceeding 150ms for more than 5 minutes at the 95th percentile.
runbook_url: 'https://github.com/subshell/helm-charts/blob/main/charts/sophora-server/alerting-runbook.md'
# -- start of rules for unready server states
- name: SophoraServerStateUnknown
for: 5m
expr: 'sophora_server_state{job="{{ include "sophora-server.fullname" . }}"} == -1'
labels:
severity: medium
annotations:
summary: Sophora server's state is unknown
description: The Sophora server's state is unknown.
runbook_url: 'https://github.com/subshell/helm-charts/blob/main/charts/sophora-server/alerting-runbook.md'
- name: SophoraServerStateSynchronizationDelayed
for: 10m
expr: 'sophora_server_state{job="{{ include "sophora-server.fullname" . }}"} == 3'
labels:
severity: medium
annotations:
summary: Sophora server's synchronization is delayed
description: The synchronization to the server server "{{`{{ $labels.pod }}`}}" is delayed.
runbook_url: 'https://github.com/subshell/helm-charts/blob/main/charts/sophora-server/alerting-runbook.md'
- name: SophoraServerStateQueueTooLong
for: 10m
expr: 'sophora_server_state{job="{{ include "sophora-server.fullname" . }}"} == 4'
labels:
severity: medium
annotations:
summary: Sophora server's queue is too long
description: The server "{{`{{ $labels.pod }}`}}" is not up-to-date due to a too long queue.
runbook_url: 'https://github.com/subshell/helm-charts/blob/main/charts/sophora-server/alerting-runbook.md'
- name: SophoraServerStateUnavailable
for: 10m
expr: 'sophora_server_state{job="{{ include "sophora-server.fullname" . }}"} == 5'
labels:
severity: high
annotations:
summary: Sophora server unavailable
description: The server "{{`{{ $labels.pod }}`}}" unavailable and the cause should be investigated
runbook_url: 'https://github.com/subshell/helm-charts/blob/main/charts/sophora-server/alerting-runbook.md'
- name: SophoraServerStateConnectionLost
for: 10m
expr: 'sophora_server_state{job="{{ include "sophora-server.fullname" . }}"} == 6'
labels:
severity: high
annotations:
summary: Sophora server lost connection to primary
description: The server "{{`{{ $labels.pod }}`}}" is disconnected from its primary server
runbook_url: 'https://github.com/subshell/helm-charts/blob/main/charts/sophora-server/alerting-runbook.md'
# -- end of state alert rules
{{- end }}
{{- with .Values.prometheusRule.rules }}
{{ tpl (toYaml .) $ | nindent 8 }}
{{- end }}
{{- end }}
8 changes: 7 additions & 1 deletion charts/sophora-server/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -322,13 +322,19 @@ serviceMonitor:
enabled: false
interval: 10s

prometheusRule:
enabled: false
defaultRulesEnabled: true
rules: []

resources:
requests:
cpu: 2
memory: 16Gi
limits:
memory: 16Gi

# This PDB should only be used for staging server setups
podDisruptionBudget:
## @param enabled Whether the pod disruption budget resource should be deployed
##
Expand All @@ -338,4 +344,4 @@ podDisruptionBudget:
minAvailable: 1
## @param podDisruptionBudget.maxUnavailable Max number of pods that can be unavailable after the eviction
##
maxUnavailable: ""
maxUnavailable: ""

0 comments on commit 1f2ce47

Please sign in to comment.