Skip to content

Commit

Permalink
[sophora-cluster-common] add Sophora cluster common chart (#52)
Browse files Browse the repository at this point in the history
* [sophora-cluster-common] add helm chart with some common resources useful in sophora cloud setups

* add readme, add documentation, add test-values file

* fix template

* fix alerts
  • Loading branch information
philmtd authored Oct 13, 2023
1 parent a16221d commit fd60cea
Show file tree
Hide file tree
Showing 12 changed files with 441 additions and 0 deletions.
23 changes: 23 additions & 0 deletions charts/sophora-cluster-common/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
6 changes: 6 additions & 0 deletions charts/sophora-cluster-common/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
apiVersion: v2
name: sophora-cluster-common
description: A Helm chart containing some common resources useful for Sophora cloud setups
type: application
version: 1.0.0
appVersion: "4"
77 changes: 77 additions & 0 deletions charts/sophora-cluster-common/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Sophora Cluster Common

This Helm chart contains resources that are useful for a Sophora cloud-installation in general and are not tied to
one specific product.

The available resources in this chart are described in the following and are all optional and can be configured to one's
needs.

## Available Resources

### PodDistruptionBudget for Cluster Servers

This will install a PodDisruptionBudget for the Sophora Cluster Servers (primary and replicas) to prevent situations
where all servers are shut down simultaneously. The PDB for staging servers can be installed via the server Helm chart.

### "LoadBalancer" for Cluster Servers

This is not actually a load balancer but rather a service and ingress definition always pointing to the primary Sophora
server. Typically, this is used to create a deterministic endpoint that can be entered by users to log in to Sophora.
To work out of the box, this requires that the *Server Mode Labeler* sidecar container of the servers is active (should be
by default).

### Alerts

This will install alerts that are not tied to one specific application but rather the general Sophora cluster state.
Look into the [alerting-runbook.md](./alerting-runbook.md) to see which alerts are available. Also check out the application's
charts to see if there are application specific alerts available.

## Parameters

### Common parameters

| Name | Description | Value |
| ------------------ | ----------------------------------------- | ----- |
| `nameOverride` | String to partially override the name | `""` |
| `fullnameOverride` | String to fully override the release name | `""` |

### Cluster Server Loadbalancer

| Name | Description | Value |
| ------------------------------------------------------------------------- | --------------------------------------------------------- | ------------------- |
| `clusterServerLb.enabled` | whether the service and ingress should be deployed or not | `false` |
| `clusterServerLb.name` | names of the resources | `cluster-server-lb` |
| `clusterServerLb.ingress.enabled` | whether the ingress should be enabled | `true` |
| `clusterServerLb.ingress.ingressClassName` | name of the ingressClass used for the ingress | `""` |
| `clusterServerLb.ingress.annotations` | annotations for the ingress | `{}` |
| `clusterServerLb.ingress.hosts` | array with hostnames used for the ingress | `[]` |
| `clusterServerLb.service.type` | Kubernetes service type | `ClusterIP` |
| `clusterServerLb.service.selectorLabels.sophora.cloud/app` | labels used to select the primary Sophora server | `cluster-server` |
| `clusterServerLb.service.selectorLabels.server.sophora.cloud/server-mode` | labels used to select the primary Sophora server | `primary` |
| `clusterServerLb.service.httpPort` | the Sophora server's http port | `1196` |
| `clusterServerLb.service.jmsPort` | the Sophora server's jms port | `1197` |
| `clusterServerLb.service.publishNotReadyAddresses` | whether the service should publish not ready addresses | `true` |

### Cluster Server Pod Disruption Budget

| Name | Description | Value |
| --------------------------------------------------- | ------------------------------------------ | ------------------------ |
| `podDisruptionBudget.enabled` | whether the PDB should be installed or not | `false` |
| `podDisruptionBudget.name` | name of the PDB | `sophora-cluster-server` |
| `podDisruptionBudget.minAvailable` | minimum available replicas | `2` |
| `podDisruptionBudget.matchLabels.sophora.cloud/app` | selector label for the cluster servers | `cluster-server` |

### Alerting / Prometheus Rules

| Name | Description | Value |
| ------------------------------------- | --------------------------------------------- | ------- |
| `prometheusRules.enabled` | Whether the alerts should be installed | `false` |
| `prometheusRules.defaultRulesEnabled` | Whether the default rules should be installed | `true` |
| `prometheusRules.rules` | allows to add custom rules | `[]` |

### Extra Deploy

| Name | Description | Value |
| ------------- | ---------------------------------------------------------- | ----- |
| `extraDeploy` | Allows to specify custom resources that should be deployed | `[]` |

45 changes: 45 additions & 0 deletions charts/sophora-cluster-common/alerting-runbook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Alerting Runbook

This document is a reference to the alerts this Helm chart can fire.

## Sophora Cluster Common

### NoPrimarySophoraServer

**Severity:** critical

**Summary:** The Sophora Cluster has no primary server. No operations with client tools will succeed and no further
replication will happen to other running servers, if there are any.

**Remediation steps:**

* Check if the Sophora cluster is down for another maintenance or incident remediation
* Check if the deployment has been uninstalled by mistake
* Check whether the server might have crashed
* Check the server logs for error messages
* Check if it would be possible to elect another cluster server to the primary. This should be done carefully to ensure no data is lost.
* Try to restart the server, if it is running but unresponsive
* Restore the server from a working backup

### SophoraServerNotInSync

**Severity:** high

**Summary:** The Sophora server is not in sync. This is concluded from comparing the server's *SourceTime* with the
SourceTime of the primary server. The SourceTime is the timestamp of the latest event that occured on the primary server.
Usually the SourceTimes of the servers should not diverge too much and stay equal when compared over a short time frame.

**Remediation steps:**

* Check if the primary server logged a message containing "ReplicationMaster stopped". If yes: The primary server needs to be
restarted **without electing another server to the primary**. The last part is absolutely critical to prevent data loss. As
the servers automatically switch using a shutdown hook, a workaround is to exec into the container and replace the
shutdown hook located in the `/tools/` directory with an empty executable file before restarting the server. Note that during the restart
working with Sophora will not be possible for a few minutes. If the error persists check the logs of the primary
to find error logs hinting at the root cause of the problem.
* Check if there is a large replication queue (e.g. due to a large amount of imports), which would result in a short replication
delay
* Check whether the not-in-sync server is in an erroneous state and stopped receiving replication messages
* Check whether network connection issues between the server and the primary server exist
* Check the server's and the primary server's logs for errors or warnings
* Restart the server
75 changes: 75 additions & 0 deletions charts/sophora-cluster-common/templates/_helpers.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
{{/*
Expand the name of the chart.
*/}}
{{- define "sophora-cluster-common.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Create a default fully qualified app name.
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
If release name contains chart name it will be used as a full name.
*/}}
{{- define "sophora-cluster-common.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}

{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "sophora-cluster-common.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Common labels
*/}}
{{- define "sophora-cluster-common.labels" -}}
helm.sh/chart: {{ include "sophora-cluster-common.chart" . }}
{{ include "sophora-cluster-common.selectorLabels" . }}
{{- if .Chart.AppVersion }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
{{- end }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}

{{/*
Selector labels
*/}}
{{- define "sophora-cluster-common.selectorLabels" -}}
app.kubernetes.io/name: {{ include "sophora-cluster-common.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}

{{/*
Create the name of the service account to use
*/}}
{{- define "sophora-cluster-common.serviceAccountName" -}}
{{- if .Values.serviceAccount.create }}
{{- default (include "sophora-cluster-common.fullname" .) .Values.serviceAccount.name }}
{{- else }}
{{- default "default" .Values.serviceAccount.name }}
{{- end }}
{{- end }}

{{/*
Renders a value that contains template.
Usage:
{{ include "common.tplvalues.render" ( dict "value" .Values.path.to.the.Value "context" $) }}
*/}}
{{- define "common.tplvalues.render" -}}
{{- if typeIs "string" .value }}
{{- tpl .value .context }}
{{- else }}
{{- tpl (.value | toYaml) .context }}
{{- end }}
{{- end -}}
35 changes: 35 additions & 0 deletions charts/sophora-cluster-common/templates/alerts/prometheusrule.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
{{- if .Values.prometheusRules.enabled }}
{{- $defaultRulesEnabled := .Values.prometheusRules.defaultRulesEnabled }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: {{ include "sophora-cluster-common.fullname" . }}
labels: {{- include "sophora-cluster-common.labels" . | nindent 4 }}
spec:
groups:
- name: {{ template "sophora-cluster-common.fullname" $ }}
rules:
{{- if $defaultRulesEnabled }}
- alert: NoPrimarySophoraServer
for: 2m
expr: 'count(sophora_server_replication_mode == 1) == 0'
labels:
severity: critical
annotations:
summary: The Sophora Cluster has no primary.
description: No primary elected in the cluster for more than 2 minutes.
runbook_url: 'https://github.com/subshell/helm-charts/blob/main/charts/sophora-cluster-common/alerting-runbook.md'
- alert: SophoraServerNotInSync
for: 2m
expr: 'max((sophora_server_source_time and sophora_server_is_primary_server == 1)) - max by (pod) (sophora_server_source_time and sophora_server_state == 2) > 60000'
labels:
severity: high
annotations:
summary: Server is not in sync
description: The server "{{`{{ $labels.pod }}`}}" is not in sync for more than 2 minutes.
runbook_url: 'https://github.com/subshell/helm-charts/blob/main/charts/sophora-cluster-common/alerting-runbook.md'
{{- end }}
{{- with .Values.prometheusRules.rules }}
{{ tpl (toYaml .) $ | nindent 8 }}
{{- end }}
{{- end }}
5 changes: 5 additions & 0 deletions charts/sophora-cluster-common/templates/extra-deploy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{{- range .Values.extraDeploy }}
---
{{ include "common.tplvalues.render" (dict "value" . "context" $) }}
{{- end }}

39 changes: 39 additions & 0 deletions charts/sophora-cluster-common/templates/lb/ingress.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
{{- if .Values.clusterServerLb.enabled }}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: {{ .Values.clusterServerLb.name }}
labels:
{{- include "sophora-cluster-common.labels" . | nindent 4 }}
{{- with .Values.clusterServerLb.ingress.annotations }}
annotations:
{{- toYaml . | nindent 4 }}
{{- end }}
spec:
{{- if .Values.clusterServerLb.ingress.ingressClassName }}
ingressClassName: {{ .Values.clusterServerLb.ingress.ingressClassName }}
{{- end -}}
{{- if .Values.clusterServerLb.ingress.tls }}
tls:
{{- range .Values.clusterServerLb.ingress.tls }}
- hosts:
{{- range .hosts }}
- {{ . | quote }}
{{- end }}
secretName: {{ .secretName }}
{{- end }}
{{- end }}
rules:
{{- range .Values.clusterServerLb.ingress.hosts }}
- host: {{ .host | quote }}
http:
paths:
- path: {{ .path }}
pathType: {{ default "ImplementationSpecific" .pathType }}
backend:
service:
name: {{ $.Values.clusterServerLb.name }}
port:
number: {{ $.Values.clusterServerLb.service.httpPort }}
{{- end }}
{{- end }}
25 changes: 25 additions & 0 deletions charts/sophora-cluster-common/templates/lb/service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{{- if .Values.clusterServerLb.enabled }}
apiVersion: v1
kind: Service
metadata:
name: {{ .Values.clusterServerLb.name }}
labels: {{- include "sophora-cluster-common.labels" . | nindent 4 }}
annotations: {{- toYaml .Values.clusterServerLb.service.annotations | nindent 4 }}
spec:
type: ClusterIP
selector: {{- toYaml .Values.clusterServerLb.service.selectorLabels | nindent 4 }}
sessionAffinity: ClientIP
publishNotReadyAddresses: {{ .Values.clusterServerLb.service.publishNotReadyAddresses }}
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600
ports:
- port: {{ .Values.clusterServerLb.service.httpPort }}
targetPort: http
protocol: TCP
name: http
- port: {{ .Values.clusterServerLb.service.jmsPort }}
targetPort: jms
protocol: TCP
name: jms
{{- end }}
14 changes: 14 additions & 0 deletions charts/sophora-cluster-common/templates/pdb/pdb.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{{- if .Values.podDisruptionBudget.enabled }}
{{- with .Values.podDisruptionBudget }}
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: {{ .name }}
labels: {{- include "sophora-cluster-common.labels" $ | nindent 4 }}
spec:
minAvailable: {{ .minAvailable }}
selector:
matchLabels:
{{- .matchLabels | toYaml | nindent 6 }}
{{- end }}
{{- end }}
19 changes: 19 additions & 0 deletions charts/sophora-cluster-common/test-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
clusterServerLb:
enabled: true
ingress:
ingressClassName: "nginx"
hosts:
- host: "cms.mysophora.com"

podDisruptionBudget:
enabled: true

prometheusRules:
enabled: true
defaultRulesEnabled: true
rules:
- alert: Foo
expr: bar_metric > 10

extraDeploy:
- apiVersion: subshell/v2
Loading

0 comments on commit fd60cea

Please sign in to comment.