From 2a16a71a49580ebf4bb0f50b60c12c8a81314595 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Tom=C3=A1=C5=A1=20Dohn=C3=A1lek?= Date: Wed, 18 Oct 2023 17:36:54 +0200 Subject: [PATCH 1/3] Fill external runbooks urls Follow up of https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/868 The current documentation on mixin's runbook page is not ideal and provides no useful information for most alerts. PrometheusOperator provides some really useful description for most of our alerts. This changeset links the PrometheusOperator's runbooks to most of our alerts. --- runbook.md | 41 +++++++++++++++++++++++++++++++++++++++-- 1 file changed, 39 insertions(+), 2 deletions(-) diff --git a/runbook.md b/runbook.md index e136d1e59..9b93e4c76 100644 --- a/runbook.md +++ b/runbook.md @@ -13,6 +13,7 @@ This page collects this repositories alerts and begins the process of describing ##### Alert Name: "KubeAPIDown" + *Message*: `KubeAPI has disappeared from Prometheus target discovery.` + *Severity*: critical ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapidown/) ##### Alert Name: "KubeControllerManagerDown" + *Message*: `KubeControllerManager has disappeared from Prometheus target discovery.` + *Severity*: critical @@ -24,6 +25,7 @@ This page collects this repositories alerts and begins the process of describing ##### Alert Name: KubeletDown + *Message*: `Kubelet has disappeared from Prometheus target discovery.` + *Severity*: critical ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletdown/) ##### Alert Name: KubeProxyDown + *Message*: `KubeProxy has disappeared from Prometheus target discovery` + *Severity*: critical @@ -32,37 +34,47 @@ This page collects this repositories alerts and begins the process of describing ##### Alert Name: KubePodCrashLooping + *Message*: `{{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf \"%.2f\" $value }} / second` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodcrashlooping/) ##### Alert Name: "KubePodNotReady" + *Message*: `{{ $labels.namespace }}/{{ $labels.pod }} is not ready.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodnotready/) ##### Alert Name: "KubeDeploymentGenerationMismatch" + *Message*: `Deployment {{ $labels.namespace }}/{{ $labels.deployment }} generation mismatch` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentgenerationmismatch/) ##### Alert Name: "KubeDeploymentReplicasMismatch" + *Message*: `Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replica mismatch` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentreplicasmismatch/) ##### Alert Name: "KubeDeploymentRolloutStuck" + *Message*: `Rollout of deployment {{ $labels.namespace }}/{{ $labels.deployment }} is not progressing` + *Severity*: warning ##### Alert Name: "KubeStatefulSetReplicasMismatch" + *Message*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} replica mismatch` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetreplicasmismatch/) ##### Alert Name: "KubeStatefulSetGenerationMismatch" + *Message*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} generation mismatch` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetgenerationmismatch/) ##### Alert Name: "KubeDaemonSetRolloutStuck" + *Message*: `Only {{$value | humanizePercentage }} of desired pods scheduled and ready for daemon set {{$labels.namespace}}/{{$labels.daemonset}}` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetrolloutstuck/) ##### Alert Name: "KubeContainerWaiting" + *Message*: `{{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is in waiting state.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecontainerwaiting/) ##### Alert Name: "KubeDaemonSetNotScheduled" + *Message*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are not scheduled.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetnotscheduled/) ##### Alert Name: "KubeDaemonSetMisScheduled" + *Message*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are running where they are not supposed to run.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetmisscheduled/) ##### Alert Name: "KubeJobNotCompleted" + *Message*: `Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than {{ "%(kubeJobTimeoutDuration)s" | humanizeDuration }} to complete.` @@ -73,91 +85,116 @@ This page collects this repositories alerts and begins the process of describing + *Message*: `Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.` + *Severity*: warning + *Action*: Check the job using `kubectl describe job ` and look at the pod logs using `kubectl logs ` for further information. ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobfailed/) ### Group Name: "kubernetes-resources" ##### Alert Name: "KubeCPUOvercommit" + *Message*: `Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuovercommit/) ##### Alert Name: "KubeMemOvercommit" + *Message*: `Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.` + *Severity*: warning ##### Alert Name: "KubeCPUQuotaOvercommit" + *Message*: `Cluster has overcommitted CPU resource requests for Namespaces.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuquotaovercommit/) ##### Alert Name: "KubeMemQuotaOvercommit" + *Message*: `Cluster has overcommitted memory resource requests for Namespaces.` + *Severity*: warning ##### Alert Name: "KubeQuotaAlmostFull" + *Message*: `{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.` + *Severity*: info ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotaalmostfull/) ##### Alert Name: "KubeQuotaFullyUsed" + *Message*: `{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.` + *Severity*: info ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotafullyused/) ##### Alert Name: "KubeQuotaExceeded" + *Message*: `{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotaexceeded/) ### Group Name: "kubernetes-storage" ##### Alert Name: "KubePersistentVolumeFillingUp" + *Message*: `The persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has {{ $value | humanizePercentage }} free.` + *Severity*: critical ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup/) ##### Alert Name: "KubePersistentVolumeFillingUp" + *Message*: `Based on recent sampling, the persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is expected to fill up within four days.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup/) ### Group Name: "kubernetes-system" ##### Alert Name: "KubeNodeNotReady" + *Message*: `{{ $labels.node }} has been unready for more than 15 minutes."` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodenotready/) ##### Alert Name: "KubeNodeUnreachable" + *Message*: `{{ $labels.node }} is unreachable and some workloads may be rescheduled.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodeunreachable/) ##### Alert Name: "KubeletTooManyPods" + *Message*: `Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage }} of its Pod capacity.` + *Severity*: info ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubelettoomanypods/) ##### Alert Name: "KubeNodeReadinessFlapping" + *Message*: `The readiness status of node {{ $labels.node }} has changed {{ $value }} times in the last 15 minutes.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodereadinessflapping/) ##### Alert Name: "KubeletPlegDurationHigh" + *Message*: `The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{ $value }} seconds on node {{ $labels.node }}.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletplegdurationhigh/) ##### Alert Name: "KubeletPodStartUpLatencyHigh" + *Message*: `Kubelet Pod startup 99th percentile latency is {{ $value }} seconds on node {{ $labels.node }}.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletpodstartuplatencyhigh/) ##### Alert Name: "KubeletClientCertificateExpiration" + *Message*: `Client certificate for Kubelet on node {{ $labels.node }} expires in 7 days.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletclientcertificateexpiration/) ##### Alert Name: "KubeletClientCertificateExpiration" + *Message*: `Client certificate for Kubelet on node {{ $labels.node }} expires in 1 day.` + *Severity*: critical ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletclientcertificateexpiration/) ##### Alert Name: "KubeletServerCertificateExpiration" + *Message*: `Server certificate for Kubelet on node {{ $labels.node }} expires in 7 days.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletservercertificateexpiration/) ##### Alert Name: "KubeletServerCertificateExpiration" + *Message*: `Server certificate for Kubelet on node {{ $labels.node }} expires in 1 day.` + *Severity*: critical ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletservercertificateexpiration/) ##### Alert Name: "KubeletClientCertificateRenewalErrors" + *Message*: `Kubelet on node {{ $labels.node }} has failed to renew its client certificate ({{ $value | humanize }} errors in the last 15 minutes).` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletclientcertificaterenewalerrors/) ##### Alert Name: "KubeletServerCertificateRenewalErrors" + *Message*: `Kubelet on node {{ $labels.node }} has failed to renew its server certificate ({{ $value | humanize }} errors in the last 5 minutes).` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletservercertificaterenewalerrors/) ##### Alert Name: "KubeVersionMismatch" + *Message*: `There are {{ $value }} different versions of Kubernetes components running.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeversionmismatch/) ##### Alert Name: "KubeClientErrors" + *Message*: `Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ $value | humanizePercentage }} errors.'` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclienterrors/) ##### Alert Name: "KubeClientCertificateExpiration" + *Message*: `A client certificate used to authenticate to the apiserver is expiring in less than 7 days.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclientcertificateexpiration/) ##### Alert Name: "KubeClientCertificateExpiration" + *Message*: `A client certificate used to authenticate to the apiserver is expiring in less than 1 day.` + *Severity*: critical ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclientcertificateexpiration/) ##### Alert Name: "KubeAPITerminatedRequests" + *Message*: `The apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.` + *Severity*: warning + *Action*: Use the `apiserver_flowcontrol_rejected_requests_total` metric to determine which flow schema is throttling the traffic to the API Server. The flow schema also provides information on the affected resources and subjects. ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapiterminatedrequests/) ## Other Kubernetes Runbooks and troubleshooting -+ [Troubleshoot Clusters ](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/) -+ [Cloud.gov Kubernetes Runbook ](https://landing.app.cloud.gov/docs/ops/runbook/troubleshooting-kubernetes/) ++ [Troubleshoot Clusters](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/) ++ [Cloud.gov Kubernetes Runbook](https://landing.app.cloud.gov/docs/ops/runbook/troubleshooting-kubernetes/) + [Recover a Broken Cluster](https://codefresh.io/Kubernetes-Tutorial/recover-broken-kubernetes-cluster/) From f72b6823a0fd160aa8287c24937ae939787c61d7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Tom=C3=A1=C5=A1=20Dohn=C3=A1lek?= Date: Thu, 19 Oct 2023 12:33:10 +0200 Subject: [PATCH 2/3] add missing alerts --- runbook.md | 42 +++++++++++++++++++++++++++++++++++++++--- 1 file changed, 39 insertions(+), 3 deletions(-) diff --git a/runbook.md b/runbook.md index 9b93e4c76..7e2fe0e44 100644 --- a/runbook.md +++ b/runbook.md @@ -30,6 +30,7 @@ This page collects this repositories alerts and begins the process of describing + *Message*: `KubeProxy has disappeared from Prometheus target discovery` + *Severity*: critical + *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeproxydown/) + ### Group Name: kubernetes-apps ##### Alert Name: KubePodCrashLooping + *Message*: `{{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf \"%.2f\" $value }} / second` @@ -70,17 +71,26 @@ This page collects this repositories alerts and begins the process of describing + *Message*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are not scheduled.` + *Severity*: warning + *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetnotscheduled/) - +##### Alert Name: "KubeStatefulSetUpdateNotRolledOut" ++ *Message*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.` ++ *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetupdatenotrolledout/) +##### Alert Name: "KubeHpaReplicasMismatch" ++ *Message*: `'HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has not matched the desired number of replicas for longer than 15 minutes.` ++ *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpareplicasmismatch/) +##### Alert Name: "KubeHpaMaxedOut" ++ *Message*: `HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has been running at max replicas for longer than 15 minutes.` ++ *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpamaxedout/) ##### Alert Name: "KubeDaemonSetMisScheduled" + *Message*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are running where they are not supposed to run.` + *Severity*: warning + *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetmisscheduled/) - ##### Alert Name: "KubeJobNotCompleted" + *Message*: `Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than {{ "%(kubeJobTimeoutDuration)s" | humanizeDuration }} to complete.` + *Severity*: warning + *Action*: Check the job using `kubectl describe job ` and look at the pod logs using `kubectl logs ` for further information. - ##### Alert Name: "KubeJobFailed" + *Message*: `Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.` + *Severity*: warning @@ -114,6 +124,11 @@ This page collects this repositories alerts and begins the process of describing + *Message*: `{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.` + *Severity*: warning + *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotaexceeded/) +##### Alert Name: "CPUThrottlingHigh" ++ *Message*: `Processes experience elevated CPU throttling.` ++ *Severity*: info ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/cputhrottlinghigh/) + ### Group Name: "kubernetes-storage" ##### Alert Name: "KubePersistentVolumeFillingUp" + *Message*: `The persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has {{ $value | humanizePercentage }} free.` @@ -123,6 +138,13 @@ This page collects this repositories alerts and begins the process of describing + *Message*: `Based on recent sampling, the persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is expected to fill up within four days.` + *Severity*: warning + *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup/) +##### Alert Name: "KubePersistentVolumeInodesFillingUp" ++ *Message*: `PersistentVolume is filling up.` +##### Alert Name: "KubePersistentVolumeErrors" ++ *Message*: `PersistentVolume is having issues with provisioning.` ++ *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumeerrors/) + ### Group Name: "kubernetes-system" ##### Alert Name: "KubeNodeNotReady" + *Message*: `{{ $labels.node }} has been unready for more than 15 minutes."` @@ -193,6 +215,20 @@ This page collects this repositories alerts and begins the process of describing + *Severity*: warning + *Action*: Use the `apiserver_flowcontrol_rejected_requests_total` metric to determine which flow schema is throttling the traffic to the API Server. The flow schema also provides information on the affected resources and subjects. + *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapiterminatedrequests/) +##### Alert Name: "KubeAggregatedAPIErrors" ++ *Message*: `Kubernetes aggregated API has reported errors.` ++ *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeaggregatedapierrors/) +##### Alert Name: "KubeAggregatedAPIDown" ++ *Message*: `Kubernetes aggregated API is down.` ++ *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeaggregatedapidown/) + +### Group Name: "kube-apiserver-slos" +##### Alert Name: "KubeAPIErrorBudgetBurn" ++ *Message*: `The API server is burning too much error budget.` ++ *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapierrorbudgetburn/) ## Other Kubernetes Runbooks and troubleshooting + [Troubleshoot Clusters](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/) From c3f0d70228d8eea84c9a5ba9ca94c589a845640a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Tom=C3=A1=C5=A1=20Dohn=C3=A1lek?= Date: Thu, 19 Oct 2023 12:34:26 +0200 Subject: [PATCH 3/3] follow up of https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/386 --- runbook.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/runbook.md b/runbook.md index 7e2fe0e44..eac7c05fa 100644 --- a/runbook.md +++ b/runbook.md @@ -102,14 +102,14 @@ This page collects this repositories alerts and begins the process of describing + *Message*: `Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.` + *Severity*: warning + *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuovercommit/) -##### Alert Name: "KubeMemOvercommit" +##### Alert Name: "KubeMemoryOvercommit" + *Message*: `Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.` + *Severity*: warning ##### Alert Name: "KubeCPUQuotaOvercommit" + *Message*: `Cluster has overcommitted CPU resource requests for Namespaces.` + *Severity*: warning + *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuquotaovercommit/) -##### Alert Name: "KubeMemQuotaOvercommit" +##### Alert Name: "KubeMemoryQuotaOvercommit" + *Message*: `Cluster has overcommitted memory resource requests for Namespaces.` + *Severity*: warning ##### Alert Name: "KubeQuotaAlmostFull"