diff --git a/runbook.md b/runbook.md index e136d1e59..eac7c05fa 100644 --- a/runbook.md +++ b/runbook.md @@ -13,6 +13,7 @@ This page collects this repositories alerts and begins the process of describing ##### Alert Name: "KubeAPIDown" + *Message*: `KubeAPI has disappeared from Prometheus target discovery.` + *Severity*: critical ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapidown/) ##### Alert Name: "KubeControllerManagerDown" + *Message*: `KubeControllerManager has disappeared from Prometheus target discovery.` + *Severity*: critical @@ -24,140 +25,212 @@ This page collects this repositories alerts and begins the process of describing ##### Alert Name: KubeletDown + *Message*: `Kubelet has disappeared from Prometheus target discovery.` + *Severity*: critical ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletdown/) ##### Alert Name: KubeProxyDown + *Message*: `KubeProxy has disappeared from Prometheus target discovery` + *Severity*: critical + *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeproxydown/) + ### Group Name: kubernetes-apps ##### Alert Name: KubePodCrashLooping + *Message*: `{{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf \"%.2f\" $value }} / second` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodcrashlooping/) ##### Alert Name: "KubePodNotReady" + *Message*: `{{ $labels.namespace }}/{{ $labels.pod }} is not ready.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodnotready/) ##### Alert Name: "KubeDeploymentGenerationMismatch" + *Message*: `Deployment {{ $labels.namespace }}/{{ $labels.deployment }} generation mismatch` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentgenerationmismatch/) ##### Alert Name: "KubeDeploymentReplicasMismatch" + *Message*: `Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replica mismatch` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentreplicasmismatch/) ##### Alert Name: "KubeDeploymentRolloutStuck" + *Message*: `Rollout of deployment {{ $labels.namespace }}/{{ $labels.deployment }} is not progressing` + *Severity*: warning ##### Alert Name: "KubeStatefulSetReplicasMismatch" + *Message*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} replica mismatch` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetreplicasmismatch/) ##### Alert Name: "KubeStatefulSetGenerationMismatch" + *Message*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} generation mismatch` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetgenerationmismatch/) ##### Alert Name: "KubeDaemonSetRolloutStuck" + *Message*: `Only {{$value | humanizePercentage }} of desired pods scheduled and ready for daemon set {{$labels.namespace}}/{{$labels.daemonset}}` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetrolloutstuck/) ##### Alert Name: "KubeContainerWaiting" + *Message*: `{{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is in waiting state.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecontainerwaiting/) ##### Alert Name: "KubeDaemonSetNotScheduled" + *Message*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are not scheduled.` + *Severity*: warning - ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetnotscheduled/) +##### Alert Name: "KubeStatefulSetUpdateNotRolledOut" ++ *Message*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.` ++ *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetupdatenotrolledout/) +##### Alert Name: "KubeHpaReplicasMismatch" ++ *Message*: `'HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has not matched the desired number of replicas for longer than 15 minutes.` ++ *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpareplicasmismatch/) +##### Alert Name: "KubeHpaMaxedOut" ++ *Message*: `HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has been running at max replicas for longer than 15 minutes.` ++ *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpamaxedout/) ##### Alert Name: "KubeDaemonSetMisScheduled" + *Message*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are running where they are not supposed to run.` + *Severity*: warning - ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetmisscheduled/) ##### Alert Name: "KubeJobNotCompleted" + *Message*: `Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than {{ "%(kubeJobTimeoutDuration)s" | humanizeDuration }} to complete.` + *Severity*: warning + *Action*: Check the job using `kubectl describe job ` and look at the pod logs using `kubectl logs ` for further information. - ##### Alert Name: "KubeJobFailed" + *Message*: `Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.` + *Severity*: warning + *Action*: Check the job using `kubectl describe job ` and look at the pod logs using `kubectl logs ` for further information. ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobfailed/) ### Group Name: "kubernetes-resources" ##### Alert Name: "KubeCPUOvercommit" + *Message*: `Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.` + *Severity*: warning -##### Alert Name: "KubeMemOvercommit" ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuovercommit/) +##### Alert Name: "KubeMemoryOvercommit" + *Message*: `Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.` + *Severity*: warning ##### Alert Name: "KubeCPUQuotaOvercommit" + *Message*: `Cluster has overcommitted CPU resource requests for Namespaces.` + *Severity*: warning -##### Alert Name: "KubeMemQuotaOvercommit" ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuquotaovercommit/) +##### Alert Name: "KubeMemoryQuotaOvercommit" + *Message*: `Cluster has overcommitted memory resource requests for Namespaces.` + *Severity*: warning ##### Alert Name: "KubeQuotaAlmostFull" + *Message*: `{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.` + *Severity*: info ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotaalmostfull/) ##### Alert Name: "KubeQuotaFullyUsed" + *Message*: `{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.` + *Severity*: info ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotafullyused/) ##### Alert Name: "KubeQuotaExceeded" + *Message*: `{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotaexceeded/) +##### Alert Name: "CPUThrottlingHigh" ++ *Message*: `Processes experience elevated CPU throttling.` ++ *Severity*: info ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/cputhrottlinghigh/) + ### Group Name: "kubernetes-storage" ##### Alert Name: "KubePersistentVolumeFillingUp" + *Message*: `The persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has {{ $value | humanizePercentage }} free.` + *Severity*: critical ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup/) ##### Alert Name: "KubePersistentVolumeFillingUp" + *Message*: `Based on recent sampling, the persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is expected to fill up within four days.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup/) +##### Alert Name: "KubePersistentVolumeInodesFillingUp" ++ *Message*: `PersistentVolume is filling up.` +##### Alert Name: "KubePersistentVolumeErrors" ++ *Message*: `PersistentVolume is having issues with provisioning.` ++ *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumeerrors/) + ### Group Name: "kubernetes-system" ##### Alert Name: "KubeNodeNotReady" + *Message*: `{{ $labels.node }} has been unready for more than 15 minutes."` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodenotready/) ##### Alert Name: "KubeNodeUnreachable" + *Message*: `{{ $labels.node }} is unreachable and some workloads may be rescheduled.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodeunreachable/) ##### Alert Name: "KubeletTooManyPods" + *Message*: `Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage }} of its Pod capacity.` + *Severity*: info ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubelettoomanypods/) ##### Alert Name: "KubeNodeReadinessFlapping" + *Message*: `The readiness status of node {{ $labels.node }} has changed {{ $value }} times in the last 15 minutes.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodereadinessflapping/) ##### Alert Name: "KubeletPlegDurationHigh" + *Message*: `The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{ $value }} seconds on node {{ $labels.node }}.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletplegdurationhigh/) ##### Alert Name: "KubeletPodStartUpLatencyHigh" + *Message*: `Kubelet Pod startup 99th percentile latency is {{ $value }} seconds on node {{ $labels.node }}.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletpodstartuplatencyhigh/) ##### Alert Name: "KubeletClientCertificateExpiration" + *Message*: `Client certificate for Kubelet on node {{ $labels.node }} expires in 7 days.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletclientcertificateexpiration/) ##### Alert Name: "KubeletClientCertificateExpiration" + *Message*: `Client certificate for Kubelet on node {{ $labels.node }} expires in 1 day.` + *Severity*: critical ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletclientcertificateexpiration/) ##### Alert Name: "KubeletServerCertificateExpiration" + *Message*: `Server certificate for Kubelet on node {{ $labels.node }} expires in 7 days.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletservercertificateexpiration/) ##### Alert Name: "KubeletServerCertificateExpiration" + *Message*: `Server certificate for Kubelet on node {{ $labels.node }} expires in 1 day.` + *Severity*: critical ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletservercertificateexpiration/) ##### Alert Name: "KubeletClientCertificateRenewalErrors" + *Message*: `Kubelet on node {{ $labels.node }} has failed to renew its client certificate ({{ $value | humanize }} errors in the last 15 minutes).` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletclientcertificaterenewalerrors/) ##### Alert Name: "KubeletServerCertificateRenewalErrors" + *Message*: `Kubelet on node {{ $labels.node }} has failed to renew its server certificate ({{ $value | humanize }} errors in the last 5 minutes).` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletservercertificaterenewalerrors/) ##### Alert Name: "KubeVersionMismatch" + *Message*: `There are {{ $value }} different versions of Kubernetes components running.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeversionmismatch/) ##### Alert Name: "KubeClientErrors" + *Message*: `Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ $value | humanizePercentage }} errors.'` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclienterrors/) ##### Alert Name: "KubeClientCertificateExpiration" + *Message*: `A client certificate used to authenticate to the apiserver is expiring in less than 7 days.` + *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclientcertificateexpiration/) ##### Alert Name: "KubeClientCertificateExpiration" + *Message*: `A client certificate used to authenticate to the apiserver is expiring in less than 1 day.` + *Severity*: critical ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclientcertificateexpiration/) ##### Alert Name: "KubeAPITerminatedRequests" + *Message*: `The apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.` + *Severity*: warning + *Action*: Use the `apiserver_flowcontrol_rejected_requests_total` metric to determine which flow schema is throttling the traffic to the API Server. The flow schema also provides information on the affected resources and subjects. ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapiterminatedrequests/) +##### Alert Name: "KubeAggregatedAPIErrors" ++ *Message*: `Kubernetes aggregated API has reported errors.` ++ *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeaggregatedapierrors/) +##### Alert Name: "KubeAggregatedAPIDown" ++ *Message*: `Kubernetes aggregated API is down.` ++ *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeaggregatedapidown/) + +### Group Name: "kube-apiserver-slos" +##### Alert Name: "KubeAPIErrorBudgetBurn" ++ *Message*: `The API server is burning too much error budget.` ++ *Severity*: warning ++ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapierrorbudgetburn/) ## Other Kubernetes Runbooks and troubleshooting -+ [Troubleshoot Clusters ](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/) -+ [Cloud.gov Kubernetes Runbook ](https://landing.app.cloud.gov/docs/ops/runbook/troubleshooting-kubernetes/) ++ [Troubleshoot Clusters](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/) ++ [Cloud.gov Kubernetes Runbook](https://landing.app.cloud.gov/docs/ops/runbook/troubleshooting-kubernetes/) + [Recover a Broken Cluster](https://codefresh.io/Kubernetes-Tutorial/recover-broken-kubernetes-cluster/)