As Rob Ewaschuk puts it:
Playbooks (or runbooks) are an important part of an alerting system; it's best to have an entry for each alert or family of alerts that catch a symptom, which can further explain what the alert means and how it might be addressed.
It is a recommended practice that you add an annotation of "runbook" to every prometheus alert with a link to a clear description of it's meaning and suggested remediation or mitigation. While some problems will require private and custom solutions, most common problems have common solutions. In practice, you'll want to automate many of the procedures (rather than leaving them in a wiki), but even a self-correcting problem should provide an explanation as to what happened and why to observers.
Matthew Skelton & Rob Thatcher have an excellent run book template. This template will help teams to fully consider most aspects of reliably operating most interesting software systems, if only to confirm that "this section definitely does not apply here" - a valuable realization.
This page collects this repositories alerts and begins the process of describing what they mean and how it might be addressed. Links from alerts to this page are added automatically.
- Message:
KubeAPI has disappeared from Prometheus target discovery.
- Severity: critical
- Message:
KubeControllerManager has disappeared from Prometheus target discovery.
- Severity: critical
- Runbook: Link
- Message:
KubeScheduler has disappeared from Prometheus target discovery
- Severity: critical
- Runbook: Link
- Message:
Kubelet has disappeared from Prometheus target discovery.
- Severity: critical
- Message:
KubeProxy has disappeared from Prometheus target discovery
- Severity: critical
- Runbook: Link
- Message:
{{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf \"%.2f\" $value }} / second
- Severity: warning
- Message:
{{ $labels.namespace }}/{{ $labels.pod }} is not ready.
- Severity: warning
- Message:
Deployment {{ $labels.namespace }}/{{ $labels.deployment }} generation mismatch
- Severity: warning
- Message:
Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replica mismatch
- Severity: warning
- Message:
Rollout of deployment {{ $labels.namespace }}/{{ $labels.deployment }} is not progressing
- Severity: warning
- Message:
StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} replica mismatch
- Severity: warning
- Message:
StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} generation mismatch
- Severity: warning
- Message:
Only {{$value | humanizePercentage }} of desired pods scheduled and ready for daemon set {{$labels.namespace}}/{{$labels.daemonset}}
- Severity: warning
- Message:
{{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is in waiting state.
- Severity: warning
- Message:
A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are not scheduled.
- Severity: warning
- Message:
A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are running where they are not supposed to run.
- Severity: warning
- Message:
Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than {{ "%(kubeJobTimeoutDuration)s" | humanizeDuration }} to complete.
- Severity: warning
- Action: Check the job using
kubectl describe job <job>
and look at the pod logs usingkubectl logs <pod>
for further information.
- Message:
Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.
- Severity: warning
- Action: Check the job using
kubectl describe job <job>
and look at the pod logs usingkubectl logs <pod>
for further information.
- Message:
Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.
- Severity: warning
- Message:
Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.
- Severity: warning
- Message:
Cluster has overcommitted CPU resource requests for Namespaces.
- Severity: warning
- Message:
Cluster has overcommitted memory resource requests for Namespaces.
- Severity: warning
- Message:
{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.
- Severity: info
- Message:
{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.
- Severity: info
- Message:
{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.
- Severity: warning
- Message:
The persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has {{ $value | humanizePercentage }} free.
- Severity: critical
- Message:
Based on recent sampling, the persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is expected to fill up within four days.
- Severity: warning
- Message:
{{ $labels.node }} has been unready for more than 15 minutes."
- Severity: warning
- Message:
{{ $labels.node }} is unreachable and some workloads may be rescheduled.
- Severity: warning
- Message:
Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage }} of its Pod capacity.
- Severity: info
- Message:
The readiness status of node {{ $labels.node }} has changed {{ $value }} times in the last 15 minutes.
- Severity: warning
- Message:
The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{ $value }} seconds on node {{ $labels.node }}.
- Severity: warning
- Message:
Kubelet Pod startup 99th percentile latency is {{ $value }} seconds on node {{ $labels.node }}.
- Severity: warning
- Message:
Client certificate for Kubelet on node {{ $labels.node }} expires in 7 days.
- Severity: warning
- Message:
Client certificate for Kubelet on node {{ $labels.node }} expires in 1 day.
- Severity: critical
- Message:
Server certificate for Kubelet on node {{ $labels.node }} expires in 7 days.
- Severity: warning
- Message:
Server certificate for Kubelet on node {{ $labels.node }} expires in 1 day.
- Severity: critical
- Message:
Kubelet on node {{ $labels.node }} has failed to renew its client certificate ({{ $value | humanize }} errors in the last 15 minutes).
- Severity: warning
- Message:
Kubelet on node {{ $labels.node }} has failed to renew its server certificate ({{ $value | humanize }} errors in the last 5 minutes).
- Severity: warning
- Message:
There are {{ $value }} different versions of Kubernetes components running.
- Severity: warning
- Message:
Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ $value | humanizePercentage }} errors.'
- Severity: warning
- Message:
A client certificate used to authenticate to the apiserver is expiring in less than 7 days.
- Severity: warning
- Message:
A client certificate used to authenticate to the apiserver is expiring in less than 1 day.
- Severity: critical
- Message:
The apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.
- Severity: warning
- Action: Use the
apiserver_flowcontrol_rejected_requests_total
metric to determine which flow schema is throttling the traffic to the API Server. The flow schema also provides information on the affected resources and subjects.