Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill external runbooks urls #878

Merged
merged 3 commits into from
Oct 30, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 80 additions & 7 deletions runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ This page collects this repositories alerts and begins the process of describing
##### Alert Name: "KubeAPIDown"
+ *Message*: `KubeAPI has disappeared from Prometheus target discovery.`
+ *Severity*: critical
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapidown/)
##### Alert Name: "KubeControllerManagerDown"
+ *Message*: `KubeControllerManager has disappeared from Prometheus target discovery.`
+ *Severity*: critical
Expand All @@ -24,140 +25,212 @@ This page collects this repositories alerts and begins the process of describing
##### Alert Name: KubeletDown
+ *Message*: `Kubelet has disappeared from Prometheus target discovery.`
+ *Severity*: critical
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletdown/)
##### Alert Name: KubeProxyDown
+ *Message*: `KubeProxy has disappeared from Prometheus target discovery`
+ *Severity*: critical
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeproxydown/)

### Group Name: kubernetes-apps
##### Alert Name: KubePodCrashLooping
+ *Message*: `{{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf \"%.2f\" $value }} / second`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodcrashlooping/)
##### Alert Name: "KubePodNotReady"
+ *Message*: `{{ $labels.namespace }}/{{ $labels.pod }} is not ready.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodnotready/)
##### Alert Name: "KubeDeploymentGenerationMismatch"
+ *Message*: `Deployment {{ $labels.namespace }}/{{ $labels.deployment }} generation mismatch`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentgenerationmismatch/)
##### Alert Name: "KubeDeploymentReplicasMismatch"
+ *Message*: `Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replica mismatch`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentreplicasmismatch/)
##### Alert Name: "KubeDeploymentRolloutStuck"
+ *Message*: `Rollout of deployment {{ $labels.namespace }}/{{ $labels.deployment }} is not progressing`
+ *Severity*: warning
##### Alert Name: "KubeStatefulSetReplicasMismatch"
+ *Message*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} replica mismatch`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetreplicasmismatch/)
##### Alert Name: "KubeStatefulSetGenerationMismatch"
+ *Message*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} generation mismatch`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetgenerationmismatch/)
##### Alert Name: "KubeDaemonSetRolloutStuck"
+ *Message*: `Only {{$value | humanizePercentage }} of desired pods scheduled and ready for daemon set {{$labels.namespace}}/{{$labels.daemonset}}`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetrolloutstuck/)
##### Alert Name: "KubeContainerWaiting"
+ *Message*: `{{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is in waiting state.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecontainerwaiting/)
##### Alert Name: "KubeDaemonSetNotScheduled"
+ *Message*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are not scheduled.`
+ *Severity*: warning

+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetnotscheduled/)
##### Alert Name: "KubeStatefulSetUpdateNotRolledOut"
+ *Message*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetupdatenotrolledout/)
##### Alert Name: "KubeHpaReplicasMismatch"
+ *Message*: `'HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has not matched the desired number of replicas for longer than 15 minutes.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpareplicasmismatch/)
##### Alert Name: "KubeHpaMaxedOut"
+ *Message*: `HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has been running at max replicas for longer than 15 minutes.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpamaxedout/)
##### Alert Name: "KubeDaemonSetMisScheduled"
+ *Message*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are running where they are not supposed to run.`
+ *Severity*: warning

+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetmisscheduled/)
##### Alert Name: "KubeJobNotCompleted"
+ *Message*: `Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than {{ "%(kubeJobTimeoutDuration)s" | humanizeDuration }} to complete.`
+ *Severity*: warning
+ *Action*: Check the job using `kubectl describe job <job>` and look at the pod logs using `kubectl logs <pod>` for further information.

##### Alert Name: "KubeJobFailed"
+ *Message*: `Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.`
+ *Severity*: warning
+ *Action*: Check the job using `kubectl describe job <job>` and look at the pod logs using `kubectl logs <pod>` for further information.
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobfailed/)

### Group Name: "kubernetes-resources"
##### Alert Name: "KubeCPUOvercommit"
+ *Message*: `Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.`
+ *Severity*: warning
##### Alert Name: "KubeMemOvercommit"
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuovercommit/)
##### Alert Name: "KubeMemoryOvercommit"
+ *Message*: `Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.`
+ *Severity*: warning
##### Alert Name: "KubeCPUQuotaOvercommit"
+ *Message*: `Cluster has overcommitted CPU resource requests for Namespaces.`
+ *Severity*: warning
##### Alert Name: "KubeMemQuotaOvercommit"
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuquotaovercommit/)
##### Alert Name: "KubeMemoryQuotaOvercommit"
+ *Message*: `Cluster has overcommitted memory resource requests for Namespaces.`
+ *Severity*: warning
##### Alert Name: "KubeQuotaAlmostFull"
+ *Message*: `{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.`
+ *Severity*: info
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotaalmostfull/)
##### Alert Name: "KubeQuotaFullyUsed"
+ *Message*: `{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.`
+ *Severity*: info
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotafullyused/)
##### Alert Name: "KubeQuotaExceeded"
+ *Message*: `{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotaexceeded/)
##### Alert Name: "CPUThrottlingHigh"
+ *Message*: `Processes experience elevated CPU throttling.`
+ *Severity*: info
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/cputhrottlinghigh/)

### Group Name: "kubernetes-storage"
##### Alert Name: "KubePersistentVolumeFillingUp"
+ *Message*: `The persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has {{ $value | humanizePercentage }} free.`
+ *Severity*: critical
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup/)
##### Alert Name: "KubePersistentVolumeFillingUp"
+ *Message*: `Based on recent sampling, the persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is expected to fill up within four days.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup/)
##### Alert Name: "KubePersistentVolumeInodesFillingUp"
+ *Message*: `PersistentVolume is filling up.`
##### Alert Name: "KubePersistentVolumeErrors"
+ *Message*: `PersistentVolume is having issues with provisioning.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumeerrors/)

### Group Name: "kubernetes-system"
##### Alert Name: "KubeNodeNotReady"
+ *Message*: `{{ $labels.node }} has been unready for more than 15 minutes."`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodenotready/)
##### Alert Name: "KubeNodeUnreachable"
+ *Message*: `{{ $labels.node }} is unreachable and some workloads may be rescheduled.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodeunreachable/)
##### Alert Name: "KubeletTooManyPods"
+ *Message*: `Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage }} of its Pod capacity.`
+ *Severity*: info
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubelettoomanypods/)
##### Alert Name: "KubeNodeReadinessFlapping"
+ *Message*: `The readiness status of node {{ $labels.node }} has changed {{ $value }} times in the last 15 minutes.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodereadinessflapping/)
##### Alert Name: "KubeletPlegDurationHigh"
+ *Message*: `The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{ $value }} seconds on node {{ $labels.node }}.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletplegdurationhigh/)
##### Alert Name: "KubeletPodStartUpLatencyHigh"
+ *Message*: `Kubelet Pod startup 99th percentile latency is {{ $value }} seconds on node {{ $labels.node }}.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletpodstartuplatencyhigh/)
##### Alert Name: "KubeletClientCertificateExpiration"
+ *Message*: `Client certificate for Kubelet on node {{ $labels.node }} expires in 7 days.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletclientcertificateexpiration/)
##### Alert Name: "KubeletClientCertificateExpiration"
+ *Message*: `Client certificate for Kubelet on node {{ $labels.node }} expires in 1 day.`
+ *Severity*: critical
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletclientcertificateexpiration/)
##### Alert Name: "KubeletServerCertificateExpiration"
+ *Message*: `Server certificate for Kubelet on node {{ $labels.node }} expires in 7 days.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletservercertificateexpiration/)
##### Alert Name: "KubeletServerCertificateExpiration"
+ *Message*: `Server certificate for Kubelet on node {{ $labels.node }} expires in 1 day.`
+ *Severity*: critical
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletservercertificateexpiration/)
##### Alert Name: "KubeletClientCertificateRenewalErrors"
+ *Message*: `Kubelet on node {{ $labels.node }} has failed to renew its client certificate ({{ $value | humanize }} errors in the last 15 minutes).`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletclientcertificaterenewalerrors/)
##### Alert Name: "KubeletServerCertificateRenewalErrors"
+ *Message*: `Kubelet on node {{ $labels.node }} has failed to renew its server certificate ({{ $value | humanize }} errors in the last 5 minutes).`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletservercertificaterenewalerrors/)
##### Alert Name: "KubeVersionMismatch"
+ *Message*: `There are {{ $value }} different versions of Kubernetes components running.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeversionmismatch/)
##### Alert Name: "KubeClientErrors"
+ *Message*: `Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ $value | humanizePercentage }} errors.'`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclienterrors/)
##### Alert Name: "KubeClientCertificateExpiration"
+ *Message*: `A client certificate used to authenticate to the apiserver is expiring in less than 7 days.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclientcertificateexpiration/)
##### Alert Name: "KubeClientCertificateExpiration"
+ *Message*: `A client certificate used to authenticate to the apiserver is expiring in less than 1 day.`
+ *Severity*: critical
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclientcertificateexpiration/)
##### Alert Name: "KubeAPITerminatedRequests"
+ *Message*: `The apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.`
+ *Severity*: warning
+ *Action*: Use the `apiserver_flowcontrol_rejected_requests_total` metric to determine which flow schema is throttling the traffic to the API Server. The flow schema also provides information on the affected resources and subjects.
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapiterminatedrequests/)
##### Alert Name: "KubeAggregatedAPIErrors"
+ *Message*: `Kubernetes aggregated API has reported errors.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeaggregatedapierrors/)
##### Alert Name: "KubeAggregatedAPIDown"
+ *Message*: `Kubernetes aggregated API is down.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeaggregatedapidown/)

### Group Name: "kube-apiserver-slos"
##### Alert Name: "KubeAPIErrorBudgetBurn"
+ *Message*: `The API server is burning too much error budget.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapierrorbudgetburn/)

## Other Kubernetes Runbooks and troubleshooting
+ [Troubleshoot Clusters ](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/)
+ [Cloud.gov Kubernetes Runbook ](https://landing.app.cloud.gov/docs/ops/runbook/troubleshooting-kubernetes/)
+ [Troubleshoot Clusters](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/)
+ [Cloud.gov Kubernetes Runbook](https://landing.app.cloud.gov/docs/ops/runbook/troubleshooting-kubernetes/)
+ [Recover a Broken Cluster](https://codefresh.io/Kubernetes-Tutorial/recover-broken-kubernetes-cluster/)
Loading