-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefer kube-scheduler's resource metrics to kube-state-metrics' #815
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I like this change I can see a problem with this in managed solutions (like EKS) where access to kube-scheduler is forbidden. In those cases alerts and dashboards which are based on kube-scheduler data won't be useful at all. Due to such an issue can we use OR
statements instead of deprecating kube-state-metrics data? I think something like kube_pod_resource_request OR kube_pod_container_resource_request
should do the trick.
Note to self: Revisit this issue once prometheus/prometheus#9624 is implemented. |
Just wondering if there's a better (or any) way to format embedded PromQL expressions in |
8102d0c
to
eb73d96
Compare
ce7c3f8
to
817b784
Compare
This PR has been automatically marked as stale because it has not The next time this stale check runs, the stale label will be Thank you for your contributions! |
Rebasing. |
8f38753
to
8102d0c
Compare
Since they are more accurate.
181ee93
to
6997e03
Compare
Refactor kube_pod_status_phase, since statuses other than "Pending" or "Running" are excluded or deprecated. Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be quite a few breaking changes here, by removing recording rules and creating new ones (as a result of changing the rule names). That would affect any queries/dashboards which have been written outside of this repo (i.e. downstream projects).
I would suggest either of the below to maintain compatibility:
- keeping the existing rules and creating new rules with the desired changes, or:
- keeping the existing rule names and updating the rule expressions with the changes
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the recording rules back in, makes sense. PTAL. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this feels better to me, thanks! Makes it easier to discuss the new metrics, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote for creating new recording rules based on kube-scheduler metrics. for instance
namespace_memory:kube_pod_container_resource_requests:sum
would continue to work on the kube-state-metricskube_pod_container_resource_requests
metric.- there would be a new
namespace_memory:kube_pod_resource_request:sum
recording rule for the kube-schedulerkube_pod_resource_request
metric.
wherever we use namespace_memory:kube_pod_container_resource_requests:sum
today, it woud be changed to
namespace_memory:kube_pod_resource_request:sum or namespace_memory:kube_pod_container_resource_requests:sum
It would work whichever of the kube-state-metrics metrics and the kube-scheduler metrics are present while being efficient. Ideally there could config options in the mixin to opt-out for kube-state-metrics or kube-scheduler recording metrics.
kube_pod_container_resource_requests{resource="memory",%(kubeStateMetricsSelector)s} | ||
) * on(namespace, pod, %(clusterLabel)s) group_left() max by (namespace, pod, %(clusterLabel)s) ( | ||
kube_pod_status_phase{phase=~"Pending|Running"} == 1 | ||
kube_pod_resource_request{resource="memory",%(kubeSchedulerSelector)s} or kube_pod_container_resource_requests{resource="memory",%(kubeStateMetricsSelector)s} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure that it will do the right thing since kube_pod_resource_request
and kube_pod_container_resource_requests
don't have the exact same labels if I understand correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong here but wouldn't the differing labelsets (kube-scheduler metrics potentially having the additional scheduler
and priority
labels) after or
be sanitized to the set of specified labels in the max
operation, so doing an ignoring (scheduler,priority)
will eventually have no effect on the final result?
…metrics' reintroduce dropped rules
{ | ||
record: 'cluster:namespace:pod_memory:active:kube_pod_resource_request_or_kube_pod_container_resource_requests', | ||
expr: ||| | ||
(kube_pod_resource_request{resource="memory",%(kubeSchedulerSelector)s} or kube_pod_container_resource_requests{resource="memory",%(kubeStateMetricsSelector)s}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As kube_pod_container_resource_requests
is container-level and kube_pod_resource_request
is not, is it worth aggregating kube_pod_container_resource_requests
up to the pod-level so that the labels for the these rules always have consistent labelsets?
So, for example, maybe wrapping the whole expression in a sum by (%(clusterLabel)s, %(namespaceLabel)s, pod, node) (...)
?
I wonder if that would make the data a bit easier to work with, regardless of whether you have scheduler or KSM data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this rule should be changed: as you said, kube-scheduler metrics have a pod-level granularity while this recorded metric is per-container.
sum by (namespace, %(clusterLabel)s) ( | ||
sum by (namespace, pod, %(clusterLabel)s) ( | ||
max by (namespace, pod, container, %(clusterLabel)s) ( | ||
kube_pod_container_resource_limits{resource="memory",%(kubeStateMetricsSelector)s} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think you maybe intended to or
the kube_pod_resource_limit
metric here?
{ | ||
record: 'cluster:namespace:pod_cpu:active:kube_pod_resource_limit_or_kube_pod_container_resource_limits', | ||
expr: ||| | ||
(kube_pod_resource_limit{resource="memory",%(kubeSchedulerSelector)s} or kube_pod_container_resource_limits{resource="cpu",%(kubeStateMetricsSelector)s}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(kube_pod_resource_limit{resource="memory",%(kubeSchedulerSelector)s} or kube_pod_container_resource_limits{resource="cpu",%(kubeStateMetricsSelector)s}) | |
(kube_pod_resource_limit{resource="cpu",%(kubeSchedulerSelector)s} or kube_pod_container_resource_limits{resource="cpu",%(kubeStateMetricsSelector)s}) |
sum by (namespace, %(clusterLabel)s) ( | ||
sum by (namespace, pod, %(clusterLabel)s) ( | ||
max by (namespace, pod, container, %(clusterLabel)s) ( | ||
kube_pod_resource_limit{resource="memory",%(kubeSchedulerSelector)s} or kube_pod_container_resource_limits{resource="cpu",%(kubeStateMetricsSelector)s} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kube_pod_resource_limit{resource="memory",%(kubeSchedulerSelector)s} or kube_pod_container_resource_limits{resource="cpu",%(kubeStateMetricsSelector)s} | |
kube_pod_resource_limit{resource="cpu",%(kubeSchedulerSelector)s} or kube_pod_container_resource_limits{resource="cpu",%(kubeStateMetricsSelector)s} |
{ | ||
record: 'cluster:namespace:pod_memory:active:kube_pod_resource_request_or_kube_pod_container_resource_requests', | ||
expr: ||| | ||
(kube_pod_resource_request{resource="memory",%(kubeSchedulerSelector)s} or kube_pod_container_resource_requests{resource="memory",%(kubeStateMetricsSelector)s}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this rule should be changed: as you said, kube-scheduler metrics have a pod-level granularity while this recorded metric is per-container.
@@ -54,7 +54,7 @@ Jsonnet offers the ability to parameterise configuration, allowing for basic cus | |||
alert: "KubePodNotReady", | |||
expr: ||| | |||
sum by (namespace, pod) ( | |||
kube_pod_status_phase{%(kubeStateMetricsSelector)s, phase!~"Running|Succeeded"} | |||
kube_pod_status_phase{%(kubeStateMetricsSelector)s, phase!~"Running"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about this change, is it needed?
@@ -33,7 +33,7 @@ | |||
expr: ||| | |||
sum by (namespace, pod, %(clusterLabel)s) ( | |||
max by(namespace, pod, %(clusterLabel)s) ( | |||
kube_pod_status_phase{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s, phase=~"Pending|Unknown|Failed"} | |||
kube_pod_status_phase{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s, phase="Pending"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote for creating new recording rules based on kube-scheduler metrics. for instance
namespace_memory:kube_pod_container_resource_requests:sum
would continue to work on the kube-state-metricskube_pod_container_resource_requests
metric.- there would be a new
namespace_memory:kube_pod_resource_request:sum
recording rule for the kube-schedulerkube_pod_resource_request
metric.
wherever we use namespace_memory:kube_pod_container_resource_requests:sum
today, it woud be changed to
namespace_memory:kube_pod_resource_request:sum or namespace_memory:kube_pod_container_resource_requests:sum
It would work whichever of the kube-state-metrics metrics and the kube-scheduler metrics are present while being efficient. Ideally there could config options in the mixin to opt-out for kube-state-metrics or kube-scheduler recording metrics.
This PR has been automatically marked as stale because it has not The next time this stale check runs, the stale label will be Thank you for your contributions! |
Use kube-scheduler's metrics instead of kube-state-metrics, as they are more precise.
Refer the links below for more details.
Also, refactor
kube_pod_status_phase
, since statuses other than "Pending" or "Running" are excluded or deprecated.