New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

kep: Pod resource metrics #1916

Merged

k8s-ci-robot merged 1 commit into kubernetes:master from smarterclayton:resource

Oct 16, 2020

Contributor

smarterclayton commented Jul 30, 2020

Report metrics about pod resource reservation as observed by scheduler, quota, and kubelet subsystems in a way that makes administrators able to easily build dashboards and perform reporting on real resource usage.

Tied to #1748

k8s-ci-robot added cncf-cla: yes size/XL kind/kep labels

k8s-ci-robot requested review from dashpole and ehashman

July 30, 2020 19:43

k8s-ci-robot added sig/architecture sig/instrumentation labels

Member

ehashman commented Jul 30, 2020

/assign

I'll TAL next week when I'm catching up on my Kubernetes backlog.

k8s-ci-robot assigned ehashman

derekwaynecarr reviewed

View reviewed changes

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

derekwaynecarr reviewed

View reviewed changes

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

brancz reviewed

View reviewed changes

Member

brancz left a comment

There are some details that need to be figured out I think, but I think the use cases and intentions are sound and important.

Maybe this is different for people very familiar with the scheduler code base, but I am missing a little bit more implementation details, for example, I wonder whether we would actually weave this into the existing workflow the scheduler performs, or whether we will just make use of the schedulers same calculations, roughly speaking as a library. If the later, then maybe a first step for quick iteration could actually be building this as a separate component first before integrating it more tightly.

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/kep.yaml Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated

+              * The kubelet will create a cgroup for the pod that expects to get roughly `300m` cores, but the container cgroup created for `copy-files` will request `100m` while that init container is running
+              * The quota subsystem will block this pod from being created if there is less than `300m` available.
+              Once a pod is created, it passes through four high level lifecycle phases as seen by the total Kubernetes system. The first phase is `Pending` - the time before the pod is scheduled to a node. The second phase is `Initializing` - the time between when the pod is scheduled on a node and when all init containers have completed successfully and the `Initalized` condition on the pod is set to `true`. A pod without init containers may only remain in this phase very briefly. The third phase is `Running`, when all init containers have completed, and continues until the pod is deleted or reaches a terminal state of success or failure.  The final state is `Completed` which means the pod has no running containers, all resources have been released or cleaned up, and the pod will never again consume those resources.

Member

brancz Jul 31, 2020

I think this will need some deeper thought. As it stands this sounds like we will have a lot of very short lived series, which we should prevent.

Contributor Author

smarterclayton Jul 31, 2020

Hrm. So short lived pods already have this problem for conditions, so this should be strictly less than almost every other ksm style metric that exposes conditions (which change more frequently than this). Maybe we should remove other short lived series in order to do this, if short lived series budget is a concern?

Member

lilic Aug 4, 2020

Because of the proposal https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/20200415-cardinality-enforcement.md, the cluster admins/vendors already have tools to prevent any kind of series explosion if that is what you meant @brancz ?

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

Contributor Author

smarterclayton commented Jul 31, 2020 •

edited

Loading

wonder whether we would actually weave this into the existing workflow the scheduler performs, or whether we will just make use of the schedulers same calculations, roughly speaking as a library.

Scheduler is tens of related workflows, with a core loop for placement, but metrics calculation I don't really think of as sufficient to justify a new top level component (in this context). In general I'm -1 to adding metrics gathering components that must cache all pods - I don't have a lot of truck with microservice for its own sake, especially when the domain for scheduler is "the resource model" and I would argue the metrics it is using to make decisions. Splitting increases deployment complexity - while this is positioned as optional, I would expect every distribution of Kube that has some form of metrics to enable it. I'll list some alternative placements such as KCM (scheduler is better), separate component (extra cost to all deployments, no real reason to separate).

wojtek-t reviewed

View reviewed changes

Member

wojtek-t left a comment

@ahg-g - FYI

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

Member

brancz commented Aug 3, 2020 •

edited

Loading

Splitting increases deployment complexity - while this is positioned as optional, I would expect every distribution of Kube that has some form of metrics to enable it.

This is largely what I was going towards. I haven't checked recently, but I think most providers don't give access to metrics endpoints of the scheduler, which is why I was hoping for those people it would still be possible to make use of these metrics somehow. Ultimately that's the providers choice, but I would want to maximize the usefullness people can get out this with any Kubernetes cluster.

dashpole reviewed

View reviewed changes

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

lilic reviewed

View reviewed changes

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated

+              * The kubelet will create a cgroup for the pod that expects to get roughly `300m` cores, but the container cgroup created for `copy-files` will request `100m` while that init container is running
+              * The quota subsystem will block this pod from being created if there is less than `300m` available.
+              Once a pod is created, it passes through four high level lifecycle phases as seen by the total Kubernetes system. The first phase is `Pending` - the time before the pod is scheduled to a node. The second phase is `Initializing` - the time between when the pod is scheduled on a node and when all init containers have completed successfully and the `Initalized` condition on the pod is set to `true`. A pod without init containers may only remain in this phase very briefly. The third phase is `Running`, when all init containers have completed, and continues until the pod is deleted or reaches a terminal state of success or failure.  The final state is `Completed` which means the pod has no running containers, all resources have been released or cleaned up, and the pod will never again consume those resources.

Member

lilic Aug 4, 2020

Because of the proposal https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/20200415-cardinality-enforcement.md, the cluster admins/vendors already have tools to prevent any kind of series explosion if that is what you meant @brancz ?

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Show resolved Hide resolved

brancz mentioned this pull request

KubeCPUOvercommit doesn't take node pools kubernetes-monitoring/kubernetes-mixin#481

Open

Contributor

serathius commented Aug 6, 2020

/cc

k8s-ci-robot requested a review from serathius

August 6, 2020 18:55

ehashman reviewed

View reviewed changes

Member

ehashman left a comment

Overall, as a cluster administrator I am intimately familiar with the problem this KEP is addressing and I would love for this feature to be available out of the box on clusters. I am generally quite enthusiastic about this KEP!

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1748-pod-resource-metrics/kep.yaml Outdated Show resolved Hide resolved

k8s-ci-robot requested review from pigletfly and removed request for pigletfly

August 27, 2020 15:27

smarterclayton force-pushed the resource branch from 849b4fe to 97fcd75 Compare

September 3, 2020 19:23

Contributor Author

smarterclayton commented Sep 3, 2020

Updated with a large number of comments addressed. The remaining issue appears to be the justification for lifecycle as a separate metric, which I will respond to in more detail shortly.

smarterclayton force-pushed the resource branch from 97fcd75 to 119f359 Compare

September 10, 2020 14:13

Contributor Author

smarterclayton commented Sep 17, 2020

So another suggestion - instead of making this optional, we could simply expose it as an endpoint /metrics/resources on the schedulers and someone can scrape if they want the data. That requires no flags, and incurs no additional cost to the scheduler (the cost is only incurred when scraping the metrics as having the collector is free unless metrics are scraped).

smarterclayton mentioned this pull request

scheduler: Implement resource metrics at /metrics/resources kubernetes/kubernetes#94866

Merged

Member

brancz commented Sep 18, 2020

I actually like that separation better either way. Essentially it's splitting metrics about Pods ("cluster state") from the metrics that are actually about the scheduler itself.

kikisdeliveryservice mentioned this pull request

Expose metrics about resource requests and limits that represent the pod model #1748

Closed

6 tasks

k8s-ci-robot removed the lgtm label

Contributor Author

smarterclayton commented Oct 15, 2020

Updated with nits. @ehashman can you do a reread and add your approval if you are ok (nothing substantitive has changed on the metrics end)

alculquicondor reviewed

View reviewed changes

Member

alculquicondor left a comment

/lgtm

Thanks for working on this. It's going to be very helpful

k8s-ci-robot added the lgtm label

Contributor

dashpole commented Oct 15, 2020

/approve

I did another pass, and it still looks good. I reopened a minor comment above about the cgroup example. You can remove the hold once that, and other open comments are resolved.
Please include me in the discussion about container-granularity metrics :).

Contributor

k8s-ci-robot commented Oct 15, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, dashpole, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-instrumentation/OWNERS~~ [dashpole]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the approved label


          kep: Pod resource metrics

689030c

Report metrics about pod resource reservation as observed by
scheduler, quota, and kubelet subsystems in a way that makes
administrators able to easily build dashboards and perform
reporting on real resource usage.

smarterclayton force-pushed the resource branch from 48cb55d to 689030c Compare

October 16, 2020 14:54

k8s-ci-robot removed the lgtm label

Contributor Author

smarterclayton commented Oct 16, 2020

Updated with last comment, can someone remove hold and add back lgtm? Thanks for the review (and sorry for sig-scheduling that this dropped through the cracks, appreciate the improvements suggested). After impl I will include some dashboard queries created for this.

Member

ahg-g commented Oct 16, 2020

/hold cancel
/lgtm

k8s-ci-robot removed the do-not-merge/hold label

k8s-ci-robot assigned ahg-g

k8s-ci-robot added the lgtm label

k8s-ci-robot merged commit 12266f8 into kubernetes:master

k8s-ci-robot added this to the v1.20 milestone

ahg-g mentioned this pull request

Define and implement scheduling latency SLO kubernetes/perf-tests#1500

Open

Member

ahg-g commented Oct 19, 2020

@smarterclayton I know that the KEP is focused on resource metrics, but I wonder if we can follow the same approach to report per-pod scheduling latency metrics. This should make it easier to implement scheduling latency SLOs.

Contributor Author

smarterclayton commented Oct 19, 2020

but I wonder if we can follow the same approach to report per-pod scheduling latency metrics. This should make it easier to implement scheduling latency SLOs.

The pattern of having a separate endpoint to report high cardinality optional metrics that can be captured from static caches is pretty scalable and generally works when you want to avoid by default having to do something expensive. However, if you need to track data yourself (using a prometheus histogram or gauge) you'll still be paying for memory cost to keep that instance if you need higher precision than what the API captures.

Member

brancz commented Oct 20, 2020

@ahg-g I would suggest opening a new issue discussing how the existing metrics are not sufficient for what you are trying to achieve (I'm aware of people using them for a scheduling latency SLO, I'd like to understand the gap 🙂 ).

This was referenced Oct 22, 2020

resource-metrics: add pod/sandbox metrics to endpoint kubernetes/kubernetes#84842

Closed

resource-metrics: add pod/sandbox metrics to endpoint kubernetes/kubernetes#95839

Merged

This was referenced Oct 26, 2020

Revert "Rework resource metrics" kubernetes/kube-state-metrics#1278

Merged

Stable metrics "kube_pod_container_resource_[limits|requests]" have cpu/memory units removed in v2 kubernetes/kube-state-metrics#1263

Closed

Contributor Author

smarterclayton commented Nov 9, 2020 •

edited

Loading

Implementation is up in kubernetes/kubernetes#94866

Member

alculquicondor commented Nov 9, 2020

Is it kubernetes/kubernetes#95839? Why is it on pkg/kubelet?

Contributor Author

smarterclayton commented Nov 9, 2020

Wrong cut and paste, I was reviewing something else. Correct PR is in the comment. The other PR is related to this but not directly part of the core impl.

dashpole mentioned this pull request

Add exception for metric component prefixes for object metrics kubernetes/community#5361

Merged

damemi mentioned this pull request

Use actual node resource utilization in the strategy "LowNodeUtilization" kubernetes-sigs/descheduler#225

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

ahg-g ahg-g left review comments

wojtek-t wojtek-t left review comments

derekwaynecarr derekwaynecarr left review comments

lilic lilic left review comments

ehashman ehashman left review comments

brancz brancz left review comments

dashpole dashpole approved these changes

damemi damemi left review comments

Huang-Wei Huang-Wei left review comments

alculquicondor alculquicondor left review comments

liggitt liggitt left review comments

serathius Awaiting requested review from serathius

Labels

approved cncf-cla: yes kind/kep lgtm sig/architecture sig/instrumentation sig/scheduling size/L