-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kep: Pod resource metrics #1916
Conversation
/assign I'll TAL next week when I'm catching up on my Kubernetes backlog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some details that need to be figured out I think, but I think the use cases and intentions are sound and important.
Maybe this is different for people very familiar with the scheduler code base, but I am missing a little bit more implementation details, for example, I wonder whether we would actually weave this into the existing workflow the scheduler performs, or whether we will just make use of the schedulers same calculations, roughly speaking as a library. If the later, then maybe a first step for quick iteration could actually be building this as a separate component first before integrating it more tightly.
* The kubelet will create a cgroup for the pod that expects to get roughly `300m` cores, but the container cgroup created for `copy-files` will request `100m` while that init container is running | ||
* The quota subsystem will block this pod from being created if there is less than `300m` available. | ||
|
||
Once a pod is created, it passes through four high level lifecycle phases as seen by the total Kubernetes system. The first phase is `Pending` - the time before the pod is scheduled to a node. The second phase is `Initializing` - the time between when the pod is scheduled on a node and when all init containers have completed successfully and the `Initalized` condition on the pod is set to `true`. A pod without init containers may only remain in this phase very briefly. The third phase is `Running`, when all init containers have completed, and continues until the pod is deleted or reaches a terminal state of success or failure. The final state is `Completed` which means the pod has no running containers, all resources have been released or cleaned up, and the pod will never again consume those resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will need some deeper thought. As it stands this sounds like we will have a lot of very short lived series, which we should prevent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm. So short lived pods already have this problem for conditions, so this should be strictly less than almost every other ksm style metric that exposes conditions (which change more frequently than this). Maybe we should remove other short lived series in order to do this, if short lived series budget is a concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the proposal https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/20200415-cardinality-enforcement.md, the cluster admins/vendors already have tools to prevent any kind of series explosion if that is what you meant @brancz ?
Scheduler is tens of related workflows, with a core loop for placement, but metrics calculation I don't really think of as sufficient to justify a new top level component (in this context). In general I'm -1 to adding metrics gathering components that must cache all pods - I don't have a lot of truck with microservice for its own sake, especially when the domain for scheduler is "the resource model" and I would argue the metrics it is using to make decisions. Splitting increases deployment complexity - while this is positioned as optional, I would expect every distribution of Kube that has some form of metrics to enable it. I'll list some alternative placements such as KCM (scheduler is better), separate component (extra cost to all deployments, no real reason to separate). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahg-g - FYI
This is largely what I was going towards. I haven't checked recently, but I think most providers don't give access to metrics endpoints of the scheduler, which is why I was hoping for those people it would still be possible to make use of these metrics somehow. Ultimately that's the providers choice, but I would want to maximize the usefullness people can get out this with any Kubernetes cluster. |
* The kubelet will create a cgroup for the pod that expects to get roughly `300m` cores, but the container cgroup created for `copy-files` will request `100m` while that init container is running | ||
* The quota subsystem will block this pod from being created if there is less than `300m` available. | ||
|
||
Once a pod is created, it passes through four high level lifecycle phases as seen by the total Kubernetes system. The first phase is `Pending` - the time before the pod is scheduled to a node. The second phase is `Initializing` - the time between when the pod is scheduled on a node and when all init containers have completed successfully and the `Initalized` condition on the pod is set to `true`. A pod without init containers may only remain in this phase very briefly. The third phase is `Running`, when all init containers have completed, and continues until the pod is deleted or reaches a terminal state of success or failure. The final state is `Completed` which means the pod has no running containers, all resources have been released or cleaned up, and the pod will never again consume those resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the proposal https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/20200415-cardinality-enforcement.md, the cluster admins/vendors already have tools to prevent any kind of series explosion if that is what you meant @brancz ?
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, as a cluster administrator I am intimately familiar with the problem this KEP is addressing and I would love for this feature to be available out of the box on clusters. I am generally quite enthusiastic about this KEP!
849b4fe
to
97fcd75
Compare
Updated with a large number of comments addressed. The remaining issue appears to be the justification for lifecycle as a separate metric, which I will respond to in more detail shortly. |
97fcd75
to
119f359
Compare
So another suggestion - instead of making this optional, we could simply expose it as an endpoint |
I actually like that separation better either way. Essentially it's splitting metrics about Pods ("cluster state") from the metrics that are actually about the scheduler itself. |
Updated with nits. @ehashman can you do a reread and add your approval if you are ok (nothing substantitive has changed on the metrics end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks for working on this. It's going to be very helpful
/approve I did another pass, and it still looks good. I reopened a minor comment above about the cgroup example. You can remove the hold once that, and other open comments are resolved. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, dashpole, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Report metrics about pod resource reservation as observed by scheduler, quota, and kubelet subsystems in a way that makes administrators able to easily build dashboards and perform reporting on real resource usage.
48cb55d
to
689030c
Compare
Updated with last comment, can someone remove hold and add back lgtm? Thanks for the review (and sorry for sig-scheduling that this dropped through the cracks, appreciate the improvements suggested). After impl I will include some dashboard queries created for this. |
/hold cancel |
@smarterclayton I know that the KEP is focused on resource metrics, but I wonder if we can follow the same approach to report per-pod scheduling latency metrics. This should make it easier to implement scheduling latency SLOs. |
The pattern of having a separate endpoint to report high cardinality optional metrics that can be captured from static caches is pretty scalable and generally works when you want to avoid by default having to do something expensive. However, if you need to track data yourself (using a prometheus histogram or gauge) you'll still be paying for memory cost to keep that instance if you need higher precision than what the API captures. |
@ahg-g I would suggest opening a new issue discussing how the existing metrics are not sufficient for what you are trying to achieve (I'm aware of people using them for a scheduling latency SLO, I'd like to understand the gap 🙂 ). |
Implementation is up in kubernetes/kubernetes#94866 |
Is it kubernetes/kubernetes#95839? Why is it on |
Wrong cut and paste, I was reviewing something else. Correct PR is in the comment. The other PR is related to this but not directly part of the core impl. |
Report metrics about pod resource reservation as observed by scheduler, quota, and kubelet subsystems in a way that makes administrators able to easily build dashboards and perform reporting on real resource usage.
Tied to #1748