Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kep: Pod resource metrics #1916

Merged
merged 1 commit into from
Oct 16, 2020
Merged

Conversation

smarterclayton
Copy link
Contributor

Report metrics about pod resource reservation as observed by scheduler, quota, and kubelet subsystems in a way that makes administrators able to easily build dashboards and perform reporting on real resource usage.

Tied to #1748

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jul 30, 2020
@k8s-ci-robot k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Jul 30, 2020
@ehashman
Copy link
Member

/assign

I'll TAL next week when I'm catching up on my Kubernetes backlog.

Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some details that need to be figured out I think, but I think the use cases and intentions are sound and important.

Maybe this is different for people very familiar with the scheduler code base, but I am missing a little bit more implementation details, for example, I wonder whether we would actually weave this into the existing workflow the scheduler performs, or whether we will just make use of the schedulers same calculations, roughly speaking as a library. If the later, then maybe a first step for quick iteration could actually be building this as a separate component first before integrating it more tightly.

* The kubelet will create a cgroup for the pod that expects to get roughly `300m` cores, but the container cgroup created for `copy-files` will request `100m` while that init container is running
* The quota subsystem will block this pod from being created if there is less than `300m` available.

Once a pod is created, it passes through four high level lifecycle phases as seen by the total Kubernetes system. The first phase is `Pending` - the time before the pod is scheduled to a node. The second phase is `Initializing` - the time between when the pod is scheduled on a node and when all init containers have completed successfully and the `Initalized` condition on the pod is set to `true`. A pod without init containers may only remain in this phase very briefly. The third phase is `Running`, when all init containers have completed, and continues until the pod is deleted or reaches a terminal state of success or failure. The final state is `Completed` which means the pod has no running containers, all resources have been released or cleaned up, and the pod will never again consume those resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will need some deeper thought. As it stands this sounds like we will have a lot of very short lived series, which we should prevent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. So short lived pods already have this problem for conditions, so this should be strictly less than almost every other ksm style metric that exposes conditions (which change more frequently than this). Maybe we should remove other short lived series in order to do this, if short lived series budget is a concern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the proposal https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/20200415-cardinality-enforcement.md, the cluster admins/vendors already have tools to prevent any kind of series explosion if that is what you meant @brancz ?

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Jul 31, 2020

wonder whether we would actually weave this into the existing workflow the scheduler performs, or whether we will just make use of the schedulers same calculations, roughly speaking as a library.

Scheduler is tens of related workflows, with a core loop for placement, but metrics calculation I don't really think of as sufficient to justify a new top level component (in this context). In general I'm -1 to adding metrics gathering components that must cache all pods - I don't have a lot of truck with microservice for its own sake, especially when the domain for scheduler is "the resource model" and I would argue the metrics it is using to make decisions. Splitting increases deployment complexity - while this is positioned as optional, I would expect every distribution of Kube that has some form of metrics to enable it. I'll list some alternative placements such as KCM (scheduler is better), separate component (extra cost to all deployments, no real reason to separate).

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahg-g - FYI

@brancz
Copy link
Member

brancz commented Aug 3, 2020

Splitting increases deployment complexity - while this is positioned as optional, I would expect every distribution of Kube that has some form of metrics to enable it.

This is largely what I was going towards. I haven't checked recently, but I think most providers don't give access to metrics endpoints of the scheduler, which is why I was hoping for those people it would still be possible to make use of these metrics somehow. Ultimately that's the providers choice, but I would want to maximize the usefullness people can get out this with any Kubernetes cluster.

* The kubelet will create a cgroup for the pod that expects to get roughly `300m` cores, but the container cgroup created for `copy-files` will request `100m` while that init container is running
* The quota subsystem will block this pod from being created if there is less than `300m` available.

Once a pod is created, it passes through four high level lifecycle phases as seen by the total Kubernetes system. The first phase is `Pending` - the time before the pod is scheduled to a node. The second phase is `Initializing` - the time between when the pod is scheduled on a node and when all init containers have completed successfully and the `Initalized` condition on the pod is set to `true`. A pod without init containers may only remain in this phase very briefly. The third phase is `Running`, when all init containers have completed, and continues until the pod is deleted or reaches a terminal state of success or failure. The final state is `Completed` which means the pod has no running containers, all resources have been released or cleaned up, and the pod will never again consume those resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the proposal https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/20200415-cardinality-enforcement.md, the cluster admins/vendors already have tools to prevent any kind of series explosion if that is what you meant @brancz ?

@serathius
Copy link
Contributor

/cc

@k8s-ci-robot k8s-ci-robot requested a review from serathius August 6, 2020 18:55
Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, as a cluster administrator I am intimately familiar with the problem this KEP is addressing and I would love for this feature to be available out of the box on clusters. I am generally quite enthusiastic about this KEP!

@k8s-ci-robot k8s-ci-robot requested review from pigletfly and removed request for pigletfly August 27, 2020 15:27
@smarterclayton
Copy link
Contributor Author

Updated with a large number of comments addressed. The remaining issue appears to be the justification for lifecycle as a separate metric, which I will respond to in more detail shortly.

@smarterclayton
Copy link
Contributor Author

So another suggestion - instead of making this optional, we could simply expose it as an endpoint /metrics/resources on the schedulers and someone can scrape if they want the data. That requires no flags, and incurs no additional cost to the scheduler (the cost is only incurred when scraping the metrics as having the collector is free unless metrics are scraped).

@brancz
Copy link
Member

brancz commented Sep 18, 2020

I actually like that separation better either way. Essentially it's splitting metrics about Pods ("cluster state") from the metrics that are actually about the scheduler itself.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 15, 2020
@smarterclayton
Copy link
Contributor Author

Updated with nits. @ehashman can you do a reread and add your approval if you are ok (nothing substantitive has changed on the metrics end)

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks for working on this. It's going to be very helpful

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 15, 2020
@dashpole
Copy link
Contributor

/approve

I did another pass, and it still looks good. I reopened a minor comment above about the cgroup example. You can remove the hold once that, and other open comments are resolved.
Please include me in the discussion about container-granularity metrics :).

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, dashpole, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 15, 2020
Report metrics about pod resource reservation as observed by
scheduler, quota, and kubelet subsystems in a way that makes
administrators able to easily build dashboards and perform
reporting on real resource usage.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 16, 2020
@smarterclayton
Copy link
Contributor Author

Updated with last comment, can someone remove hold and add back lgtm? Thanks for the review (and sorry for sig-scheduling that this dropped through the cracks, appreciate the improvements suggested). After impl I will include some dashboard queries created for this.

@ahg-g
Copy link
Member

ahg-g commented Oct 16, 2020

/hold cancel
/lgtm

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 16, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 16, 2020
@k8s-ci-robot k8s-ci-robot merged commit 12266f8 into kubernetes:master Oct 16, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Oct 16, 2020
@ahg-g
Copy link
Member

ahg-g commented Oct 19, 2020

@smarterclayton I know that the KEP is focused on resource metrics, but I wonder if we can follow the same approach to report per-pod scheduling latency metrics. This should make it easier to implement scheduling latency SLOs.

@smarterclayton
Copy link
Contributor Author

but I wonder if we can follow the same approach to report per-pod scheduling latency metrics. This should make it easier to implement scheduling latency SLOs.

The pattern of having a separate endpoint to report high cardinality optional metrics that can be captured from static caches is pretty scalable and generally works when you want to avoid by default having to do something expensive. However, if you need to track data yourself (using a prometheus histogram or gauge) you'll still be paying for memory cost to keep that instance if you need higher precision than what the API captures.

@brancz
Copy link
Member

brancz commented Oct 20, 2020

@ahg-g I would suggest opening a new issue discussing how the existing metrics are not sufficient for what you are trying to achieve (I'm aware of people using them for a scheduling latency SLO, I'd like to understand the gap 🙂 ).

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Nov 9, 2020

Implementation is up in kubernetes/kubernetes#94866

@alculquicondor
Copy link
Member

Is it kubernetes/kubernetes#95839? Why is it on pkg/kubelet?

@smarterclayton
Copy link
Contributor Author

Wrong cut and paste, I was reviewing something else. Correct PR is in the comment. The other PR is related to this but not directly part of the core impl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.