Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Add back the helm suspended metric #1111

Closed
wants to merge 1 commit into from
Closed

Conversation

jmleddy
Copy link

@jmleddy jmleddy commented Nov 18, 2024

At some point we had this and then we lost it. Discovered after we started suspending a bunch of things but could not get this metric to appear, meaning we are currently in a quasi-state of releases suspended across all our clusters that we don't know about.

At some point we had this and then we lost it. Discovered after we
started suspending a bunch of things but could not get this metric to
appear, meaning we are currently in a quasi-state of releases suspended
across all our clusters that we don't know about.

Signed-off-by: James M Leddy <jm.leddy@gmail.com>
@stefanprodan
Copy link
Member

stefanprodan commented Nov 19, 2024

The suspend label has been included in the gotk_resource_info which is provided by kube-state-metrics. I recommend you migrate your alerts and dashboard to the new metric as the old ones have been deprecated long ago. Docs here: https://fluxcd.io/flux/monitoring/metrics/#resource-metrics

@jmleddy
Copy link
Author

jmleddy commented Nov 19, 2024

This is extra toil for us to get a back metric that we were alerting on and have completely lost. We have no idea how many helm releases are paused not applying resource request updates or whatever. And it's inconsistently applied. Why do our kustomizations still report when they are stalled but our helm releases don't? I realize that there are different maintainers that have different opinions about what metrics should be exposed, but to the end user this all just looks like "flux", since all the controllers come with flux.

For anyone that might find this PR and wonder what the kube-state-metrics config is, seems to be here

@stefanprodan
Copy link
Member

stefanprodan commented Nov 19, 2024

I realize that there are different maintainers that have different opinions

The core maintainers make the decisions for the common behaviour of all Flux controllers and the metrics fall into this category. We made the decision to drop the resource specific metrics from the controllers exporters and rely on kube-prometheus-stack. The deprecation notice can be found here: https://fluxcd.io/flux/monitoring/metrics/#warning-deprecated-resource-metrics

This controller was last promoted to GA, so we removed the deprecated metrics from it, but we should've done that in all controllers. We'll make sure the old metrics are removed in the next release across all Flux components.

@jmleddy
Copy link
Author

jmleddy commented Nov 19, 2024

Thank you, though I would prefer you add back this metric everywhere to avoid requiring everyone to add 275 lines of yaml to their kube-prometheus-stack helm chart, the inconsistency is even worse than the first decision as it feels uneven. And probably also led us to slower detection of the issue as it was still finding suspends "sometimes". So looking forward to having a consistent view from the Flux controller maintainers here :)

@jmleddy
Copy link
Author

jmleddy commented Nov 19, 2024

Also, is this you? fluxcd/flux2-monitoring-example#35 (comment)

@stefanprodan
Copy link
Member

@jmleddy you can read our motivation in this issue: fluxcd/flux2#4128

If you don't like the kube-state-metrics approach feel free to use the Flux Operator, the tradeoff is that you can't customise those metrics in any way.

@jmleddy
Copy link
Author

jmleddy commented Nov 19, 2024

Okay I thought the controller was running as part of the operator I must not have my kube config right. I'll run it as part of the ksm thanks!

@jmleddy jmleddy closed this Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants