Pyrra causes instability of Prometheus #1153

snikch · 2024-04-17T22:30:10Z

snikch
Apr 17, 2024

Hi there, and thank you for Pyrra!

I'm aware this is going to be a really vague issue report, but we've been plagued with Prometheus stability issues for the last month and have come to realise that Pyrra is causing this.

We see our Prometheus pod being killed by Kubernetes and logging it received a SIGTERM. There are no OOM messages nor any probe issues on the container. This happens about every 10-40 minutes.

You can see a graph here where we removed the entire Pyrra helm chart for a few days and then turned it back on today.

I'd like to be able to dig into why this might be, but I'm not really sure where to start. It took me several days of digging to even realise it was Pyrra at "fault". Perhaps you could point me in the right direction?

Answered by metalmatze

Apr 19, 2024

It would be interesting to see what the final recording rules look like.
Overall, I don't think the issue is too specific to Pyrra. It's more like whatever recording rules you feed into your Prometheus.
A similar recording rule that wouldn't come from Pyrra but elsewhere might have a similar impact.

That being said, I would investigate the cardinality of your traces_spanmetrics_latency_count metric and then how expensive it is to run a increase(traces_spanmetrics_latency_count[4w]) for that metric.

View full answer

snikch · 2024-04-17T22:32:53Z

snikch
Apr 17, 2024
Author

We do see some errors in the Pyrra pods.


2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.603386975Z stderr F 2024-04-17T21:44:32Z	ERROR	Reconciler error	{"controller": "servicelevelobjective", "controllerGroup": "pyrra.dev", "controllerKind": "ServiceLevelObjective", "ServiceLevelObjective": {"name":"inmusicprofile-authorised-devices","namespace":"monitoring"}, "namespace": "monitoring", "name": "inmusicprofile-authorised-devices", "reconcileID": "71068fe1-4d2e-4c25-a4c0-68569c4f60c3", "error": "failed to update prometheus rule: prometheusrules.monitoring.coreos.com \"inmusicprofile-authorised-devices\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update"} |  
-- | -- | -- | --
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.586588632Z stderr F level=info ts=2024-04-17T21:44:32.584682542Z caller=servicelevelobjective.go:89 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=monitoring/inmusicprofile-authorised-devices msg="updating prometheus rule" namespace= name= |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.486439391Z stderr F level=info ts=2024-04-17T21:44:32.486105827Z caller=servicelevelobjective.go:78 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=monitoring/inmusicprofile-authorised-devices msg="creating prometheus rule" namespace= name= |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.3001328Z stderr F level=info ts=2024-04-17T21:44:32.298281629Z caller=servicelevelobjective.go:89 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=monitoring/inmusicprofile-device-auth-rest-api msg="updating prometheus rule" namespace=monitoring name=inmusicprofile-device-auth-rest-api |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.251899805Z stderr F level=info ts=2024-04-17T21:44:32.251727212Z caller=servicelevelobjective.go:89 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=monitoring/inmusicprofile-device-auth-rest-api msg="updating prometheus rule" namespace=monitoring name=inmusicprofile-device-auth-rest-api |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.246897311Z stderr F 	sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:227 |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.246892669Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.246887735Z stderr F 	sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:266 |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.24688296Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.246877871Z stderr F 	sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:329 |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.246870407Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler

The SLO is defined like so

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: inmusicprofile-authorised-devices
  namespace: monitoring
  labels:
    prometheus: k8s
    role: alert-rules
    pyrra.dev/team: webservices
    pyrra.dev/ns: inmusicprofile
    pyrra.dev/service: AuthorisedDevicesService
    pyrra.dev/tier: "4"
spec:
  target: "99"
  window: 4w
  description: AuthorisedDevicesService public endpoints.
  indicator:
    ratio:
      errors:
        metric: traces_spanmetrics_latency_count{span_name=~"inmusicapi\\.v1\\.AuthorisedDevicesService\\/.*", status_code="STATUS_CODE_ERROR"}
      total:
        metric: traces_spanmetrics_latency_count{span_name=~"inmusicapi\\.v1\\.AuthorisedDevicesService\\/.*"}

0 replies

metalmatze · 2024-04-19T18:51:29Z

metalmatze
Apr 19, 2024
Maintainer

It would be interesting to see what the final recording rules look like.
Overall, I don't think the issue is too specific to Pyrra. It's more like whatever recording rules you feed into your Prometheus.
A similar recording rule that wouldn't come from Pyrra but elsewhere might have a similar impact.

That being said, I would investigate the cardinality of your traces_spanmetrics_latency_count metric and then how expensive it is to run a increase(traces_spanmetrics_latency_count[4w]) for that metric.

1 reply

snikch May 6, 2024
Author

Thanks for your response. This sounds correct, of course! We'll look into this now.

giz33 · 2024-05-22T12:31:13Z

giz33
May 22, 2024

@snikch Hello!
Are you able to discover the root cause of your problem?
I am currently holding a deploy of Pyrra on my production environment because of your problem.
I am afraid that if I deploy Pyrra on production I will affect my prometheus.

Thanks in advance.

3 replies

metalmatze May 22, 2024
Maintainer

What exactly are you observing? As written above, it would be the same if similar recording and alerting rules would be added to your Prometheus. Especially the increase(...[4w]) (if you have a SLO over 4w) takes quite some compute and memory.
Make sure you do some capacity planning depending on your SLOs and time series.

snikch May 22, 2024
Author

@giz33 please don't hold off - this ultimately was an issue with the way we were deploying.

When we deployed our monorepo we do an "slo deploy" which deletes and recreates all Pyrra SLOs. Because this change was over enough rulesets, Prometheus in k8s decides to restart the prometheus service instead of just updating the rules. If the number of changed rules was low enough, this happens without a restart but if a threshold is breached then Prometheus instead decides to restart the container with the new rules.

So the be explicitly clear: There was no performance issues with Pyrra.

We've updated our deploy to use kubectl to do an update with prune on these resources and we're much happier.

The 10-40 minute occurrence interval was because we'd misconfigured our slo deploy job to run on most CI runs (instead of just when the SLOs were changed).

giz33 May 22, 2024

Hi @snikch and @metalmatze !
Thank you guys for the answers and the heads up about the 4w SLO!
I will proceed with my deploy on our production environment.

o/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pyrra causes instability of Prometheus #1153

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Pyrra causes instability of Prometheus #1153

snikch Apr 17, 2024

Replies: 3 comments · 4 replies

snikch Apr 17, 2024 Author

metalmatze Apr 19, 2024 Maintainer

snikch May 6, 2024 Author

giz33 May 22, 2024

metalmatze May 22, 2024 Maintainer

snikch May 22, 2024 Author

giz33 May 22, 2024

snikch
Apr 17, 2024

Replies: 3 comments 4 replies

snikch
Apr 17, 2024
Author

metalmatze
Apr 19, 2024
Maintainer

snikch May 6, 2024
Author

giz33
May 22, 2024

metalmatze May 22, 2024
Maintainer

snikch May 22, 2024
Author