Pyrra causes instability of Prometheus #1153
-
Hi there, and thank you for Pyrra! I'm aware this is going to be a really vague issue report, but we've been plagued with Prometheus stability issues for the last month and have come to realise that Pyrra is causing this. We see our Prometheus pod being killed by Kubernetes and logging it received a SIGTERM. There are no OOM messages nor any probe issues on the container. This happens about every 10-40 minutes. You can see a graph here where we removed the entire Pyrra helm chart for a few days and then turned it back on today. I'd like to be able to dig into why this might be, but I'm not really sure where to start. It took me several days of digging to even realise it was Pyrra at "fault". Perhaps you could point me in the right direction? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 4 replies
-
We do see some errors in the Pyrra pods.
The SLO is defined like so apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
name: inmusicprofile-authorised-devices
namespace: monitoring
labels:
prometheus: k8s
role: alert-rules
pyrra.dev/team: webservices
pyrra.dev/ns: inmusicprofile
pyrra.dev/service: AuthorisedDevicesService
pyrra.dev/tier: "4"
spec:
target: "99"
window: 4w
description: AuthorisedDevicesService public endpoints.
indicator:
ratio:
errors:
metric: traces_spanmetrics_latency_count{span_name=~"inmusicapi\\.v1\\.AuthorisedDevicesService\\/.*", status_code="STATUS_CODE_ERROR"}
total:
metric: traces_spanmetrics_latency_count{span_name=~"inmusicapi\\.v1\\.AuthorisedDevicesService\\/.*"} |
Beta Was this translation helpful? Give feedback.
-
It would be interesting to see what the final recording rules look like. That being said, I would investigate the cardinality of your |
Beta Was this translation helpful? Give feedback.
-
@snikch Hello! Thanks in advance. |
Beta Was this translation helpful? Give feedback.
It would be interesting to see what the final recording rules look like.
Overall, I don't think the issue is too specific to Pyrra. It's more like whatever recording rules you feed into your Prometheus.
A similar recording rule that wouldn't come from Pyrra but elsewhere might have a similar impact.
That being said, I would investigate the cardinality of your
traces_spanmetrics_latency_count
metric and then how expensive it is to run aincrease(traces_spanmetrics_latency_count[4w])
for that metric.