Autoscaling best practices with Thanos in a router-ingestor configuration #7637

ermirry · 2024-08-13T11:00:09Z

ermirry
Aug 13, 2024

Hello,

I'm looking to receive some advice or at the very least receive some insights into how everyone else is leveraging autoscaling when using Thanos and specifically in a router-ingestor configuration thats using the thanos-receive-controller for managing the hashring. I've run into some different quirks with Thanos while implementing Keda for the ingestor pods. For example, similar to the user in this discussion I've been noticing a rise in 500's and 503's when scaling events occur. I've monitored the hashring to ensure it was getting updated promptly during scaling events (it has). I've also noticed a complete halt in ingestion once we reach a high replica count (15+ pods) that I haven't been able to find an explanation for.

I know there are also concerns about scaling down too fast and accidentally losing data if you scale down before the retention window expires so we intentionally slow down the scale down operation, however I've also read in some places that scaling up can also be disruptive to Thanos. All in all, there doesn't seem to be much documentation for addressing autoscaling Thanos components so I wanted to see if/how everyone else is doing it and if there are any best practices that are known when pairing the two. The router component is also being scaled by the k8s HPA but we wanted to use Keda for the ingestor so we could use custom metrics to scale.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaling best practices with Thanos in a router-ingestor configuration #7637

{{title}}

Replies: 0 comments

Select a reply

Autoscaling best practices with Thanos in a router-ingestor configuration #7637

ermirry Aug 13, 2024

Replies: 0 comments

ermirry
Aug 13, 2024