You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
Thanos Querier aggregate functions(count,sum,min,max) on metrics is not honouring --query.timeout configurational settings. We are passing - --query.timeout=30s as argument in Thanos querier deployment to avoid prometheus/thanos/sidecar getting OOM killed due to long running/expensive queries. Raw query honours the timeout and terminates the query beyond 30 seconds where as aggregate query doesn't terminate beyond the configured timeout
What you expected to happen:
Thanos querier to terminate long running queries breaching 30 seconds whether it's raw or aggregated query
How to reproduce it (as minimally and precisely as possible):
Running aggregated query like count({name!=""}) by (name) to figure out if query timeout is being applied on the long running queries
FYI Total 3M samples available in prometheus
Below screenshot displays the working scenario where query getting terminated
Below screenshot displays the not working scenario where query is not getting terminated
Hey @mohaabduvisa! 👋 I'm here to help you with your tech issues and questions as you wait for a human maintainer.
The issue you're experiencing with Thanos Querier not honoring the --query.timeout setting for aggregate functions like count, sum, min, and max could be related to how the timeout is applied in the query processing pipeline. The --query.timeout flag is implemented to set a maximum time for processing a query by configuring the Timeout field in the promql.EngineOpts struct [1].
However, there might be differences in how timeouts are enforced for raw versus aggregate queries, especially in distributed execution models. The distributed execution model can affect timeout enforcement by breaking down queries into fragments and delegating them to different components, which might lead to inconsistencies in timeout behavior for complex queries like aggregates [2][3].
To address this, you might want to check if there are any specific configurations or settings that could affect the behavior of aggregate functions with respect to timeouts. Additionally, consider reviewing the distributed execution settings and ensure that the timeout configurations are consistently applied across all components involved in query processing. If the issue persists, it might require further investigation into the specific implementation details or potential limitations in the current version of Thanos you are using.
What happened:
Thanos Querier aggregate functions(count,sum,min,max) on metrics is not honouring --query.timeout configurational settings. We are passing - --query.timeout=30s as argument in Thanos querier deployment to avoid prometheus/thanos/sidecar getting OOM killed due to long running/expensive queries. Raw query honours the timeout and terminates the query beyond 30 seconds where as aggregate query doesn't terminate beyond the configured timeout
What you expected to happen:
Thanos querier to terminate long running queries breaching 30 seconds whether it's raw or aggregated query
How to reproduce it (as minimally and precisely as possible):
Running aggregated query like count({name!=""}) by (name) to figure out if query timeout is being applied on the long running queries
FYI Total 3M samples available in prometheus
Environment:
Thanos: 0.35.1
prometheus: 2.52.0
golang: 1.22.5
The text was updated successfully, but these errors were encountered: