Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Querier aggregate functions(count,sum,min,max) on metrics is not honouring --query.timeout configurational settings #7846

Open
mohaabduvisa opened this issue Oct 21, 2024 · 1 comment

Comments

@mohaabduvisa
Copy link

What happened:
Thanos Querier aggregate functions(count,sum,min,max) on metrics is not honouring --query.timeout configurational settings. We are passing - --query.timeout=30s as argument in Thanos querier deployment to avoid prometheus/thanos/sidecar getting OOM killed due to long running/expensive queries. Raw query honours the timeout and terminates the query beyond 30 seconds where as aggregate query doesn't terminate beyond the configured timeout

What you expected to happen:
Thanos querier to terminate long running queries breaching 30 seconds whether it's raw or aggregated query

How to reproduce it (as minimally and precisely as possible):
Running aggregated query like count({name!=""}) by (name) to figure out if query timeout is being applied on the long running queries

FYI Total 3M samples available in prometheus

  1. Below screenshot displays the working scenario where query getting terminated
timeout_wrkng
  1. Below screenshot displays the not working scenario where query is not getting terminated
timeout_not_wrkng

Environment:
Thanos: 0.35.1
prometheus: 2.52.0
golang: 1.22.5

Copy link

dosubot bot commented Oct 21, 2024

Hey @mohaabduvisa! 👋 I'm here to help you with your tech issues and questions as you wait for a human maintainer.

The issue you're experiencing with Thanos Querier not honoring the --query.timeout setting for aggregate functions like count, sum, min, and max could be related to how the timeout is applied in the query processing pipeline. The --query.timeout flag is implemented to set a maximum time for processing a query by configuring the Timeout field in the promql.EngineOpts struct [1].

However, there might be differences in how timeouts are enforced for raw versus aggregate queries, especially in distributed execution models. The distributed execution model can affect timeout enforcement by breaking down queries into fragments and delegating them to different components, which might lead to inconsistencies in timeout behavior for complex queries like aggregates [2][3].

To address this, you might want to check if there are any specific configurations or settings that could affect the behavior of aggregate functions with respect to timeouts. Additionally, consider reviewing the distributed execution settings and ensure that the timeout configurations are consistently applied across all components involved in query processing. If the issue persists, it might require further investigation into the specific implementation details or potential limitations in the current version of Thanos you are using.

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant