Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add limit for max range query splits by interval #6458

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

afhassan
Copy link
Contributor

What this PR does:
Cortex only supports using a static interval to split range queries. This PR adds a new limit called split_queries_by_interval_max_splits that is used to dynamically change split interval to a multiple of split_queries_by_interval to ensure that the total number of splits remains below the set number.

Example:
split_queries_by_interval = 24h
split_queries_by_interval_max_splits = 30
A 30 day range query is split to 30 queries using 24h interval
A 40 day range query is split to 20 queries using 48h interval

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Ahmed Hassan <afayekhassan@gmail.com>
staticIntervalFn := func(_ tripperware.Request) time.Duration { return cfg.SplitQueriesByInterval }
queryRangeMiddleware = append(queryRangeMiddleware, tripperware.InstrumentMiddleware("split_by_interval", metrics), SplitByIntervalMiddleware(staticIntervalFn, limits, prometheusCodec, registerer))
intervalFn := func(_ tripperware.Request) time.Duration { return cfg.SplitQueriesByInterval }
if cfg.SplitQueriesByIntervalMaxSplits != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the limit be applied to both range splits and vertical spits?

func (s shardBy) Do(ctx context.Context, r Request) (Response, error) {

Copy link
Contributor Author

@afhassan afhassan Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this sets a limit for the total range and vertical splits for a given query. The number of vertical shards is static, so the max number of of splits for a given query becomes split_queries_by_interval_max_splits x query_vertical_shard_size. Because of this adding a separate limit for vertical sharding when the number of vertical shards is a static config would be redundant because we limit it already.

Signed-off-by: Ahmed Hassan <afayekhassan@gmail.com>
@pull-request-size pull-request-size bot added size/M and removed size/S labels Dec 31, 2024
@yeya24
Copy link
Contributor

yeya24 commented Dec 31, 2024

Instead of changing split interval using max number of split queries, can we try to combine it with estimated data to fetch?

For example, a query up[30d] is very expensive to split to 30 splits as each split query still fetches 30 day of data so 30 splits ended up fetching 900 days of data.

Instead of having a limit of total splits should we use total days of data to fetch?

@afhassan
Copy link
Contributor Author

Instead of changing split interval using max number of split queries, can we try to combine it with estimated data to fetch?

For example, a query up[30d] is very expensive to split to 30 splits as each split query still fetches 30 day of data so 30 splits ended up fetching 900 days of data.

Instead of having a limit of total splits should we use total days of data to fetch?

That's a good idea - I can add a new limit for total hours of data fetched and adjust the interval to not exceed it.

We can still keep max number of splits since it gives more flexibility to limit the number of shards for queries with long day range even if they don't fetch a lot of days of data like the example you mentioned

Signed-off-by: Ahmed Hassan <afayekhassan@gmail.com>
@pull-request-size pull-request-size bot added size/L and removed size/M labels Jan 16, 2025
# If vertical sharding is enabled for a query, the combined total number of
# vertical and interval shards is kept below this limit
# CLI flag: -querier.split-queries-by-interval-max-splits
[split_queries_by_interval_max_splits: <int> | default = 0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should run: make doc?

@@ -62,6 +62,9 @@ type Config struct {
// Limit of number of steps allowed for every subquery expression in a query.
MaxSubQuerySteps int64 `yaml:"max_subquery_steps"`

// Max number of days of data fetched for a query, used to calculate appropriate interval and vertical shard size.
MaxDaysOfDataFetched int `yaml:"max_days_of_data_fetched"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does MaxDurationOfDataFetchedFromStoragePerQuery sound better?
Should this be part of QueryRange configuration?

@@ -131,6 +134,7 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
f.Int64Var(&cfg.MaxSubQuerySteps, "querier.max-subquery-steps", 0, "Max number of steps allowed for every subquery expression in query. Number of steps is calculated using subquery range / step. A value > 0 enables it.")
f.BoolVar(&cfg.IgnoreMaxQueryLength, "querier.ignore-max-query-length", false, "If enabled, ignore max query length check at Querier select method. Users can choose to ignore it since the validation can be done before Querier evaluation like at Query Frontend or Ruler.")
f.BoolVar(&cfg.EnablePromQLExperimentalFunctions, "querier.enable-promql-experimental-functions", false, "[Experimental] If true, experimental promQL functions are enabled.")
f.IntVar(&cfg.MaxDaysOfDataFetched, "querier.max-days-of-data-fetched", 0, "Max number of days of data fetched for a query. This can be used to calculate appropriate interval and vertical shard size dynamically.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could more details be added to the explanation? Also add "0 to disable".

CacheResults bool `yaml:"cache_results"`
MaxRetries int `yaml:"max_retries"`
SplitQueriesByInterval time.Duration `yaml:"split_queries_by_interval"`
SplitQueriesByIntervalMaxSplits int `yaml:"split_queries_by_interval_max_splits"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these both could be nested inside another config called DynamicQuerySplits?

)

type IntervalFn func(r tripperware.Request) time.Duration
// dayMillis is the L4 block range in milliseconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The L4 block range is configurable in Cortex. Do we have to tie it to the L4 block range? Could the configuration itself be of type time.Duration?

reqs, err := splitQuery(r, s.interval(r))
interval, err := s.interval(ctx, r)
if err != nil {
return nil, httpgrpc.Errorf(http.StatusBadRequest, err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This should be an InternalServerError

if err != nil {
return nil, httpgrpc.Errorf(http.StatusBadRequest, err.Error())
}
s.splitByCounter.Add(float64(len(reqs)))

stats := querier_stats.FromContext(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the stats used for? Is it only used to log in query-frontend?

}
}

func dynamicIntervalFn(cfg Config, limits tripperware.Limits, queryAnalyzer querysharding.Analyzer, queryStoreAfter time.Duration, lookbackDelta time.Duration, maxDaysOfDataFetched int) func(ctx context.Context, r tripperware.Request) (time.Duration, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could all of these be passed through the cfg?

return cfg.SplitQueriesByInterval, err
}

queryDayRange := int((r.GetEnd() / dayMillis) - (r.GetStart() / dayMillis) + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid using day here? Other users of Cortex might choose to split by multiple days or less than a day?

return int64(d / (time.Millisecond / time.Nanosecond))
}

func GetTimeRangesForSelector(start, end int64, lookbackDelta time.Duration, n *parser.VectorSelector, path []parser.Node, evalRange time.Duration) (int64, int64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some tests for these util methods?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants