Proposal to change default query memory limits to allow full cluster utilization #6573

erichwang · 2021-01-12T06:20:25Z

erichwang
Jan 12, 2021
Collaborator

Problem Statement:
We've often seen users getting confused on how to manage Trino memory limits. In particular, over the last so many weeks, I've observed at least several instances of users complaining about hitting max cluster-wide query limits with the default cluster settings, and not knowing how to proceed with tuning. Upon deeper investigation, it seems that it may actually be unreasonable to ask users to configure some of these limits because of the relationships to certain environmental properties. For example: consider the default config for the total cluster-wide memory limit per query of 40GB. This limit encompasses memory reservations for user memory, but also system memory (which users don't intuit very well).

In order to set this value, how does one estimate how much system memory a query might take at any given instant?

The funny thing is that this depends upon a number of factors:
(A) The number of nodes in the cluster
The more nodes that you have, the more concurrency you can have, and thus the more system memory can be consumed. We've seen users complain about them hitting memory limits with only pure table scan queries. When node auto-scaling is enacted, this becomes next to impossible to manage.
(B) The number of splits generated by the connector
As in (A), the more splits that you have, the more concurrency you can have, which can increase the burst system memory size required.
(C) The split scheduling policy of the scheduler
The scheduler dictates how and when splits get scheduled, which affect the instantaneous system memory consumption of the cluster. Users do not have a huge amount of control over this today.
(D) Other concurrent queries in the cluster
The irony here is that if you have more queries running at the same time, less concurrency gets allocated per query (and thus system memory), and subsequently your query that was failing before on an empty cluster might actually succeed on a saturated cluster.

Given these things, it seems like this limit threshold does create a somewhat unpredictable experience for end users. Something that could be working at one point, could all of a sudden start failing in another instant.

Why were the query.max-memory, query.max-total-memory, query.max-memory-per-node, and query.max-total-memory-per-node limits added in the past?

In the past, we needed these two properties to be able to guarantee the behavior of Trino reserved memory pools: whereby a cluster under memory pressure could resort to using the reserve memory pools to ensure that at least one query could proceed at any given time (given that we know the max amount of memory required for a query at a per-node and cluster-wide level).

However, as of November of 2019, the reserved memory pool implementation has effectively been disabled by default, and the ability to enable it is only possible via a deprecated config option (#2006).

Proposal:
I'm proposing that we change the Trino default memory settings so that a query can leverage the entire cluster (if possible). This could potentially look like the following:

query.max-memory:
20GB => INFINITE (which will allow failing via per-node limits)

query.max-total-memory:
query.max-memory * 2 => INFINITE (which will allow failing via per-node limits)

query.max-memory-per-node:
JVM max memory * 0.1 => JVM max memory * 0.7

query.max-total-memory-per-node:
JVM max memory * 0.3 => JVM max memory * 0.7

memory.heap-headroom-per-node:
JVM max memory * 0.3 => SAME

query.low-memory-killer.policy:
total-reservation-on-blocked-nodes => SAME

Users could in theory tune these values down IF they want to enforce memory policies on query limits, however my guess is that few users actually do so (someone please correct if this is wrong). And if they do enforce limits, my guess is that these are probably more at the user memory limit level, rather than the user+system memory limit level. I don't see how anyone could reasonably set system memory limit levels.

Future:

Longer term, if there are not that many users utilizing the memory pools implementation option (experimental.reserved-pool-enabled=true), we should consider getting rid of the whole memory pools concept altogether. This would dramatically simplify the mental load when discussing the topic of memory. In doing so, we may not either need the whole separation between user and system memory tracking either (although there is no harm in keeping it around if we like).

I'm curious to hear if anyone else has any thoughts on this?

findepi · 2021-01-14T13:05:47Z

findepi
Jan 14, 2021
Collaborator

Why were the query.max-memory, query.max-total-memory, query.max-memory-per-node, and query.max-total-memory-per-node limits added in the past?

If user memory was comparable to e.g. file sizes of the input files, it could be something comprehendible to me.
Since it's not (internal representation is different, there are various processing overheaders, the memory is consumed on multiple stages, and the original files are often compressed), the "user memory" is just an abstract metric. More data means more memory, but I cannot attach and special meaning to a particular value.

Because of this, I see no benefit from distinguishing user and system memory at all. If anyone asks me about how to set configure them, I always recommend setting them to the same value.

If we could get rid of the distinction, the memory configuration would be 2x simpler.

query.max-memory:
20GB => INFINITE (which will allow failing via per-node limits)

👍

query.max-memory-per-node:
JVM max memory * 0.1 => JVM max memory * 0.7

I am not sure this is an OK value or not, but certainly it requires removing reserved pool completely first.

2 replies

erichwang Jan 19, 2021
Collaborator Author

The way that I've cluster memory managed in the past is that USER memory should in theory be somewhat more repeatable per query than SYSTEM memory. So I've seen cluster admins run canonical query sets over fixed data to determine how to size these limits. That being said, if nobody is actually currently using this distinction in practice, than I see no reason to keep it.

I'm hoping that anyone who cares will speak up.

electrum Mar 2, 2021
Maintainer

User memory was supposed to be memory that was under the user's control, based on their query and data, e.g., the size of aggregations. Determinism is an important effect of that. The same query over the same data that runs today should also run tomorrow, regardless of things like machine count, buffer sizing, etc. In practice, for many reasons we never fully achieved this, and it's not clear that this is a particularly useful goal or distinction to most users or administrators.

sopel39 · 2021-01-14T13:24:20Z

sopel39
Jan 14, 2021
Collaborator

Longer term, if there are not that many users utilizing the memory pools implementation option (experimental.reserved-pool-enabled=true)

I don't think we should wait with removing of reserved pool. I think we could do it soon, but we need to gauge it's usage by community. I started community discussion around it: https://trinodb.slack.com/archives/CP1MUNEUX/p1610630362006000

In doing so, we may not either need the whole separation between user and system memory tracking either (although there is no harm in keeping it around if we like).

I think we should remove system memory accounting too at this point. Objections @dain @electrum @martint ?

query.max-memory-per-node:
JVM max memory * 0.1 => JVM max memory * 0.7

What is the risk of doing so? I see some potential issues:

it could happen that there will be some rogue queries that could reduce concurrency. Are queries killed (on OOM) only where all queries are stuck waiting for memory?
it could happen that rogue queries will cause excessive spilling (when spilling is enabled).

3 replies

erichwang Jan 19, 2021
Collaborator Author

it could happen that there will be some rogue queries that could reduce concurrency. Are queries killed (on OOM) only where all queries are stuck waiting for memory?

It is not so much about the reduction of concurrency (since you can still have this problem even without hitting memory limits). But rather once we hit an OOM condition, the query selected to be killed could feel some what arbitrary and difficult to control to the end users. This is especially noticeable in a multi-tenant system where one of your tasks might get killed for something that is not of your own doing (it could be someone else misbehaving).

it could happen that rogue queries will cause excessive spilling (when spilling is enabled).

good point

hashhar Jan 20, 2021
Collaborator

Does removing the system memory accounting in any way mean that Trino might try to overcommit memory from the heap?

With the memory limits in place today it's very unlikely for a Trino process to hit JVM OOM - the queries hit the limit (user+system) and end up being picked by the low memory killer.

Having administered a moderately large cluster in the past I agree that system and user memory distinction wasn't useful since you can't reason about it.

As for the changing of default config values, I'm totally onboard with that since that's similar to what I ended up configuring and I assume many others do the same for a single-tenant setup.

erichwang Jan 20, 2021
Collaborator Author

Does removing the system memory accounting in any way mean that Trino might try to overcommit memory from the heap?

No, he just means that we remove the system memory distinction from user memory and track the aggregate memory as a single resource.

sopel39 · 2021-01-20T10:49:50Z

sopel39
Jan 20, 2021
Collaborator

But rather once we hit an OOM condition, the query selected to be killed could feel some what arbitrary and difficult to control to the end users.

IIRC that would be biggest query, no? Then if somebody runs such query, he will get penalized first for over allocations

2 replies

erichwang Jan 21, 2021
Collaborator Author

Yes, the current implementation is the biggest query. However, you need to keep in mind the perspective of each user in this multi-tenant system. If I am submitting a single query, how can I know it is going to be the biggest query in the context of all queries being run across all users? And the success of my single query would then depend on how others are using the cluster even though I can not see the others using the cluster. For example, it could be a query that occupies 10% of the available memory, but gets killed sometimes because someone else submitted a bunch of queries occupying 5% of the memory. So from the perspective of one user, the enforcement can feel arbitrary. This is what I mean.

dain Mar 2, 2021
Maintainer

The current implementation is something like to kill query that unblocks the most other queries. We used to use a mode where we killed the biggest query, but we found it was in effective. We found that blocking was mostly caused by skewed queries... basically on query sucks down all the memory on one machine, and that ends up blocking all queries. I don't remember the behavior when there isn't a skewed query. Maybe @electrum or @martint remember this detail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to change default query memory limits to allow full cluster utilization #6573

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Proposal to change default query memory limits to allow full cluster utilization #6573

erichwang Jan 12, 2021 Collaborator

Replies: 3 comments · 7 replies

findepi Jan 14, 2021 Collaborator

erichwang Jan 19, 2021 Collaborator Author

electrum Mar 2, 2021 Maintainer

sopel39 Jan 14, 2021 Collaborator

erichwang Jan 19, 2021 Collaborator Author

hashhar Jan 20, 2021 Collaborator

erichwang Jan 20, 2021 Collaborator Author

sopel39 Jan 20, 2021 Collaborator

erichwang Jan 21, 2021 Collaborator Author

dain Mar 2, 2021 Maintainer

erichwang
Jan 12, 2021
Collaborator

Replies: 3 comments 7 replies

findepi
Jan 14, 2021
Collaborator

erichwang Jan 19, 2021
Collaborator Author

electrum Mar 2, 2021
Maintainer

sopel39
Jan 14, 2021
Collaborator

erichwang Jan 19, 2021
Collaborator Author

hashhar Jan 20, 2021
Collaborator

erichwang Jan 20, 2021
Collaborator Author

sopel39
Jan 20, 2021
Collaborator

erichwang Jan 21, 2021
Collaborator Author

dain Mar 2, 2021
Maintainer