Replies: 3 comments 7 replies
-
If user memory was comparable to e.g. file sizes of the input files, it could be something comprehendible to me. Because of this, I see no benefit from distinguishing user and system memory at all. If anyone asks me about how to set configure them, I always recommend setting them to the same value. If we could get rid of the distinction, the memory configuration would be 2x simpler.
👍
I am not sure this is an OK value or not, but certainly it requires removing reserved pool completely first. |
Beta Was this translation helpful? Give feedback.
-
I don't think we should wait with removing of reserved pool. I think we could do it soon, but we need to gauge it's usage by community. I started community discussion around it: https://trinodb.slack.com/archives/CP1MUNEUX/p1610630362006000
I think we should remove system memory accounting too at this point. Objections @dain @electrum @martint ?
What is the risk of doing so? I see some potential issues:
|
Beta Was this translation helpful? Give feedback.
-
IIRC that would be biggest query, no? Then if somebody runs such query, he will get penalized first for over allocations |
Beta Was this translation helpful? Give feedback.
-
Problem Statement:
We've often seen users getting confused on how to manage Trino memory limits. In particular, over the last so many weeks, I've observed at least several instances of users complaining about hitting max cluster-wide query limits with the default cluster settings, and not knowing how to proceed with tuning. Upon deeper investigation, it seems that it may actually be unreasonable to ask users to configure some of these limits because of the relationships to certain environmental properties. For example: consider the default config for the total cluster-wide memory limit per query of 40GB. This limit encompasses memory reservations for user memory, but also system memory (which users don't intuit very well).
In order to set this value, how does one estimate how much system memory a query might take at any given instant?
The funny thing is that this depends upon a number of factors:
(A) The number of nodes in the cluster
The more nodes that you have, the more concurrency you can have, and thus the more system memory can be consumed. We've seen users complain about them hitting memory limits with only pure table scan queries. When node auto-scaling is enacted, this becomes next to impossible to manage.
(B) The number of splits generated by the connector
As in (A), the more splits that you have, the more concurrency you can have, which can increase the burst system memory size required.
(C) The split scheduling policy of the scheduler
The scheduler dictates how and when splits get scheduled, which affect the instantaneous system memory consumption of the cluster. Users do not have a huge amount of control over this today.
(D) Other concurrent queries in the cluster
The irony here is that if you have more queries running at the same time, less concurrency gets allocated per query (and thus system memory), and subsequently your query that was failing before on an empty cluster might actually succeed on a saturated cluster.
Given these things, it seems like this limit threshold does create a somewhat unpredictable experience for end users. Something that could be working at one point, could all of a sudden start failing in another instant.
Why were the query.max-memory, query.max-total-memory, query.max-memory-per-node, and query.max-total-memory-per-node limits added in the past?
In the past, we needed these two properties to be able to guarantee the behavior of Trino reserved memory pools: whereby a cluster under memory pressure could resort to using the reserve memory pools to ensure that at least one query could proceed at any given time (given that we know the max amount of memory required for a query at a per-node and cluster-wide level).
However, as of November of 2019, the reserved memory pool implementation has effectively been disabled by default, and the ability to enable it is only possible via a deprecated config option (#2006).
Proposal:
I'm proposing that we change the Trino default memory settings so that a query can leverage the entire cluster (if possible). This could potentially look like the following:
query.max-memory:
20GB => INFINITE (which will allow failing via per-node limits)
query.max-total-memory:
query.max-memory * 2 => INFINITE (which will allow failing via per-node limits)
query.max-memory-per-node:
JVM max memory * 0.1 => JVM max memory * 0.7
query.max-total-memory-per-node:
JVM max memory * 0.3 => JVM max memory * 0.7
memory.heap-headroom-per-node:
JVM max memory * 0.3 => SAME
query.low-memory-killer.policy:
total-reservation-on-blocked-nodes => SAME
Users could in theory tune these values down IF they want to enforce memory policies on query limits, however my guess is that few users actually do so (someone please correct if this is wrong). And if they do enforce limits, my guess is that these are probably more at the user memory limit level, rather than the user+system memory limit level. I don't see how anyone could reasonably set system memory limit levels.
Future:
Longer term, if there are not that many users utilizing the memory pools implementation option (experimental.reserved-pool-enabled=true), we should consider getting rid of the whole memory pools concept altogether. This would dramatically simplify the mental load when discussing the topic of memory. In doing so, we may not either need the whole separation between user and system memory tracking either (although there is no harm in keeping it around if we like).
I'm curious to hear if anyone else has any thoughts on this?
Beta Was this translation helpful? Give feedback.
All reactions