You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NodeDb.ScheduleMany can be quite slow in the case where there are a large number of nodes and jobs cannot be assigned to any node. This is important because it represents the very common case where a cluster is full and we have a large number of jobs in the queue.
Here's a very simple benchmark:
This benchmark create a 1000 node cluster and then tries to schedule 4000 jobs on each node. The first 1K jobs will fill up the cluster, the remaining 3K jobs will all be unable to schedule due to no free resource. On my laptop the current master code runs this in 13.5 seconds. I have a PR (#2318 ) with some trivial changes, which reduces this to 9 Seconds.
To improve this further, I think we would need to change the way Armada works. I've listed the possible improvements we can make in a table below and hopefully we can come to an agreement about which should be implements.
Change
How it helps
Notes
PodSchedulingContext NumExcludedNodesByReason to be keyed by error rather than string
Building strings is expensive: notably 11% of cpu time in the benchmark is spent in InsufficientResources.String().
Arguably this is a good thing to do from a code cleanliness pov too as the reporting layer can then decide how reasons should be formatted.
InsufficientResources.String() to only report the resource that was insufficient rather than the values.
Reduces allocation because right now we have to allocate one error per job per node. If we only reported the resource we could cache errors for each resource type and thus reduce allocation to almost zero.
This would mean that information about how much free resource was present on each node would be lost, but I'm not sure this is too bad
Make NodeDb operate on a node struct we control (i.e. not a grpc generated struct)
We spend a lot of time calculating trivial facts about the node (AvailableQuantityByQueueAndResource, DominantQueue etc). In theory we only need to calculate these things when we schedule a job on the node, but in practice we have no way of storing these values as the structs involved are gpc generated. As a result we end up recalculating on every job scheduling event.
I think there's a general point here about not using grpc objects inside the scheuler, as our lack of control over these objects' code greatly limits what we can do.
Implement a custom resource list that special cases cpu and memory (possibly also gpu and ephermeral storage) so that they don't have to be looked up in maps
We spend lots of time doing map lookups- when accessing a struct field would be much faster
This would need a bit of thought as we also do map iteration and go doesn't let you make custom map drop ins.
Change the scheduler so that we filter nodes by used resource in the index
I admit I haven't fully delved into this but ii looks to me as if we end up checking every node for each job when all nodes are full. It seems like we should be able to order nodes by free resources and then stop checking new nodes once we've reached a node where the resources are insufficient.
Implement key based checking
The common case is that jobs in the same queue largely have homogeneous requirements so we can simply remember the requirements of jobs we have already been unable to schedule and simply mark subsequent jobs with same requirements as unschedulable without an exhaustive check of nodes.
This is a no-brainer but we do need to plan out where we are going to plumb it in and how it interacts with blackholing nodes
The above is probably enough to be getting on with for now- after that there are further optimizations we could do such as adding resource.quantity.approxDoubleValue to the relevant structs so it doesn't need to be continually recalculated, or replacing the memdb with something a bit faster.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
cc @severinson @JamesMurkin
NodeDb.ScheduleMany can be quite slow in the case where there are a large number of nodes and jobs cannot be assigned to any node. This is important because it represents the very common case where a cluster is full and we have a large number of jobs in the queue.
Here's a very simple benchmark:
This benchmark create a 1000 node cluster and then tries to schedule 4000 jobs on each node. The first 1K jobs will fill up the cluster, the remaining 3K jobs will all be unable to schedule due to no free resource. On my laptop the current master code runs this in 13.5 seconds. I have a PR (#2318 ) with some trivial changes, which reduces this to 9 Seconds.
To improve this further, I think we would need to change the way Armada works. I've listed the possible improvements we can make in a table below and hopefully we can come to an agreement about which should be implements.
AvailableQuantityByQueueAndResource
,DominantQueue
etc). In theory we only need to calculate these things when we schedule a job on the node, but in practice we have no way of storing these values as the structs involved are gpc generated. As a result we end up recalculating on every job scheduling event.The above is probably enough to be getting on with for now- after that there are further optimizations we could do such as adding resource.quantity.approxDoubleValue to the relevant structs so it doesn't need to be continually recalculated, or replacing the memdb with something a bit faster.
Beta Was this translation helpful? Give feedback.
All reactions