Improving the performance of NodeDb.ScheduleMany #2319

d80tb7 · 2023-04-02T20:54:03Z

d80tb7
Apr 2, 2023
Collaborator

NodeDb.ScheduleMany can be quite slow in the case where there are a large number of nodes and jobs cannot be assigned to any node. This is important because it represents the very common case where a cluster is full and we have a large number of jobs in the queue.
Here's a very simple benchmark:

func TestScheduleManyPerf(t *testing.T) {
	nodes := testNCpuNode(1000, testPriorities)
	nodeDb, err := createNodeDb(nodes)
	require.NoError(t, err)

	reqs := testNLargeCpuJob("foo", 0, 4000)
	for _, r := range reqs {
		_, _, err = nodeDb.ScheduleMany([]*schedulerobjects.PodRequirements{r})
		require.NoError(t, err)
	}
}

This benchmark create a 1000 node cluster and then tries to schedule 4000 jobs on each node. The first 1K jobs will fill up the cluster, the remaining 3K jobs will all be unable to schedule due to no free resource. On my laptop the current master code runs this in 13.5 seconds. I have a PR (#2318 ) with some trivial changes, which reduces this to 9 Seconds.

To improve this further, I think we would need to change the way Armada works. I've listed the possible improvements we can make in a table below and hopefully we can come to an agreement about which should be implements.

Change	How it helps	Notes
PodSchedulingContext NumExcludedNodesByReason to be keyed by error rather than string	Building strings is expensive: notably 11% of cpu time in the benchmark is spent in InsufficientResources.String().	Arguably this is a good thing to do from a code cleanliness pov too as the reporting layer can then decide how reasons should be formatted.
InsufficientResources.String() to only report the resource that was insufficient rather than the values.	Reduces allocation because right now we have to allocate one error per job per node. If we only reported the resource we could cache errors for each resource type and thus reduce allocation to almost zero.	This would mean that information about how much free resource was present on each node would be lost, but I'm not sure this is too bad
Make NodeDb operate on a node struct we control (i.e. not a grpc generated struct)	We spend a lot of time calculating trivial facts about the node (`AvailableQuantityByQueueAndResource`, `DominantQueue` etc). In theory we only need to calculate these things when we schedule a job on the node, but in practice we have no way of storing these values as the structs involved are gpc generated. As a result we end up recalculating on every job scheduling event.	I think there's a general point here about not using grpc objects inside the scheuler, as our lack of control over these objects' code greatly limits what we can do.
Implement a custom resource list that special cases cpu and memory (possibly also gpu and ephermeral storage) so that they don't have to be looked up in maps	We spend lots of time doing map lookups- when accessing a struct field would be much faster	This would need a bit of thought as we also do map iteration and go doesn't let you make custom map drop ins.
Change the scheduler so that we filter nodes by used resource in the index	I admit I haven't fully delved into this but ii looks to me as if we end up checking every node for each job when all nodes are full. It seems like we should be able to order nodes by free resources and then stop checking new nodes once we've reached a node where the resources are insufficient.
Implement key based checking	The common case is that jobs in the same queue largely have homogeneous requirements so we can simply remember the requirements of jobs we have already been unable to schedule and simply mark subsequent jobs with same requirements as unschedulable without an exhaustive check of nodes.	This is a no-brainer but we do need to plan out where we are going to plumb it in and how it interacts with blackholing nodes

The above is probably enough to be getting on with for now- after that there are further optimizations we could do such as adding resource.quantity.approxDoubleValue to the relevant structs so it doesn't need to be continually recalculated, or replacing the memdb with something a bit faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving the performance of NodeDb.ScheduleMany #2319

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Improving the performance of NodeDb.ScheduleMany #2319

d80tb7 Apr 2, 2023 Collaborator

Replies: 0 comments

d80tb7
Apr 2, 2023
Collaborator