-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cordon at large scale #89
Comments
Closed
Can we get an update on this thread? Could you please provide a reason for not considering the cordon limiter? Please update the thread, and I'll be happy to take it up from there |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We met a case where a lot of nodes (>75% of the cluster) have received a condition that trigger
draino
.draino
has the ability to delay/schedule the drain activity not to create an outage, but the cordon is always applied immediately.This lead us to have to many nodes
cordon
at the same time in the cluster which made some operations (rollout of huge application) impossible.In our case let's say that cluster capacity is limit to C nodes and/or that a given application cannot be allocated more than A nodes (quota). Let's imagine that a huge % of the nodes are cordon, some drains are scheduled by draino. At that time an application is rolled out. If all current nodes of the application are cordon, the system will try to provision (with cluster autoscaler for example) some more nodes to host new pods of the application. Sometimes this provisioning is impossible because it would lead to break either the C or A limits.
In that case the system is blocked: the application cannot be rolled out. This is critical because we may not want to wait for all the ongoing drain activities to complete (could take hours if there are hundreds of nodes) before being able to rollout.
Possible solutions (that can be combined):
A- limit the number of nodes that can be cordon on the cluster simultaneously: draino would not cordon if this limit is reached, waiting for some more cordon slot. New flag:
--max-simultaneous-cordon
, format of value(int | int%)
, default -1 meaning no limit.example:
B- limit the number of nodes that can be cordon simultaneously for a given set of label key: draino would not cordon if this limit is reached, waiting for some more cordon slot. New flag:
--max-simultaneous-cordon-for-labels, , format of value
(int | int%),labelKey+``C- same than B but using taint key instead of label key:
--max-simultaneous-cordon-for-labels
D- in a first place instead of
cordon
use a taintPreferNoSchedule
. Then only Cordon just before the start of the Drain activity (according to current schedule). New flag: --use-preferred-no-schedule-taintWhat do you think?
I can start to implement A which is simple and can already help to protect the system.
The text was updated successfully, but these errors were encountered: