-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
970: Taint nodes even if reboot is currently blocked #971
base: main
Are you sure you want to change the base?
Conversation
- If using the --prefer-no-schedule-taint flag, apply the taint even if the reboot is currently blocked. This will make it less likley for Pods to get scheduled to nodes that are pending a reboot, once the blocking Pods / alerts are cleared. Signed-off-by: Matt Yaraskavitch <62650344+yaraskm@users.noreply.github.com>
Signed-off-by: Matt Yaraskavitch <62650344+yaraskm@users.noreply.github.com>
@@ -761,16 +761,24 @@ func rebootAsRequired(nodeID string, booter reboot.Reboot, sentinelCommand []str | |||
blockCheckers = append(blockCheckers, KubernetesBlockingChecker{client: client, nodename: nodeID, filter: podSelectors}) | |||
} | |||
|
|||
rebootBlocked := rebootBlocked(blockCheckers...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder is the easiest and most correct way to improve this is to move right after L32 above, immediately after we determine that the node will need to be deleted.
As far as I can see there isn't any harm in always adding this "prefer no schedule" taint. Right now we're only doing it if we aren't able to rapidly progress to the drain operation, but why not do it no matter what?
This would solve for your current problem as well.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, I'll make the change later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jackfrancis , which line is it you're referring to? I don't think L32 is correct.
I've been able to successfully run with this PR in prod for the last few weeks now and it definitely seems to have lowered the average time for a node to be blocked. I'm definitely open to adding the taint all of the time, regardless of whether we're in an acceptable reboot window or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackfrancis do you truly believe that adding the prefer no schedule taint can be done in all cases?
Wouldn't it make it a bit moot to have all nodes with prefernoschedule after a security update (for which all nodes will get the taint)? When kubernetes will schedule (all the nodes having the prefernoschedule) there will be no distinction anymore... Am I missing something?
I am not convinced this is a view everyone will have. I believe some ppl will prefer adding the preferNoSchedule taint, while some others will consider that the taint shouldn't be liberally applied on all nodes including the ones that can't reboot. I don't like the fact to change the rebootasrequired loop through a variable. It makes it very hard to understand the loop in terms of maintenance in the long run. On top of that, most of the reasons ppl would have adjusted the rebootasrequired over time was for personal preference: "In my case, I would like to do x before y". It all made sense for some ppl, while it didn't make sense for some others. I truely belive your case makes sense. Yet, I am not certain that it's the case for everyone. Two choices:
|
I definitely agree with you here, and making it a configurable behaviour change makes the most sense to me as well. I'll wait for your feedback on the discussions and am happy to chip in on implementation changes if needed. |
Resolves #970