Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - Operator Pod Error Loops if the Pod has to recreate #22

Open
odellem opened this issue Aug 1, 2024 · 0 comments
Open

[BUG] - Operator Pod Error Loops if the Pod has to recreate #22

odellem opened this issue Aug 1, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@odellem
Copy link

odellem commented Aug 1, 2024

Describe the bug
After applying a payload to the slurm cluster, the operator creates the daemonset for slurmabler pods. However, if the pod crashes or restarts, it will error loop because the daemonset already exists.

To Reproduce
Steps to reproduce the behavior:

  1. Install Slik
  2. Apply either payload
  3. Let the slurmabler pods be created
  4. Delete the operator pod, allowing the deployment to recreate it, and check the logs for the error loop.

Expected behavior
It should handle errors gracefully, or if there is an issue where the daemonset needs to be created, then the operator should just delete and then recreate the daemonset.

Additional context
Deleting the daemonset and restarting the operator pod will fix the problem but when you upgrade a cluster pods will be moved around during the rolling update, therefore any cluster upgrade will break the slurm operator.

@odellem odellem added the bug Something isn't working label Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant