Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: support for affinity rules #58

Open
vsoch opened this issue Jan 15, 2024 · 0 comments
Open

bug: support for affinity rules #58

vsoch opened this issue Jan 15, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@vsoch
Copy link
Member

vsoch commented Jan 15, 2024

When we parse the pod, it looks like we don't take into account affinity rules (e.g., for the Flux Operator here). Regardless of the CPU limit/requests, it could be that a pod has affinity that would ask for the entire node. In this case, we would ignore that and still pass in the cpu/memory via the jobspec here and fluxion could decide to put two pods on one node (if I understand that correctly). I think affinity rules are typically applied in Filter which is the step after PreFilter), and we implement it here but don't account for them. In this case we might ignore the affinity rule all together, so that could result in multiple pods/node for the MiniCluster unless the resource limits are also set.

For context, I'm trying to brainstorm the behavior I'm seeing with the latest experiments. It's most likely I did something wrong, but I think there are features of the Flux Operator that need to be taken into account (such as this one). If the default scheduler is accounting for affinity, that is minimally a subtle difference (even if not the exact problem here). I think likely what is needed is careful debugging of an entire scheduling session and checking of every output. I'll continue to try to think of more subtle differences and open issues as I do.

@vsoch vsoch added the bug Something isn't working label Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant