bug: support for affinity rules #58

vsoch · 2024-01-15T08:51:19Z

When we parse the pod, it looks like we don't take into account affinity rules (e.g., for the Flux Operator here). Regardless of the CPU limit/requests, it could be that a pod has affinity that would ask for the entire node. In this case, we would ignore that and still pass in the cpu/memory via the jobspec here and fluxion could decide to put two pods on one node (if I understand that correctly). I think affinity rules are typically applied in Filter which is the step after PreFilter), and we implement it here but don't account for them. In this case we might ignore the affinity rule all together, so that could result in multiple pods/node for the MiniCluster unless the resource limits are also set.

For context, I'm trying to brainstorm the behavior I'm seeing with the latest experiments. It's most likely I did something wrong, but I think there are features of the Flux Operator that need to be taken into account (such as this one). If the default scheduler is accounting for affinity, that is minimally a subtle difference (even if not the exact problem here). I think likely what is needed is careful debugging of an entire scheduling session and checking of every output. I'll continue to try to think of more subtle differences and open issues as I do.

vsoch added the bug Something isn't working label Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: support for affinity rules #58

bug: support for affinity rules #58

vsoch commented Jan 15, 2024 •

edited

Loading

bug: support for affinity rules #58

bug: support for affinity rules #58

Comments

vsoch commented Jan 15, 2024 • edited Loading

vsoch commented Jan 15, 2024 •

edited

Loading