-
-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xgboost "Allreduce failed" with the dask operator specifically #898
Comments
So it turns out that this error can surface when the How would you feel about exposing it via make_cluster_spec? Would you prefer that it default to auto? Happy to contribute if you think this contribution makes sense (bandwidth permitting). |
With the operator the flag is not set, so it defaults to If I remember correctly we added setting things explicitly in the classic mode because in the early days Kubernetes/linux cgroups didn't always correctly report the right memory limit and would report the limit of the whole node. I wouldn't expect that to be an issue these days, but perhaps the problem you ran into here shows that it does. I'm very conservative about adding new options to from dask_kubernetes.operator import KubeCluster, make_cluster_spec
cluster_spec = make_cluster_spec(name="foo")
cluster_spec["spec"]["worker"]["spec"]["containers"][0]["args"] += ["--memory-limit", "4GB"]
cluster = KubeCluster(custom_cluster_spec=cluster_spec) I've been thinking lately about making the customizing your cluster API a little more pleasant, and I think that would be useful here. I'll open a separate issue for that. |
Thanks, @jacobtomlinson!
Dumb question: how do I check what "auto" evaluates to? |
You should be able to see the worker memory on the dashboard.
|
Looks like it was inferred to a reasonable value and setting it manually was a red herring. Thanks, @jacobtomlinson! |
So did you find what was causing the |
Turns out it was cluster autoscaler trying to bin-pack workers and the specific xgboost computation (RFE) not being resilient to worker loss. Interestingly, we didn't see this behavior in classic when conducting the same computation. Take this with a huge grain of salt, but I'm wondering if it has something to do with these annotations. Were they excluded from the operator spec for any specific reason? In any case, adding |
Those tolerations are only there so that users can set up dedicated Dask nodes by adding the corresponding taint. Features like this were added by users who needed specific functionality. When we wrote the operator we intentionally left out a load of things like this because it's much easier to just add them yourself now and it helps keep the package more simple.
Generally Dask workers should be safe to evict. When they recieve a SIGTERM they attempt to hand off any tasks and memory to another worker before shutting down. I think it's actually XGBoost that isn't resillient to this, so I don't think we should add a default just for that single use case. I think this definitely feeds into #899. Adding a concenience method to make it easy to add annotations would be very helpful. |
Hey, @jacobtomlinson!
We're trying to run a training job that uses xgboost + dask + distributed + dask-kubernetes. It works fine as long as provision the dask cluster with KubeCluster classic. As soon as we provision the cluster with KubeCluster operator, the same exact computation FAILS with the following error:
All other dependencies are identical! We're even using the same version of dask-kubernetes (2023.10.0).
I'm sorry for not providing MCVE. It would be really hard to extricate the replicate the proprietary dataset.
Have you ever run into anything like this before? Any advice would be greatly appreciated. Thank you!
Environment:
The text was updated successfully, but these errors were encountered: