-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question regards to soure code #207
Comments
And also here I think |
Hi @eedalong, we do test elastic and non-elastic fault tolerance in the unit tests and also in ray release tests, so it should generally work. Do you have an example where it does not? For the code: In your first linked code the relevant part is this:
which will not force actor restart before we commence training. For your second linked code, the relevant part is:
which updates the training state. This will trigger multiple times as the training futures will usually not be ready. If they are ready, training is over, so we don't care about actor states anymore. Does this make sense? |
Elastic training and Non-Elastic training seems to have the same failure processing strategy, both restart all failed workers and wait until these workers finish loading data.source code here
The text was updated successfully, but these errors were encountered: