Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inform submission about accumulated_submission_time #785

Closed
Niccolo-Ajroldi opened this issue Sep 12, 2024 · 2 comments
Closed

Inform submission about accumulated_submission_time #785

Niccolo-Ajroldi opened this issue Sep 12, 2024 · 2 comments

Comments

@Niccolo-Ajroldi
Copy link
Contributor

Description

Currently, update_params has no up-to-date information about the elapsed time since start.

My motivation for adding this feature is to simplify the implementation of a time-based learning rate schedule.

Can't a submission just keep track of time or estimate it?
In theory yes, this is allowed by the rules and feasible. However, such implementation would require synchronization among devices inside update_params when training in distributed mode, which would penalize such submission.

Why is a time-based scheduler useful?
Currently, a submission can implement a LR scheduler using step_hint as a step budget. This is a reliable estimate of the number of steps needed for (N)AdamW to reach max_runtime. However, a submission could be faster/slower than (N)AdamW, and the extent of this difference can vary based on the workload itself. This makes deriving a custom step budget from step_hint suboptimal.

Implementation

We could simply pass train_state to update_params, or even just train_state['accumulated_submission_time'].
Notice that update_params already receives in input eval_results, which informs the submission about the elapsed time at the moment of the last evaluation, which is not up-to date with accumulated_submission_time.

@Niccolo-Ajroldi Niccolo-Ajroldi changed the title Inform submission about 'accumulated_submission_time' Inform submission about accumulated_submission_time Sep 12, 2024
@Niccolo-Ajroldi
Copy link
Contributor Author

Niccolo-Ajroldi commented Oct 3, 2024

Remark: we should avoid that a submission modifies train_state.

A straightforward solution would be to pass it by copy. A shallow copy should be enough, since all values in train_state are primitive types, and cannot be modified in-place. Although very fast, this copying operation would however be counted as part of the submission time.

Another solution would be to just pass train_state['accumulated_submission_time'] instead.

@priyakasimbeg
Copy link
Contributor

Fixed in #790

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants