You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, update_params has no up-to-date information about the elapsed time since start.
My motivation for adding this feature is to simplify the implementation of a time-based learning rate schedule.
Can't a submission just keep track of time or estimate it?
In theory yes, this is allowed by the rules and feasible. However, such implementation would require synchronization among devices inside update_params when training in distributed mode, which would penalize such submission.
Why is a time-based scheduler useful?
Currently, a submission can implement a LR scheduler using step_hint as a step budget. This is a reliable estimate of the number of steps needed for (N)AdamW to reach max_runtime. However, a submission could be faster/slower than (N)AdamW, and the extent of this difference can vary based on the workload itself. This makes deriving a custom step budget from step_hint suboptimal.
Implementation
We could simply pass train_state to update_params, or even just train_state['accumulated_submission_time'].
Notice that update_params already receives in input eval_results, which informs the submission about the elapsed time at the moment of the last evaluation, which is not up-to date with accumulated_submission_time.
The text was updated successfully, but these errors were encountered:
Niccolo-Ajroldi
changed the title
Inform submission about 'accumulated_submission_time'
Inform submission about accumulated_submission_timeSep 12, 2024
Remark: we should avoid that a submission modifies train_state.
A straightforward solution would be to pass it by copy. A shallow copy should be enough, since all values in train_state are primitive types, and cannot be modified in-place. Although very fast, this copying operation would however be counted as part of the submission time.
Another solution would be to just pass train_state['accumulated_submission_time'] instead.
Description
Currently,
update_params
has no up-to-date information about the elapsed time since start.My motivation for adding this feature is to simplify the implementation of a time-based learning rate schedule.
Can't a submission just keep track of time or estimate it?
In theory yes, this is allowed by the rules and feasible. However, such implementation would require synchronization among devices inside
update_params
when training in distributed mode, which would penalize such submission.Why is a time-based scheduler useful?
Currently, a submission can implement a LR scheduler using
step_hint
as a step budget. This is a reliable estimate of the number of steps needed for (N)AdamW to reach max_runtime. However, a submission could be faster/slower than (N)AdamW, and the extent of this difference can vary based on the workload itself. This makes deriving a custom step budget fromstep_hint
suboptimal.Implementation
We could simply pass
train_state
toupdate_params
, or even justtrain_state['accumulated_submission_time']
.Notice that
update_params
already receives in inputeval_results
, which informs the submission about the elapsed time at the moment of the last evaluation, which is not up-to date withaccumulated_submission_time
.The text was updated successfully, but these errors were encountered: