Resuming training via `--load_step` #30

justinchiu · 2023-12-30T07:58:07Z

Thanks for the code release!

Heads up for other users who want to resume training from a checkpoint: you will want to

de-indent DDP_main.py:80 so that all devices can load the checkpoint
load the optimizer and scheduler states on line DDP_main:146
set the index of the dataloader to the correct example before actually training

I'm not totally sure this solves everything like logging, but might work ok.

Note: There's also a separate issue that your checkpoints might get overwritten between epochs, so be sure you're loading the right thing and saving where you want.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming training via `--load_step` #30

Resuming training via `--load_step` #30

justinchiu commented Dec 30, 2023 •

edited

Loading

Resuming training via --load_step #30

Resuming training via --load_step #30

Comments

justinchiu commented Dec 30, 2023 • edited Loading

Resuming training via `--load_step` #30

Resuming training via `--load_step` #30

justinchiu commented Dec 30, 2023 •

edited

Loading