Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training doesn't resume from previous checkpoint using max_train_steps #1172

Open
playerzer0x opened this issue Nov 21, 2024 · 3 comments
Open

Comments

@playerzer0x
Copy link

  1. I train a model to 10k steps
  2. I change max_train_steps in config to 15000
  3. I change out the data loader with an updated multidatabackend
  4. I start training and receive this error:
    2024-11-21 01:52:34,920 [INFO] Reached the end (58 epochs) of our training run (42 epochs). This run will do zero steps.
  5. Training doesn't continue

If I set max_train_steps to 0 and change num_train_epochs to 100, training starts fine. Haven't counted, but the updated dataset for resume may be less than the original dataset used.

My brain thinks in steps, so would prefer to use steps over epochs.

@bghira
Copy link
Owner

bghira commented Nov 21, 2024

well, that is normal. you are no longer resuming the old training run, as you have changed everything.

it's not really recommended to change anything within a single training run, let alone the entire dataset or the step schedule

@playerzer0x
Copy link
Author

This change would be across two separate training runs. I'm following Caith's recommendation on training new subjects into a "base LoKR" that was previously trained on styles.

@bghira
Copy link
Owner

bghira commented Nov 22, 2024

you want to use --init_lora to begin a new training run from the old lokr then. it takes a path to the safetensor file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants