Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Collapse #18

Open
JewelChen2019 opened this issue Jul 25, 2024 · 0 comments
Open

Training Collapse #18

JewelChen2019 opened this issue Jul 25, 2024 · 0 comments

Comments

@JewelChen2019
Copy link

JewelChen2019 commented Jul 25, 2024

hey, thanks for your excellent work, I'm currently following the open-sourced code and encountering a few questions about the training procedure:

  1. I pull down the latest code from GitHub and run the stage1 training code on Imagenet from scratch on a 8-GPU A100 machine, but the training log seems abnormal. The recon-loss seems diverge and the visualization results turns bad. (See the appendix image in email)

  2. The train code uses '-num_nodes 4', what does this hparams mean ?

  3. The default train code saves checkpoints every n step, rather than topK 'val/recon_loss', should I use the topK checkpoints callback function?

training_2024-07-25 11 14 21

training:
train-recon

validation:
val-recon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant