Training Collapse #18

JewelChen2019 · 2024-07-25T07:20:21Z

hey, thanks for your excellent work, I'm currently following the open-sourced code and encountering a few questions about the training procedure:

I pull down the latest code from GitHub and run the stage1 training code on Imagenet from scratch on a 8-GPU A100 machine, but the training log seems abnormal. The recon-loss seems diverge and the visualization results turns bad. (See the appendix image in email)
The train code uses '-num_nodes 4', what does this hparams mean ?
The default train code saves checkpoints every n step, rather than topK 'val/recon_loss', should I use the topK checkpoints callback function?

training：

validation：

Provide feedback