Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About "train_predict_loss": NaN, "train_edge_loss": NaN #35

Open
fangfangbuaibiancheng opened this issue Oct 30, 2024 · 6 comments
Open

Comments

@fangfangbuaibiancheng
Copy link

作者您好,训练的时候遇到了梯度爆炸的问题,下面是我打印出来的日志信息。
{"train_lr": 1.2480487331573042e-05, "train_predict_loss": 0.7279542872231782, "train_edge_loss": 0.0554991626542601, "test_average_f1": 0.13728936245250195, "epoch": 0}
{"train_lr": 3.74804873315729e-05, "train_predict_loss": 0.6461437258957582, "train_edge_loss": 0.0565992421235376, "epoch": 1}
{"train_lr": 6.248048733157299e-05, "train_predict_loss": 0.3960253125017053, "train_edge_loss": 0.055867052419107784, "epoch": 2}
{"train_lr": 8.748048733157295e-05, "train_predict_loss": 0.18236436514146548, "train_edge_loss": 0.06066870497076324, "epoch": 3}
{"train_lr": 9.999787476642203e-05, "train_predict_loss": NaN, "train_edge_loss": NaN, "test_average_f1": 0.0, "epoch": 4}
下面是我的train.sh
CUDA_VISIBLE_DEVICES=0,1,2,3
torchrun
--standalone
--nnodes=1
--nproc_per_node=4
main_train.py
--world_size 1
--batch_size 1
--data_path '/root/data1/IML-ViT-main/Dataset/CASIA2.0'
--epochs 200
--lr 1e-4
--min_lr 5e-7
--weight_decay 0.05
--edge_lambda 20
--predict_head_norm "BN"
--vit_pretrain_path '/root/data1/IML-ViT-main/pretrained-weights/mae_pretrain_vit_base.pth'
--test_data_path '/root/data1/IML-ViT-main/Dataset/CASIA1.0'
--warmup_epochs 4
--output_dir ./output_dir/
--log_dir ./output_dir/
--accum_iter 8
--seed 42
--test_period 4
--num_workers 8
2> train_error.log 1>train_log.log
请问是因为学习率太高了导致的梯度爆炸嘛,我能否把lr和min_lr改成1e-5和5e-6呢,GPT给我的解决办法是让我增大batch_size,现在训练时我的batch_size设置为1,但是我设置为2时就会报显存,因为我只有4张显存为16G的卡,期待您的回复,谢谢~

@SunnyHaze
Copy link
Owner

你好,这个setting和我们常见的训练的setting十分接近,我觉得可能换个seed就可以了,或者略微尝试换accum_iter为4或16,应该也有效果。

@SunnyHaze
Copy link
Owner

如果还是不行可以适当调小学习率,不过会导致收敛速度变慢。

@fangfangbuaibiancheng
Copy link
Author

作者您好,我尝试把accum_iter改为4,但是在第四轮依然会出现梯度爆炸情况,我也尝试把学习率调小,把-lr 1e-4 --min_lr 5e-7改成1e-5和5e-6,这个确实会导致收敛速度变慢,您说换个seed,之前的seed是42,seed改成多少更合适呢,谢谢您的回复!

@SunnyHaze
Copy link
Owner

SunnyHaze commented Nov 2, 2024 via email

@fangfangbuaibiancheng
Copy link
Author

好滴,调高edge lambda到30依旧会梯度爆炸,我把学习率调小就好啦,谢谢您啦~

@SunnyHaze
Copy link
Owner

SunnyHaze commented Nov 15, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants