-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About "train_predict_loss": NaN, "train_edge_loss": NaN #35
Comments
你好,这个setting和我们常见的训练的setting十分接近,我觉得可能换个seed就可以了,或者略微尝试换accum_iter为4或16,应该也有效果。 |
如果还是不行可以适当调小学习率,不过会导致收敛速度变慢。 |
作者您好,我尝试把accum_iter改为4,但是在第四轮依然会出现梯度爆炸情况,我也尝试把学习率调小,把-lr 1e-4 --min_lr 5e-7改成1e-5和5e-6,这个确实会导致收敛速度变慢,您说换个seed,之前的seed是42,seed改成多少更合适呢,谢谢您的回复! |
这个seed是随机种子,可以随便换,主要是试其他的初始化,能不能更好一点。
如果还是不对劲的话,可以考虑尝试调高edge lambda到30
…---- 回复的原邮件 ----
| 发件人 | ***@***.***> |
| 日期 | 2024年11月02日 06:09 |
| 收件人 | ***@***.***> |
| 抄送至 | Ma, ***@***.***>***@***.***> |
| 主题 | Re: [SunnyHaze/IML-ViT] About "train_predict_loss": NaN, "train_edge_loss": NaN (Issue #35) |
作者您好,我尝试把accum_iter改为4,但是在第四轮依然会出现梯度爆炸情况,我也尝试把学习率调小,把-lr 1e-4 --min_lr 5e-7改成1e-5和5e-6,这个确实会导致收敛速度变慢,您说换个seed,之前的seed是42,seed改成多少更合适呢,谢谢您的回复!
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
好滴,调高edge lambda到30依旧会梯度爆炸,我把学习率调小就好啦,谢谢您啦~ |
好的,搞定了就好!祝好!
…---- 回复的原邮件 ----
| 发件人 | ***@***.***> |
| 日期 | 2024年11月15日 15:12 |
| 收件人 | ***@***.***> |
| 抄送至 | Ma, ***@***.***>***@***.***> |
| 主题 | Re: [SunnyHaze/IML-ViT] About "train_predict_loss": NaN, "train_edge_loss": NaN (Issue #35) |
好滴,调高edge lambda到30依旧会梯度爆炸,我把学习率调小就好啦,谢谢您啦~
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
作者您好,训练的时候遇到了梯度爆炸的问题,下面是我打印出来的日志信息。
{"train_lr": 1.2480487331573042e-05, "train_predict_loss": 0.7279542872231782, "train_edge_loss": 0.0554991626542601, "test_average_f1": 0.13728936245250195, "epoch": 0}
{"train_lr": 3.74804873315729e-05, "train_predict_loss": 0.6461437258957582, "train_edge_loss": 0.0565992421235376, "epoch": 1}
{"train_lr": 6.248048733157299e-05, "train_predict_loss": 0.3960253125017053, "train_edge_loss": 0.055867052419107784, "epoch": 2}
{"train_lr": 8.748048733157295e-05, "train_predict_loss": 0.18236436514146548, "train_edge_loss": 0.06066870497076324, "epoch": 3}
{"train_lr": 9.999787476642203e-05, "train_predict_loss": NaN, "train_edge_loss": NaN, "test_average_f1": 0.0, "epoch": 4}
下面是我的train.sh
CUDA_VISIBLE_DEVICES=0,1,2,3
torchrun
--standalone
--nnodes=1
--nproc_per_node=4
main_train.py
--world_size 1
--batch_size 1
--data_path '/root/data1/IML-ViT-main/Dataset/CASIA2.0'
--epochs 200
--lr 1e-4
--min_lr 5e-7
--weight_decay 0.05
--edge_lambda 20
--predict_head_norm "BN"
--vit_pretrain_path '/root/data1/IML-ViT-main/pretrained-weights/mae_pretrain_vit_base.pth'
--test_data_path '/root/data1/IML-ViT-main/Dataset/CASIA1.0'
--warmup_epochs 4
--output_dir ./output_dir/
--log_dir ./output_dir/
--accum_iter 8
--seed 42
--test_period 4
--num_workers 8
2> train_error.log 1>train_log.log
请问是因为学习率太高了导致的梯度爆炸嘛,我能否把lr和min_lr改成1e-5和5e-6呢,GPT给我的解决办法是让我增大batch_size,现在训练时我的batch_size设置为1,但是我设置为2时就会报显存,因为我只有4张显存为16G的卡,期待您的回复,谢谢~
The text was updated successfully, but these errors were encountered: