Skip to content

What is considered a large Gradient Norm? #30

Discussion options

You must be logged in to vote

Instead of trying to answer this (which will vary from model to model), I would look for signs of being instability bound when you do a learning rate sweep. What happens when you take your best learning rate lr* and run at 2lr* or 4lr*? Do you see loss instability? If so then that's a sign you should be able to improve performance by dealing with the instability in some way. Warmup and clipping are the easiest ways to tackle this.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by adrian-dalessandro
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants