What is considered a large Gradient Norm? #30
-
When it's mentioned in the guide that "Clipping can fix either early training instability (large gradient norm early)", what is the value of a large gradient norm? Is it a fixed number, or relative? I often see the clipping threshold set to 10 for common classification problems. However, many image regression problems can have very very large gradient norms (the gradients themselves aren't too large). Similarly, when working on problems with large resolution images, I believe the gradients will be larger as well, as each kernel weight is receiving the total gradient as a summation of gradients across the entire receptive field (correct me if I'm wrong, here). Is this a problem? Or is clipping simply relative? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Instead of trying to answer this (which will vary from model to model), I would look for signs of being instability bound when you do a learning rate sweep. What happens when you take your best learning rate lr* and run at 2lr* or 4lr*? Do you see loss instability? If so then that's a sign you should be able to improve performance by dealing with the instability in some way. Warmup and clipping are the easiest ways to tackle this. |
Beta Was this translation helpful? Give feedback.
Instead of trying to answer this (which will vary from model to model), I would look for signs of being instability bound when you do a learning rate sweep. What happens when you take your best learning rate lr* and run at 2lr* or 4lr*? Do you see loss instability? If so then that's a sign you should be able to improve performance by dealing with the instability in some way. Warmup and clipping are the easiest ways to tackle this.