Replies: 7 comments
-
it would need to support stochastic rounding and or kahan summation. furthermore any new optimisers need to be proven in a toy ViT training session against AdamW BF16 and Adam fp32. if you are interested please open a pull request with this data. |
Beta Was this translation helpful? Give feedback.
-
fwiw, D-adaptation can only hit 80% accuracy on CIFAR-10 after 20 epochs. this doesn't bode well for the optimiser's performance, especially versus a more robust option like Adam(W). |
Beta Was this translation helpful? Give feedback.
-
i know that it isn’t good at long runs because it over adapts after maybe 100,000 steps. my use case is to do two runs of training. then i do a full run with regular lion using that value. |
Beta Was this translation helpful? Give feedback.
-
the above is often the LR curve for dadapt. I'm grabbing the value from the first plateau |
Beta Was this translation helpful? Give feedback.
-
learning rates don't really apply across optims this way so i'm not sure that is a useful approach still |
Beta Was this translation helpful? Give feedback.
-
Eh... im sure there will still be some differences, but at least they are still both lion. Quote from web search, "Yes, you can use "Dadapt AdamW" to find a good learning rate for AdamW" |
Beta Was this translation helpful? Give feedback.
-
as mentioned before if you want to see this in simpletuner you'll have to provide the implementation with the requirements met as well as data indicating its bf16 training effectiveness against Adam(W) in fp32 and bf16 w/ stochastic rounding. |
Beta Was this translation helpful? Give feedback.
-
coming over here from OneTrainer.
It supports LION and DADAPT-LION.
You supportt LION, and... quantized lion?
What about adaptive lion?
Beta Was this translation helpful? Give feedback.
All reactions