-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Testing the lion optimizer #432
Conversation
@mitchellnw oh nice! yea let us try it out and add your anecdata here in my mind, a learning rate cooldown is still needed with this technique |
@rwightman have you tried Lion yet for any of the vision transformers training? your voice is one of the more definitive ones out there |
So far at small scale (short B/32 run, batch size 16k), well tuned lion slightly outperforms AdamW (still tuning AdamW). AdamW (LR 2e-3, WD 0.2, betas=0.9, 0.95) = 42.1 Trying to run anbalagous experiments for H but resources have been busy.. |
@lucidrains inconclusive so far, managed to almost match some recent adamw results for large fine-tune, but took a fair bit of search. I feel unless very resource contained adamw still be go to due to hparam familiarity... |
nice! maybe this is the perfect fit then, with the large batch size training needed for clip (in paper they show growing advantage of lion over adamw with increasing batch size, so this lines up)
good to know! seems like everyone is reporting needing to fiddle around with hparam before seeing comparable results.. |
@mitchellnw wanted to thank you for running and sharing this btw! honestly, i was on the fence about this technique, but now i believe it should be used in the large batch size regime |
@mitchellnw Thanks for the experiments! I observe that you used betas=0.9, 0.95 for AdamW compared to the default betas=0.9, 0.999. While for Lion it's still the default one betas=0.9, 0.99, could you please try betas=0.95, 0.98? In our experiments, we use this setting if the beta2 in AdamW is 0.99, though in your case it's a even smaller 0.95. Really appreciate it! |
@xiangning-chen 👋 just heard another positive result this morning from someone trustworthy! 💯 while you are here, have you figured out which learning rate scheduler is optimal with Lion? it seems like it would matter |
@lucidrains Thanks for sharing the good news! We always used the same learning rate schedule as AdamW in our experiments including cosine decay, linear decay, and constant (all with 10K steps warmup). We also tried rsqrt decay when pre-training ViT on JFT, but the gain of Lion was lower compared to cosine decay, which we also observed in the ViT proxy task. So I would say on ViT, the Lion optimizer is better suited for cosine decay, where the learning rate decays to either zero or a very small value, compared to rsqrt decay. |
@xiangning-chen thank you for your recommendation! |
Thanks @xiangning-chen will try the other betas, and potentially higher LR when making that change! When raising LR, would you also raise WD? |
@mitchellnw Thank you! Actually I would decrease the WD when raising the LR to maintain the effective weight decay strength LR*WD. |
Thanks, and congrats on the work by the way. Really cool results and quite an interesting optimizer you found! |
Ran short (20k iterations) for batch size 16k and H/14 on LAION 2b. Not as much room for hparam tuning as the experiments are compute intensive so still finding lion falling a bit short of AdamW. Please let me know which other hparams you'd recommend @xiangning-chen .
|
Thanks for the experiments! |
yea, i'm starting to hear more negative reports coming in unfortunately. the common story i hear is that it converges faster, but generalizes worse |
@lucidrains May I know on what domains they observe a faster convergence but worse generalization? Thanks! |
@xiangning-chen thanks for the recommendations! Yes 20k is extremely short but sadly these experiments are already very expensive so don't have much other option. Hmm so you'd say just re-run red with LR 4e-4 instead of 5e-4? My guess would be such a small change won't fix red but you'd definitely know best here :) |
@mitchellnw May I know the warmup iterations and learning rate schedule? Thanks! |
Yep! 5k warmup iterations (linear warmup) then cosine decay. And weight decay for the AdamW baseline is 0.2. |
@mitchellnw Thanks for the information!
So it seems like Lion works fine on few training steps, I will provide an update on the CLIP training result ASAP. |
Very interesting, thanks for sharing! A few comments/questions:
|
If with the same learning rate, I don't think this would make a big difference. But using beta2=0.95 here would definitely helps with the training stability, which means that a larger learning rate can be used without NaN.
I used gradient clipping 1.0 for both optimizers.
Oh this is not zero-shot accuracy, I just quickly tested on supervised image classification to see whether the short 20k training steps matter. Currently I'm having some permission issues with the internal image-text dataset. As soon as I regain access, I will proceed with the CLIP training.
Sure, glad to help with the hparam tuning and really hope that Lion can benefit the open clip project! Thanks you so much! |
@mitchellnw I have some updates to share! I discovered that the initial temperature value has an impact, and tuning it has resulted in better performance for Lion, compared to AdamW. I conducted experiments using base-sized vision and text encoders, with a learning rate schedule of 10K steps for warmup and then cosine decay. The batch size was 16K, and the dataset I used was WebLI. With
Lion indeed performed worse than AdamW for both ImageNet zero-shot accuracy and validation error.
Lion becomes the clear winner. Note that |
This is super interesting @xiangning-chen, thanks a lot for the exhaustive exploration! I would not have thought of modifying temperature, how did you think of this? I am really looking forward to trying this modification in our setting, looks very promising! However, it may now take some time as the cluster is extremely busy for the month of March so I'm not able to run jobs. However, rest assured I will get to this -- and if anyone else wants to try this out first for OpenCLIP on LAION-5b please do get in touch. Otherwise I will send the latest updates here when the cluster is back to regular in April :). Thanks again! |
@xiangning-chen oh this is really interesting what initial temperature value did the contrastive learning networks (LiT and BASIC) you tested on have? |
@lucidrains I used an initial temperature 10 in the paper. But in LiT and BASIC, the vision tower is loaded from a pre-trained ckpt and is fixed during training, while in OpenCLIP both vision and language towers are initialized from scratch. |
@mitchellnw I further validate on the large and giant size CLIP, each with 20K steps (10K steps warmup then cosine decay) and
From the table, Lion is still the better one, and it also offers a runtime improvement of 10%-15%. |
Great! ETA for continuing to test in openclip is late march or early april, looking forward! |
i am readying an 8-bit version of Lion over at the bitsandbytes repository, and hope to get it merged some time end of this month |
@mitchellnw Just wondering do you have any update for using Lion with a higher initial temperature? Thanks! |
Thanks for the reminder! I was expectedly behind schedule but just launched the jobs! |
That did help narrow the gap a lot! AdamW and lion are now within 1 percentage point of each-other. Do you think I should go even higher on init temp? Feel free to let me know exactly what hparam setting you'd try next and I'll run it.
* I should also mention -- the AdamW I'm using here uses "update clipping" from AdaFactor to get rid of loss spikes. To use vanilla AdamW I need to decrease beta2 a lot for stability. The best vanilla AdamW result is AdamW (2e-3, 0.9, 0.9), wd 0.2: 57.29 -- so within 0.3pp of lion. |
Thanks for the result! I used gradient clipping in my experiments but not update clipping. |
I do find update clipping helps compared to gradient clipping, but also without update clipping and without gradient clipping is still fairly good (57.33). But yes, as you point out the beta2 is very low here (0.9) for stability reasons. With lion we don't observe any stability problems though. |
not strictly on topic but I thought you would appreciate hearing @xiangning-chen -- i'm testing out the lion optimizer for a text task and finding really nice performance there. |
@mitchellnw Thanks for letting me know, that's really good news! |
wish there was a way to convert the comments here to a discussion as it's a great bit of info, but closing PR in favour of #979 as lion is among numerous options there. |
Used the implementation from https://github.com/lucidrains/lion-pytorch (thanks @lucidrains)
Paper https://arxiv.org/abs/2302.06675