-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama 3 #565
Comments
@Helw150 said it worked out of the box. Just configs I think |
Thats fantastic, @dlwh! Great if you could share your configs @Helw150. I must admit I have not dug into the details here yet, but I understood the biggest architectural changes were using a larger tokenizer, and adding GQA to the smaller models. I havent seen GQA used in any of the Levanter models, but found a post saying it was supported. Can this also just be enabled through the configs? I also read a post about them doing some masking on longer sequences so that the attention did not "spill over" to new documents. |
The model seems to start training with:
However, I keep getting the message: Not really sure what is causing this. |
@dlwh. Unfortuantely, I can not seem to get it to work right out of the box. The model is training, but when trying to train on a domain specific corpus, the loss is starting way too high, and never fully recovers. I am pretty sure the issue is the vocab size here. I can not seem to override the vocab size in the model config. This line seem to return the default Llama tokenizer: levanter/src/levanter/main/train_lm.py Line 62 in bd2aad6
While it is overwritten later, I think this is the main issue. I have tried both reading the configs from HF, and creating them from scratch. Please advice. |
ok i'll try to take a look this weekend. Do you have a full config you can use a reproducer by any chance? |
Do you have a reproduction of a case where the Levanter implementation gives you a different prediction than the HuggingFace implementation? As an example, here's a round trip test I used to verify the Whisper implementation levanter/tests/whisper_test.py Line 130 in 407d54b
The only architectural change in LLama 3 is the Grouped Query attention - which is supported here: levanter/src/levanter/models/llama.py Line 236 in 407d54b
I've exported a few Llama 3 finetunes from Levanter to HuggingFace successfully and the models seem to work as expected for inference, so it's unclear to me whether the above case suggests a bug or is a function of the much larger vocab size of LLama 3 v.s. Mistral. I'm not sure what the data mix is above, but if it's multilingual it's also likely Mistral starts from lower loss because it's more explicitly designed for Multilinguality. If you send over a case where HuggingFace and Levanter output different logits for the Llama 3 weights, I'd be happy to take on the debugging from there! |
I am trying to debug this and test on downstream tasks by exporting to HF. However, I noticed that for llama3, no tokenizer.model file is created when saving to HF. Have you experienced this @Helw150? Edit: I see the reason for this is that the HF repos does not contain any tokenizer.model-file. |
Reopening this. I have trained a bit more, and I am really not satisfied with the result, even if the train/eval loss looks fine. Do you have a working llama3 config-file @Helw150. I want to double check if I have made any mistakes here. |
Hi! My use case is a bit non-standard (training multi-modal encoders) so I'm not sure my configs will help so much. If you want to check them anyways, you can find them on the Could you give a bit more details about the issue you are facing? Does it seem like the model isn't training properly? Or is it that the results aren't satisfactory? If it's the latter, additional context (e.g. specific symptoms, expected behavior) would help for me to understand whether there's an underlying bug that could cause this or if it's a matter of hyperparameters/underlying capabilities! |
What revision/commit were you using to train? My usage of the TPU splash attention had/has a bug that messed everything up. I'm like 60% sure I know how to fix (and you can probably fix your checkpoints post-hoc) but I need another day or so. If you want to try something, can you pre-multiply all of the q_proj by sqrt(headdim). I haven't verified that yet but I strongly suspect |
Ah yes, worth noting that I haven't pulled in the Splash Attention changes yet |
splash attention is currently disabled so main is fine 🤞 right now |
I was using splash attention, so that might have caused the error. However, I was suspecting this to be a tokenizer-size issue. I remember also getting some warning about non-matching tokenizers here. But I can retry this without splash, and see if that is related. |
I believe splash is now fixed in latest main, but it's now off by default. Can you try --model.attn_backend splash and --model.attn_backend jax_flash and let me know if things seem ok? |
Awesome! I have not been training for long, but in general my good runs have been starting with an eval-loss of around 2.5, while the broken runs have started on 6. In the latest main, this seems to start with a 2.5 loss both with and without flash attention. Looks very good. For reference (in case other are having the same issue), the correct commands are uppercase: Splash automatically upscales to 32, since 16 is not working. I understand this is expected. |
Awesome thanks for your patience. Yeah, for whatever reason they don't support bf16 for attention with that kernel yet the uppercase thing can be fixed by upgrading draccus to >=0.8.0 |
@peregilk Llama3 shouldn't work out of the box nicely, as it uses a different |
that's not great. Probably need to spend some time in a debugger. |
i probably won't get to this for at least a few days myself, but happy to provide some support |
@dlwh any progress on this one? I was thinking of switching to Levanter from composer. |
i don't really understand the issue. we have a unit test (which I recognize is not necessarily proof it's correct) and support rope scaling now. Does someone have a code that fails |
ok i see. this is becoming a priority for me so i will try to tackle it by wednesday |
I haven't fully tested it but can you try main. I added the new llama 3 rope stuff |
@mayankjobanputra did you have a chance to try it? |
@dlwh I haven't tried it yet. Still preprocessing the data and meanwhile writing some infra code around the framework. If everything goes smoothly I should be able to answer your question in 15ish days. |
Do you have any plans for adding for supporting Llama-3? Any idea how complex this would be, apart from new configs?
The text was updated successfully, but these errors were encountered: