-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to accommodate OpenLLM leaderboard v2 tasks and change Meta Llama 3.1 to Llama 3.1 #639
Conversation
|
||
As Meta Llama models gain popularity, evaluating these models has become increasingly important. We have released all the evaluation details for Meta-Llama 3.1 models as datasets in the [3.1 evals Hugging Face collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). This recipe demonstrates how to closely reproduce the Llama 3.1 reported benchmark numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library and our prompts from the 3.1 evals datasets on selected tasks. | ||
As Meta Llama models gain popularity, evaluating these models has become increasingly important. We have released all the evaluation details for Meta-Llama 3.1 models as datasets in the [3.1 evals Hugging Face collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). This recipe demonstrates how to calculate the Llama 3.1 reported benchmark numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library and our prompts from the 3.1 evals datasets on selected tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clarify here that we are providing this since our internal repo isn't public but that this will not reproduce exactly the evals. It is just meant to be an aid for someone wanting to use the dataset we released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, this line has been added.
|
||
## Disclaimer | ||
|
||
|
||
1. **This recipe is not the official implementation** of Meta Llama evaluation. It is based on public third-party libraries, as this implementation is not mirroring Meta Llama evaluation, this may lead to minor differences in the reproduced numbers. | ||
1. **This recipe is not the official implementation** of Meta Llama evaluation. It is based on public third-party libraries, as this implementation is not mirroring Meta Llama evaluation, this may lead to minor differences in the produced numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Se comment above. I think this point addresses my comment partially.
Previously, the OpenLLM leaderboard V1 is hard to reproduce as it did not included a easy commend to run, so we created a customized eval.py to load a folder of selected task templates to run OpenLLM leaderboard V1. Now, OpenLLM leaderboard V2 has not only updated the selected tasks, but created a easy-to-use commend for us to run. Therefore, this PR deletes the previous work for OpenLLM leaderboard V1 and updates the README to accommodate OpenLLM leaderboardv2 tasks
Before submitting
Pull Request section?
to it if that's the case.
Thanks for contributing 🎉!