Skip to content

Commit

Permalink
Merge pull request #611 from ymcui/ceval-notebook
Browse files Browse the repository at this point in the history
Add C-Eval notebook & Release v4.1
  • Loading branch information
ymcui authored Jun 16, 2023
2 parents c7e9782 + 3be07ec commit 1736c0a
Show file tree
Hide file tree
Showing 5 changed files with 9,238 additions and 4 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,9 @@

## 新闻

**[2023/06/08] [v4.0版本](https://github.com/ymcui/Chinese-LLaMA-Alpaca/releases/tag/v4.0): 发布中文LLaMA/Alpaca-33B、添加privateGPT使用示例、添加C-Eval结果等。**
**[2023/06/16] [v4.1版本](https://github.com/ymcui/Chinese-LLaMA-Alpaca/releases/tag/v4.1): 发布新版技术报告、添加C-Eval解码脚本、添加低资源模型合并脚本等。**

[2023/06/08] [v4.0版本](https://github.com/ymcui/Chinese-LLaMA-Alpaca/releases/tag/v4.0): 发布中文LLaMA/Alpaca-33B、添加privateGPT使用示例、添加C-Eval结果等。

[2023/06/05] llama.cpp已支持Apple Silicon GPU解码,解码速度大幅提升,详见:[讨论区#开发者公告](https://github.com/ymcui/Chinese-LLaMA-Alpaca/discussions/505)

Expand Down Expand Up @@ -229,7 +231,7 @@ chinese_llama_lora_7b/

### 客观效果评测

本项目还在“NLU”类客观评测集合上对相关模型进行了测试。这类评测的结果不具有主观性,只需要输出给定标签(需要设计标签mapping策略),因此可以从另外一个侧面了解大模型的能力。本项目在近期推出的[C-Eval评测数据集](https://cevalbenchmark.com)上测试了相关模型效果,其中测试集包含12.3K个选择题,涵盖52个学科。以下是部分模型的valid和test集评测结果(Average),完整结果后续将更新至[技术报告](https://arxiv.org/abs/2304.08177)
本项目还在“NLU”类客观评测集合上对相关模型进行了测试。这类评测的结果不具有主观性,只需要输出给定标签(需要设计标签mapping策略),因此可以从另外一个侧面了解大模型的能力。本项目在近期推出的[C-Eval评测数据集](https://cevalbenchmark.com)上测试了相关模型效果,其中测试集包含12.3K个选择题,涵盖52个学科。以下是部分模型的valid和test集评测结果(Average),完整结果请参考[技术报告](https://arxiv.org/abs/2304.08177)

| 模型 | Valid (zero-shot) | Valid (5-shot) | Test (zero-shot) | Test (5-shot) |
| ----------------------- | :---------------: | :------------: | :--------------: | :-----------: |
Expand Down
6 changes: 4 additions & 2 deletions README_EN.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,9 @@ To promote open research of large models in the Chinese NLP community, this proj

## News

**[June 8, 2023] [Release v4.0](https://github.com/ymcui/Chinese-LLaMA-Alpaca/releases/tag/v4.0): LLaMA/Alpaca 33B versions are available. We also add privateGPT demo, C-Eval results, etc.**
**[June 16, 2023] [Release v4.1](https://github.com/ymcui/Chinese-LLaMA-Alpaca/releases/tag/v4.1): New technical report, add C-Eval inference script, add low-resource model merging script, etc.**

[June 8, 2023] [Release v4.0](https://github.com/ymcui/Chinese-LLaMA-Alpaca/releases/tag/v4.0): LLaMA/Alpaca 33B versions are available. We also add privateGPT demo, C-Eval results, etc.

[May 16, 2023] [Release v3.2](https://github.com/ymcui/Chinese-LLaMA-Alpaca/releases/tag/v3.2): Add SFT scripts, LangChain supports, Gradio-based web demo, etc.

Expand Down Expand Up @@ -233,7 +235,7 @@ In order to quickly evaluate the actual performance of related models, this proj

### NLU Performance Test

This project also conducted tests on relevant models using the "NLU" objective evaluation dataset. The results of this type of evaluation are objective and only require the output of given labels, so they can provide insights into the capabilities of large models from another perspective. In the recently launched [C-Eval dataset](https://cevalbenchmark.com/), this project tested the performance of the relevant models. The test set contains 12.3K multiple-choice questions covering 52 subjects. The following are the evaluation results (average) of some models on the validation and test sets, and the complete results will be updated in the [technical report](https://arxiv.org/abs/2304.08177) later.
This project also conducted tests on relevant models using the "NLU" objective evaluation dataset. The results of this type of evaluation are objective and only require the output of given labels, so they can provide insights into the capabilities of large models from another perspective. In the recently launched [C-Eval dataset](https://cevalbenchmark.com/), this project tested the performance of the relevant models. The test set contains 12.3K multiple-choice questions covering 52 subjects. The following are the evaluation results (average) of some models on the validation and test sets. For complete results, please refer to our [technical report](https://arxiv.org/abs/2304.08177).

| Models | Valid (zero-shot) | Valid (5-shot) | Test (zero-shot) | Test (5-shot) |
| ----------------------- | :---------------: | :------------: | :--------------: | :-----------: |
Expand Down
8 changes: 8 additions & 0 deletions notebooks/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# 笔记本示例 Notebooks

### ceval_example_for_chinese_alpaca.ipynb

利用Chinese Alpaca模型解码C-Eval数据集的示例。

Example of decoding C-Eval dataset with Chinese Alpaca.

建议查看Colab上的最新版 / Check latest notebook:<a href="https://colab.research.google.com/drive/12YewimRT7JuqJGOejxN7YG8jq2de4DnF?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### convert_and_quantize_chinese_llama_and_alpaca.ipynb

Colab上的转换和量化中文LLaMA/Alpaca(含Plus版本)的运行示例(仅供流程参考)。
Expand Down
Loading

0 comments on commit 1736c0a

Please sign in to comment.