This project is based on the Llama-2, released by Meta, and it is the second generation of the Chinese LLaMA & Alpaca LLM project. We open-source Chinese LLaMA-2 (foundation model) and Alpaca-2 (instruction-following model). These models have been expanded and optimized with Chinese vocabulary beyond the original Llama-2. We used large-scale Chinese data for incremental pre-training, which further improved the fundamental semantic understanding of the Chinese language, resulting in a significant performance improvement compared to the first-generation models. Standard version supports 4K context, and long context version supports 16K context. All models' context size can be further extended with NTK method (up to 24K+).

Main Contents

🚀 New extended Chinese vocabulary beyond Llama-2, open-sourcing the Chinese LLaMA-2 and Alpaca-2 LLMs.
🚀 Open-sourced the pre-training and instruction finetuning (SFT) scripts for further tuning on user's data
🚀 Quickly deploy and experience the quantized LLMs on CPU/GPU of personal PC
🚀 Support for LLaMA ecosystems like 🤗transformers, llama.cpp, text-generation-webui, LangChain, privateGPT, vLLM etc.

Open-sourced Models

Base model: Chinese-LLaMA-2-1.3B, Chinese-LLaMA-2-7B, Chinese-LLaMA-2-13B
Instruction/chat model: Chinese-Alpaca-2-1.3B, Chinese-Alpaca-2-7B, Chinese-Alpaca-2-13B
Long context model: Chinese-LLaMA-2-7B-16K, Chinese-LLaMA-2-13B-16K, Chinese-Alpaca-2-7B-16K, Chinese-Alpaca-2-13B-16K

News

[Sep 01, 2023] Release long context models: Chinese-Alpaca-2-7B-16K and Chinese-Alpaca-2-13B-16K, which can be directly used in downstream tasks, such as privateGPT. For details, see 📚 v3.1 release note

[Aug 25, 2023] Release long context models: Chinese-LLaMA-2-7B-16K and Chinese-LLaMA-2-13B-16K, which support 16K context and can be further extended up to 24K+ using NTK. For details, see 📚 v3.0 release note

[Aug 14, 2023] Release Chinese-LLaMA-2-13B and Chinese-Alpaca-2-13B. Add text-generation-webui/LangChain/privateGPT support. Add CFG sampling, etc. For details, see 📚 v2.0 release note

[Aug 02, 2023] Add FlashAttention-2 training support, vLLM-based inference acceleration support, a new system prompt that generates longer response, etc. For details, see 📚 v1.1 release note

[July 31, 2023] Release Chinese-LLaMA-2-7B (base model), trained with 120GB Chinese data. It was further fine-tuned using 5M instruction data, resulting in the Chinese-Alpaca-2-7B (instruction/chat model). For details, see 📚 v1.0 release notes

[July 19, 2023] 🚀Launched the Chinese LLaMA-2 and Alpaca-2 open-source LLM project

Content Guide

Section	Description
💁🏻‍♂️Introduction	Briefly introduces the technical features of the models in this project
⏬Download	Download links for Chinese LLaMA-2 and Alpaca-2
💻Inference and Deployment	Introduces how to quantify models and deploy and experience large models using a personal computer
💯System Performance	Experimental results on several tasks
📝Training and Fine-tuning	Introduces how to perform further training and fine-tuning on Chinese LLaMA-2 and Alpaca-2
❓Frequently Asked Questions	Responses to some common questions

Introduction

This project launches the Chinese LLaMA-2 and Alpaca-2 models based on Llama-2. Compared to the first generation of the project, the main features include:

📖 Optimized Chinese Vocabulary

In the first generation of the project, we expanded Chinese words and characters for the first-generation Chinese LLaMA model (LLaMA: 49953, Alpaca: 49954) to improve the model's encoding and decoding efficiency of Chinese texts.
In this project, we redesigned the new vocabulary (size: 55296) to further improve the coverage of Chinese words and characters. We also unified the LLaMA/Alpaca vocabulary to avoid problems due to mixed use.

⚡ Efficient FlashAttention-2

FlashAttention-2 is an implementation of efficient attention mechanisms, offering faster speed and optimized memory usage compared to its first-generation.
When the context length is longer, using efficient attention technology is essential to prevent explosive growth in memory usage.

🚄 Adaptive Context Extension based on PI and NTK

In the first generation of the project, we implemented the context extension based on NTK, which can support longer contexts without further training the model.
We release long context models, using PI and NTK methods, supporting 16K context, and can be further extended up to 24K-32K
Based on the above, we further designed a convenient adaptive empirical formula that does not require manually setting corresponding hyperparameters for different context lengths.

🤖 Simplified Bilingual System Prompt

In the first generation of the project, we use Stanford Alpaca template for our Chinese Alpaca models
Through preliminary experiments, we found that the lengthy system prompt by Llama-2-Chat is not as effective as a simple one
We use a very simple system prompt while keeping the Llama-2-Chat template to better adapt to relevant ecosystems

The following figure depicts all open-sourced models for our projects (including the first-gen project).

Download

Model Selection Guide

Below is a basic comparison between the Chinese LLaMA-2 and Alpaca-2 models, as well as recommended use cases. Use Alpaca for ChatGPT-like interaction.

Comparison	Chinese LLaMA-2	Chinese Alpaca-2
Model Type	Base Model	Instruction/Chat Model (like ChatGPT)
Released Sizes	1.3B, 7B, 13B	1.3B, 7B, 13B
Training Method	Causal-LM (CLM)	Instruction fine-tuning
Training Parts	7B, 13B: LoRA + emb/lm-head 1.3B: full params	7B, 13B: LoRA + emb/lm-head 1.3B: full params
Trained on	Original Llama-2 (non-chat)	Chinese LLaMA-2
Training Corpus	Unlabeled general corpus (120G raw text)	Labeled instruction data (5M samples)
Vocabulary Size^[1]	55,296	55,296
Context Size^[2]	Standard: 4K (12K-18K) Long ctx: 16K (24K-32K)	Standard: 4K (12K-18K) Long ctx: 16K (24K-32K)
Input Template	Not required	Requires specific templates^[3]
Suitable Scenarios	Text continuation: Given the context, the model generates the following text	Instruction understanding: Q&A, writing, chatting, interaction, etc.
Unsuitable Scenarios	Instruction understanding, multi-turn chat, etc.	Unrestricted text generation

Note

[1] The vocabulary of the first and second generation models in this project are different, do not mix them. The vocabularies of the second generation LLaMA and Alpaca are the same.
[2] Extended context size with NTK method is depicted in brackets.
[3] Alpaca-2 uses the Llama-2-chat series templates (different prompts), not the templates of the first-generation Alpaca, do not mix them.
[4] 1.3B models are not intended for standalone use; instead, use it together with larger models (7B, 13B) through speculative sampling.

Full Model Download

Below are the full models, which can be used directly afterwards, without additional merging steps. Recommended for users with sufficient network bandwidth.

Model Name	Type	Size	Download Link
Chinese-LLaMA-2-13B	Base model	24.7 GB	[Baidu] [Google] [🤗HF]
Chinese-LLaMA-2-7B	Base model	12.9 GB	[Baidu] [Google] [🤗HF]
Chinese-LLaMA-2-1.3B	Base model	2.4 GB	[🤗HF]
Chinese-Alpaca-2-13B	Chat Model	24.7 GB	[Baidu] [Google] [🤗HF]
Chinese-Alpaca-2-7B	Chat Model	12.9 GB	[Baidu] [Google] [🤗HF]
Chinese-Alpaca-2-1.3B	Chat model	2.4 GB	[🤗HF]

The followings are long context models, which are recommended for long context tasks.

Model Name	Type	Size	Download Link
Chinese-LLaMA-2-13B-16K	Base model	24.7 GB	[Baidu] [Google] [🤗HF]
Chinese-LLaMA-2-7B-16K	Base model	12.9 GB	[Baidu] [Google] [🤗HF]
Chinese-Alpaca-2-13B-16K 🆕	Chat Model	24.7 GB	[百度] [Google] [🤗HF]
Chinese-Alpaca-2-7B-16K 🆕	Chat Model	12.9 GB	[百度] [Google] [🤗HF]

Important

When using long context models, please follow our wiki to correctly use them. See Wiki.

LoRA Model Download

Below are the LoRA models, which cannot be used directly and must be merged with the refactored models according to the tutorial. Recommended for users with insufficient network bandwidth, who already have the original Llama-2 and light-weight download.

Model Name	Type	Required Model for merging	Size	LoRA Download Link
Chinese-LLaMA-2-LoRA-13B	Base model	Llama-2-13B-hf	1.5 GB	[Baidu] [Google] [🤗HF]
Chinese-LLaMA-2-LoRA-7B	Base model	Llama-2-7B-hf	1.1 GB	[Baidu] [Google] [🤗HF]
Chinese-Alpaca-2-LoRA-13B	Chat Model	Llama-2-13B-hf	1.5 GB	[Baidu] [Google] [🤗HF]
Chinese-Alpaca-2-LoRA-7B	Chat Model	Llama-2-7B-hf	1.1 GB	[Baidu] [Google] [🤗HF]

The followings are long context models, which are recommended for long context tasks.

Model Name	Type	Required Model for merging	Size	Download Link
Chinese-LLaMA-2-LoRA-13B-16K	Base model	Llama-2-13B-hf	1.5 GB	[Baidu] [Google] [🤗HF]
Chinese-LLaMA-2-LoRA-7B-16K	Base model	Llama-2-7B-hf	1.1 GB	[Baidu] [Google] [🤗HF]
Chinese-Alpaca-2-LoRA-13B-16K 🆕	Chat Model	Llama-2-13B-hf	1.5 GB	[Baidu] [Google] [🤗HF]
Chinese-Alpaca-2-LoRA-7B-16K 🆕	Chat Model	Llama-2-7B-hf	1.1 GB	[Baidu] [Google] [🤗HF]

Important

As the LoRA models cannot be used separately, they must be merged with the original Llama-2 to form a complete model for model inference, quantization, or further training. Please choose one of the following methods to merge these models.

Online Conversion: Colab users can use the notebook provided by this project for online conversion and model quantization
Manual Conversion: Offline method of conversion, generating different formats of models for quantization or further fine-tuning

Inference and Deployment

The models in this project mainly support the following quantization, inference, and deployment methods.

Tool	Features	CPU	GPU	Quant	GUI	API	vLLM^§	16K^‡	Speculative Sampling	Tutorial
llama.cpp	Rich quantization options and efficient local inference	✅	✅	✅	❌	✅	❌	✅	✅	link
🤗Transformers	Native transformers inference interface	✅	✅	✅	✅	❌	✅	✅	✅	link
Colab Demo	Running a Gradio web demo in Colab	✅	✅	✅	✅	❌	✅	✅	✅	link
OpenAI API Calls	A server that implements OpenAI API	✅	✅	✅	❌	✅	✅	✅	❌	link
text-generation-webui	A tool for deploying model as a web UI	✅	✅	✅	✅	✅^†	❌	✅	❌	link
LangChain	LLM application development framework, suitable for secondary development	✅^†	✅	✅^†	❌	❌	❌	✅	❌	link
privateGPT	LangChain-based multi-document QA framework	✅	✅	✅	❌	❌	❌	✅	❌	link

Note

^†: Supported by this tool, but not implemented in the tutorial. Please refer to the official documentation for details.
^‡: Support 16K long context or not (requires customized RoPE support)
^§: vLLM backend does not support our 16K long context models.

System Performance

Generation Performance Evaluation

In order to intuitively understand the generation performance of the model, this project has launched an online model arena platform imitating Fastchat Chatbot Arena, where you can browse and evaluate the quality of model responses. The arena platform provides evaluation indicators such as win rate and Elo score, and you can view the win rate of battles between two models. The question bank comes from 200 questions manually created in the first-generation project, and additional questions added on this basis. Generated replies are subject to randomness and are influenced by decoding hyperparameters, random seeds, etc., so the related evaluations are not absolutely rigorous. The results are only for reference, and you are welcome to experience it yourself. Please see the examples directory for some generated examples.

⚔️ Online Chatbot Arena: http://llm-arena.ymcui.com

System	Win Rate (no tie)↓	Elo Rating
Chinese-Alpaca-2-13B-16K	86.84%	1580
Chinese-Alpaca-2-13B	72.01%	1579
Chinese-Alpaca-Pro-33B	64.87%	1548
Chinese-Alpaca-2-7B	64.11%	1572
Chinese-Alpaca-Pro-7B	62.05%	1500
Chinese-Alpaca-2-7B-16K	61.67%	1540
Chinese-Alpaca-Pro-13B	61.26%	1567
Chinese-Alpaca-Plus-33B	31.29%	1401
Chinese-Alpaca-Plus-13B	23.43%	1329
Chinese-Alpaca-Plus-7B	20.92%	1379

Note

Results timestamp: Sep 1. 2023 . For the latest results, see ⚔️Arena.

NLU Performance Evaluation: C-Eval

C-Eval is a comprehensive Chinese basic model evaluation suite. The validation set contains 1.3K multiple-choice questions, and the test set contains 12.3K multiple-choice questions, covering 52 subjects. The type of questions is multiple-choice. The experimental results are presented in the format of "zero-shot / 5-shot". For C-Eval inference code, please refer to this project's 📖GitHub Wiki.

LLaMA Models	Valid	Test	Alpaca Models	Valid	Test
Chinese-LLaMA-2-13B	40.6 / 42.7	38.0 / 41.6	Chinese-Alpaca-2-13B	44.3 / 45.9	42.6 / 44.0
Chinese-LLaMA-2-7B	28.2 / 36.0	30.3 / 34.2	Chinese-Alpaca-2-7B	41.3 / 42.9	40.3 / 39.5
Chinese-LLaMA-Plus-33B	37.4 / 40.0	35.7 / 38.3	Chinese-Alpaca-Plus-33B	46.5 / 46.3	44.9 / 43.5
Chinese-LLaMA-Plus-13B	27.3 / 34.0	27.8 / 33.3	Chinese-Alpaca-Plus-13B	43.3 / 42.4	41.5 / 39.9
Chinese-LLaMA-Plus-7B	27.3 / 28.3	26.9 / 28.4	Chinese-Alpaca-Plus-7B	36.7 / 32.9	36.4 / 32.3

NLU Performance Evaluation: CMMLU

CMMLU is another comprehensive Chinese evaluation dataset, specifically designed to evaluate the knowledge and reasoning abilities of language models in a Chinese context. It covers 67 topics ranging from basic subjects to advanced professional levels, with a total of 11.5K test cases. The type of questions is multiple-choice. For CMMLU inference code, please refer to this project's 📖GitHub Wiki.

LLaMA Models	Test (0/few-shot)	Alpaca Models	Test (0/few-shot)
Chinese-LLaMA-2-13B	38.9 / 42.5	Chinese-Alpaca-2-13B	43.2 / 45.5
Chinese-LLaMA-2-7B	27.9 / 34.1	Chinese-Alpaca-2-7B	40.0 / 41.8
Chinese-LLaMA-Plus-33B	35.2 / 38.8	Chinese-Alpaca-Plus-33B	46.6 / 45.3
Chinese-LLaMA-Plus-13B	29.6 / 34.0	Chinese-Alpaca-Plus-13B	40.6 / 39.9
Chinese-LLaMA-Plus-7B	25.4 / 26.3	Chinese-Alpaca-Plus-7B	36.8 / 32.6

Long Context Model (16K) Evaluation

LongBench is a benchmark for testing LLM's long context ability, consisting of 6 categories and 20 tasks. The average length of most of the task ranges from 5K to 15K. LongBench has 4.5K test samples in total. The followings are the results on Chinese subtasks. For LongBench inference code, please refer to this project's 📖GitHub Wiki

Models	Single-doc QA	Multi-doc QA	Summarization	Few-shot Learning	Code Completion	Synthetic Task	Avg
Chinese-Alpaca-2-13B-16K	47.9	26.7	13.0	22.3	46.6	21.5	29.7
Chinese-Alpaca-2-13B	38.4	20.0	11.9	17.3	46.5	8.0	23.7
Chinese-Alpaca-2-7B-16K	46.4	23.3	14.3	29.0	49.6	9.0	28.6
Chinese-Alpaca-2-7B	34.0	17.4	11.8	21.3	50.3	4.5	23.2
Chinese-LLaMA-2-13B-16K	36.7	17.7	3.1	29.8	13.8	3.0	17.3
Chinese-LLaMA-2-13B	28.3	14.4	4.6	16.3	10.4	5.4	13.2
Chinese-LLaMA-2-7B-16K	33.2	15.9	6.5	23.5	10.3	5.3	15.8
Chinese-LLaMA-2-7B	19.0	13.9	6.4	11.0	11.0	4.7	11.0

Quantization Evaluation

To understand the quality loss brought by quantization, taking Chinese-LLaMA-2-7B as an example, we report the model size, PPL, C-eval results under different quantization levels. PPL is calculated under 4K context, and we report zero-shot and 5-shot results on C-Eval valid set.

Precision	Model Size	PPL	C-Eval
FP16	12.9 GB	9.373	28.2 / 36.0
8-bit quantized	6.8 GB	9.476	26.8 / 35.4
4-bit quantized	3.7 GB	10.132	25.5 / 32.8

Specifically, the followings are the benchmark for different quantization methods in llama.cpp. The speed is presented with ms/tok. For details, see our Wiki.

llama.cpp	F16	Q2_K	Q3_K	Q4_0	Q4_1	Q4_K	Q5_0	Q5_1	Q5_K	Q6_K	Q8_0
PPL	9.128	11.107	9.576	9.476	9.576	9.240	9.156	9.213	9.168	9.133	9.129
Size	12.91G	2.41G	3.18G	3.69G	4.08G	3.92G	4.47G	4.86G	4.59G	5.30G	6.81G
CPU Speed	117	42	51	39	44	43	48	51	50	54	65
GPU Speed	53	19	21	17	18	20	x	x	25	26	x

Speculative Sampling Evaluation

Using speculative sampling and leveraging Chinese-LLaMA-2-1.3B and Chinese-Alpaca-2-1.3B can accelerate the inference speed of 7B and 13B LLaMA and Alpaca models. The followings are the inference speeds (ms/token) evaluated on the questions in Generation Performance Evaluation on 1*A40-48G. All the models are in fp16 format. For details, see our Wiki.

Draft Model	Draft Model Speed	Target Model	Target Model Speed	Speculative Sampling Speed
Chinese-LLaMA-2-1.3B	7.6	Chinese-LLaMA-2-7B	49.3	36.0（1.37x）
Chinese-LLaMA-2-1.3B	7.6	Chinese-LLaMA-2-13B	66.0	47.1（1.40x）
Chinese-Alpaca-2-1.3B	8.1	Chinese-Alpaca-2-7B	50.2	34.9（1.44x）
Chinese-Alpaca-2-1.3B	8.2	Chinese-Alpaca-2-13B	67.0	41.6（1.61x）

Training and Fine-tuning

Please refer to the corresponding Wiki for information on pre-training (Chinese LLaMA-2 training) and instruction fine-tuning (Chinese Alpaca-2 training).

Pre-training: The code is adapted from run_clm.py in 🤗transformers. For usage, see the Pre-training Script Wiki.
Instruction Fine-tuning: The code refers to the relevant parts of dataset handling in the Stanford Alpaca project. For usage, see the Instruction Fine-tuning Script Wiki.

FAQ

Please make sure to check if there is a solution in the FAQ before raising an Issue.

Question 1: What is the difference between this project and the first-gen project?
Question 2: Can the model be commercialized?
Question 3: Do you accept third-party Pull Requests?
Question 4: Why not perform full pre-training but use LoRA instead?
Question 5: Does Llama-2 series support tools that support the first-gen LLaMA?
Question 6: Is Chinese-Alpaca-2 trained from Llama-2-Chat?
Question 7: Why does training with 24GB VRAM lead to an OOM error when fine-tuning chinese-alpaca-2-7b?
Question 8: Can the 16K long-context version model replace the standard version model?
Question 9: How to interprete the results of third-party benchmarks?
Question 10: Will you release 34B or 70B models?
Question 11: Why the long-context model is 16K context, not 32K or 100K?
Question 12: Why does the Alpaca model reply that it is ChatGPT?
Question 13: Why is the adapter_model.bin in the pt_lora_mdoel or sft_lora_model folder only a few hundred kb?

For specific questions and answers, please refer to the project >>> 📚 GitHub Wiki

Citation

If you use the resources related to this project, please refer to and cite this project's technical report: https://arxiv.org/abs/2304.08177

@article{Chinese-LLaMA-Alpaca,
    title={Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca},
    author={Cui, Yiming and Yang, Ziqing and Yao, Xin},
    journal={arXiv preprint arXiv:2304.08177},
    url={https://arxiv.org/abs/2304.08177},
    year={2023}
}

Acknowledgments

This project is mainly based on the following open-source projects, and we would like to express our gratitude to the related projects and research developers.

Llama-2 by Meta
llama.cpp by @ggerganov
FlashAttention-2 by Dao-AILab

We also appreciate the contributors of Chinese-LLaMA-Alpaca (the first-gen project) and the associated projects and personnel.

Disclaimer

This project is predicated on the utilization of the Llama-2 model, as released by Meta. As such, we respectfully request all users to adhere diligently to the provisions of the open-source license agreement pertinent to the Llama-2 model. In instances where third-party code is integrated, strict adherence to the appropriate open-source license agreement is also essential. Please be advised that the precision of the content generated by the model is subject to variability due to computational methodologies, random elements, and potential degradation of quantization accuracy. Consequently, this project makes no warranties, express or implied, regarding the accuracy of the model output. Furthermore, this project cannot be held accountable for any losses, whether direct or consequential, that may arise from the use of associated resources and the results derived therefrom. In cases where the models associated with this project are employed for commercial purposes, it is incumbent upon developers to act in accordance with local laws and regulations, thereby ensuring the legality of the content generated by the model. Finally, please note that this project does not accept any liability for products or services that may be developed based on its models.

Limitation Statement

Although the models in this project have significantly improved Chinese understanding and generation capabilities compared to the original LLaMA and Alpaca, there are also the following limitations:

It may produce unpredictable harmful content and content that does not conform to human preferences and values.
Due to computing power and data issues, the training of the related models is not sufficient, and the Chinese understanding ability needs to be further improved.
There is no online interactive demo available for now (Note: users can still deploy it locally themselves).

Feedback

If you have any questions, please submit them in GitHub Issues.

Before submitting a question, please check if the FAQ can solve the problem and consult past issues to see if they can help.
Please use our dedicated issue template for submitting.
Duplicate and unrelated issues will be handled by stable-bot; please understand.
Raise questions politely and help build a harmonious discussion community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

Main Contents

Open-sourced Models

News

Content Guide

Introduction

Download

Model Selection Guide

Full Model Download

LoRA Model Download

Inference and Deployment

System Performance

Generation Performance Evaluation

NLU Performance Evaluation: C-Eval

NLU Performance Evaluation: CMMLU

Long Context Model (16K) Evaluation

Quantization Evaluation

Speculative Sampling Evaluation

Training and Fine-tuning

FAQ

Citation

Acknowledgments

Disclaimer

Feedback

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

Main Contents

Open-sourced Models

News

Content Guide

Introduction

Download

Model Selection Guide

Full Model Download

LoRA Model Download

Inference and Deployment

System Performance

Generation Performance Evaluation

NLU Performance Evaluation: C-Eval

NLU Performance Evaluation: CMMLU

Long Context Model (16K) Evaluation

Quantization Evaluation

Speculative Sampling Evaluation

Training and Fine-tuning

FAQ

Citation

Acknowledgments

Disclaimer

Feedback