Training Large Language Model faster, easily and low-cost.
✦ Both GPU and NPU are supported.
✦ Directly training on whole big data write by Spark when pretrain.
✦ Flash speed when fine-tuning because of no redundant computation .
✦ Make PCIE as fast as NVlinks under 20 billion level model.
pip wheel -e . --no-deps && pip install jllm-*-py3-none-any.whl
This step is optional but recommended especially when your data are too big to be loaded to CPU memory at once.
Convert the raw data to token ids stored in parquet files.
python -m jllm.raw2ids \
--tokenizer Qwen1.5-14B-Chat \
-i dataset0.jsonl \
-o dataset0_Qwen1.5-14B-Chat
Note: Samples of pre-train dataset should be separated by '\n\n'
in text files or be the value of key'text'
in jsonl files. Fine-tune dataset's format should be [{'system':content},{'user':content},{'assistant':content},...]
in each row of jsonl files, key'system'
is not necessary.
For Vision Language Model:
python -m jllm.raw2ids \
--tokenizer Qwen2-VL-7B-Instruct \
-i dataset_vl.jsonl \
--image_path images \
--max_len 8192 \
Folder images
stores all the images data. Format of dataset_vl.jsonl
is like:
[{'user':['Give a description of these pictures please.\n <image>....','image0.jpg',...]},{'assistant':'This is ....'}]
If you have multiple datasets, you shouldn't skip this step. It could shuffle all the datasets globally by rows like Spark doing.
Firstly, move all the datasets stored in parquet folders into one directory. such as datasets
:
datasets
├── dataset0_Qwen1.5-14B-Chat
│ ├── dataset0-00000
│ │ ├── dataset0-00000-00000.gzip.parquet
│ │ └── dataset0-00000-00001.gzip.parquet
│ └── dataset0-00001
│ ├── dataset0-00001-00000.gzip.parquet
│ └── dataset0-00001-00001.gzip.parquet
└── dataset1_Qwen1.5-14B-Chat
├── dataset1-00000
│ ├── dataset1-00000-00000.gzip.parquet
│ └── dataset1-00000-00001.gzip.parquet
└── dataset1-00001
├── dataset1-00001-00000.gzip.parquet
└── dataset1-00001-00001.gzip.parquet
Then run the following command to shuffle the rows inner each dataset and distribute them to new blocks, num_block
is recommended to be the multiple of next step's repartition number.
python -m jllm.shuffle_datasets -d datasets -o shuffled_datasets -n 4
Every dataset would be shuffled and placed in shuffled_datasets
with several times of num_block
parquet files:
shuffled_datasets/
├── dataset0_Qwen1.5-14B-Chat-00000
│ ├── dataset0_Qwen1.5-14B-Chat-00000-00000.gzip.parquet
│ ├── dataset0_Qwen1.5-14B-Chat-00000-00001.gzip.parquet
│ ├── dataset0_Qwen1.5-14B-Chat-00000-00002.gzip.parquet
│ └── dataset0_Qwen1.5-14B-Chat-00000-00003.gzip.parquet
└── dataset1_Qwen1.5-14B-Chat-00000
├── dataset1_Qwen1.5-14B-Chat-00000-00000.gzip.parquet
├── dataset1_Qwen1.5-14B-Chat-00000-00001.gzip.parquet
├── dataset1_Qwen1.5-14B-Chat-00000-00002.gzip.parquet
└── dataset1_Qwen1.5-14B-Chat-00000-00003.gzip.parquet
Optional but recommended. 1B token ids in parquet files take up to 2G of hard disk at most but require approximately 10G of CPU memory. Setting num_partition
according to the CPU memory of each worker.
python -m jllm.repartition -d shuffled_datasets -n 4
The datasets will be:
shuffled_datasets/
├── 5984729befe338e6a7-part-00000
│ ├── dataset0_Qwen1.5-14B-Chat-00000-00000.gzip.parquet
│ └── dataset1_Qwen1.5-14B-Chat-00000-00000.gzip.parquet
├── 5984729befe338e6a7-part-00001
│ ├── dataset0_Qwen1.5-14B-Chat-00000-00001.gzip.parquet
│ └── dataset1_Qwen1.5-14B-Chat-00000-00001.gzip.parquet
├── 5984729befe338e6a7-part-00002
│ ├── dataset0_Qwen1.5-14B-Chat-00000-00002.gzip.parquet
│ └── dataset1_Qwen1.5-14B-Chat-00000-00002.gzip.parquet
├── 5984729befe338e6a7-part-00003
│ ├── dataset0_Qwen1.5-14B-Chat-00000-00003.gzip.parquet
│ └── dataset1_Qwen1.5-14B-Chat-00000-00003.gzip.parquet
└── data.info
Note: You can also use PySpark to do these steps. jllm could directly read token ids from the parquets those write out by Spark .
deepspeed -H $HOSTFILE \
--module jllm.train_pipe \
--model Qwen2-VL-7B-Instruct \
--train_data dataset_vl_Qwen2-VL-7B-Instruct \
--pipe_parallel_size 8 \
--encoder_pipe_parallel_size 2 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 32 \
--only_ckpt_model \
--checkpoint checkpoint \
--max_num_checkpoints 2 \
--partition_method 11,2 \
--split_dlayer \
--checkpoint_grad_interval 1
Note: Arguments train_data
and eval_data
also support jsonl
file. Run python -m jllm.train_pipe -h
for more arguments.
Generally, every GPU process reads one piece of data, that means one node with 8 GPUs will need to allocate a total of 8x CPU memory for data. But now they need just 1x if these GPUs belong to one pipeline under my special optimizations in this project . I strongly recommend you to train your model with faster and low-cost Pipeline Parallelism rather than ZERO. Pipeline engine could directly load and save model's weights in HuggingFace's format. It could also load weights from checkpoint. If you want to resume interruption, any configs related to training shouldn't be modified.
The engine was designed to save checkpoint through background process by default to save more time for training. Don't save checkpoint too frequently unless you disable checkpoint in background via the argument '--background_executor none
' to avoid out of CPU memory.
Setting --partition_method
to be fast
will always get a faster training when GPU memory are enough.
If argument --only_ckpt_model
is enabled , engine will directly only checkpoint model's weights with HF's format.
You can also convert model's weights from deepspeed's checkpoint to HF's format by jllm.train_pipe
, such as:
deepspeed -H $HOSTFILE \
--module jllm.train_pipe \
--model Qwen2-VL-7B-Instruct \
--train_data dataset_vl_Qwen2-VL-7B-Instruct \
--pipe_parallel_size 8 \
--encoder_pipe_parallel_size 2 \
--partition_method 11,2 \
--split_dlayer \
--num_train_epochs 0 \
--from_ckpt checkpoint --tag 1000 \
--output_dir output_path
Model | Training Speed (tokens/s) |
---|---|
llama-13b | 92749.82(old) |
baichuan-13b | 79765.50(old) |
qwen-14b | 80749.57(old) |
qwen2-moe | - |
internlm2 | - |
internvl2 | - |
qwen2-vl | - |
Note: The training speed of each model was measured on 64 NVIDIA A100-PCIE-40GB GPUs linked by 100Gb/s bandwidth of InfiniBand with data type of bfloat16 and batch token size of 2048*2048 (batch_size*sequence_length, batch_size = micro_batch_size * gradient_accumulation_steps).
Model | Training Speed (tokens/s) |
---|---|
llama-7b | 26335.232 |
Note: Measured on 8 NVIDIA A100-PCIE-40GB GPUs with data type of bfloat16 and batch token size of 2304*2048.
vLLm is quoted here for Inference.
python batch_infer.py \
--model Qwen1.5-14B-Chat-Finetune \
--prompt_file prompt.txt
Start the server:
python server.py --model Qwen1.5-14B-Chat-Finetune
Query the model :
curl http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"messages":[{"user": "San Francisco is a"}],
"sampling":{"max_tokens":32}
}'
If you find flash-finetuning useful or use flash-finetuning code in your research, please cite it in your publications.
@misc{flash-finetuning,
author = {Jian Lu},
title = {Flash Fine Tuning: Training Large Language Model faster, easily and low-cost.},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/janelu9/flash-finetuning.git}},
}
This repository benefits from DeepSpeed, Megatron-LM and Flash-Attention.