version 1.0.5, fix resume checkpoint, model save and generation #148

KaiLv69 · 2024-01-03T03:19:22Z

No description provided.

fix lomo with lr_scheduler

srun 命令中加上 --unbuffered 即可显示进度条

…lie into kv_cache_resume_checkpoint

pp时会按照第一个维度切分batch到micro batch，而chatglm2的kvcache第一个维度是seq_len，需要permute一下

之前写法复杂了，实际上generate_batch的输入全是dict就好。返回时候再转换到tuple就行

fix resume checkpoint

解决了tp pp同时使用，保存模型会卡死的问题. gather weight_map in pp group

先tp gather再更换名字和cat query_key_value

update kv cache for generation and fix chatglm2

x54-729 and others added 30 commits July 25, 2023 12:33

set checkpointing to model_config

4f6e248

update pipelinemodel to newest

3f2f46f

fix llama's save_parallel_state_dict and support llama2-70b

7e5aaaa

fix: bugs for python3.11

150fd93

support save config to petrel

8a076ac

small conflict

8524457

support saving config to petrel

984fab4

support saving config to petrel: llama

43aa047

support saving config to petrel: moss

087e5fb

support saving config to petrel: moss_moon

31eea10

support saving config to petrel: internlm

d16ca2f

support saving config to petrel: chatglm

1bfdf30

support saving config to petrel: chatglm2

2b421b5

fix: bugs in saveing llama

3f2976e

Merge branch 'dev' of https://github.com/OpenLMLab/collie into dev

ca14f4b

change pipeline engine and several models to fit new pipeline format

65696b0

change pipeline engine and several models to fit new pipeline format

859a495

update save_peft and load_peft(lora)

e025bd7

Merge branch 'dev' of https://github.com/OpenLMLab/collie into dev

eeb56f5

fix get_input_embedding

fa85494

add: flashv2

418b0c5

Merge branch 'dev' of https://github.com/OpenLMLab/collie into dev

5d87c08

Delete peft

c9b45fb

Update requirements.txt

d9ef6a3

remove extra dropout in llama

73aeda6

update memory profile

88501cb

add function flash_attention

ffefc52

fix lomo with lr_scheduler

3b69924

Merge pull request #97 from KaiLv69/dev

7815c07

fix lomo with lr_scheduler

add merge_index_dict to models/utils.py

b0aa25f

KaiLv69 and others added 27 commits November 10, 2023 20:44

fix: split lm_head for tp

a75dfc4

fix: fix rich progress for some slurm cluster

4b9aac6

srun 命令中加上 --unbuffered 即可显示进度条

update glm&glm2 model & add tests

f79521f

add glm tests

52d3674

fix: typos in docs

99ea677

add resume train

8023389

fix print

91e54df

Merge branch 'kv_cache_resume_checkpoint' of github.com:OpenLMLab/col…

722ebc7

…lie into kv_cache_resume_checkpoint

fix chatglm2 for generation in pp

24caa67

pp时会按照第一个维度切分batch到micro batch，而chatglm2的kvcache第一个维度是seq_len，需要permute一下

fix llama's kv_cache in pp

c552a73

fix kv_cache in PipelineGenerationMixin

3f7d6a5

之前写法复杂了，实际上generate_batch的输入全是dict就好。返回时候再转换到tuple就行

add comments

850510e

remove useless comments

254201c

Merge pull request #146 from OpenLMLab/kv_cache_resume_checkpoint

a29d086

fix resume checkpoint

Merge branch 'dev' into kv_cache

412d180

update tests

5da8d00

fix: only permute kv_cache for pp

b6f57b1

remove src/peft

8cc47e9

fix: update import transformers.deepspeed

eb4d154

add: test model save

c2e8155

fix: save model in tp+pp

53d061b

解决了tp pp同时使用，保存模型会卡死的问题. gather weight_map in pp group

fix: save chatglm2 in tp and pp

09f10a0

先tp gather再更换名字和cat query_key_value

add: llama test save

5e14eeb

Merge pull request #147 from OpenLMLab/kv_cache

cddb8ed

update kv cache for generation and fix chatglm2

Merge branch 'main' into dev

0954aaa

update description

bcdb9e4

Merge branch 'dev' of https://github.com/OpenLMLab/collie into dev

4c60129

KaiLv69 changed the title ~~version 1.0.4, fix resume checkpoint, model save and generation~~ version 1.0.5, fix resume checkpoint, model save and generation Jan 3, 2024

MorningForest approved these changes Jan 3, 2024

View reviewed changes

KaiLv69 merged commit 5a30412 into main Jan 3, 2024
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

version 1.0.5, fix resume checkpoint, model save and generation #148

version 1.0.5, fix resume checkpoint, model save and generation #148

KaiLv69 commented Jan 3, 2024

version 1.0.5, fix resume checkpoint, model save and generation #148

version 1.0.5, fix resume checkpoint, model save and generation #148

Conversation

KaiLv69 commented Jan 3, 2024