Release Version v0.3.1 Release Today! · hpcaitech/ColossalAI

What's Changed

Release

[release] update version (#4332) by Hongxin Liu

Chat

[chat] fix compute_approx_kl (#4338) by Wenhao Chen
[chat] removed cache file (#4155) by Frank Lee
[chat] use official transformers and fix some issues (#4117) by Wenhao Chen
[chat] remove naive strategy and split colossalai strategy (#4094) by Wenhao Chen
[chat] refactor trainer class (#4080) by Wenhao Chen
[chat]: fix chat evaluation possible bug (#4064) by Michelle
[chat] refactor strategy class with booster api (#3987) by Wenhao Chen
[chat] refactor actor class (#3968) by Wenhao Chen
[chat] add distributed PPO trainer (#3740) by Hongxin Liu

Zero

[zero] optimize the optimizer step time (#4221) by LuGY
[zero] support shard optimizer state dict of zero (#4194) by LuGY
[zero] add state dict for low level zero (#4179) by LuGY
[zero] allow passing process group to zero12 (#4153) by LuGY
[zero]support no_sync method for zero1 plugin (#4138) by LuGY
[zero] refactor low level zero for shard evenly (#4030) by LuGY

Nfc

[NFC] polish applications/Chat/coati/models/utils.py codestyle (#4277) by yuxuan-lou
[NFC] polish applications/Chat/coati/trainer/strategies/base.py code style (#4278) by Zirui Zhu
[NFC] polish applications/Chat/coati/models/generation.py code style (#4275) by RichardoLuo
[NFC] polish applications/Chat/inference/server.py code style (#4274) by Yuanchen
[NFC] fix format of application/Chat/coati/trainer/utils.py (#4273) by アマデウス
[NFC] polish applications/Chat/examples/train_reward_model.py code style (#4271) by Xu Kai
[NFC] fix: format (#4270) by dayellow
[NFC] polish runtime_preparation_pass style (#4266) by Wenhao Chen
[NFC] polish unary_elementwise_generator.py code style (#4267) by YeAnbang
[NFC] polish applications/Chat/coati/trainer/base.py code style (#4260) by shenggan
[NFC] polish applications/Chat/coati/dataset/sft_dataset.py code style (#4259) by Zheng Zangwei (Alex Zheng)
[NFC] polish colossalai/booster/plugin/low_level_zero_plugin.py code style (#4256) by 梁爽
[NFC] polish colossalai/auto_parallel/offload/amp_optimizer.py code style (#4255) by Yanjia0
[NFC] polish colossalai/cli/benchmark/utils.py code style (#4254) by ocd_with_naming
[NFC] policy applications/Chat/examples/ray/mmmt_prompt.py code style (#4250) by CZYCW
[NFC] polish applications/Chat/coati/models/base/actor.py code style (#4248) by Junming Wu
[NFC] polish applications/Chat/inference/requirements.txt code style (#4265) by Camille Zhong
[NFC] Fix format for mixed precision (#4253) by Jianghai
[nfc]fix ColossalaiOptimizer is not defined (#4122) by digger yu
[nfc] fix dim not defined and fix typo (#3991) by digger yu
[nfc] fix typo colossalai/zero (#3923) by digger yu
[nfc]fix typo colossalai/pipeline tensor nn (#3899) by digger yu
[nfc] fix typo colossalai/nn (#3887) by digger yu
[nfc] fix typo colossalai/cli fx kernel (#3847) by digger yu

Example

Fix/format (#4261) by Michelle
[example] add llama pretraining (#4257) by binmakeswell
[example] fix bucket size in example of gpt gemini (#4028) by LuGY
[example] update ViT example using booster api (#3940) by Baizhou Zhang
Merge pull request #3905 from MaruyamaAya/dreambooth by Liu Ziming
[example] update opt example using booster api (#3918) by Baizhou Zhang
[example] Modify palm example with the new booster API (#3913) by Liu Ziming
[example] update gemini examples (#3868) by jiangmingyan

Ci

[ci] support testmon core pkg change detection (#4305) by Hongxin Liu

Checkpointio

[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302) by Baizhou Zhang
Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin (#4141) by Baizhou Zhang
[checkpointio] sharded optimizer checkpoint for DDP plugin (#4002) by Baizhou Zhang
[checkpointio] General Checkpointing of Sharded Optimizers (#3984) by Baizhou Zhang

Lazy

[lazy] support init on cuda (#4269) by Hongxin Liu
[lazy] fix compatibility problem on torch 1.13 (#3911) by Hongxin Liu
[lazy] refactor lazy init (#3891) by Hongxin Liu

Kernels

[Kernels] added triton-implemented of self attention for colossal-ai (#4241) by Cuiqing Li

Docker

[docker] fixed ninja build command (#4203) by Frank Lee
[docker] added ssh and rdma support for docker (#4192) by Frank Lee

Dtensor

[dtensor] fixed readme file name and removed deprecated file (#4162) by Frank Lee
[dtensor] updated api and doc (#3845) by Frank Lee

Workflow

[workflow] show test duration (#4159) by Frank Lee
[workflow] added status check for test coverage workflow (#4106) by Frank Lee
[workflow] cover all public repositories in weekly report (#4069) by Frank Lee
[workflow] fixed the directory check in build (#3980) by Frank Lee
[workflow] cancel duplicated workflow jobs (#3960) by Frank Lee
[workflow] cancel duplicated workflow jobs (#3960) by Frank Lee
[workflow] added docker latest tag for release (#3920) by Frank Lee
[workflow] fixed workflow check for docker build (#3849) by Frank Lee

Cli

[cli] hotfix launch command for multi-nodes (#4165) by Hongxin Liu

Format

[format] applied code formatting on changed files in pull request 4152 (#4157) by github-actions[bot]
[format] applied code formatting on changed files in pull request 4021 (#4022) by github-actions[bot]

Shardformer

[shardformer] added development protocol for standardization (#4149) by Frank Lee
[shardformer] made tensor parallelism configurable (#4144) by Frank Lee
[shardformer] refactored some doc and api (#4137) by Frank Lee
[shardformer] write an shardformer example with bert finetuning (#4126) by jiangmingyan
[shardformer] added embedding gradient check (#4124) by Frank Lee
[shardformer] import huggingface implicitly (#4101) by Frank Lee
[shardformer] integrate with data parallelism (#4103) by Frank Lee
[shardformer] supported fused normalization (#4112) by Frank Lee
[shardformer] supported bloom model (#4098) by Frank Lee
[shardformer] support vision transformer (#4096) by Kun Lin
[shardformer] shardformer support opt models (#4091) by jiangmingyan
[shardformer] refactored layernorm (#4086) by Frank Lee
[shardformer] Add layernorm (#4072) by FoolPlayer
[shardformer] supported fused qkv checkpoint (#4073) by Frank Lee
[shardformer] add linearconv1d test (#4067) by FoolPlayer
[shardformer] support module saving and loading (#4062) by Frank Lee
[shardformer] refactored the shardformer layer structure (#4053) by Frank Lee
[shardformer] adapted T5 and LLaMa test to use kit (#4049) by Frank Lee
[shardformer] add gpt2 test and layer class refactor (#4041) by FoolPlayer
[shardformer] supported T5 and its variants (#4045) by Frank Lee
[shardformer] adapted llama to the new API (#4036) by Frank Lee
[shardformer] fix bert and gpt downstream with new api (#4024) by FoolPlayer
[shardformer] updated doc (#4016) by Frank Lee
[shardformer] removed inplace tensor sharding (#4018) by Frank Lee
[shardformer] refactored embedding and dropout to parallel module (#4013) by Frank Lee
[shardformer] integrated linear 1D with dtensor (#3996) by Frank Lee
[shardformer] Refactor shardformer api (#4001) by FoolPlayer
[shardformer] fix an error in readme (#3988) by FoolPlayer
[Shardformer] Downstream bert (#3979) by FoolPlayer
[shardformer] shardformer support t5 model (#3994) by wukong1992
[shardformer] support llama model using shardformer (#3969) by wukong1992
[shardformer] Add dropout layer in shard model and refactor policy api (#3949) by FoolPlayer
[shardformer] Unit test (#3928) by FoolPlayer
[shardformer] Align bert value (#3907) by FoolPlayer
[shardformer] add gpt2 policy and modify shard and slicer to support (#3883) by FoolPlayer
[shardformer] add Dropout layer support different dropout pattern (#3856) by FoolPlayer
[shardformer] update readme with modules implement doc (#3834) by FoolPlayer
[shardformer] refactored the user api (#3828) by Frank Lee
[shardformer] updated readme (#3827) by Frank Lee
[shardformer]: Feature/shardformer, add some docstring and readme (#3816) by FoolPlayer
[shardformer] init shardformer code structure (#3731) by FoolPlayer
[shardformer] add gpt2 policy and modify shard and slicer to support (#3883) by FoolPlayer
[shardformer] add Dropout layer support different dropout pattern (#3856) by FoolPlayer
[shardformer] update readme with modules implement doc (#3834) by FoolPlayer
[shardformer] refactored the user api (#3828) by Frank Lee
[shardformer] updated readme (#3827) by Frank Lee
[shardformer]: Feature/shardformer, add some docstring and readme (#3816) by FoolPlayer
[shardformer] init shardformer code structure (#3731) by FoolPlayer

Test

[test] fixed tests failed due to dtensor change (#4082) by Frank Lee
[test] fixed codefactor format report (#4026) by Frank Lee

Device

[device] support init device mesh from process group (#3990) by Frank Lee

Hotfix

[hotfix] fix import bug in checkpoint_io (#4142) by Baizhou Zhang
[hotfix]fix argument naming in docs and examples (#4083) by Baizhou Zhang

Doc

[doc] update and revise some typos and errs in docs (#4107) by Jianghai
[doc] add a note about unit-testing to CONTRIBUTING.md (#3970) by Baizhou Zhang
[doc] add lazy init tutorial (#3922) by Hongxin Liu
[doc] fix docs about booster api usage (#3898) by Baizhou Zhang
[doc]update moe chinese document. (#3890) by jiangmingyan
[doc] update document of zero with chunk. (#3855) by jiangmingyan
[doc] update nvme offload documents. (#3850) by jiangmingyan

Examples

[examples] copy resnet example to image (#4090) by Jianghai

Testing

[testing] move pytest to be inside the function (#4087) by Frank Lee

Gemini

Merge pull request #4056 from Fridge003/hotfix/fix_gemini_chunk_config_searching by Baizhou Zhang
[gemini] fix argument naming during chunk configuration searching by Baizhou Zhang
[gemini] fixed the gemini checkpoint io (#3934) by Frank Lee
[gemini] fixed the gemini checkpoint io (#3934) by Frank Lee

Devops

[devops] fix build on pr ci (#4043) by Hongxin Liu
[devops] update torch version in compability test (#3919) by Hongxin Liu
[devops] hotfix testmon cache clean logic (#3917) by Hongxin Liu
[devops] hotfix CI about testmon cache (#3910) by Hongxin Liu
[devops] improving testmon cache (#3902) by Hongxin Liu

Sync

Merge pull request #4025 from hpcaitech/develop by Frank Lee
Merge pull request #3967 from ver217/update-develop by Frank Lee
Merge pull request #3942 from hpcaitech/revert-3931-sync/develop-to-shardformer by FoolPlayer
Revert "[sync] sync feature/shardformer with develop" by Frank Lee
Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer by FoolPlayer
Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop by Frank Lee
Merge pull request #3915 from FrankLeeeee/update/develop by Frank Lee

Booster

[booster] make optimizer argument optional for boost (#3993) by Wenhao Chen
[booster] update bert example, using booster api (#3885) by wukong1992

Evaluate

[evaluate] support gpt evaluation with reference (#3972) by Yuanchen

Feature

Merge pull request #3926 from hpcaitech/feature/dtensor by Frank Lee

Bf16

[bf16] add bf16 support (#3882) by Hongxin Liu

Evaluation

[evaluation] improvement on evaluation (#3862) by Yuanchen

Full Changelog: v0.3.1...v0.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version v0.3.1 Release Today!

What's Changed

Release

Chat

Zero

Nfc

Example

Ci

Checkpointio

Lazy

Kernels

Docker

Dtensor

Workflow

Cli

Format

Shardformer

Test

Device

Hotfix

Doc

Examples

Testing

Gemini

Devops

Sync

Booster

Evaluate

Feature

Bf16

Evaluation