Version v0.3.1 Release Today!
What's Changed
Release
- [release] update version (#4332) by Hongxin Liu
Chat
- [chat] fix compute_approx_kl (#4338) by Wenhao Chen
- [chat] removed cache file (#4155) by Frank Lee
- [chat] use official transformers and fix some issues (#4117) by Wenhao Chen
- [chat] remove naive strategy and split colossalai strategy (#4094) by Wenhao Chen
- [chat] refactor trainer class (#4080) by Wenhao Chen
- [chat]: fix chat evaluation possible bug (#4064) by Michelle
- [chat] refactor strategy class with booster api (#3987) by Wenhao Chen
- [chat] refactor actor class (#3968) by Wenhao Chen
- [chat] add distributed PPO trainer (#3740) by Hongxin Liu
Zero
- [zero] optimize the optimizer step time (#4221) by LuGY
- [zero] support shard optimizer state dict of zero (#4194) by LuGY
- [zero] add state dict for low level zero (#4179) by LuGY
- [zero] allow passing process group to zero12 (#4153) by LuGY
- [zero]support no_sync method for zero1 plugin (#4138) by LuGY
- [zero] refactor low level zero for shard evenly (#4030) by LuGY
Nfc
- [NFC] polish applications/Chat/coati/models/utils.py codestyle (#4277) by yuxuan-lou
- [NFC] polish applications/Chat/coati/trainer/strategies/base.py code style (#4278) by Zirui Zhu
- [NFC] polish applications/Chat/coati/models/generation.py code style (#4275) by RichardoLuo
- [NFC] polish applications/Chat/inference/server.py code style (#4274) by Yuanchen
- [NFC] fix format of application/Chat/coati/trainer/utils.py (#4273) by アマデウス
- [NFC] polish applications/Chat/examples/train_reward_model.py code style (#4271) by Xu Kai
- [NFC] fix: format (#4270) by dayellow
- [NFC] polish runtime_preparation_pass style (#4266) by Wenhao Chen
- [NFC] polish unary_elementwise_generator.py code style (#4267) by YeAnbang
- [NFC] polish applications/Chat/coati/trainer/base.py code style (#4260) by shenggan
- [NFC] polish applications/Chat/coati/dataset/sft_dataset.py code style (#4259) by Zheng Zangwei (Alex Zheng)
- [NFC] polish colossalai/booster/plugin/low_level_zero_plugin.py code style (#4256) by 梁爽
- [NFC] polish colossalai/auto_parallel/offload/amp_optimizer.py code style (#4255) by Yanjia0
- [NFC] polish colossalai/cli/benchmark/utils.py code style (#4254) by ocd_with_naming
- [NFC] policy applications/Chat/examples/ray/mmmt_prompt.py code style (#4250) by CZYCW
- [NFC] polish applications/Chat/coati/models/base/actor.py code style (#4248) by Junming Wu
- [NFC] polish applications/Chat/inference/requirements.txt code style (#4265) by Camille Zhong
- [NFC] Fix format for mixed precision (#4253) by Jianghai
- [nfc]fix ColossalaiOptimizer is not defined (#4122) by digger yu
- [nfc] fix dim not defined and fix typo (#3991) by digger yu
- [nfc] fix typo colossalai/zero (#3923) by digger yu
- [nfc]fix typo colossalai/pipeline tensor nn (#3899) by digger yu
- [nfc] fix typo colossalai/nn (#3887) by digger yu
- [nfc] fix typo colossalai/cli fx kernel (#3847) by digger yu
Example
- Fix/format (#4261) by Michelle
- [example] add llama pretraining (#4257) by binmakeswell
- [example] fix bucket size in example of gpt gemini (#4028) by LuGY
- [example] update ViT example using booster api (#3940) by Baizhou Zhang
- Merge pull request #3905 from MaruyamaAya/dreambooth by Liu Ziming
- [example] update opt example using booster api (#3918) by Baizhou Zhang
- [example] Modify palm example with the new booster API (#3913) by Liu Ziming
- [example] update gemini examples (#3868) by jiangmingyan
Ci
- [ci] support testmon core pkg change detection (#4305) by Hongxin Liu
Checkpointio
- [checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302) by Baizhou Zhang
- Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin (#4141) by Baizhou Zhang
- [checkpointio] sharded optimizer checkpoint for DDP plugin (#4002) by Baizhou Zhang
- [checkpointio] General Checkpointing of Sharded Optimizers (#3984) by Baizhou Zhang
Lazy
- [lazy] support init on cuda (#4269) by Hongxin Liu
- [lazy] fix compatibility problem on torch 1.13 (#3911) by Hongxin Liu
- [lazy] refactor lazy init (#3891) by Hongxin Liu
Kernels
- [Kernels] added triton-implemented of self attention for colossal-ai (#4241) by Cuiqing Li
Docker
- [docker] fixed ninja build command (#4203) by Frank Lee
- [docker] added ssh and rdma support for docker (#4192) by Frank Lee
Dtensor
- [dtensor] fixed readme file name and removed deprecated file (#4162) by Frank Lee
- [dtensor] updated api and doc (#3845) by Frank Lee
Workflow
- [workflow] show test duration (#4159) by Frank Lee
- [workflow] added status check for test coverage workflow (#4106) by Frank Lee
- [workflow] cover all public repositories in weekly report (#4069) by Frank Lee
- [workflow] fixed the directory check in build (#3980) by Frank Lee
- [workflow] cancel duplicated workflow jobs (#3960) by Frank Lee
- [workflow] cancel duplicated workflow jobs (#3960) by Frank Lee
- [workflow] added docker latest tag for release (#3920) by Frank Lee
- [workflow] fixed workflow check for docker build (#3849) by Frank Lee
Cli
- [cli] hotfix launch command for multi-nodes (#4165) by Hongxin Liu
Format
- [format] applied code formatting on changed files in pull request 4152 (#4157) by github-actions[bot]
- [format] applied code formatting on changed files in pull request 4021 (#4022) by github-actions[bot]
Shardformer
- [shardformer] added development protocol for standardization (#4149) by Frank Lee
- [shardformer] made tensor parallelism configurable (#4144) by Frank Lee
- [shardformer] refactored some doc and api (#4137) by Frank Lee
- [shardformer] write an shardformer example with bert finetuning (#4126) by jiangmingyan
- [shardformer] added embedding gradient check (#4124) by Frank Lee
- [shardformer] import huggingface implicitly (#4101) by Frank Lee
- [shardformer] integrate with data parallelism (#4103) by Frank Lee
- [shardformer] supported fused normalization (#4112) by Frank Lee
- [shardformer] supported bloom model (#4098) by Frank Lee
- [shardformer] support vision transformer (#4096) by Kun Lin
- [shardformer] shardformer support opt models (#4091) by jiangmingyan
- [shardformer] refactored layernorm (#4086) by Frank Lee
- [shardformer] Add layernorm (#4072) by FoolPlayer
- [shardformer] supported fused qkv checkpoint (#4073) by Frank Lee
- [shardformer] add linearconv1d test (#4067) by FoolPlayer
- [shardformer] support module saving and loading (#4062) by Frank Lee
- [shardformer] refactored the shardformer layer structure (#4053) by Frank Lee
- [shardformer] adapted T5 and LLaMa test to use kit (#4049) by Frank Lee
- [shardformer] add gpt2 test and layer class refactor (#4041) by FoolPlayer
- [shardformer] supported T5 and its variants (#4045) by Frank Lee
- [shardformer] adapted llama to the new API (#4036) by Frank Lee
- [shardformer] fix bert and gpt downstream with new api (#4024) by FoolPlayer
- [shardformer] updated doc (#4016) by Frank Lee
- [shardformer] removed inplace tensor sharding (#4018) by Frank Lee
- [shardformer] refactored embedding and dropout to parallel module (#4013) by Frank Lee
- [shardformer] integrated linear 1D with dtensor (#3996) by Frank Lee
- [shardformer] Refactor shardformer api (#4001) by FoolPlayer
- [shardformer] fix an error in readme (#3988) by FoolPlayer
- [Shardformer] Downstream bert (#3979) by FoolPlayer
- [shardformer] shardformer support t5 model (#3994) by wukong1992
- [shardformer] support llama model using shardformer (#3969) by wukong1992
- [shardformer] Add dropout layer in shard model and refactor policy api (#3949) by FoolPlayer
- [shardformer] Unit test (#3928) by FoolPlayer
- [shardformer] Align bert value (#3907) by FoolPlayer
- [shardformer] add gpt2 policy and modify shard and slicer to support (#3883) by FoolPlayer
- [shardformer] add Dropout layer support different dropout pattern (#3856) by FoolPlayer
- [shardformer] update readme with modules implement doc (#3834) by FoolPlayer
- [shardformer] refactored the user api (#3828) by Frank Lee
- [shardformer] updated readme (#3827) by Frank Lee
- [shardformer]: Feature/shardformer, add some docstring and readme (#3816) by FoolPlayer
- [shardformer] init shardformer code structure (#3731) by FoolPlayer
- [shardformer] add gpt2 policy and modify shard and slicer to support (#3883) by FoolPlayer
- [shardformer] add Dropout layer support different dropout pattern (#3856) by FoolPlayer
- [shardformer] update readme with modules implement doc (#3834) by FoolPlayer
- [shardformer] refactored the user api (#3828) by Frank Lee
- [shardformer] updated readme (#3827) by Frank Lee
- [shardformer]: Feature/shardformer, add some docstring and readme (#3816) by FoolPlayer
- [shardformer] init shardformer code structure (#3731) by FoolPlayer
Test
- [test] fixed tests failed due to dtensor change (#4082) by Frank Lee
- [test] fixed codefactor format report (#4026) by Frank Lee
Device
Hotfix
- [hotfix] fix import bug in checkpoint_io (#4142) by Baizhou Zhang
- [hotfix]fix argument naming in docs and examples (#4083) by Baizhou Zhang
Doc
- [doc] update and revise some typos and errs in docs (#4107) by Jianghai
- [doc] add a note about unit-testing to CONTRIBUTING.md (#3970) by Baizhou Zhang
- [doc] add lazy init tutorial (#3922) by Hongxin Liu
- [doc] fix docs about booster api usage (#3898) by Baizhou Zhang
- [doc]update moe chinese document. (#3890) by jiangmingyan
- [doc] update document of zero with chunk. (#3855) by jiangmingyan
- [doc] update nvme offload documents. (#3850) by jiangmingyan
Examples
Testing
Gemini
- Merge pull request #4056 from Fridge003/hotfix/fix_gemini_chunk_config_searching by Baizhou Zhang
- [gemini] fix argument naming during chunk configuration searching by Baizhou Zhang
- [gemini] fixed the gemini checkpoint io (#3934) by Frank Lee
- [gemini] fixed the gemini checkpoint io (#3934) by Frank Lee
Devops
- [devops] fix build on pr ci (#4043) by Hongxin Liu
- [devops] update torch version in compability test (#3919) by Hongxin Liu
- [devops] hotfix testmon cache clean logic (#3917) by Hongxin Liu
- [devops] hotfix CI about testmon cache (#3910) by Hongxin Liu
- [devops] improving testmon cache (#3902) by Hongxin Liu
Sync
- Merge pull request #4025 from hpcaitech/develop by Frank Lee
- Merge pull request #3967 from ver217/update-develop by Frank Lee
- Merge pull request #3942 from hpcaitech/revert-3931-sync/develop-to-shardformer by FoolPlayer
- Revert "[sync] sync feature/shardformer with develop" by Frank Lee
- Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer by FoolPlayer
- Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop by Frank Lee
- Merge pull request #3915 from FrankLeeeee/update/develop by Frank Lee
Booster
- [booster] make optimizer argument optional for boost (#3993) by Wenhao Chen
- [booster] update bert example, using booster api (#3885) by wukong1992
Evaluate
Feature
Bf16
- [bf16] add bf16 support (#3882) by Hongxin Liu
Evaluation
Full Changelog: v0.3.1...v0.3.0