Version v0.3.0 Release Today!
What's Changed
Release
Nfc
- [nfc] fix typo colossalai/ applications/ (#3831) by digger yu
- [NFC]fix typo colossalai/auto_parallel nn utils etc. (#3779) by digger yu
- [NFC] fix typo colossalai/amp auto_parallel autochunk (#3756) by digger yu
- [NFC] fix typo with colossalai/auto_parallel/tensor_shard (#3742) by digger yu
- [NFC] fix typo applications/ and colossalai/ (#3735) by digger-yu
- [NFC] polish colossalai/engine/gradient_handler/init.py code style (#3329) by Ofey Chan
- [NFC] polish colossalai/context/random/init.py code style (#3327) by yuxuan-lou
- [NFC] polish colossalai/fx/tracer/_tracer_utils.py (#3323) by Michelle
- [NFC] polish colossalai/gemini/paramhooks/_param_hookmgr.py code style by Xu Kai
- [NFC] polish initializer_data.py code style (#3287) by RichardoLuo
- [NFC] polish colossalai/cli/benchmark/models.py code style (#3290) by Ziheng Qin
- [NFC] polish initializer_3d.py code style (#3279) by Kai Wang (Victor Kai)
- [NFC] polish colossalai/engine/gradient_accumulation/_gradient_accumulation.py code style (#3277) by Sze-qq
- [NFC] polish colossalai/context/parallel_context.py code style (#3276) by Arsmart1
- [NFC] polish colossalai/engine/schedule/_pipeline_schedule_v2.py code style (#3275) by Zirui Zhu
- [NFC] polish colossalai/nn/_ops/addmm.py code style (#3274) by Tong Li
- [NFC] polish colossalai/amp/init.py code style (#3272) by lucasliunju
- [NFC] polish code style (#3273) by Xuanlei Zhao
- [NFC] policy colossalai/fx/proxy.py code style (#3269) by CZYCW
- [NFC] polish code style (#3268) by Yuanchen
- [NFC] polish tensor_placement_policy.py code style (#3265) by Camille Zhong
- [NFC] polish colossalai/fx/passes/split_module.py code style (#3263) by CsRic
- [NFC] polish colossalai/global_variables.py code style (#3259) by jiangmingyan
- [NFC] polish colossalai/engine/gradient_handler/_moe_gradient_handler.py (#3260) by LuGY
- [NFC] polish colossalai/fx/profiler/experimental/profiler_module/embedding.py code style (#3256) by dayellow
Doc
- [doc] update document of gemini instruction. (#3842) by jiangmingyan
- Merge pull request #3810 from jiangmingyan/amp by jiangmingyan
- [doc]fix by jiangmingyan
- [doc]fix by jiangmingyan
- [doc] add warning about fsdp plugin (#3813) by Hongxin Liu
- [doc] add removed change of config.py by jiangmingyan
- [doc] add removed warning by jiangmingyan
- [doc] update amp document by Mingyan Jiang
- [doc] update amp document by Mingyan Jiang
- [doc] update amp document by Mingyan Jiang
- [doc] update gradient accumulation (#3771) by jiangmingyan
- [doc] update gradient cliping document (#3778) by jiangmingyan
- [doc] add deprecated warning on doc Basics section (#3754) by Yanjia0
- [doc] add booster docstring and fix autodoc (#3789) by Hongxin Liu
- [doc] add tutorial for booster checkpoint (#3785) by Hongxin Liu
- [doc] add tutorial for booster plugins (#3758) by Hongxin Liu
- [doc] add tutorial for cluster utils (#3763) by Hongxin Liu
- [doc] update hybrid parallelism doc (#3770) by jiangmingyan
- [doc] update booster tutorials (#3718) by jiangmingyan
- [doc] fix chat spelling error (#3671) by digger-yu
- [Doc] enhancement on README.md for chat examples (#3646) by Camille Zhong
- [doc] Fix typo under colossalai and doc(#3618) by digger-yu
- [doc] .github/workflows/README.md (#3605) by digger-yu
- [doc] fix setup.py typo (#3603) by digger-yu
- [doc] fix op_builder/README.md (#3597) by digger-yu
- [doc] Update .github/workflows/README.md (#3577) by digger-yu
- [doc] Update 1D_tensor_parallel.md (#3573) by digger-yu
- [doc] Update 1D_tensor_parallel.md (#3563) by digger-yu
- [doc] Update README.md (#3549) by digger-yu
- [doc] Update README-zh-Hans.md (#3541) by digger-yu
- [doc] hide diffusion in application path (#3519) by binmakeswell
- [doc] add requirement and highlight application (#3516) by binmakeswell
- [doc] Add docs for clip args in zero optim (#3504) by YH
- [doc] updated contributor list (#3474) by Frank Lee
- [doc] polish diffusion example (#3386) by Jan Roudaut
- [doc] add Intel cooperation news (#3333) by binmakeswell
- [doc] added authors to the chat application (#3307) by Fazzie-Maqianli
Workflow
- [workflow] supported test on CUDA 10.2 (#3841) by Frank Lee
- [workflow] fixed testmon cache in build CI (#3806) by Frank Lee
- [workflow] changed to doc build to be on schedule and release (#3825) by Frank Lee
- [workflow] enblaed doc build from a forked repo (#3815) by Frank Lee
- [workflow] enable testing for develop & feature branch (#3801) by Frank Lee
- [workflow] fixed the docker build workflow (#3794) by Frank Lee
Booster
- [booster] add warning for torch fsdp plugin doc (#3833) by wukong1992
- [booster] torch fsdp fix ckpt (#3788) by wukong1992
- [booster] removed models that don't support fsdp (#3744) by wukong1992
- [booster] support torch fsdp plugin in booster (#3697) by wukong1992
- [booster] add tests for ddp and low level zero's checkpointio (#3715) by jiangmingyan
- [booster] fix no_sync method (#3709) by Hongxin Liu
- [booster] update prepare dataloader method for plugin (#3706) by Hongxin Liu
- [booster] refactor all dp fashion plugins (#3684) by Hongxin Liu
- [booster] gemini plugin support shard checkpoint (#3610) by jiangmingyan
- [booster] add low level zero plugin (#3594) by Hongxin Liu
- [booster] fixed the torch ddp plugin with the new checkpoint api (#3442) by Frank Lee
- [booster] implement Gemini plugin (#3352) by ver217
Docs
Evaluation
Docker
Api
- [API] add docstrings and initialization to apex amp, naive amp (#3783) by jiangmingyan
Test
- [test] fixed lazy init test import error (#3799) by Frank Lee
- Update test_ci.sh by Camille Zhong
- [test] refactor tests with spawn (#3452) by Frank Lee
- [test] reorganize zero/gemini tests (#3445) by ver217
- [test] fixed gemini plugin test (#3411) by Frank Lee
Format
- [format] applied code formatting on changed files in pull request 3786 (#3787) by github-actions[bot]
- [format] Run lint on colossalai.engine (#3367) by Hakjin Lee
Plugin
- [plugin] a workaround for zero plugins' optimizer checkpoint (#3780) by Hongxin Liu
- [plugin] torch ddp plugin supports sharded model checkpoint (#3775) by Hongxin Liu
Chat
- [chat] add performance and tutorial (#3786) by binmakeswell
- [chat] fix bugs in stage 3 training (#3759) by Yuanchen
- [chat] fix community example ray (#3719) by MisterLin1995
- [chat] fix train_prompts.py gemini strategy bug (#3666) by zhang-yi-chi
- [chat] PPO stage3 doc enhancement (#3679) by Camille Zhong
- [chat] add opt attn kernel (#3655) by Hongxin Liu
- [chat] typo accimulation_steps -> accumulation_steps (#3662) by tanitna
- Merge pull request #3656 from TongLi3701/chat/update_eval by Tong Li
- [chat] set default zero2 strategy (#3667) by binmakeswell
- [chat] refactor model save/load logic (#3654) by Hongxin Liu
- [chat] remove lm model class (#3653) by Hongxin Liu
- [chat] refactor trainer (#3648) by Hongxin Liu
- [chat] polish performance evaluator (#3647) by Hongxin Liu
- Merge pull request #3621 from zhang-yi-chi/fix/chat-train-prompts-single-gpu by Tong Li
- [Chat] Remove duplicate functions (#3625) by ddobokki
- [chat] fix enable single gpu training bug by zhang-yi-chi
- [chat] polish code note typo (#3612) by digger-yu
- [chat] update reward model sh (#3578) by binmakeswell
- [chat] ChatGPT train prompts on ray example (#3309) by MisterLin1995
- [chat] polish tutorial doc (#3551) by binmakeswell
- [chat]add examples of training with limited resources in chat readme (#3536) by Yuanchen
- [chat]: add vf_coef argument for PPOTrainer (#3318) by zhang-yi-chi
- [chat] add zero2 cpu strategy for sft training (#3520) by ver217
- [chat] fix stage3 PPO sample sh command (#3477) by binmakeswell
- [Chat]Add Peft support & fix the ptx bug (#3433) by YY Lin
- [chat]fix save_model(#3377) by Dr-Corgi
- [chat]fix readme (#3429) by kingkingofall
- [Chat] fix the tokenizer "int too big to convert" error in SFT training (#3453) by Camille Zhong
- [chat]fix sft training for bloom, gpt and opt (#3418) by Yuanchen
- [chat] correcting a few obvious typos and grammars errors (#3338) by Andrew
Devops
- [devops] fix doc test on pr (#3782) by Hongxin Liu
- [devops] fix ci for document check (#3751) by Hongxin Liu
- [devops] make build on PR run automatically (#3748) by Hongxin Liu
- [devops] update torch version of CI (#3725) by Hongxin Liu
- [devops] fix chat ci (#3628) by Hongxin Liu
Amp
- [amp] Add naive amp demo (#3774) by jiangmingyan
Auto
- [auto] fix install cmd (#3772) by binmakeswell
Fix
- [fix] Add init to fix import error when importing _analyzer (#3668) by Ziyue Jiang
Ci
- [CI] fix typo with tests/ etc. (#3727) by digger-yu
- [CI] fix typo with tests components (#3695) by digger-yu
- [CI] fix some spelling errors (#3707) by digger-yu
- [CI] Update test_sharded_optim_with_sync_bn.py (#3688) by digger-yu
Example
- [example] add train resnet/vit with booster example (#3694) by Hongxin Liu
- [example] add finetune bert with booster example (#3693) by Hongxin Liu
- [example] fix community doc (#3586) by digger-yu
- [example] reorganize for community examples (#3557) by binmakeswell
- [example] remove redundant texts & update roberta (#3493) by mandoxzhang
- [example] update roberta with newer ColossalAI (#3472) by mandoxzhang
- [example] update examples related to zero/gemini (#3431) by ver217
Tensor
- [tensor] Refactor handle_trans_spec in DistSpecManager by YH
Zero
- [zero] Suggests a minor change to confusing variable names in the ZeRO optimizer. (#3173) by YH
- [zero] reorganize zero/gemini folder structure (#3424) by ver217
Gemini
- [gemini] accelerate inference (#3641) by Hongxin Liu
- [gemini] state dict supports fp16 (#3590) by Hongxin Liu
- [gemini] support save state dict in shards (#3581) by Hongxin Liu
- [gemini] gemini supports lazy init (#3379) by Hongxin Liu
Bot
- [bot] Automated submodule synchronization (#3596) by github-actions[bot]
Misc
- [misc] op_builder/builder.py (#3593) by digger-yu
- [misc] add verbose arg for zero and op builder (#3552) by Hongxin Liu
Coati
- [coati] fix install cmd (#3592) by binmakeswell
- [coati] add costom model suppor tguide (#3579) by Fazzie-Maqianli
- [coati] Fix LlamaCritic (#3475) by gongenlei
Fx
- [fx] fix meta tensor registration (#3589) by Hongxin Liu
Chatgpt
- [chatgpt] Detached PPO Training (#3195) by csric
- [chatgpt] add pre-trained model RoBERTa for RLHF stage 2 & 3 (#3223) by Camille Zhong
Lazyinit
- [lazyinit] fix clone and deepcopy (#3553) by Hongxin Liu
Checkpoint
- [checkpoint] Shard saved checkpoint need to be compatible with the naming format of hf checkpoint files (#3479) by jiangmingyan
- [checkpoint] support huggingface style sharded checkpoint (#3461) by jiangmingyan
- [checkpoint] refactored the API and added safetensors support (#3427) by Frank Lee
Chat community
- [Chat Community] Update README.md (fixed#3487) (#3506) by NatalieC323
Dreambooth
- Revert "[dreambooth] fixing the incompatibity in requirements.txt (#3190) (#3378)" (#3481) by NatalieC323
- [dreambooth] fixing the incompatibity in requirements.txt (#3190) (#3378) by NatalieC323
Autoparallel
- [autoparallel]integrate auto parallel feature with new tracer (#3408) by YuliangLiu0306
- [autoparallel] adapt autoparallel with new analyzer (#3261) by YuliangLiu0306
Moe
Hotfix
- [hotfix] meta_tensor_compatibility_with_torch2 by YuliangLiu0306
Full Changelog: v0.3.0...v0.2.8