Skip to content

Latest commit

 

History

History
192 lines (154 loc) · 8.54 KB

CHANGELOG.md

File metadata and controls

192 lines (154 loc) · 8.54 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

NEXT - TBD

Fixed

  • FSDP: workaround AMP autocast cache issue with clear_autocast_cache flag
  • setup.py: hide CUDA extensions behind BUILD_CUDA_EXTENSIONS envvar
  • SDP: re-expose the module property (#647)

Added

  • FSDP: better memory usage for reduce bucket (#633)

[0.3.6] - 2021-04-26

Added

  • FSDP: Consolidate cpu_adam optimizer state dict (#607)

Fixed

  • FSDP: handle model with multiple forward pass and checkpoint (#621)
  • FSDP & SDP: check before calling _specify_ddp_gpu_num (#626)
  • FSDP: relax checking root condition (#620)
  • SDP: removing an assert which does not seem always accurate (#625)
  • FSDP: changing FSDP init to by pass pg validation (#619)
  • OSS: to 100% coverage (#618)

[0.3.5] - 2021-04-19

Added

  • [offload] Add API, tutorial and smaller doc string changes. (#576)

Fixed

  • FSDP: fixing training with freezing weights (#614)
  • SDP: privatizing all the things (#611)
  • FSDP: Make _get_default_cuda_device more robust to modules without params (#606)
  • OffloadModel: Add prev codepath of using OffloadModel without activation checkpointing (#608)

[0.3.4] - 2021-04-13

Added

  • FSDP: Add no broadcast optim state option (#560)

Fixed

  • ShardedDDP: Properly handle .eval() mode (#587)
  • ShardedDDP: Handle model being moved back to CPU prior to state consolidation (#573)
  • FSDP: much faster state consolidation (#595)
  • FSDP: Add gradient pre-dedivide to prevent overflow with large world sizes (#565)
  • Offload: (experimental) Fix activation offloading to CPU (#588

[0.3.3] - 2021-04-1

Added

  • FSDP: changed auto_wrap_bn utility function so that single FSDP group is optional (#556)
  • FSDP: optimizer state load/save (#537)
  • FSDP: fix weight init when using apply() (#543)
  • Multiprocess Pipe: retired old implementation
  • Experimental: xpipe

Fixed

  • ShardedDDP deferred init (#558)

[0.3.2] - 2021-03-18

Added

  • Experimental: Add spectrain support (#372)
  • FSDP: enabled pytorch SyncBN (no asserting) (#527)
  • FSDP: added auto_wrap_bn utility function (#531)

Fixed

  • OSS: fix a compatibily problem with lightning wrt optimizer state dict (#510)
  • FSDP: fixed a bug when part of autograd graph is traversed multiple times in mixed precision mode (#513)

[0.3.1] - 2021-03-09

Added

  • FSDP docs (#455)
  • enable_wrap and auto_wrap APIs (#446)
  • Added experimental.nn.OffloadModel API for training large models on a single GPU.(#432)

Fixed

  • OSS: fix a broken state dict when using non contiguous param groups
  • Several SDP fixes around performance and corner cases
  • Many FSDP fixes
  • AdaScale & SDP/FSDP test added but not officially supported

[0.3.0] - 2021-02-22

Added

  • FullyShardedDataParallel (FSDP) (#413)
  • ShardedDDP fp16 grad reduction option (#402)
  • Expose experimental algorithms within the pip package (#410)

Fixed

  • Catch corner case when the model is too small with respect to the world size, and shards are empty (#406)
  • Memory leak in checkpoint_wrapper (#412)

[0.1.7] - 2021-02-19

Fixed

  • ShardedDDP and OSS handle model trainability changes during training (#369)
  • ShardedDDP state dict load/save bug (#386)
  • ShardedDDP handle train/eval modes (#393)
  • AdaScale handling custom scaling factors (#401)

Added

  • ShardedDDP manual reduce option for checkpointing (#389)

[0.1.6] - 2021-02-10

Added

  • Checkpointing model wrapper (#376)
  • Faster OSS, flatbuffers (#371)
  • Small speedup in OSS clipgradnorm (#363)

Fixed

  • Bug in ShardedDDP with 0.1.5 depending the init (KeyError / OSS)
  • Much refactoring in Pipe (#357, #358, #360, #362, #370, #373)
  • Better pip integration / resident pytorch (#375)

[0.1.5] - 2021-02-03

Added

  • Pytorch compatibility for OSS checkpoints (#310)
  • Elastic checkpoints for OSS, world size can vary in between save and loads (#310)
  • Tensor views for OSS bucketing, reduced CPU use (#300)
  • Bucket calls in ShardedDDP, for faster inter node communications (#327)
  • FlattenParamWrapper, which flattens module parameters into a single tensor seamlessly (#317)
  • AMPnet experimental support (#304)

Fixed

  • ShardedDDP properly handles device changes via .to() (#353)
  • Add a new interface for AdaScale, AdaScaleWrapper, which makes it compatible with OSS (#347)

[0.1.4] - 2021-01-07

Fixed

  • Missing cu files in the pip package

[0.1.3] - 2021-01-04

Fixed

  • Release numbering within python and from pypi

[0.1.2] - 2021-01-04

Added

  • AdaScale: . Added gradient accumulation feature (#202) . Added support of torch.lr_scheduler (#229) . Added support for add_param_groups (#266) . Added support for scale != world_size (#266)

Fixed

  • AdaScale: smoothing factor value fixed when using gradient accumulation (#235)
  • Pipe: documentation on balancing functions (#243)
  • ShardedDDP: handle typical NLP models
  • ShardedDDP: better partitioning when finetuning

[0.1.1] - 2020-12-01

Fixed

  • make sure pip package includes header files (#221)

[0.1.0] - 2020-12-01

Added

  • ShardedDataParallel with autoreduce (#157)
  • cpu support for Pipe (#188)
  • ShardedOptim: Distributed Grad Scaler (for torch AMP) (#182)
  • OSS-aware clip grads, bridge sharded states (#167)
  • oss: add rank_local_state_dict staticmethod (#174)
  • support for PyTorch 1.7.0 (#171)
  • Add implementation of AdaScale (#139)

Fixed

  • pip package install (#196, #200)

[0.0.3] - 2020-10-14

Added

  • multi-process pipe

Fixed

  • multiple OSS fixes
  • MegaTron+OSS DDP fix

[0.0.2] - 2020-08-28

Added

  • add ddp that works with oss with reduce() not all_reduce() (#19)
  • support for PyTorch v1.6
  • add mixed precision Adam (#40)
  • Adam optimizer state scaling (#44)

Fixed

  • properly restore a sharded optim state (#39)
  • OSS restore state to proper device (#46)
  • optim/oss: support optimizers with additional step kwargs (#53)
  • optim/oss: fix state cast (#56)
  • fix eval for oss_ddp (#55)
  • optim/oss: work correctly with LRScheduler (#58)

[0.0.1] - 2020-07-31

  • Initial release.