Context manager for mixed precision training #2027

jatentaki · 2022-04-05T13:10:31Z

jatentaki
Apr 5, 2022

TLDR: I am proposing a jax interpreter for mixed precision training, automatically converting dtypes as appropriate. This topic aims to ask how it fits within the roadmap for mixed precision training in Flax if already considered, or to pitch it otherwise.

Background

Training with reduced precision can lead to problems with certain operations such as normalization layers, where large reduction ops can cause overflows. For that reason, the AMP implementation of PyTorch keeps a white- and blacklist of operations which respectively should and shouldn't be carried out in lower precision, falling back to f32 when necessary. Currently there seems to be no way to do that in flax: I can set the dtype of module parameters (param_dtype) and output types (dtype), but for example normalization layers will still perform reduction operations at precision of inputs.

My understanding of the state of AMP in Flax and comparison with PyTorch

PR #1803 is introducing computation_dtype (alongside already existing param_dtype and dtype), which is a step towards solving this issue. My worry is however that keeping track of those parameters and propagating them all the way through our NN definitions is cumbersome and introduces a lot of boiler plate. It is also not very backwards-friendly: it avoids breakage, but a library implementing for example a ResNet50 backbone cannot be used in mixed precision training until it releases a patch to propagate computation_dtype from the top-level call. A PyTorch-style solution could be to traverse the tree of nested modules after creation of the root module (ResNet50) and recursively change their computation_dtype. This is however not very effective, because with the @linen.compact idiom many submodules are defined in Module.__call__ and not stored statefully. This is in contrast to the imperative approach of PyTorch where all learnable parameters have to be defined in Module.__init__ and can thus be traversed.

My suggestion

My suggestion to this problem is to define a context manager/decorator which implements mixed precision via jax interpreter mechanics. We would have a whitelist of operations which should run in reduced precision and a black list of those which shouldn't. Some operations, like reshaping is type-agnostic and would be in neither of those. The interpreter then considers the entire computation graph and introduces casting from/to lower precision when appropriate. This way the technicalities of precision choices can be abstracted away from model definition, following the same philosophy as xmap: define models at a high level of abstraction and adjust the details post-hoc, via jax transforms. To illustrate the savings in terms of boilerplate, this approach would make all of computation_dtype, param_dtype and dtype obsolete (in context of automatic mixed precision training, that is).

Notes

Note 1: I imagine an idea like mine has likely already been discussed before, so I would be happy to know where it fits within the bigger picture of where Flax is going with mixed precision training. I originally intended this topic to just ask about the state of work, but I figured I may as well pitch the solution as I see it, to better understand the tradeoffs with different approaches.
Note 2: It's certainly up for discussion whether this kind of machinery belongs in Flax, Optax or elsewhere entirely, please feel free to move it as appropriate.

jheek · 2022-04-12T12:07:00Z

jheek
Apr 12, 2022
Maintainer

What Flax is trying to do with mixed precision is provide very explicit APIs that allow you to control the dtypes used in part of your code. For larger operations like normalization layers we try to do something sensible in the implementation. Where sensible means we do something that empirically is found to be stable in most cases.

PyTorch AMP is a more general purpose transformation that takes code that doesn't deal with mixed precision types at all and does a best effort transformation of the computation that uses half precision types as much as possible. A tool like that could be very valuable in the JAX ecosystem as well but it has less to do with Flax. In JAX I would imagine the equivalent of AMP to be a functional transformation just like jit or vmap although it coudl be provided by a seperate library. So you would have something like:

from jax_mixed_precision import amp
fn_f16 = amp(fn_f32, dtype=jnp.float16, ..hyper_params)

fn_f16(x) # used mixed precision for dots, convs, etc. but not for exp, sum and other unstable ops

1 reply

Skylion007 Feb 15, 2023

@jheek Any way to do this by exploiting JMP? https://github.com/deepmind/jmp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context manager for mixed precision training #2027

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Context manager for mixed precision training #2027

jatentaki Apr 5, 2022

Background

My understanding of the state of AMP in Flax and comparison with PyTorch

My suggestion

Notes

Replies: 1 comment · 1 reply

jheek Apr 12, 2022 Maintainer

Skylion007 Feb 15, 2023

jatentaki
Apr 5, 2022

Replies: 1 comment 1 reply

jheek
Apr 12, 2022
Maintainer