Equivalent of torch.no_grad? #15897

zw615 · 2023-05-06T01:01:25Z

zw615
May 6, 2023

Hi there! I want to turn off gradient computations when a model forwards some input, which is important in some cases like accumulative gradient implementation when memory is limited. I have searched the issues and discussion panels and find this post: #1937, which talks about jax.lax.stop_gradient. However, I find the code below only disables the gradient flow through the op jax.lax.stop_gradient, but still performs computational graph building/tracing. As a result, the accumulative gradient technique does not save memory at all. I wonder how I can extract features without any gradient operation, just like inference under torch.no_grad?

Thanks a lot!

Note that this accumulative gradient implementation is different from the one commonly used in supervised training like here https://github.com/google-research/big_vision/blob/47ac2fd075fcb66cadc0e39bd959c78a6080070d/big_vision/utils.py#L296. This implementation is useful in contrastive learning like CLIP.


def loss_fn(params, images, labels, accum_zimg=None, accum_ztxt=None, accum_iter_curr=None):
    zimg, ztxt = model.apply(
        {"params": params}, images, labels)

    if accum_zimg is not None and accum_ztxt is not None:
        zimg = jnp.concatenate(accum_zimg[:accum_iter_curr] + [zimg,] + accum_zimg[accum_iter_curr+1:], axis=0)
        ztxt = jnp.concatenate(accum_ztxt[:accum_iter_curr] + [ztxt,] + accum_ztxt[accum_iter_curr+1:], axis=0)

   # compute some loss here
    return loss(zimg, ztxt)

@pmap
def update_fn(params, opt, batch):
      images = batch["image"]
      labels = batch["labels"]

      # implement accumulative iter here
      ######################################
      # How to disable graph buliding here like torch.no_grad?
      if accum_iter > 1:
          subset_images_list = jnp.split(images, accum_iter, axis=0)
          subset_labels_list = jnp.split(labels, accum_iter, axis=0)
          params_nograd = jax.lax.stop_gradient(params)
          accum_zimg = []
          accum_ztxt = []
          for i in range(accum_iter):
              subset_images_nograd = jax.lax.stop_gradient(subset_images_list[i])
              subset_labels_nograd = jax.lax.stop_gradient(subset_labels_list[i])
              subset_zimg_nograd, subset_ztxt_nograd, _ = model.apply(
                  {"params": params_nograd}, subset_images_nograd, subset_labels_nograd)
              subset_zimg_nograd = jax.lax.stop_gradient(subset_zimg_nograd)
              subset_ztxt_nograd = jax.lax.stop_gradient(subset_ztxt_nograd)
              accum_zimg.append(subset_zimg_nograd)
              accum_ztxt.append(subset_ztxt_nograd)
    ######################################

          subset_l_list = []
          subset_grads_list = []
          for i in range(accum_iter):
              subset_images = subset_images_list[i]
              subset_labels = subset_labels_list[i]
              subset_l, subset_grads = jax.value_and_grad(
                  loss_fn)(params, subset_images, subset_labels, accum_zimg=accum_zimg, accum_ztxt=accum_ztxt, accum_iter_curr=i)
              subset_l_list.append(subset_l)
              subset_grads_list.append(subset_grads)

          l = sum(subset_l_list) / accum_iter
          grads = subset_grads_list[0]
          for i in range(1, accum_iter):
              grads = jax.tree_util.tree_map(lambda x, y: x + y, subset_grads_list[i], grads)
     else:
         # do normal forward backward here

      l, grads = jax.lax.pmean((l, grads), axis_name="batch")
      updates, opt = tx.update(grads, opt, params)
      params = optax.apply_updates(params, updates)

      return params, opt

zw615 · 2023-05-06T02:05:08Z

zw615
May 6, 2023
Author

Alternatively, I guess what I am asking is how to release the memory occupied by the graph to compute accum_zimg and accum_ztxt, after I have already obtained their values. I am not sure if it is possible in the pmaped update_fn.

I find in jax, the forward part of predict_fn during evaluation is also model.apply (https://github.com/google-research/big_vision/blob/47ac2fd075fcb66cadc0e39bd959c78a6080070d/big_vision/train.py#L182) , which is really the same as the one in update_fn during training (https://github.com/google-research/big_vision/blob/47ac2fd075fcb66cadc0e39bd959c78a6080070d/big_vision/train.py#L206). So there is no counterparts of torch.no_grad called even during evaluation in jax. Maybe jax will automatically release those intermedaite activation values if all the references to the output are deleted?

0 replies

cgarciae · 2023-05-08T13:02:12Z

cgarciae
May 8, 2023
Collaborator

Hey @zw615, there are 3 options of how to stop gradients in JAX:

Using jax.lax.stop_gradient, as you point out gradients will still be computed but they are just zeroed, this is typically the least performant option but depending on your problem you might have to use it.
Filtering the parameters for jax.grad, this means you only compute gradients w.r.t. a subset of parameters that will get gradients, you have to do this manually.
Using optax.multi_transform and selecting the optax.set_to_zero "optimizer" (its a no-op) for the parameter subset that should not get gradients, jit will detect that the gradients for these parameters are not used and will result in a program as efficient as filtering the parameters.

For more info checkout Flax's Transfer Learning guide.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Equivalent of torch.no_grad? #15897

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Equivalent of torch.no_grad? #15897

zw615 May 6, 2023

Replies: 2 comments

zw615 May 6, 2023 Author

cgarciae May 8, 2023 Collaborator

zw615
May 6, 2023

zw615
May 6, 2023
Author

cgarciae
May 8, 2023
Collaborator