cache_miss on every call to pjitted function #15915

hr0nix · 2023-05-08T21:34:47Z

hr0nix
May 8, 2023

I've been looking at traces of model-parallel training I've implemented in jax using pjit and noticed a curious thing: every call to pjitted function hits a function called cache_miss and does quite a lot of computations in python.

I'm wondering whether this is expected or I have set up something wrong and jax is re-doing some work it's supposed to do only once. It has practical significance for me because overhead from these python pjit activities can sometimes be large enough to make GPU pipeline wait.

The training step function I pjit looks something like this:

def training_step(
        train_state: TrainState,
        batch: PyTree,
) -> tuple[TrainState, jax.Array]:
    rng, new_rng = jax.random.split(train_state.rng)

    loss_value_and_grad_fn = jax.value_and_grad(compute_loss, argnums=0)
    loss, grad = loss_value_and_grad_fn(train_state.params, train_state, batch)

    updates, new_optimizer_state = train_state.tx.update(grad, train_state.opt_state, train_state.params)
    new_params = optax.apply_updates(train_state.params, updates)

    train_state = train_state.replace(
        params=new_params,
        opt_state=new_optimizer_state,
        rng=new_rng,
    )

    return train_state, loss

Here's how I compile it:

train_on_batch_fn = jax.experimental.pjit.pjit(
    training_step,
    in_shardings=(state_mesh_annotations, data_sharding),
    out_shardings=(state_mesh_annotations, None),
    donate_argnums=(0,),  # Donate train_state
)

On every training step I just call it like this:

train_state, loss = train_on_batch_fn(train_state, batch)

quattro · 2024-08-15T16:46:07Z

quattro
Aug 15, 2024

Any updates on this? I'm running into a similar issue.

12 replies

jakevdp Aug 23, 2024
Maintainer

Are any of your arrays sharded? Different shardings would lead to recompilation

quattro Aug 23, 2024

No, they're all on the same CPU device. I'm beginning to think it may be the partial that wraps lax.dot_general for eval_shape inside the bcoo_dot_general_abstract_eval at line 714.

That corresponds with the cache log:

TRACING CACHE MISS at /Users/nicholas/Projects/sel-zero/src/sel_zero/dynamics/projection.py:54:26 (Projection.weighted_projection) because:
  never seen function:
    functools.partial(<function dot_general at 0x12b887240>, dimension_numbers=(((1,), (0,)), ((), ())), preferred_element_type=dtype('float64')) id=10837044384 defined at /Users/nicholas/micromamba/envs/jax/lib/python3.11/site-packages/jax/_src/lax/lax.py:739
  but seen another function defined on the same line; maybe the function is
  being re-defined repeatedly, preventing caching?
Finished tracing + transforming <unnamed wrapped function> for pjit in 0.0001659393310546875 sec
TRACING CACHE MISS at /Users/nicholas/Projects/sel-zero/src/sel_zero/dynamics/projection.py:56:22 (Projection.weighted_projection) because:
  never seen function:
    functools.partial(<function dot_general at 0x12b887240>, dimension_numbers=(((1,), (0,)), ((), ())), preferred_element_type=dtype('float64')) id=6094598160 defined at /Users/nicholas/micromamba/envs/jax/lib/python3.11/site-packages/jax/_src/lax/lax.py:739
  but seen another function defined on the same line; maybe the function is
  being re-defined repeatedly, preventing caching?
Finished tracing + transforming <unnamed wrapped function> for pjit in 0.000164031982421875 sec
TRACING CACHE MISS at /Users/nicholas/Projects/sel-zero/src/sel_zero/dynamics/projection.py:58:21 (Projection.weighted_projection) because:
  never seen function:
    functools.partial(<function dot_general at 0x12b887240>, dimension_numbers=(((1,), (0,)), ((), ())), preferred_element_type=dtype('float64')) id=6094588160 defined at /Users/nicholas/micromamba/envs/jax/lib/python3.11/site-packages/jax/_src/lax/lax.py:739
  but seen another function defined on the same line; maybe the function is
  being re-defined repeatedly, preventing caching?
Finished tracing + transforming <unnamed wrapped function> for pjit in 0.0001881122589111328 sec
TRACING CACHE MISS at /Users/nicholas/Projects/sel-zero/src/sel_zero/dynamics/projection.py:65:14 (Projection.weighted_projection) because:
  never seen function:
    functools.partial(<function dot_general at 0x12b887240>, dimension_numbers=(((1,), (0,)), ((), ())), preferred_element_type=dtype('float64')) id=10834850848 defined at /Users/nicholas/micromamba/envs/jax/lib/python3.11/site-packages/jax/_src/lax/lax.py:739
  but seen another function defined on the same line; maybe the function is
  being re-defined repeatedly, preventing caching?
Finished tracing + transforming <unnamed wrapped function> for pjit in 0.00015616416931152344 sec

I'm not sure how to get around that, since the partials are needed to keep the static args for shape eval.

jakevdp Aug 23, 2024
Maintainer

Yeah that makes sense – I think we could fix this by switching from using jax.eval_shape(f, *args) to using the new(ish) jax.jit(f).eval_shape(*args) because this API allows for static argnums & argnames, and that would let us avoid the partial evaulations that break the cache

Do you want to give that a try locally and see if it improves things?

quattro Aug 23, 2024

Ah too cool. Replacing the offending line with

  out_aval = jax.jit(lax.dot_general, static_argnames=("dimension_numbers", "preferred_element_type")).eval_shape(
          jax.ShapeDtypeStruct(lhs_spinfo.shape, lhs_data.dtype),
          jax.ShapeDtypeStruct(rhs.shape, rhs.dtype),
          dimension_numbers=dimension_numbers,
          preferred_element_type=preferred_element_type)

fixed it. Thanks so much!

jakevdp Aug 23, 2024
Maintainer

Awesome - would you like to put together a PR with that fix?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache_miss on every call to pjitted function #15915

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

cache_miss on every call to pjitted function #15915

hr0nix May 8, 2023

Replies: 1 comment · 12 replies

quattro Aug 15, 2024

jakevdp Aug 23, 2024 Maintainer

quattro Aug 23, 2024

jakevdp Aug 23, 2024 Maintainer

quattro Aug 23, 2024

jakevdp Aug 23, 2024 Maintainer

hr0nix
May 8, 2023

Replies: 1 comment 12 replies

quattro
Aug 15, 2024

jakevdp Aug 23, 2024
Maintainer

jakevdp Aug 23, 2024
Maintainer

jakevdp Aug 23, 2024
Maintainer