Flax NNX GSPMD guide #4220

IvyZX · 2024-09-23T23:13:38Z

Add a guide to do GSPMD-style sharding annotation on NNX models.

Covered everything in the Linen pjit guide, but better explanations and demonstrations, and more concise code!

Also added a small example for loading sharded model from checkpoint.

Preview

review-notebook-app · 2024-09-23T23:13:43Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

cgarciae · 2024-09-24T12:34:26Z

docs_nnx/guides/flax_gspmd.md

+
+++
+
+## Flax and `jax.jit` scaled up


Maybe we should change the writing here to talk about nnx.jit?

I kinda want to convey the idea that essentially we are using JAX's compilation machinery for the scaling up work. I renamed the title and added another paragraph explaining this (and mentioning nnx.jit there).

cgarciae · 2024-09-24T13:58:06Z

docs_nnx/guides/flax_gspmd.md

+    self.w2 = nnx.Param(
+      nnx.with_partitioning(init_fn, ('model', None))(
+        rngs.params(), (depth, depth))   # RNG key and shape for W2 creation
+    )


This is a good opportunity to show how to manually add the sharding metadata:

Suggested change

self.w2 = nnx.Param(

nnx.with_partitioning(init_fn, ('model', None))(

rngs.params(), (depth, depth)) # RNG key and shape for W2 creation

)

self.w2 = nnx.Param(

init_fn(rngs.params(), (depth, depth)) # RNG key and shape for W2 creation

sharding=('model', None),

)

cgarciae · 2024-09-24T14:01:16Z

docs_nnx/guides/flax_gspmd.md

+    # In data parallelism, input / intermediate value's first dimension (batch)
+    # will be sharded on `data` axis
+    y = jax.lax.with_sharding_constraint(y, PartitionSpec('data', 'model'))
+    z = jnp.dot(y, self.w2.value)


Variables can be used as JAX arrays thanks to the __jax_aray__ protocol.

Suggested change

z = jnp.dot(y, self.w2.value)

z = jnp.dot(y, self.w2)

For some reason this will fail later when I do:

with mesh: output = sharded_model(input)

With error: AttributeError: 'tuple' object has no attribute '_device_assignment'.

I'll keep this as-is for now.

cgarciae · 2024-09-24T14:02:42Z

docs_nnx/guides/flax_gspmd.md

+print(unsharded_model.w2.value.sharding)           # SingleDeviceSharding
+```
+
+We should leverage JAX's compilation mechanism, aka. `jax.jit`, to create the sharded model. The key is to intialize a model and assign shardings upon the model state within a jitted function:


Suggested change

We should leverage JAX's compilation mechanism, aka. `jax.jit`, to create the sharded model. The key is to intialize a model and assign shardings upon the model state within a jitted function:

We should leverage JAX's compilation mechanism, via `nnx.jit`, to create the sharded model. The key is to intialize a model and assign shardings upon the model state within a jitted function:

cgarciae · 2024-09-24T14:05:08Z

docs_nnx/guides/flax_gspmd.md

+1. Call [`jax.lax.with_sharding_constraint`](https://jax.readthedocs.io/en/latest/_autosummary/jax.lax.with_sharding_constraint.html) to bind the model state with the sharding annotations. This API tells the top-level `jax.jit` how to shard a variable!
+
+1. Throw away the unsharded state and return the model based upon the sharded state.
+
+1. Compile the whole function with `nnx.jit` instead of `jax.jit` because it allows the output to be a stateful NNX module.
+
+1. Run it under a device mesh context so that JAX knows which devices to shard it to.


Suggestion: replaced jax.jit with nnx.jit in the other points and remove the point where you suggest using nnx.jit instead of jax.jit.

Suggested change

1. Call [`jax.lax.with_sharding_constraint`](https://jax.readthedocs.io/en/latest/_autosummary/jax.lax.with_sharding_constraint.html) to bind the model state with the sharding annotations. This API tells the top-level `jax.jit` how to shard a variable!

1. Throw away the unsharded state and return the model based upon the sharded state.

1. Compile the whole function with `nnx.jit` instead of `jax.jit` because it allows the output to be a stateful NNX module.

1. Run it under a device mesh context so that JAX knows which devices to shard it to.

1. Call [`jax.lax.with_sharding_constraint`](https://jax.readthedocs.io/en/latest/_autosummary/jax.lax.with_sharding_constraint.html) to bind the model state with the sharding annotations. This API tells the top-level `nnx.jit` how to shard a variable!

1. Throw away the unsharded state and return the model based upon the sharded state.

1. Run it under a device mesh context so that JAX knows which devices to shard it to.

Hmm... I think we should still briefly explain why using nnx.jit is a better pattern. Especially since we are making transforms closer to JAX style now we should assume some users have experience with jax.jit. I can remove the mentions of jax.jit here and direct users more explicitly to nnx.jit.

cgarciae · 2024-09-24T14:07:46Z

docs_nnx/guides/flax_gspmd.md

+
+Now, from initialization or from checkpoint, we have a sharded model. To carry out the compiled, scaled up training, we need to shard the inputs as well. In this data parallelism example, the training data has its batch dimension sharded across `data` device axis, so you should put your data in sharding `('data', None)`. You can use `jax.device_put` for this.
+
+Note that with the correct sharding for all inputs, the output will be sharded in the most natural way even without `jax.jit`. See the example below - even without `jax.lax.with_sharding_constraint` on the output `y`, it was still sharded as `('data', None)`.


Suggested change

Note that with the correct sharding for all inputs, the output will be sharded in the most natural way even without `jax.jit`. See the example below - even without `jax.lax.with_sharding_constraint` on the output `y`, it was still sharded as `('data', None)`.

Note that with the correct sharding for all inputs, the output will be sharded in the most natural way even without `nnx.jit`. See the example below - even without `jax.lax.with_sharding_constraint` on the output `y`, it was still sharded as `('data', None)`.

cgarciae · 2024-09-24T14:47:34Z

docs_nnx/guides/flax_gspmd.md

+  new_state = block_all(train_step(sharded_model, optimizer, input, label))
+```
+
+## Logical axis annotation


Nice! I didn't know about sharding_rules. In nnx_lm1b with have this other pattern which maps the mesh axes in the constructor:

flax/examples/lm1b_nnx/configs/default.py

Line 22 in e3772b2

class MeshRules:

flax/examples/lm1b_nnx/models.py

Line 214 in e3772b2

config.axis_rules('mlp'),

Yeah I added them recently to align with Linen's LogicallyPartitioned. It's just annotations so there's actually a ton of ways to make them work, and I like how you made it in nnx_lm1b!

IvyZX requested a review from cgarciae September 23, 2024 23:13

cgarciae reviewed Sep 24, 2024

View reviewed changes

cgarciae approved these changes Sep 24, 2024

View reviewed changes

IvyZX force-pushed the jit-guide branch from c155f55 to 2cc8f55 Compare September 24, 2024 20:58

Flax gspmd guide

9b41f61

IvyZX force-pushed the jit-guide branch from 2cc8f55 to 9b41f61 Compare September 24, 2024 21:07

IvyZX added the pull ready label Sep 24, 2024

copybara-service bot merged commit 9c86396 into google:main Sep 24, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flax NNX GSPMD guide #4220

Flax NNX GSPMD guide #4220

IvyZX commented Sep 23, 2024 •

edited by cgarciae

Loading

review-notebook-app bot commented Sep 23, 2024

cgarciae Sep 24, 2024

IvyZX Sep 24, 2024

cgarciae Sep 24, 2024

IvyZX Sep 24, 2024

cgarciae Sep 24, 2024

IvyZX Sep 24, 2024 •

edited

Loading

cgarciae Sep 24, 2024

IvyZX Sep 24, 2024

cgarciae Sep 24, 2024 •

edited

Loading

IvyZX Sep 24, 2024

cgarciae Sep 24, 2024

IvyZX Sep 24, 2024

cgarciae Sep 24, 2024

IvyZX Sep 24, 2024

	We should leverage JAX's compilation mechanism, aka. `jax.jit`, to create the sharded model. The key is to intialize a model and assign shardings upon the model state within a jitted function:
	We should leverage JAX's compilation mechanism, via `nnx.jit`, to create the sharded model. The key is to intialize a model and assign shardings upon the model state within a jitted function:


		Now, from initialization or from checkpoint, we have a sharded model. To carry out the compiled, scaled up training, we need to shard the inputs as well. In this data parallelism example, the training data has its batch dimension sharded across `data` device axis, so you should put your data in sharding `('data', None)`. You can use `jax.device_put` for this.

		Note that with the correct sharding for all inputs, the output will be sharded in the most natural way even without `jax.jit`. See the example below - even without `jax.lax.with_sharding_constraint` on the output `y`, it was still sharded as `('data', None)`.

Flax NNX GSPMD guide #4220

Flax NNX GSPMD guide #4220

Conversation

IvyZX commented Sep 23, 2024 • edited by cgarciae Loading

review-notebook-app bot commented Sep 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvyZX Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgarciae Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvyZX commented Sep 23, 2024 •

edited by cgarciae

Loading

IvyZX Sep 24, 2024 •

edited

Loading

cgarciae Sep 24, 2024 •

edited

Loading