Simplest way to do data-parallel training #4300

kriscao-cohere · 2024-10-15T16:33:07Z

kriscao-cohere
Oct 15, 2024

I'm a long time JAX user, and recently I've started trying to get used to NNX. What is currently the simplest way to do data-parallel distributed training (no model sharding)? In the past I could just replicate all my model code to all devices with jax.device_put_replicated and use jax.pmap to automatically take care of the outer device dimension. However, I'm unsure of how to use nnx.jit to do the same thing. Any pointers would be gratefully received, as I've read https://flax.readthedocs.io/en/latest/guides/flax_gspmd.html and am still none the wiser.

Naively attempting to replicate all my model params and a input batch to every device (with a jax.tree_map on the output of nnx.split) fails, as there is a dimension mismatch with my embedding module.

EDIT: in particular, it's very inconvenient to deal with an explicit leading device dimension in all of my tensors in my modules, as some modules (such as the jax_flash_attention package) expect tensors of a particular shape. What I really want to do is just to pmap this over the leading device dimension, but I don't know if such a thing is possible inside an nnx module.

hrbigelow · 2024-10-24T21:32:30Z

hrbigelow
Oct 24, 2024

Hi @kriscao-cohere,

I'm curious about this too - did you look at nnx.pmap? It's not showing up in the docs since the inline docs are not correct. I just opened an issue there.

Assuming you already checked the test examples but I don't think I see a simple SPMD example there either.

0 replies

hrbigelow · 2024-10-24T22:52:20Z

hrbigelow
Oct 24, 2024

Okay, I tried on Colab TPU here (which is version v2-8), this seems to work:

with flax==0.10.0

from flax import nnx
from functools import partial
import jax.numpy as jnp

class CNN(nnx.Module):
    """A Simple CNN Model"""
    def __init__(self, *, rngs: nnx.Rngs):
        self.conv1 = nnx.Conv(1, 32, kernel_size=(3, 3), rngs=rngs)
        self.conv2 = nnx.Conv(32, 64, kernel_size=(3, 3), rngs=rngs)
        self.avg_pool = partial(nnx.avg_pool, window_shape=(2, 2), strides=(2, 2))
        self.linear1 = nnx.Linear(3136, 256, rngs=rngs)
        self.linear2 = nnx.Linear(256, 10, rngs=rngs)

    def __call__(self, x):
        # print(f'In call, {type(x) = }')
        x = self.avg_pool(nnx.relu(self.conv1(x)))
        x = self.avg_pool(nnx.relu(self.conv2(x)))
        x = x.reshape(x.shape[0], -1)
        x = nnx.relu(self.linear1(x))
        x = self.linear2(x)
        return x

model = CNN(rngs=nnx.Rngs(0))

x = jnp.ones((8, 10, 28, 28, 1))
# nnx.pmap `in_axes` argument accepts a tuple of integers and nnx.StateAxes objects
# integers are just passed along to jax.pmap, while StateAxes designate separate axes for different state types
# here, means to map all state types (... == filter for "Everything") to None (meaning, broadcast across devices)
state_axes = nnx.StateAxes({ ...: None }) 
@nnx.pmap(in_axes=(state_axes, 0), out_axes=0, devices=jax.devices())
@nnx.jit
def fwd(model, x):    
    y = model(x)
    return y

y = fwd(model, x)
# print(type(x))
# nnx.display(x.shape)
nnx.display(y.shape) # (8, 10, 10)

3 replies

kriscao-cohere Oct 25, 2024
Author

Great, I didn't know there was a nnx.pmap already! For your example, did you have to separately broadcast your inputs (with leading axis dim 8) to all the devices before the call to the forward fn, or does nnx sort that out automatically? Further, if we want to compute a loss, do we need to use something like jax.lax.psum again to gather the loss across all devices?

hrbigelow Oct 25, 2024

The example above is stand-alone, so I'm not doing anything else not shown there. At least on TPU you don't have to separately broadcast or shard inputs across devices when using nnx.pmap. I've only used jax on TPU though so I don't know how how pmap behaves in a multi-GPU setup.

I'm still just familiarizing with nnx. But from studying the code, it looks as if every nnx transform internally uses the corresponding jax transform. So my sense is it's a design goal that if you wrapped a pure (jax compatible) function with @nnx.pmap, the wrapped function would behave identically to the @jax.pmap wrapped function. And that should be the case for all of the nnx transforms. @cgarciae is this true?

To your second question, yes, you'll need to reduce across devices over out_axes of the outputs, just as you would with the outputs of a jax.pmap'ed function.

cgarciae Nov 1, 2024
Maintainer

This is correct. NNX transforms are semantically equivalent to the underlaying JAX transform, they follow the exact same APIs in most cases so your JAX knowledge should transfer without issues.

See:

cgarciae · 2024-11-01T14:56:01Z

cgarciae
Nov 1, 2024
Maintainer

I'm adding examples/nnx_toy_examples/04_data_parallel_with_jit.py in #4354 with a more modern data-parallel training recipe using jit + sharding annotations. Including a copy here:

import os
os.environ['XLA_FLAGS'] = '--xla_force_host_platform_device_count=8'

import jax
import jax.numpy as jnp
import numpy as np
import optax
from flax import nnx
from jax.experimental import mesh_utils
import matplotlib.pyplot as plt

# create a mesh + shardings
num_devices = jax.local_device_count()
mesh = jax.sharding.Mesh(
  mesh_utils.create_device_mesh((num_devices,)), ('data',)
)
model_sharding = jax.NamedSharding(mesh, jax.sharding.PartitionSpec())
data_sharding = jax.NamedSharding(mesh, jax.sharding.PartitionSpec('data'))


# create model
class MLP(nnx.Module):
  def __init__(self, din, dmid, dout, *, rngs: nnx.Rngs):
    self.linear1 = nnx.Linear(din, dmid, rngs=rngs)
    self.linear2 = nnx.Linear(dmid, dout, rngs=rngs)

  def __call__(self, x):
    return self.linear2(nnx.relu(self.linear1(x)))


model = MLP(1, 64, 1, rngs=nnx.Rngs(0))
optimizer = nnx.Optimizer(model, optax.adamw(1e-2))

# replicate state
state = nnx.state((model, optimizer))
state = jax.device_put(state, model_sharding)
nnx.update((model, optimizer), state)

# visualize model sharding
print('model sharding')
jax.debug.visualize_array_sharding(model.linear1.kernel.value)


@nnx.jit
def train_step(model: MLP, optimizer: nnx.Optimizer, x, y):
  def loss_fn(model: MLP):
    y_pred = model(x)
    return jnp.mean((y - y_pred) ** 2)

  loss, grads = nnx.value_and_grad(loss_fn)(model)
  optimizer.update(grads)
  return loss


def dataset(steps, batch_size):
  for _ in range(steps):
    x = np.random.uniform(-2, 2, size=(batch_size, 1))
    y = 0.8 * x**2 + 0.1 + np.random.normal(0, 0.1, size=x.shape)
    yield x, y


for step, (x, y) in enumerate(dataset(1000, 16)):
  # shard data
  x, y = jax.device_put((x, y), data_sharding)
  # train
  loss = train_step(model, optimizer, x, y)

  if step == 0:
    print('data sharding')
    jax.debug.visualize_array_sharding(x)

  if step % 100 == 0:
    print(f'step={step}, loss={loss}')

# dereplicate state
state = nnx.state((model, optimizer))
state = jax.device_get(state)
nnx.update((model, optimizer), state)

X, Y = next(dataset(1, 1000))
x_range = np.linspace(X.min(), X.max(), 100)[:, None]
y_pred = model(x_range)

# plot
plt.scatter(X, Y, label='data')
plt.plot(x_range, y_pred, color='black', label='model')
plt.legend()
plt.show()

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplest way to do data-parallel training #4300

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Simplest way to do data-parallel training #4300

kriscao-cohere Oct 15, 2024

Replies: 3 comments · 3 replies

hrbigelow Oct 24, 2024

hrbigelow Oct 24, 2024

kriscao-cohere Oct 25, 2024 Author

hrbigelow Oct 25, 2024

cgarciae Nov 1, 2024 Maintainer

cgarciae Nov 1, 2024 Maintainer

kriscao-cohere
Oct 15, 2024

Replies: 3 comments 3 replies

hrbigelow
Oct 24, 2024

hrbigelow
Oct 24, 2024

kriscao-cohere Oct 25, 2024
Author

cgarciae Nov 1, 2024
Maintainer

cgarciae
Nov 1, 2024
Maintainer