Best practices for loading a batch in a distributed multi-process setup #16350

hr0nix · 2023-06-11T18:10:02Z

hr0nix
Jun 11, 2023

Let's say I have a distributed training setup that combines pjit-based model and data parallelism and can encompass multiple processes. I'm struggling to figure out what's the recommended way of loading a batch of data in this setup.

For simplicity, let's assume that I have 4 devices, each managed by a separate process, and my batch is sharded into 2 chunks, so that process 0 & 1 should process chunk #0, and process 2 & 3 should process chunk #1.

How should I load the data in this case? Should each process just load the data it's going to need (meaning that 0 and 1 will each load chunk #0, while 2 & 3 will load chunk #1), and then call make_array_from_single_device_arrays? If assume not, because it goes against the whole idea of batch being replicated over some devices: like what if #0 and #1 will load different data into it?

Perhaps I should pick a single process within each replica to load the data? Or load it all from a single process and then replicate send it to the whole mesh? If so, I'm not quite sure how to correctly achieve that. Like, can I call make_array_from_single_device_arrays on a subset of mesh devices and not cause synchronization locks?

Any suggestions or pointers to code that does something similar are be much appreciated!

hr0nix · 2023-06-11T23:04:08Z

hr0nix
Jun 11, 2023
Author

I guess what I'm really asking is what's the canonic way to create an array that is replicated across multiple processes.

Turns out I can actually populate such array with different values in different processes successfully. Consider the following code:

from multiprocessing import Process
import jax
import jax.numpy as jnp
# Nothing special, just a helper to create a mesh with given size
from papyrax.utils.sharding import create_device_mesh


def main(num_processes: int, rank: int):
    jax.distributed.initialize(
        coordinator_address='127.0.0.1:9879',
        num_processes=num_processes,
        process_id=rank,
        local_device_ids=[rank],
    )

    mesh = create_device_mesh({'data': 1, 'model': 2})

    def _process_array(a, b):
        return (a + 1) * (b + 2)

    a_sharding = jax.sharding.NamedSharding(mesh, jax.sharding.PartitionSpec('data', None))
    b_sharding = jax.sharding.NamedSharding(mesh, jax.sharding.PartitionSpec(None, 'model'))
    process_array_fn = jax.jit(
        _process_array,
        in_shardings=(a_sharding, b_sharding),
    )

    global_shape = (4, 8)
    local_device = jax.local_devices()[0]

    local_shape_a = (global_shape[0] // mesh.shape['data'], global_shape[1])
    local_a = jnp.full(local_shape_a, fill_value=jax.process_index(), dtype=jnp.int32)
    local_a_on_device = jax.device_put(local_a, local_device)
    global_a = jax.make_array_from_single_device_arrays(
        global_shape, a_sharding, [local_a_on_device]
    )

    local_shape_b = (global_shape[0], global_shape[1] // mesh.shape['model'])
    local_b = jnp.full(local_shape_b, fill_value=jax.process_index(), dtype=jnp.int32)
    local_b_on_device = jax.device_put(local_b, local_device)
    global_b = jax.make_array_from_single_device_arrays(
        global_shape, b_sharding, [local_b_on_device]
    )

    print(f'Rank {rank} global_a sharding: {global_a.sharding}')
    print(f'Rank {rank} global_a: {jax.experimental.multihost_utils.process_allgather(global_a)}')
    print(f'Rank {rank} global_b sharding: {global_b.sharding}')
    print(f'Rank {rank} global_b: {jax.experimental.multihost_utils.process_allgather(global_b)}')

    result = process_array_fn(global_a, global_b)

    print(f'Rank {rank} result sharding: {result.sharding}')
    print(f'Rank {rank} result: {jax.experimental.multihost_utils.process_allgather(result)}')


if __name__ == "__main__":
    processes = []
    num_processes = 2
    for rank in range(num_processes):
        p = Process(name=f'Rank {rank}', target=main, args=(num_processes, rank))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

This code outputs the following:

Rank 0 global_a sharding: NamedSharding(mesh={'data': 1, 'model': 2}, spec=PartitionSpec('data', None))
Rank 0 global_a: [[0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]]
Rank 0 global_b sharding: NamedSharding(mesh={'data': 1, 'model': 2}, spec=PartitionSpec(None, 'model'))
Rank 1 global_a sharding: NamedSharding(mesh={'data': 1, 'model': 2}, spec=PartitionSpec('data', None))
Rank 1 global_a: [[1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1]]
Rank 1 global_b sharding: NamedSharding(mesh={'data': 1, 'model': 2}, spec=PartitionSpec(None, 'model'))
Rank 1 global_b: [[0 0 0 0 1 1 1 1]
 [0 0 0 0 1 1 1 1]
 [0 0 0 0 1 1 1 1]
 [0 0 0 0 1 1 1 1]]
Rank 0 global_b: [[0 0 0 0 1 1 1 1]
 [0 0 0 0 1 1 1 1]
 [0 0 0 0 1 1 1 1]
 [0 0 0 0 1 1 1 1]]
Rank 1 result sharding: NamedSharding(mesh={'data': 1, 'model': 2}, spec=PartitionSpec(None, 'model'))
Rank 0 result sharding: NamedSharding(mesh={'data': 1, 'model': 2}, spec=PartitionSpec(None, 'model'))
Rank 1 result: [[2 2 2 2 6 6 6 6]
 [2 2 2 2 6 6 6 6]
 [2 2 2 2 6 6 6 6]
 [2 2 2 2 6 6 6 6]]
Rank 0 result: [[2 2 2 2 6 6 6 6]
 [2 2 2 2 6 6 6 6]
 [2 2 2 2 6 6 6 6]
 [2 2 2 2 6 6 6 6]]

So even though global_a is created as replicated over both devices, each device actually works with its own copy (which can be seen by inspecting the value of result).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for loading a batch in a distributed multi-process setup #16350

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Best practices for loading a batch in a distributed multi-process setup #16350

hr0nix Jun 11, 2023

Replies: 1 comment

hr0nix Jun 11, 2023 Author

hr0nix
Jun 11, 2023

hr0nix
Jun 11, 2023
Author