Autodifferentiation through parallelized operators with xmap #14982

cmunna0052 · 2023-03-14T17:44:53Z

cmunna0052
Mar 14, 2023

This question concerns the same basic setup as my previous one #14879, but with a slightly different approach to sharding that gets a bit further before breaking. I am still trying to shard a feed-forward network on MNIST dataset by splitting up the weight matrices into 10 groups of columns. However, now I have defined an xmapped matrix multiplication operation on its own, with the following code:

def init_params(layer_neurons):
    params = {}
    newKey = jax.random.PRNGKey(42)
    i=0
    for n_in, n_out in zip(layer_neurons[:-1],layer_neurons[1:]):
        newKey,subKey = jax.random.split(newKey)
        params['linear'+str(i)] = dict(w = jax.random.normal(key=subKey,shape=(n_in,n_out))*jnp.sqrt(2/n_in),
                           b = jnp.ones(shape = (n_out)))
        i+=1
    return params

device_array = np.array(jax.devices())
params = init_params([784,100,10])

in_axes = ({0:'batch'},{'b':{0:'hidden'},'w':{1:'hidden'}})
axis_resources = {'hidden':'x'}
sharded_matmul = jmaps.xmap(lambda x, params: (x@params['w'] + params['b']), in_axes, 
                            out_axes = ['batch','hidden'], axis_resources = axis_resources)

def forward(params,x):
    for i in range(len(params)-1): 
        x = jax.nn.sigmoid(sharded_matmul(x, params['linear'+str(i)]))
    i = len(params)-1
    x = sharded_matmul(x, params['linear'+str(i)]) 
    return jax.device_put(x,jax.devices()[0])

This code works (though I am not sure why I don't have to recombine the x vector at each step with jax.lax.all_gather, and in fact doing so causes an error).

Now the problem comes at the next step, where I try to backpropagate with the following:

@jax.custom_vjp
def f_pmean(x): return jax.lax.pmean(x, "batch");
def f_pmean_fwd(x): return (f_pmean(x), None);
def f_pmean_bwd(_, g): return (jax.lax.pmean(g, "batch"),);
f_pmean.defvjp(f_pmean_fwd, f_pmean_bwd)

def loss_from_fwd(fwd_outs,y):
    result = jnp.sum(-jnp.log(jax.nn.softmax(fwd_outs,axis=-1))*y)
    return f_pmean(result)  

in_axes = ({0:'batch'},{0:'batch'})
axis_resources = {'batch':'y'}
sharded_loss = jmaps.xmap(loss_from_fwd, in_axes, out_axes = [...], axis_resources = axis_resources)

def loss_fn(params,x,y): return sharded_loss(forward(params,x), y);

def train_batch(params,x_batch,y_batch,lr):
    loss_grads = jax.grad(loss_fn)(params,x_batch,y_batch)
    return jax.tree_map(lambda p, g: p - lr * g, params, loss_grads)
with jax.sharding.Mesh(device_array.reshape(1,-1),('y','x')):
    new_params1 = train_batch(params,inputs,answer,.01)

This correctly calculates the loss but fails in the train_batch portion with the error assert len(arg) == n, f'length mismatch: [6,6,2]'. I added in the custom_vjp because the regular jax.lax.pmean was throwing a similar error, and I assumed it wasn't a differentiable operator anyway. Any ideas on how I should get through this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autodifferentiation through parallelized operators with xmap #14982

{{title}}

Replies: 0 comments

Select a reply

Autodifferentiation through parallelized operators with xmap #14982

cmunna0052 Mar 14, 2023

Replies: 0 comments

cmunna0052
Mar 14, 2023