About the Linen API #437

cgarciae · 2020-08-28T16:04:46Z

cgarciae
Aug 28, 2020
Maintainer

Hey, I am the main developer of Elegy. Initially this library was supposed to be a Keras-like interface on top of an intermediate library like Flax or Haiku. Initially Haiku was chosen but it soon became apparent that if modules didn't have references between them doing stuff like transfer learning would be very painful since you would have to merge parameter structures manually to match the code structure.

So in version 0.2.0 all submodules registered themselves under their parent module and you could very simply extract and reuse them since they also keep track of their own parameters and states. Take a look at this VAE example where the standalone decoder is extracted by reference from the full VAE in order to generate new samples:

https://github.com/poets-ai/elegy/blob/master/examples/mnist_vae.py#L197

But with this change we had to implement our own Module class and port all the layers since we were no longer compatible with Haiku (which was not the initial idea). I recently saw some comments in the Linen API which seem to point in this direction:

https://github.com/google/flax/blob/master/flax/linen/module.py#L51

However it seems the API is very new and there is no documentation. I was wondering if you can give me some details on it capabilities:

Do modules take with them (have reference to) their parameters and states?
How does combining modules look like with this API? Say if you have a trained model and want to put a couple of layers on top to do transfer learning, do you still have to manually merge parameter structures?

Davidnet · 2020-08-30T17:38:15Z

Davidnet
Aug 30, 2020

I was closely looking at this, it would be very helpful to see the discussion or guideline on how you guys are doing the linen API design spec, if you could share that with us, that would be awesome! Please and thank you!

0 replies

avital · 2020-08-31T11:17:40Z

avital
Aug 31, 2020

Hi @cgarciae and @Davidnet!

Yes we are about to suggest people play around with the Linen abstraction as an alpha release this week. We'll also post some of our design goals, perhaps as a new GitHub discussion. For now, though, to answer your questions,

Do modules take with them (have reference to) their parameters and states?
Yes. Module instances are initialized based on the "scope" in which they were created (meaning a path down the variable tree from the root), but once they are initialized then module instances are fully operational -- module.variables is defined (and you can access these are the top level of your code as well, though you have to take care to not accidentally introduce device memory leaks that way, say if you're taking an optimization step of the top level variable dict but have references to variables of submodules from previous optimization iterations)

How does combining modules look like with this API? Say if you have a trained model and want to put a couple of layers on top to do transfer learning, do you still have to manually merge parameter structures?
@rolandgvc is looking into building some examples for this on top of his in-progress "flaxvision" codebase, but can you share more about your question? You should be able to "just" take an old checkpoint and then add additional layers. You'd still need to decide how to initialize the new layers and get a new variable dictionary. Can you share the Haiku code that you've been using? Then I can propose what patterns could work in Linen. (And it may help us adapt our design to make sure it's reasonable!)

In the meanwhile, you can take a look at some of our ported examples at https://github.com/google/flax/tree/master/linen_examples. The VAE example most clearly shows how to deal with module instances and methods on them.

2 replies

cgarciae Aug 31, 2020
Maintainer Author

@avital thanks for the info!

Can you share the Haiku code that you've been using? Then I can propose what patterns could work in Linen. (And it may help us adapt our design to make sure it's reasonable!)

One thing I notice it was rather involved in purely functional Haiku-like systems compared to TF / Pytorch was taking a pretrained model A (say a ResNet50) and use its code + weights inside a new model B that adds its own layers. The problem was that if you changed the code structure (adding new layers) you had to figure out how that would change the params structure manually which might not be trivial or intuitive for users.

I'll try out your Linen VAE example but it looks promising. I am very interested to see if we can drop our custom Module system in favor of Flax to have compatibility with all your layers (we prefer not to maintain our own layers library).

avital Aug 31, 2020

Parameter structure is defined by one of the following:

If modules use explicit submodule assignment during setup, then the keys are based on the attribute names (this is like in PyTorch)
If modules are defined implicitly in-line (using our optional @nn.compact wrapper), then they the key is either the name (if defined explicitly) or the automatically generated name (e.g. Dense_1)

Maybe the simplest thing is for you to take our ResNet example and print out the parameter tree to see what it looks like.

Adding new layers at the end shouldn't change any existing parameter names...

The main challenge at the moment for transfer learning in Flax/Linen is that we don't have a pattern for capturing the output of a submodule -- you'd currently have to explicitly separate the "features" subnetwork from the "classifier" subnetwork. We're consider ways to make this easier.

Please continue to keep us posted, I'd love to find ways we can support your use cases smoothly and indeed it'd be great not to have /yet/ another set of module implementations ;)

cgarciae · 2020-08-31T15:28:43Z

cgarciae
Aug 31, 2020
Maintainer Author

@avital I tried the example but I see that you never actually get hold of a reference to a VAE object. Let me propose a use case:

Say you have a way to get a pretrained model

model, pretrained_params = load_pretrained()

and you want to create a new model to e.g. perform transfer learning

class AwesomeClassifier(Module):
    @nn.compact
    def __call__(self, x):
        x = self.somehow_use_pretrained_model(x)
        logits = nn.Dense(10)(x)
        return logits

How would you construct this module and how are the pretrained_params integrated with the new params?

4 replies

avital Aug 31, 2020

I tried the example but I see that you never actually get hold of a reference to a VAE object.

We're going to add a new API soon, module.interactive(variables) that creates a fully mutable object you can directly tinker with in Colab, etc. But we don't encourage that for "top level training code" because it's too easy to shoot yourself in the foot when dealing with JAX transformations. Because if you jit a function that has side effects then those side effects will be lost. In general, to get the true power of JAX (purely functional libraries, JAX transformations) it's much easier when you're using pure functions (like module.apply in Flax/Linen)

How would you construct this module and how are the pretrained_params integrated with the new params?

It would currently be like this:

# variables include both parameters and additional state, e.g. for batch norm
VGG, pretrained_variables = load_pretrained()

class AwesomeClassifier(Module):
  @nn.compact
  def __call__(self, x):
    # I am ... almost sure this work (`ResNet.features` is a submodule instance that overrides `__call__`)
    x = model.apply(pretrained_variables, method=ResNet.features)
    logits = nn.Dense(10)(x)
    return logits

The caveat is that this assumes that the ResNet module you're using exposes two methods, e.g.:

class Features(Module):
  def __call__(self, x):
    [...]

class Classifier(Module):
  def __call__(self, x):
    [...]

def VGG(Module):
  def setup(self):
    self.features = Features()
    self.classifier = Classifier()

  def __call__(self):
    return self.classifier(self.features(x))

I believe this is how TorchVision models are also structured.

We are also considering adding a wrapper that lets you extract just the particular outputs you want from a model that wasn't designed with fine-tuning in mind. Then you could say, "give me the 2nd output of the submodule named "Conv/2"". But there are some API edges to polish to get this right.

Sorry if that came out long, please let me know if this helps or if I can answer more questions.

cgarciae Aug 31, 2020
Maintainer Author

@avital thanks for the details!

How would you merge the loaded pretrained_params parameters from e.g. ResNet with the parameters from the final (complete) model? This was the difficult part in Haiku, you ended trying to guess how the new parameter hierarchy would look in order to appropriately insert the pretrained_params into the correct place.

In TensorFlow / Pytorch you don't have this problem because each module carries its parameters + states with it.

cgarciae Aug 31, 2020
Maintainer Author

We are also considering adding a wrapper that lets you extract just the particular outputs you want from a model that wasn't designed with fine-tuning in mind. Then you could say, "give me the 2nd output of the submodule named "Conv/2"". But there are some API edges to polish to get this right.

I've though about this and it could be implemented very similarly as how we implement the Model.summary method (check out this example), we define an elegy.add_summary hook that adds an array to the context, all summaries are collected by apply.

You could implement something like add_output and get_output hooks, I guess there is no penalty from keeping around the references to the arrays of all possible outputs in jit mode.

avital Aug 31, 2020

Yes, your add_summary suggestion also works fine.
Is the ResNet also built with Flax/Linen? Then it would have the same variable structure (even if you don't end up using a few of the layers at the end). The parameter hierarchy would remain the same. (Maybe I'm misunderstanding some of the subtlety in your use-case...)

cgarciae · 2020-08-31T19:07:26Z

cgarciae
Aug 31, 2020
Maintainer Author

@avital
Here is a snippet of the behavior I am referring to:

import numpy as np
from jax import random
from jax.config import config
config.enable_omnistaging()
from flax import linen as nn

class Child(nn.Module):
    @nn.compact
    def __call__(self, x):
        x = nn.Dense(32)(x)
        x = nn.Dense(10)(x)
        return x

class Parent(nn.Module):
    child: nn.Module

    @nn.compact
    def __call__(self, x):
        x = self.child()(x)
        x = nn.Dense(2)(x)
        return x

def load_pretrained(x):
    key = random.PRNGKey(42)
    child_params = Child().init(key, x)["param"]

    return Child, child_params


x = np.random.uniform(size=(64, 3))
model, pretrained_params = load_pretrained(x)

key = random.PRNGKey(42)
params = Parent(model).init(key, x)["param"]

# Manually add pretrained <== HERE
params["Child_0"] = pretrained_params  # FrozenDict forbids this but you get the idea

y = Parent(model).apply({"param": params}, x)

To transfer the pretrained parameters from the Child module into the Parent module you have to know by name where to insert them, I don't know it this can get confusing as model become more complicated. By the way, I think this is way easier in Flax than Haiku, yet in Tensorflow / Pytorch model would carry its own parameters so you wouldn't need to manually merge stuff.

6 replies

cgarciae Aug 31, 2020
Maintainer Author

@avital Seem reasonable given the current approach, I was doing something similar previously. I don't like the following:

It requires you to handle computation and data separately
String-based APIs are error prone, dictionaries don't have type checking.
The users needs to know more about the internals of the module and its parameters structure.
Errors using this kind of API are silent, this code will run perfectly but it totally wrong:

params = params.copy(chil=pretrained_params)  # typo: child -> chil

y = Parent(model).apply({"param": params}, x)  # no problem

If its helpful, I think the way S4TF handles stuff is on the right track, and this old issue jax-ml/jax#1808 gave me some ideas on how to implement something similar, the main insight is:

Modules should be differentiable

If you just store the parameters as fields and differentiate w.r.t. the module then the parameters and computation are coupled again, as its the case in S4FT. Sadly Jax cant differentiate with respect to classes but if you implement a get_parameters methods that would recursively extract the parameters (like you can in Keras and Pytorch) into a pytree just so grad could know what the target is then you can do stuff like this:

class Linear(Module):
    def __init__(self):
        self.w = w_init()
        self.b = b_init()
    def __call__(self, x):
        return jnp.dot(x, w) + b

def loss(_params, x, y):  # _params is required but ignored
    y_pred = model(x)
    return jnp.mean(jnp.square(y - y_pred))

model = Linear()
params = model.get_parameters() # {"w": ..., "b": ...}
grads = jax.grad(loss)(params, x, y)

This would have the following benefits:

Parameters are carried by their module
Modules can be passed to other module seamlessly since their parameters are "discovered" automatically, no need to manually assign stuff.

For this to work you also need a set_parameters method to perform the update:

new_params = optimizer.update(grads, params)
model.set_parameters(new_params)

Elegy already implements this kind of API, if you are curious you can check some of the details here. A couple of things where done in Elegy to preserve Haiku hooks, but underneath everything is stored as fields in the Module.

avital Sep 2, 2020

I slept on this for a bit, here are a few responses and thoughts:

It requires you to handle computation and data separately

That's an intentional design decision -- to make variables be "just a nested dict" -- this means they can be easily inspected and manipulated by (non-Flax) tools, it makes it precisely clear what the costs are in memory and compute, and it just fits more naturally with JAX's pure functional perspective (say if you're interoperating with other JAX code that operates on "just data" and pure functions)

String-based APIs are error prone, dictionaries don't have type checking. [...] Errors using this kind of API are silent, this code will run perfectly but it totally wrong:

In a way that's true, but so is Python: If you write obj.foo = 2 and then read obj.foo_misspelled you'll get a runtime error, it's no different here. We follow PyTorch's convention of using the attribute name as the key if assigned during init. The example I showed above is just a shortcut we can use to still give submodules names if they are created in a @compact method.

The users needs to know more about the internals of the module and its parameters structure.

We basically are saying: there are two (intentionally different) hierarchies that follow the same structure, and the attribute names exactly parallel each other. So if in say Keras you could take your model and access model.conv1.weight then in Linen you'd access the same path through variables, e.g. variables.param.conv1.weight (that's a proposal at least -- right now that won't work but we want to consider overriding __getattr__ for this)

I think Linen (if used in the PyTorch like form of: all parameters and submodules are defined inside setup) gives you something that satisfies what you're looking for, I think. Then we add optional @compact that you still /could/ use if you wanted, makes things a little more implicit but also more concise and co-located. Elegy Modules then look like just a slightly different syntactic form but operationally isomorphic. That's just my understanding so far, I'd love to hear more about how you understand this.

avital Sep 2, 2020

I think if you would use module.interactive(variables) that we plan to add soon, I think you pretty much get the same stateful objects you're describing in your code snippet and then you can just do module.conv1.weight. You just fundamentally have to be careful around JAX transformations if you use those interactive (stateful) objects.

avital Sep 2, 2020

Looking at https://poets-ai.github.io/elegy/guides/module-system/ once again, I think other than the API at the very top level, the Module API itself and how they compose are very very very similar between Linen and Elegy.

jheek Sep 2, 2020
Maintainer

Have you considered passing the parameters of the pretrained model explicitly?

This way the pretrained parameters are separated nicely and it's easier to calculate gradients either over both parameter sets or just one of them.

import numpy as np
from jax import random
from jax.config import config
config.enable_omnistaging()
from flax import linen as nn

class Child(nn.Module):
    @nn.compact
    def __call__(self, x):
        x = nn.Dense(32)(x)
        x = nn.Dense(10)(x)
        return x

class Parent(nn.Module):
    @nn.compact
    def __call__(self, x, backbone_module, backbone_params):
        x = backbone_module.apply(backbone_params, x)
        x = nn.Dense(2)(x)
        return x

def load_pretrained(x):
    key = random.PRNGKey(42)
    child_module = Child()
    child_params = child_module.init(key, x)["param"]

    return child_module, child_params

x = np.random.uniform(size=(64, 3))
child_module, child_params = load_pretrained(x)

key = random.PRNGKey(42)
model = Parent()
params = model.init(key, x, child_module, child_params)

y = model.apply(params, x, child_module, child_params)

cgarciae · 2020-09-02T15:33:39Z

cgarciae
Sep 2, 2020
Maintainer Author

That's an intentional design decision -- to make variables be "just a nested dict" -- this means they can be easily inspected and manipulated by (non-Flax) tools, it makes it precisely clear what the costs are in memory and compute...

I also think this is very nice and all framework are "forced" to this good habit by Jax, the thing is that they can also be wrongly modified to not agree with the computation as is the issue we are seeing here.

In a way that's true, but so is Python: If you write obj.foo = 2 and then read obj.foo_misspelled you'll get a runtime error, it's no different here.

I mean yes BUT modern python, specially with the help of the typing module, has come a long way with static typing analysis. Tools like pyright will easily catch spelling mistakes in most cases. However, with dictionaries you are left on you own.

Elegy Modules then look like just a slightly different syntactic form but operationally isomorphic. That's just my understanding so far, I'd love to hear more about how you understand this.

Yeah, Elegy uses the equivalent of compact by default. You can also manually assign submoudles in the __init__ if you like, if you do it inline in call they will be automatically assigned to unique fields. The thing is that in Elegy you can do something like this:

model = nn.Linear(10)
model.init(...)(...)

model.w, model.b  # <== model tracks its own weights so you can pass it around

Although you can also get them as a dict, this is needed for actual training:

model.get_paramteres() # {"w": ..., "b": ...}

The thing is that if a pass model to Parent it will automatically pick them up, the parameter structure is calculated from the code structure:

parent = Parent(model)
parent.init(...)(...)

parent.get_parameters() # {"child": {"w": ..., "b": ...}, ...}

Elegy is stateful intentionally, if you train model separately parent should pick up the changes automatically since its the same object. I think you can be in favor or against this behaviour, I just think its the most practical since its just how Python behaves so users might expect it.

I think if you would use module.interactive(variables) that we plan to add soon, I think you pretty much get the same stateful objects you're describing in your code snippet and then you can just do module.conv1.weight.

Nice! Will definitely check it out when its available.

0 replies

cgarciae · 2020-09-02T16:03:35Z

cgarciae
Sep 2, 2020
Maintainer Author

@avital This is a bit more meta, but I've been intrigued by the ideas proposed in parallax.

I like:

It makes a call for standardization which would be a huge step forward.
Tries to simplify things.
It proposes that Modules should be pytrees, you could differentiate with respect to you model with is actually what you want so its very easy to interpret. You could write something like:

def loss(model, x, y)
    preds = model(x)
    return jnp.mean(jnp.square(y - preds))

grads = jax.grad(loss)(model, x, y)
model = jax.tree_multimap(lambda p, g: p - 0.01*g, model, grads)

Things I don't like:

Right now its modeled after Pytorch so it doesn't have shape inference or inline calls to submodules during forward. I think functional hooks-based systems (Linen + compact, Haiku, Elegy) make writing code much easier.
Lacks features. This is understandable since its more of a proposal than an actual implementation, but I didn't see ways to handle state for e.g. BatchNorm.

Any way. It would be interesting if the Jax ecosystem could join forces to create a standard.

2 replies

avital Sep 3, 2020

It proposes that Modules should be pytrees, you could differentiate with respect to you model with is actually what you want so its very easy to interpret. You could write something like def loss(model, x, y)

We actually have a Model class in current Flax that allows for this -- it's just a struct to a module and parameter dict. But folks actually found this confusing (concretely they were confused by the notion of "gradient with respect to model" as opposed to "gradient with respect to parameters"), thus we removed it in Linen and went back to asking users to just keep parameter dicts around. That also makes parameter trees interoperate more cleanly with arbitrary data structure tooling such as for serialization.

cgarciae Sep 4, 2020
Maintainer Author

Ah I see, makes sense. Maybe in S4TF it made more sense since you explicitly had to declare that your type was Differentiable.

shashank2000 · 2022-02-03T01:21:11Z

shashank2000
Feb 3, 2022

Apologies for reviving this old thread, but would this still be the recommended way of doing transfer learning?

Have you considered passing the parameters of the pretrained model explicitly?

This way the pretrained parameters are separated nicely and it's easier to calculate gradients either over both parameter sets or just one of them.

import numpy as np
from jax import random
from jax.config import config
config.enable_omnistaging()
from flax import linen as nn

class Child(nn.Module):
    @nn.compact
    def __call__(self, x):
        x = nn.Dense(32)(x)
        x = nn.Dense(10)(x)
        return x

class Parent(nn.Module):
    @nn.compact
    def __call__(self, x, backbone_module, backbone_params):
        x = backbone_module.apply(backbone_params, x)
        x = nn.Dense(2)(x)
        return x

def load_pretrained(x):
    key = random.PRNGKey(42)
    child_module = Child()
    child_params = child_module.init(key, x)["param"]

    return child_module, child_params

x = np.random.uniform(size=(64, 3))
child_module, child_params = load_pretrained(x)

key = random.PRNGKey(42)
model = Parent()
params = model.init(key, x, child_module, child_params)

y = model.apply(params, x, child_module, child_params)

I'd like to be able to freeze/not calculate gradients on the child_params, I'm assuming this would entail using the multi_transform method in Optax? This thread seems to indicate it is. It seems to be a little bit overkill considering all I want to do is add a Dense layer on top of my existing pretrained setup.

0 replies

About the Linen API #437

cgarciae Aug 28, 2020 Maintainer

Replies: 7 comments · 14 replies

cgarciae Aug 31, 2020 Maintainer Author

cgarciae Aug 31, 2020 Maintainer Author

cgarciae Aug 31, 2020 Maintainer Author

cgarciae Aug 31, 2020 Maintainer Author

cgarciae Aug 31, 2020 Maintainer Author

cgarciae Aug 31, 2020 Maintainer Author

jheek Sep 2, 2020 Maintainer

cgarciae Sep 2, 2020 Maintainer Author

cgarciae Sep 2, 2020 Maintainer Author

cgarciae Sep 4, 2020 Maintainer Author

cgarciae
Aug 28, 2020
Maintainer

Replies: 7 comments 14 replies

cgarciae Aug 31, 2020
Maintainer Author

cgarciae
Aug 31, 2020
Maintainer Author

cgarciae Aug 31, 2020
Maintainer Author

cgarciae Aug 31, 2020
Maintainer Author

cgarciae
Aug 31, 2020
Maintainer Author

cgarciae Aug 31, 2020
Maintainer Author

jheek Sep 2, 2020
Maintainer

cgarciae
Sep 2, 2020
Maintainer Author

cgarciae
Sep 2, 2020
Maintainer Author

cgarciae Sep 4, 2020
Maintainer Author