Order of Norm vs. function f #3

madaan · 2023-01-20T00:27:26Z

madaan
Jan 20, 2023

Thanks, the playbook looks pretty cool!

I am curious about:

Normalization should be the last operation before the residual. E.g. x + Norm(f(x)).

Is this advice for specific settings/norms?

For modern LMs, the order typically is x + f(Norm(x)). For example, transformer blocks in language models typically have the following design:

def block(x):
    # x is the input, ln{1, 2} are layer norms, attn is self-attention, mlp is a feed-forward network
    return x + mlp(ln2(x + attn(ln1(x))))

Some examples are T5, GPT-2, and I think PaLM also applies LayerNorm before MLP/Attention.

jmgilmer · 2023-01-20T00:30:22Z

jmgilmer
Jan 20, 2023
Maintainer

Thank you for pointing this out. This is a typo on our end!

1 reply

madaan Jan 20, 2023
Author

Ah ok, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Order of Norm vs. function f #3

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Order of Norm vs. function f #3

madaan Jan 20, 2023

Replies: 1 comment · 1 reply

jmgilmer Jan 20, 2023 Maintainer

madaan Jan 20, 2023 Author

madaan
Jan 20, 2023

Replies: 1 comment 1 reply

jmgilmer
Jan 20, 2023
Maintainer

madaan Jan 20, 2023
Author