Replies: 1 comment 1 reply
-
Thank you for pointing this out. This is a typo on our end! |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Thanks, the playbook looks pretty cool!
I am curious about:
Is this advice for specific settings/norms?
For modern LMs, the order typically is
x + f(Norm(x))
. For example, transformer blocks in language models typically have the following design:Some examples are T5, GPT-2, and I think PaLM also applies LayerNorm before MLP/Attention.
Beta Was this translation helpful? Give feedback.
All reactions