Layer Normalization (LayerNorm) in a Transformer #440

d-kleine · 2024-11-16T22:45:57Z

d-kleine
Nov 16, 2024

I had problems understanding how LayerNorm will be computed in a Transformer and came across this paper:
https://openaccess.thecvf.com/content/ICCV2021W/NeurArch/papers/Yao_Leveraging_Batch_Normalization_for_Vision_Transformers_ICCVW_2021_paper.pdf

If you take a look on p. 414, you will see that LayerNorm will be calculated differently in Transformers than in CNNs: mean and standard deviation will be calculated per each token along embedding dimensions independently. So, the normalization will be performed on the whole embedding vector of each single token. This is conceptually more similar to InstanceNorm, but operates across the feature dimension (embeddings).

Especially for computer vision tasks, this is an important information as the normalization then works differently in a ViT (where every image is divided into patches, and each patch acts like a token) than in a CNN.

rasbt · 2024-11-17T08:36:29Z

rasbt
Nov 17, 2024
Maintainer

Yeah, I think you are right here, it's subtly different when looking at it like this.

2 replies

d-kleine Nov 23, 2024
Author

Do you know by any chance if Layer Normalization techniques in a Transformer have learnable parameters? In the PyTorch implementation of LN and RMSNorm, I can see these params (however they seem not to be implemented in the source code):
https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html (both $\gamma$ and $\beta$)
https://pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html (only $\gamma$)

From my pov, this would also interesting in the additional experiments to see whether the performance improves if LayerNorm learns these params. I assume this would no or only a small effect (especially as the additional experiments use pre-LN where there are more following layers that can learn those transformations than with post-LN)

rasbt Nov 23, 2024
Maintainer

Good question, and yes they do have learnable parameters exactly like you referenced above. (The only exception is OLMo I think).

As far as I know, the fact that RMSNorm has fewer (i.e., half) was one of the motivations using it over LayerNorm. I.e. the performance of RMSNorm and LayerNorm was said to be the same but since RMSNorm is simpler, why not using that one (I don't recall the paper of the top of my head, but I think there was one comparing them directly in a transformer).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layer Normalization (LayerNorm) in a Transformer #440

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Layer Normalization (LayerNorm) in a Transformer #440

d-kleine Nov 16, 2024

Replies: 1 comment · 2 replies

rasbt Nov 17, 2024 Maintainer

d-kleine Nov 23, 2024 Author

rasbt Nov 23, 2024 Maintainer

d-kleine
Nov 16, 2024

Replies: 1 comment 2 replies

rasbt
Nov 17, 2024
Maintainer

d-kleine Nov 23, 2024
Author

rasbt Nov 23, 2024
Maintainer