Replies: 1 comment 2 replies
-
Yeah, I think you are right here, it's subtly different when looking at it like this. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I had problems understanding how LayerNorm will be computed in a Transformer and came across this paper:
https://openaccess.thecvf.com/content/ICCV2021W/NeurArch/papers/Yao_Leveraging_Batch_Normalization_for_Vision_Transformers_ICCVW_2021_paper.pdf
If you take a look on p. 414, you will see that LayerNorm will be calculated differently in Transformers than in CNNs: mean and standard deviation will be calculated per each token along embedding dimensions independently. So, the normalization will be performed on the whole embedding vector of each single token. This is conceptually more similar to InstanceNorm, but operates across the feature dimension (embeddings).
Especially for computer vision tasks, this is an important information as the normalization then works differently in a ViT (where every image is divided into patches, and each patch acts like a token) than in a CNN.
Beta Was this translation helpful? Give feedback.
All reactions