add parallel_attention_blocks #446

lessw2020 · 2023-08-15T02:25:05Z

Summary:
This PR adds Parallel Attention blocks to Torch MultiModal.
There are 3 main additions:
a - Rotary Embeddings
b - RMS Norm
c - Parallel_Blocks

Test plan:
Code has been tested separately in ViT and LLM applications and with rotary unit tests.
However, a general unit test needs to be added along with rotary unit tests being migrated.

rohan-varma

looking forward to the unittests, will do more thorough review after that

rohan-varma · 2023-08-15T20:38:09Z

torchmultimodal/modules/layers/normalizations.py

@@ -45,3 +45,17 @@ def forward(self, x: Tensor) -> Tensor:
            self.eps,
        )
        return output.type_as(x)
+
+
+class RMSNorm(nn.Module):


add docstring

rohan-varma · 2023-08-15T20:38:45Z

torchmultimodal/modules/layers/parallel_blocks.py

+
+    * We use SwiGLU for the activation function
+    * SwiGLU will approximate same total num params as traditional MLP with GELU
+    * Cross Attention is not enabled here (but available)


what does "but available" mean?

ah - we support cross attention in the main codebase for parallel attention. However, it's not really useful for TMM, so I removed it for this PR. (hence the 'but available'). Let me expand that comment to note it's in the main codebase.

rohan-varma · 2023-08-15T20:39:11Z

torchmultimodal/modules/layers/parallel_blocks.py

+    * We use SwiGLU for the activation function
+    * SwiGLU will approximate same total num params as traditional MLP with GELU
+    * Cross Attention is not enabled here (but available)
+    * MQA and GQA are enabled - modify heads via 'num_heads_group_query_attn'


this seems for GQA, how would I use MQA?

Easy (but I will expand this in the comments to clarify):
num_heads_group_query_attn = 1 and you now have MQA.
num_heads_group_query_attn > 1 and < Q_heads = GQA.
I should probably also clarify that MQA heads needs to be a multiple of Q heads (there is an assert check so you ultimately can't miss it, but might be nicer to note in docstring).

torchmultimodal/modules/layers/parallel_blocks.py

torchmultimodal/modules/layers/position_embedding.py

torchmultimodal/modules/layers/parallel_blocks.py

ebsmothers · 2023-08-16T17:35:34Z

torchmultimodal/modules/layers/position_embedding.py

+        q_ = q.float().reshape(*q.shape[:-1], -1, 2)  # B H L D/2 2
+        k_ = k.float().reshape(*k.shape[:-1], -1, 2)  # B H L D/2 2


If we care about making this scriptable it might be good to ditch the list unpacking

add parallel_attention_blocks

e1c6650

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 15, 2023

rohan-varma reviewed Aug 15, 2023

View reviewed changes

ebsmothers reviewed Aug 16, 2023

View reviewed changes

add tests - RMSNorm algo, RMSNorm return type

19f000b

lessw2020 closed this by deleting the head repository Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add parallel_attention_blocks #446

add parallel_attention_blocks #446

lessw2020 commented Aug 15, 2023

rohan-varma left a comment

rohan-varma Aug 15, 2023

rohan-varma Aug 15, 2023

lessw2020 Aug 16, 2023

rohan-varma Aug 15, 2023

lessw2020 Aug 16, 2023

ebsmothers Aug 16, 2023

		q_ = q.float().reshape(*q.shape[:-1], -1, 2) # B H L D/2 2
		k_ = k.float().reshape(*k.shape[:-1], -1, 2) # B H L D/2 2

add parallel_attention_blocks #446

add parallel_attention_blocks #446

Conversation

lessw2020 commented Aug 15, 2023

rohan-varma left a comment

Choose a reason for hiding this comment

rohan-varma Aug 15, 2023

Choose a reason for hiding this comment

rohan-varma Aug 15, 2023

Choose a reason for hiding this comment

lessw2020 Aug 16, 2023

Choose a reason for hiding this comment

rohan-varma Aug 15, 2023

Choose a reason for hiding this comment

lessw2020 Aug 16, 2023

Choose a reason for hiding this comment

ebsmothers Aug 16, 2023

Choose a reason for hiding this comment