add Alibi positional embeddings #462

lessw2020 · 2023-09-06T01:24:11Z

Summary:
This PR adds Alibi positional embeddings class. (per the Alibi paper https://arxiv.org/abs/2108.12409)
This generates the Alibi attn mask to be added post QKT/sqrt(k.dim) and replaces the usual sinusoidal type positional embeddings.
The usage is designed to be instantiated outside the transformer block loop based on max_seq_length, and the layers retrieve the attn mask based on current seq length (thus only a single mask buffer needs to be created).

Test plan:
I tested by running in a 200M gpt2 model along with 10% of openwebtext to compare curves between learned embeddings (default in gpt2) and alibi.

I also added a unit test with three tests:
a - shape of the alibi mask
b - verify first head row entry
c - verify last head last row entry
Note that half the mask is -inf, but in trying to use allclose with -inf, they will not match...so I targeted entries that have only real numbers.

lessw2020 · 2023-09-06T04:45:20Z

unit test failure is not related.

lessw2020 · 2023-09-07T15:28:40Z

test failure is not related - appears to be rounding issue:
test_model.py::TestAudioMaskedAutoEncoder::test_audio_mae_train_masking - AssertionError: actual: 512.999755859375, expected: 513.

lessw2020 · 2023-09-07T17:31:28Z

I reran the training with the updates to confirm:

torchmultimodal/modules/layers/position_embedding.py

… power 2 case

…w2020/PyTorch_MultiModal into alibi_positional_embeddings

codecov-commenter · 2023-09-07T23:50:25Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (1fd96dc) 74.01% compared to head (c54548f) 74.13%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #462      +/-   ##
==========================================
+ Coverage   74.01%   74.13%   +0.12%     
==========================================
  Files         207      207              
  Lines       14203    14274      +71     
==========================================
+ Hits        10512    10582      +70     
- Misses       3691     3692       +1

Files	Coverage Δ
...rchmultimodal/modules/layers/position_embedding.py	`100.00% <100.00%> (ø)`
tests/modules/layers/test_position_embedding.py	`98.76% <97.29%> (-1.24%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ebsmothers · 2023-09-26T01:00:09Z

torchmultimodal/modules/layers/position_embedding.py

+ return self.alibi_mask[..., :curr_seq_len, :curr_seq_len]
+
+ @classmethod
+ def build_causal_attention_mask(cls, seq_len: int, num_heads: int) -> torch.Tensor:


Fwiw there is also the get_causal_attention_mask utility (you may even be able to use get_extended_attention_mask from the same file in lieu of the repeat, it does broadcast to an extra dim for batch size though)

ebsmothers · 2023-09-26T01:24:55Z

torchmultimodal/modules/layers/position_embedding.py

+ max_seq_len: int,
+ num_heads: int,
+ ) -> None:
+ """recommended usage: create alibi mask before transformer block loop and integrate


Yeah this is a bit tricky. Kinda similar to RoPE embeddings: integrating this properly will necessitate rethinking some aspects of our transformer implementation. For instance, seems like one assumption here is that our transformer's mask should be float dtype and not bool

ebsmothers · 2023-09-26T01:27:16Z

torchmultimodal/modules/layers/position_embedding.py

@@ -169,3 +170,108 @@ def forward(self, t: Tensor) -> Tensor:
 if self.embed_dim % 2 == 1:
 embeddings = nn.functional.pad(embeddings, (0, 1))
 return embeddings
+
+
+class AlibiPositionEmbeddings(nn.Module):


High level q: if we not using model forward and mostly using class/static methods, why not just define as a function? Offhand I don't see a reason why this needs to be stateful (it's very possible I'm missing something though)

ebsmothers · 2023-09-26T01:35:52Z

torchmultimodal/modules/layers/position_embedding.py

+ @staticmethod
+ def get_slopes(num_heads: int) -> List[float]:
+ """for n heads, a range from (0,1) and is the geometric sequence
+ that starts at 2^(-8/n) and uses this same value as its ratio


Thank you for explaining/documenting the magic numbers 🙂

ebsmothers · 2023-09-26T03:02:09Z

torchmultimodal/modules/layers/position_embedding.py

+ return get_slopes_power_of_2(num_heads)
+
+ # paper authors note that they only trained models that have 2^a heads for some a.
+ # This has beneficial properties related to input being power of 2.


Do you know what these properties are? Tbh I am confused by this because even if n is a power of 2 some of the ratios will not be rational for n > 8

ebsmothers · 2023-09-26T03:03:54Z

torchmultimodal/modules/layers/position_embedding.py

+ b = get_slopes_power_of_2(2 * closest_power_of_2)[0::2][
+ : num_heads - closest_power_of_2
+ ]
+ return [x for pair in zip(b, a) for x in pair] + a[len(b) :]


Imo this is hard to parse. Agree with @daviswer's comment about returning values in order but could we just do sorted(a+b)? (Maybe I'm missing a tricky case.. if so a comment explaining this would suffice instead)

ebsmothers · 2023-09-26T03:11:18Z

torchmultimodal/modules/layers/position_embedding.py

+ # paper authors note that they only trained models that have 2^a heads for some a.
+ # This has beneficial properties related to input being power of 2.
+
+ # Closest power of 2 below is workaround for when num of heads is not power of 2


Their method of interpolating is a bit unusual. Maybe explicitly explain that for $num \textunderscore heads=2^N + k$ they are splicing the geometric series with ratio $2^{-\frac{8}{N}}$ with the first $2k$ elements of the geometric series with ratio $2^{-\frac{8}{N+1}}$ (assuming I am even understanding it correctly 😅)

lessw2020 added 3 commits September 5, 2023 20:44

add Alibi

6eb62d9

add unit tests for alibi

0c52d11

remove unused fixture

f263b23

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 6, 2023

lessw2020 added 4 commits September 6, 2023 03:46

update typedef for slopes inline function

926b3fb

all typedefs for function get_slopes_power_of_2

8aadfb4

remove unused data fixture, param from alibi test

d409abc

Merge branch 'main' into alibi_positional_embeddings

7d8ff19

rohan-varma self-requested a review September 6, 2023 05:25

lessw2020 requested review from pbontrager and ebsmothers September 6, 2023 15:10

lessw2020 added 2 commits September 7, 2023 14:20

update alibi mask

82876cf

update usage - alibi is applied after sqrt scaling

2289962

Merge branch 'main' into alibi_positional_embeddings

07c21a2

daviswer reviewed Sep 7, 2023

View reviewed changes

torchmultimodal/modules/layers/position_embedding.py Outdated Show resolved Hide resolved

lessw2020 added 2 commits September 7, 2023 23:44

return ordered slopes for non power 2, expand unit tests to cover non…

ece9a0e

… power 2 case

Merge branch 'alibi_positional_embeddings' of https://github.com/less…

13b4270

…w2020/PyTorch_MultiModal into alibi_positional_embeddings

Merge branch 'main' into alibi_positional_embeddings

1673422

ebsmothers reviewed Sep 26, 2023

View reviewed changes

pbontrager and others added 2 commits September 27, 2023 14:29

Merge branch 'main' into alibi_positional_embeddings

52f7afa

Merge branch 'main' into alibi_positional_embeddings

c54548f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Alibi positional embeddings #462

add Alibi positional embeddings #462

lessw2020 commented Sep 6, 2023

lessw2020 commented Sep 6, 2023

lessw2020 commented Sep 7, 2023

lessw2020 commented Sep 7, 2023

codecov-commenter commented Sep 7, 2023 •

edited

Loading

ebsmothers Sep 26, 2023

ebsmothers Sep 26, 2023

ebsmothers Sep 26, 2023

ebsmothers Sep 26, 2023

ebsmothers Sep 26, 2023

ebsmothers Sep 26, 2023

ebsmothers Sep 26, 2023

add Alibi positional embeddings #462

Are you sure you want to change the base?

add Alibi positional embeddings #462

Conversation

lessw2020 commented Sep 6, 2023

lessw2020 commented Sep 6, 2023

lessw2020 commented Sep 7, 2023

lessw2020 commented Sep 7, 2023

codecov-commenter commented Sep 7, 2023 • edited Loading

Codecov Report

ebsmothers Sep 26, 2023

Choose a reason for hiding this comment

ebsmothers Sep 26, 2023

Choose a reason for hiding this comment

ebsmothers Sep 26, 2023

Choose a reason for hiding this comment

ebsmothers Sep 26, 2023

Choose a reason for hiding this comment

ebsmothers Sep 26, 2023

Choose a reason for hiding this comment

ebsmothers Sep 26, 2023

Choose a reason for hiding this comment

ebsmothers Sep 26, 2023

Choose a reason for hiding this comment

codecov-commenter commented Sep 7, 2023 •

edited

Loading