Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the purpose of c_proj here? #135

Open
brynhayder opened this issue Apr 10, 2024 · 1 comment
Open

What is the purpose of c_proj here? #135

brynhayder opened this issue Apr 10, 2024 · 1 comment

Comments

@brynhayder
Copy link

brynhayder commented Apr 10, 2024

self.c_proj = nn.Linear(config.n_embd, config.n_embd)

Why do we need an additional linear transformation after the MHA and before the MLP when the dimensions are the same?

(I understand that this is how the initial transformer implementation was written, but I took this operation to be for dimension consistency between sequential attention operations. It seems superfluous here since the first linear in the MLP can already take linear combinations of the attention outputs.)

@theicfire
Copy link

theicfire commented May 6, 2024

I think the point is that the W^O can change the dimensionality, if the output of the concatenation is large/small (i.e. if dk or dv was a different dimension). Though the paper ultimately used parameters to make W^O square. It maybe helped the authors with trying out other parameters.

Related: #118 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants