Skip to content

Commit

Permalink
Publish
Browse files Browse the repository at this point in the history
  • Loading branch information
DouglasOrr committed Apr 17, 2024
1 parent 2fc0cc1 commit beb955d
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion 2024-04-transformers/article.html
Original file line number Diff line number Diff line change
Expand Up @@ -284,7 +284,7 @@ <h2 id="multi-layer-perceptron-mlp-geglu">Multi-layer perceptron (MLP, GeGLU)</h
<p><strong>Summary</strong></p>
<p>A final figure might help review the journey we've been on, from ReLU -&gt; ReGLU -&gt; GeGLU. To make things legible, we're now looking at a slice through the surfaces we've seen so far, setting <code>x[1]</code> to a constant value, and just looking at how <code>y[0]</code> depends on <code>x[0]</code>.</p>
<p><img alt="Three line plots, shown as x[0] varies from -2 to 2. The first, ReLU, is piecewise linear. The second, ReGLU, is piecewise quadratic with gradient discontinuities. The third, GeGLU, is smooth but still vaguely quadratic." class="img-fluid" src="img/mlp_slice.png" /></p>
<p>So Gemma's MLP, the GeGLU, can be thought of as a piecewise-quadratic function with smooth boundaries between the pieces. Where our example had 6 regions across a 2-vector input, Gemma's MLPs can have up to 32768 regions across their 2048-vector input.</p>
<p>So Gemma's MLP, the GeGLU, can be thought of as a piecewise-quadratic function with smooth boundaries between the pieces. Where our example had 6 regions across a 2-vector input, Gemma's MLPs may have a vast number of regions (perhaps $10^{2000}$) across their 2048-vector input.</p>
<p>The purpose of the MLP in Gemma is to use this function to independently transform each token, ready to form another attention query or ready to match against output tokens. Although MLPs cannot fuse information from across the context by themselves (which is the fundamental task of a language model), our experience shows that including the MLP makes attention much more efficient at doing exactly this.</p>
</details>
<h2 id="final-norm">Final norm</h2>
Expand Down

0 comments on commit beb955d

Please sign in to comment.