From beb955dfc039d9a279a6172b427e9e16cacb06a2 Mon Sep 17 00:00:00 2001 From: Douglas Orr Date: Wed, 17 Apr 2024 07:53:23 +0100 Subject: [PATCH] Publish --- 2024-04-transformers/article.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/2024-04-transformers/article.html b/2024-04-transformers/article.html index ae73ef6..f3b05d7 100644 --- a/2024-04-transformers/article.html +++ b/2024-04-transformers/article.html @@ -284,7 +284,7 @@

Multi-layer perceptron (MLP, GeGLU)Summary

A final figure might help review the journey we've been on, from ReLU -> ReGLU -> GeGLU. To make things legible, we're now looking at a slice through the surfaces we've seen so far, setting x[1] to a constant value, and just looking at how y[0] depends on x[0].

Three line plots, shown as x[0] varies from -2 to 2. The first, ReLU, is piecewise linear. The second, ReGLU, is piecewise quadratic with gradient discontinuities. The third, GeGLU, is smooth but still vaguely quadratic.

-

So Gemma's MLP, the GeGLU, can be thought of as a piecewise-quadratic function with smooth boundaries between the pieces. Where our example had 6 regions across a 2-vector input, Gemma's MLPs can have up to 32768 regions across their 2048-vector input.

+

So Gemma's MLP, the GeGLU, can be thought of as a piecewise-quadratic function with smooth boundaries between the pieces. Where our example had 6 regions across a 2-vector input, Gemma's MLPs may have a vast number of regions (perhaps $10^{2000}$) across their 2048-vector input.

The purpose of the MLP in Gemma is to use this function to independently transform each token, ready to form another attention query or ready to match against output tokens. Although MLPs cannot fuse information from across the context by themselves (which is the fundamental task of a language model), our experience shows that including the MLP makes attention much more efficient at doing exactly this.

Final norm