From beb955dfc039d9a279a6172b427e9e16cacb06a2 Mon Sep 17 00:00:00 2001
From: Douglas Orr Multi-layer perceptron (MLP, GeGLU)Summary
A final figure might help review the journey we've been on, from ReLU -> ReGLU -> GeGLU. To make things legible, we're now looking at a slice through the surfaces we've seen so far, setting x[1]
to a constant value, and just looking at how y[0]
depends on x[0]
.
So Gemma's MLP, the GeGLU, can be thought of as a piecewise-quadratic function with smooth boundaries between the pieces. Where our example had 6 regions across a 2-vector input, Gemma's MLPs can have up to 32768 regions across their 2048-vector input.
+So Gemma's MLP, the GeGLU, can be thought of as a piecewise-quadratic function with smooth boundaries between the pieces. Where our example had 6 regions across a 2-vector input, Gemma's MLPs may have a vast number of regions (perhaps $10^{2000}$) across their 2048-vector input.
The purpose of the MLP in Gemma is to use this function to independently transform each token, ready to form another attention query or ready to match against output tokens. Although MLPs cannot fuse information from across the context by themselves (which is the fundamental task of a language model), our experience shows that including the MLP makes attention much more efficient at doing exactly this.