Skip to content

Commit

Permalink
Improve the intro in the n-gram example
Browse files Browse the repository at this point in the history
  • Loading branch information
parsiad committed Aug 17, 2024
1 parent 5ed226e commit 337ebe5
Showing 1 changed file with 20 additions and 4 deletions.
24 changes: 20 additions & 4 deletions examples/n-gram.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,28 @@
"source": [
"# n-gram language model\n",
"\n",
"An [n-gram language model](https://en.wikipedia.org/wiki/Word_n-gram_language_model) is a statistical model of language that models the distribution of the $k$-th token $X_k$ on the previous $n$ tokens $X_{k - 1}, \\ldots, X_{k - n}$.\n",
"An efficient way to compute an estimator for the probabilities of this model is by [counting n-grams](https://en.wikipedia.org/wiki/Word_n-gram_language_model#Approximation_method).\n",
"An [n-gram language model](https://en.wikipedia.org/wiki/Word_n-gram_language_model) is a statistical model of language.\n",
"It assumes that the probability of the next token in a sequence depends on a fixed size window (a.k.a. context) of previous tokens.\n",
"\n",
"This estimator can also be approximated by performing gradient descent on the cross entropy loss, as is done below.\n",
"For example, in a trigram ($n = 3$) model, the [likelihood](https://en.wikipedia.org/wiki/Likelihood_function) of observing the sentence $X_1 \\, X_2 \\cdots X_T$ is\n",
"$$\n",
"\\mathbb{P}(X_1 \\, X_2 \\cdots X_T)\n",
"= \\prod_{t = 1}^T \\mathbb{P}(X_t \\mid X_{t - 2} \\, X_{t - 1})\n",
"$$\n",
"where, typically, $X_t$ is assigned a placeholder value (e.g., a null or start of sentence token) whenever $t \\leq 0$.\n",
"The logarithm of the likelihood is\n",
"$$\n",
"\\log \\mathbb{P}(X_1 \\, X_2 \\cdots X_T)\n",
"\\propto \\frac{1}{T} \\sum_{t = 1}^T \\log \\mathbb{P}(X_t \\mid X_{t - 2} \\, X_{t - 1})\n",
"$$\n",
"which we recognize as the [cross-entropy](https://parsiad.ca/blog/2023/motivating_the_cross_entropy_loss).\n",
"We allow each probability $p_{x,x^\\prime,x^{\\prime\\prime}} \\equiv \\mathbb{P}(x \\mid x^\\prime \\, x^{\\prime\\prime})$ to be a distinct parameter of the model.\n",
"In this case, letting $V$ denote the set of tokens (a.k.a. the vocabulary), the trigram model has $|V|^3$ parameters.\n",
"An efficient way to compute a [maxium likelihood estimator](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) for $p$ is by [counting n-grams](https://en.wikipedia.org/wiki/Word_n-gram_language_model#Approximation_method).\n",
"\n",
"Although this is not an efficient way to compute the estimator, it is useful as it demonstrates how a language model without a closed form solution (e.g., [recurrent neural networks](https://en.wikipedia.org/wiki/Recurrent_neural_network) and [large language models](https://en.wikipedia.org/wiki/Large_language_model)) can be learned."
"However, one could also approximate this estimator by performing gradient ascent on the cross entropy.\n",
"This notebook uses Micrograd++ to do just that.\n",
"While **this approach is not efficient**, it is useful in that it demonstrates how a more complicated language model *without* a closed form solution (e.g., [recurrent neural networks](https://en.wikipedia.org/wiki/Recurrent_neural_network) and [large language models](https://en.wikipedia.org/wiki/Large_language_model)) can be learned by maximizing cross entropy iteratively."
]
},
{
Expand Down

0 comments on commit 337ebe5

Please sign in to comment.