Improve the intro in the n-gram example

parsiad · Aug 17, 2024 · 337ebe5 · 337ebe5
1 parent 5ed226e
commit 337ebe5
Showing 1 changed file with 20 additions and 4 deletions.
diff --git a/examples/n-gram.ipynb b/examples/n-gram.ipynb
@@ -7,12 +7,28 @@
    "source": [
     "# n-gram language model\n",
     "\n",
-    "An [n-gram language model](https://en.wikipedia.org/wiki/Word_n-gram_language_model) is a statistical model of language that models the distribution of the $k$-th token $X_k$ on the previous $n$ tokens $X_{k - 1}, \\ldots, X_{k - n}$.\n",
-    "An efficient way to compute an estimator for the probabilities of this model is by [counting n-grams](https://en.wikipedia.org/wiki/Word_n-gram_language_model#Approximation_method).\n",
+    "An [n-gram language model](https://en.wikipedia.org/wiki/Word_n-gram_language_model) is a statistical model of language.\n",
+    "It assumes that the probability of the next token in a sequence depends on a fixed size window (a.k.a. context) of previous tokens.\n",
     "\n",
-    "This estimator can also be approximated by performing gradient descent on the cross entropy loss, as is done below.\n",
+    "For example, in a trigram ($n = 3$) model, the [likelihood](https://en.wikipedia.org/wiki/Likelihood_function) of observing the sentence $X_1 \\, X_2 \\cdots X_T$ is\n",
+    "$$\n",
+    "\\mathbb{P}(X_1 \\, X_2 \\cdots X_T)\n",
+    "= \\prod_{t = 1}^T \\mathbb{P}(X_t \\mid X_{t - 2} \\, X_{t - 1})\n",
+    "$$\n",
+    "where, typically, $X_t$ is assigned a placeholder value (e.g., a null or start of sentence token) whenever $t \\leq 0$.\n",
+    "The logarithm of the likelihood is\n",
+    "$$\n",
+    "\\log \\mathbb{P}(X_1 \\, X_2 \\cdots X_T)\n",
+    "\\propto \\frac{1}{T} \\sum_{t = 1}^T \\log \\mathbb{P}(X_t \\mid X_{t - 2} \\, X_{t - 1})\n",
+    "$$\n",
+    "which we recognize as the [cross-entropy](https://parsiad.ca/blog/2023/motivating_the_cross_entropy_loss).\n",
+    "We allow each probability $p_{x,x^\\prime,x^{\\prime\\prime}} \\equiv \\mathbb{P}(x \\mid x^\\prime \\, x^{\\prime\\prime})$ to be a distinct parameter of the model.\n",
+    "In this case, letting $V$ denote the set of tokens (a.k.a. the vocabulary), the trigram model has $|V|^3$ parameters.\n",
+    "An efficient way to compute a [maxium likelihood estimator](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) for $p$ is by [counting n-grams](https://en.wikipedia.org/wiki/Word_n-gram_language_model#Approximation_method).\n",
     "\n",
-    "Although this is not an efficient way to compute the estimator, it is useful as it demonstrates how a language model without a closed form solution (e.g., [recurrent neural networks](https://en.wikipedia.org/wiki/Recurrent_neural_network) and [large language models](https://en.wikipedia.org/wiki/Large_language_model)) can be learned."
+    "However, one could also approximate this estimator by performing gradient ascent on the cross entropy.\n",
+    "This notebook uses Micrograd++ to do just that.\n",
+    "While **this approach is not efficient**, it is useful in that it demonstrates how a more complicated language model *without* a closed form solution (e.g., [recurrent neural networks](https://en.wikipedia.org/wiki/Recurrent_neural_network) and [large language models](https://en.wikipedia.org/wiki/Large_language_model)) can be learned by maximizing cross entropy iteratively."
    ]
   },
   {