-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
67bb842
commit 08bf169
Showing
2 changed files
with
125 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
## 5. History of language modelling | ||
|
||
|
||
> Explanations, formulas, visualisations: | ||
> - Lena Voita's blog: [Language Modelling](https://lena-voita.github.io/nlp_course/language_modeling.html) | ||
> - Eisenstein 6 (ignore details of NN architectures for now) | ||
> - Jurafsky-Martin [3](https://web.stanford.edu/~jurafsky/slp3/3.pdf) | ||
|
||
| ||
|
||
### Language as a stochastic process | ||
|
||
| ||
|
||
<img src="figures/signal-noise.png" alt="signal-noise" width="700"/> | ||
|
||
|
||
|
||
|
||
|
||
|
||
<img src="figures/text_basics.jpg" alt="text_basics" width="600"/> | ||
|
||
|
||
A way of modelling language computationally is to view it as a code sequence, that is a sequence of symbols generated by an encoder in the sense of information theory. By observing the sequences and looking for regularities in it, we model the encoder. | ||
|
||
|
||
Initially, this approach to language modelling concerned only the form of language, that is the rules of grammar. A famous example showing the difference between the form and the content is the following sentence by Noam Chomsky: | ||
|
||
> Colorless green ideas sleep furiously. | ||
We judge this sentence as correctly formed, although it does not mean anything. If we would strip off the the formal units, the sentence would sound like this: | ||
|
||
> Color green idea sleep furious. | ||
Or we could give all the words the same form: | ||
|
||
> Colorless greenless idealess sleepless furiousless. | ||
|
||
Or we could shuffle the order: | ||
|
||
> Sleep colorless furiously ideas green. | ||
> Sleep color furious idea green. | ||
> Sleepless colorless furiousless idealess greenless. | ||
|
||
Only the first sentence is grammatical, all the other versions are not. When we say we model the form of language, we make predictions on what combinations of words (or subword units) are grammatical regardless of whether these sequences have a meaning. | ||
|
||
| ||
|
||
|
||
### Statistical or n-gram models | ||
|
||
We model the form of sequences by calculating the probability of a sequence. | ||
|
||
In statistical modelling, we decompose sequences into n-grams. The probability of the sequence is a product of the n-gram probabilities, which are estimated on a large corpus using n-gram counts (maximum likelihood estimation). | ||
|
||
We typically do not model the subword structure in statistical models, word-level tokens are taken as discrete, atomic units. What n-gram modelling shows is the probability of joint occurrence of several words in a given order. The order of words is thus the main aspect of grammaticality targetted by statistical models. | ||
|
||
Modelling the probability of a sequence modelled in this way was useful for helped speech-to-text conversion, machine translation and grammar checking. | ||
|
||
Estimating n-gram probabilities is harder than it sounds - the longer the n-gram, the harder the estimation. Real statistical models employ various techniques to overcome the problem of estimation: | ||
|
||
- Smoothing: re-distributing the probability over unobserved items | ||
- Back-off: combining the probabilities of different n-gram orders | ||
- Out-of-vocabulary words (OOV): we can never have a finite word-level vocabulary. | ||
|
||
The main limitation is that the decomposition of texts into n-gram is too simplistic: not possible to capture long-distance dependences. | ||
|
||
|
||
|
||
| ||
|
||
### Neural language models | ||
|
||
Neural networks started being used for estimating the probability of sequences in early 2000. At that time, the task was seen exactly as before: scoring the grammaticality of a sequence of words. The first models were recurrent networks (RNNs) trained on the task of predicting the next word. | ||
|
||
The most important difference between statistical and (early) neural models: | ||
- Full history: because the hidden recurrent states contain the information on the history of a sequence, there is no need to decompose into n-grams | ||
- Representation (embedding): to predict the next word, neural networks need to represent the words in a feature space | ||
|
||
The last point was crucial for generalising the use of language models beyond grammaticality - the representations are the content. So, in order to predict the next word, we actually model the meaning of all words, not just the form of the sequence. | ||
|
||
An important notion for further developments was *self-supervision*: to train a neural language model we do not need human annotation. Words naturally occurring in text are the only information we need. The model tries to predict a word and then it can easily check whether it succeeded or not. **Important**: This doesn't mean that neural language models are unsupervised, neural networks cannot learn without supervision. | ||
|
||
|
||
| ||
|
||
### Masked language modelling (MLM) | ||
|
||
The next step towards modern large models was abandoning the idea of predicting the *next* word. Instead, we look into both left and right context and predict a *target* word. This is what brought to the currently most popular language modelling objective: masked language modelling (MLM). Note that taking into account both directions is only possible in encoding. When we use language models to generate text (decoding), we can only look into the left context. | ||
|
||
The word2vec set of algorithms can be seen as one of the earl implementations of the MLM idea. It is also the first implementation of modelling the meaning of words by means of a language model. An important difference compared to the contemporary models is that it models individual words: each word gets a vector representation, which is, in some sense, the sum of all uses of that word in the training corpus (all contexts are collapsed). | ||
|
||
The next step was modelling subword units (GloVe). | ||
|
||
BERT-like models are the current technology. They model subword sequences resulting in contextual, or dynm | ||
|
||
MLM task is mo | ||
- focus on the meaning in context | ||
- c: first step, static embeddings | ||
- : current technology behind LLMs, dynamic embeddings | ||
|
||
|
||
| ||
|
||
### Large language models (LLMs) | ||
|
||
- general encoders | ||
- all MLM (or a similar) | ||
- still open question to what degree they model human linguistic competence | ||
|
||
| ||
|
||
### Statistical vs. neural | ||
|
||
- statistical still used in practice for ASR: fast and well understood | ||
- neural models are used for other text generation tasks: machine translation, summarisation, robot-writers | ||
|
||
|
||
|
||
-------------- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters