Explanations and visualisations:
- Jurafsky-Martin 11
- Lena Voita's blog: Transfer Learning
- Jay Alammar's blog: A Visual Guide to Using BERT for the First Time
Source: http://transfersoflearning.blogspot.com/2010/03/positive-transfer.html
- We shared trained models which were strictly task-specific
- Models could be trained with joint or multi-task learning: find the parameters that minimise two losses, e.g. PoS and lemma
- Neural networks could be designed with parameter sharing: some parts of the network are shared between two tasks (their loss combined used for weight updating) and other parts are task specific (using only one loss to update the weights)
- Consecutive transfer: first pre-train a LLM, then the train the main model with LLM as input
- A LLM is trained on raw text with self-supervised learning
- The main model is trained on labelled data (most of the time) with supervised learning
- The way how pass the pre-trained weights depends on whether the task is token or sentence level
- Fine-tuning: This term is typically used in cases where the learning objective (task) used for pre-training LLM is different from the objective (task) on which the main model is trained. For example, a typical task for pre-training LLMs is masked language modelling (MLM), while the main model is trained for text classification or sentiment analysis. All weights are updated with the main model's loss.
- Continued training: This term is used in cases where a pre-trained LLM is used to improve text representation on a new domain or a new language. In this case, the the learning objective is the same for the pre-trained LLM and the main LM (e.g. we use MLM in both cases), but pre-training and main training are done on different data sets. All weights are updated with the main model's loss.
- One- and few-shot learning, also known as meta-learning: pre-trained LLM performs the main task without updating parameters. A pre-trained LLM learns to classify new examples relying on a similarity function. This terminology is very new, not yet summarised in a textbook. An overview of terms and references can be found in Timo Schick's PhD thesis
- Adaptation: Pre-trained weights fixed, train only a small component to use the pre-trained weights for a new task
- Zero-shot classification: unsupervised setting without updating LMMs parameters.
Source: https://www.interconnects.ai/p/llm-development-paths
- model architecture types:
- only the encoder part of Transformers (e.g. BERT, RoBERTa)
- only the decoder part of Transformers (e.g. GPT)
- full encoder-decoder Transformers (e.g. t5)
- training objective:
- masked language modelling (e.g. BERT, RoBERTa)
- discriminating between alternative fillings of slots (e.g. ELECTRA)
- reconstructing the input after permutations (e.g. XLNet)
- model size
- several models are trained using fewer parameters (e.g. DistilBERT)
- English (Wikipedia and BooksCorpus): bert-base, cased and uncased
- French: FlauBERT (BERT), CamemBERT (RoBERTa), variants
- Bosnian, Croatian, Montengrin, Serbian: BERTić (ELECTRA), trained on almost all the texts available online
- many, many more!