Small introduction, paper, code etc.
Year | Name | Paper | Info | Implementation |
---|---|---|---|---|
2017 | Transformer | Attention is All you Need | The focus of the original research was on translation tasks. | TensorFlow + article |
2018 | GPT | Improving Language Understanding by Generative Pre-Training | The first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results | |
2018 | BERT | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | Another large pretrained model, this one designed to produce better summaries of sentences | PyTorch |
2019 | GPT-2 | Language Models are Unsupervised Multitask Learners | An improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns | |
2019 | DistilBERT - Distilled BERT | DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter | A distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance | |
2019 | BART | BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension | Large pretrained models using the same architecture as the original Transformer model. | |
2019 | T5 | Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | Large pretrained models using the same architecture as the original Transformer model. | |
2019 | ALBERT | ALBERT: A Lite BERT for Self-supervised Learning of Language Representations | ||
2019 | RoBERTa - A Robustly Optimized BERT Pretraining Approach | RoBERTa: A Robustly Optimized BERT Pretraining Approach | ||
2019 | CTRL | CTRL: A Conditional Transformer Language Model for Controllable Generation | ||
2019 | Transformer XL | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Adopts a recurrence methodology over past state coupled with relative positional encoding enabling longer term dependencies | |
2019 | Diablo GPT | DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation | Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017 | PyTorch |
2019 | ERNIE | ERNIE: Enhanced Language Representation with Informative Entities | In this paper, we utilize both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE), which can take full advantage of lexical, syntactic, and knowledge information simultaneously. | |
2020 | GPT-3 | Language Models are Few-Shot Learners | An even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning) | |
2020 | ELECTRA | ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS | ||
2020 | mBART | Multilingual Denoising Pre-training for Neural Machine Translation | ||
2021 | CLIP (Contrastive Language-Image Pre-Training) | Learning Transferable Visual Models From Natural Language Supervision | CLIP is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. | PyTorch |
2021 | DALL-E | Zero-Shot Text-to-Image Generation | PyTorch | |
2021 | Gopher | Scaling Language Models: Methods, Analysis & Insights from Training Gopher | ||
2021 | Decision Transformer | Decision Transformer: Reinforcement Learning via Sequence Modeling | An architecture that casts the problem of RL as conditional sequence modeling. | PyTorch |
2021 | GLam (Generalist Language Model) | GLaM: Efficient Scaling of Language Models with Mixture-of-Experts | In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. | |
2022 | chatGPT/InstructGPT | Training language models to follow instructions with human feedback | This trained language model is much better at following user intentions than GPT-3. The model is optimised (fine tuned) using Reinforcement Learning with Human Feedback (RLHF) to achieve conversational dialogue. The model was trained using a variety of data which were written by people to achieve responses that sounded human-like. | :-: |
2022 | Chinchilla | Training Compute-Optimal Large Language Models | Uses the same compute budget as Gopher but with 70B parameters and 4x more more data. | :-: |
2022 | LaMDA - Language Models for Dialog Applications | LaMDA | It is a family of Transformer-based neural language models specialized for dialog | |
2022 | DQ-BART | DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization | Propose to jointly distill and quantize the model, where knowledge is transferred from the full-precision teacher model to the quantized and distilled low-precision student model. | |
2022 | Flamingo | Flamingo: a Visual Language Model for Few-Shot Learning | Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. | |
2022 | Gato | A Generalist Agent | Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. | |
2022 | GODEL: Large-Scale Pre-Training for Goal-Directed Dialog | GODEL: Large-Scale Pre-Training for Goal-Directed Dialog | In contrast with earlier models such as DialoGPT, GODEL leverages a new phase of grounded pre-training designed to better support adapting GODEL to a wide range of downstream dialog tasks that require information external to the current conversation (e.g., a database or document) to produce good responses. | PyTorch |
2023 | GPT-4 | GPT-4 Technical Report | The model now accepts multimodal inputs: images and text | :-: |
2023 | BloombergGPT | BloombergGPT: A Large Language Model for Finance | LLM specialised in financial domain trained on Bloomberg's extensive data sources | |
2023 | BLOOM | BLOOM: A 176B-Parameter Open-Access Multilingual Language Model | BLOOM (BigScience Large Open-science Open-access Multilingual Language Model) is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total) | |
2023 | Llama 2 | Llama 2: Open Foundation and Fine-Tuned Chat Models | PyTorch #1 PyTorch #2 | |
2023 | Claude | Claude | Claude can analyze 75k words (100k tokens). GPT4 can do just 32.7k tokens. | |
2023 | SelfCheckGPT | SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models | A simple sampling-based approach that can be used to fact-check black-box models in a zero-resource fashion, i.e. without an external database. |
Name | Size (# Parameters) | Training Tokens | Training data |
---|---|---|---|
GLaM | 1.2T | ||
Gopher | 280B | 300B | |
BLOOM | 176B | ROOTS corpus | |
GPT-3 | 175B | ||
LaMDA | 137B | 168B | 1.56T words of public dialog data and web text |
Chinchilla | 70B | 1.4T | |
Llama 2 | 7B, 13B and 70B | ||
BloombergGPT | 50B | 363B+345B | |
Falcon40B | 40B | 1T | 1,000B tokens of RefinedWeb |
- M=Million | B=billion | T=Trillion
- ALBERT | Alpaca
- BART | BERT | Big Bird | BLOOM |
- Chinchilla | CLIP | CTRL | chatGPT | Claude
- DALL-E | DALL-E-2 | Decision Transformers | DialoGPT | DistilBERT | DQ-BART |
- ELECTRA | ERNIE
- Flamingo | Falcon40B
- Gato | Gopher | GLaM | GLIDE | GPT | GPT-2 | GPT-3 | GPT-4 | GPT-Neo | Godel | GPT-J
- Imagen | InstructGPT
- Jurassic-1
- LaMDA | Llama 2
- mBART | Megatron | Minerva | MT-NLG
- OPT
- Palm | Pegasus
- RoBERTa
- SeeKer | Swin Transformer | Switch | SelfCheckGPT
- Transformer | T5 | Trajectory Transformers | Transformer XL | Turing-NLG
- ViT
- Wu Dao 2.0 |
- XLM-RoBERTa | XLNet
Architecture | Models | Tasks |
---|---|---|
Encoder-only, aka also called auto-encoding Transformer models | ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa | Sentence classification, named entity recognition, extractive question answering |
Decoder-only, aka auto-regressive (or causal) Transformer models | CTRL, GPT, GPT-2, Transformer XL | Text generation given a prompt |
Encoder-Decoder, aka sequence-to-sequence Transformer models | BART, T5, Marian, mBART | Summarisation, translation, generative question answering |
- HuggingFace, a popular NLP library, but it also offers an easy way to deploy models via their Inference API. When you build a model using the HuggingFace library, you can then train it and upload it to their Model Hub. Read more about this here.
- List of notebook
- 2014 | Neural Machine Translation by Jointly Learning to Align and Translate
- 2022 | A SURVEY ON GPT-3
- 2022 | Efficiently Scaling Transformer Inference
- Must-Read Papers on Pre-trained Language Models (PLMs)
- Building a synth with ChatGPT
- PubMed GPT: a Domain-Specific Large Language Model for Biomedical Text
- ChatGPT - Where it lacks
- Awesome ChatGPT Prompts
- ChatGPT vs. GPT3: The Ultimate Comparison
- Prompt Engineering 101: Introduction and resources
- Transformer models: an introduction and catalog — 2022 Edition
- Can GPT-3 or BERT Ever Understand Language?—The Limits of Deep Learning Language Models
- 10 Things You Need to Know About BERT and the Transformer Architecture That Are Reshaping the AI Landscape
- Comprehensive Guide to Transformers
- Unmasking BERT: The Key to Transformer Model Performance
- Transformer NLP Models (Meena and LaMDA): Are They “Sentient” and What Does It Mean for Open-Domain Chatbots?
- Hugging Face Pre-trained Models: Find the Best One for Your Task
- Large Transformer Model Inference Optimization
- 4-part tutorial on how transformers work: Part 1 | Part 2 | Part 3 | Part 4
- What Makes a Dialog Agent Useful?
- Understanding Large Language Models -- A Transformative Reading List
- Prompt Engineering
- Building LLM applications for production
- Developer's Guide To LLMOps: Prompt Engineering, LLM Agents, and Observability
- Argument for using RL LLMs
- Why Google and OpenAI are loosing against the open-source communities
- You probably don't know how to do Prompt Engineering!
- The Full Story of Large Language Models and RLHF
- Understanding OpenAI's Evals
- What We Know About LLMs (Primer)
- F**k You, Show Me The Prompt.
- Building a search engine with a pre-trained BERT model
- Fine tuning pre-trained BERT model on Text Classification Task
- Fine tuning pre-trained BERT model on the Amazon product review dataset
- Sentiment analysis with Hugging Face transformer
- Fine tuning pre-trained BERT model on YELP review Classification Task
- HuggingFace API
- HuggingFace mask filling
- HuggingFace NER name entity recognition
- HuggingFace question answering within context
- HuggingFace text generation
- HuggingFace text summarisation.ipynb
- HuggingFace zero-shot learning
- Two notebooks are available:
- One with coloured boxes and outside folder
GitHub_MD_rendering
- One in black-and-white under folder
GitHub_MD_rendering
- One with coloured boxes and outside folder
- The easiest option would be for you to clone this repository.
- Navigate to Google Colab and open the notebook directly from Colab.
- You can then also write it back to GitHub provided permission to Colab is granted. The whole procedure is automated.
- How to Code BERT Using PyTorch
- miniGPT in PyTorch
- nanoGPT in PyTorch
- TensorFlow implementation of Attention is all you need + article