- T5
- Paper
- Architecture
- Encoder-Decoder
- GPT
- GPT-Neo
- GPT-J-6B
- Megatron-11B
- Pangu-a-13B
- FairSeq
- GLaM
- LaMDA
- JURASSIC-1
- MT-NLG
- ERNIE
- Gopher
- Paper
- Conclusion
- Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit.
- Chinchilla
- Paper
- Conclusion
- We find that current large language models are significantly under trained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant.
- we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.
- PaLM
- Paper
- Architecture
- Decoder
- PaLM 2
- OPT
- Paper
- Architecture
- Decoder
- Gpt-neox
- BLOOM
- Paper
- Architecture
- Decoder
- LLaMA
- GLM
- BloombergGPT
- MOSS
- OpenLLaMA: An Open Reproduction of LLaMA
- dolly
- panda
- WeLM
- Baichuan
- Llama 2
- Qwen
- Chameleon
- Mixtral