Repository for organizing datasets used in Open LLM.
To download or access information about the most commonly used datasets: https://huggingface.co/datasets
- falcon-refinedweb
- ultraChat
- ShareGPT_Vicuna_unfiltered
- pku-saferlhf-dataset
- RefGPT-Dataset
- Luotuo-QA-A-CoQA-Chinese
- Wizard-LM-Chinese-instruct-evol
- alpaca_chinese_dataset
- Zhihu-KOL
- Alpaca-GPT-4_zh-cn
- Baize Dataset
- h2oai/h2ogpt-fortune2000-personalized
- SHP
- ELI5
- evol_instruct_70k
- MOSS SFT data
- ShareGPT52K
- GPT-4all Dataset
- COIG
- RedPajama-Data-1T
- OpenAssistant Conversations Dataset (OASST1)
- Alpaca-COT
- CBook-150K
- databricks-dolly-15k (possible zh-cn version)
- AlpacaDataCleaned
- GPT-4-LLM Dataset
- GPTeacher
- HC3
- Alpaca data Download
- OIG OIG-small-chip2
- ChatAlpaca data
- InstructionWild
- Firefly(流萤)
- BELLE 0.5M version 1M version 2M version
- GuanacoDataset
- xP3 (and some variant)
- OpenAI WebGPT
- OpenAI Summarization Comparison
- Natural Instruction GitHub&Download
- hh-rlhf on Huggingface
- OpenAI PRM800k
- falcon-refinedweb
- Common Crawl
- nlp_Chinese_Corpus
- The Pile (V1)
- Huggingface dataset for C4
- TensorFlow dataset for C4
- ROOTS
- PushshPairs reddit
- Gutenberg project
- CLUECorpus
- ChatGPT-Jailbreak-Prompts
- awesome-chinese-legal-resources
- Long Form
- symbolic-instruction-tuning
- Safety Prompt
- Tapir-Cleaned
- instructional_codesearchnet_python
- finance-alpaca
- WebText(Reddit links) - Private Dataset
- MassiveText - Private Dataset
- Korean-Open-LLM-Datasets
OIG | hh-rlhf | xP3 | Natural instruct | AlpacaDataCleaned | GPT-4-LLM | Alpaca-CoT | |
---|---|---|---|---|---|---|---|
OIG | - | Contains | Overlap | Overlap | Overlap | Overlap | |
hh-rlhf | Part of | - | Overlap | ||||
xP3 | Overlap | - | Overlap | Overlap | |||
Natural instruct | Overlap | Overlap | - | Overlap | |||
AlpacaDataCleaned | Overlap | - | Overlap | Overlap | |||
GPT-4-LLM | Overlap | - | Overlap | ||||
Alpaca-CoT | Overlap | Overlap | Overlap | Overlap | Overlap | Overlap | - |
- Attention Is All You Need
- Improving Language Understanding by Generative Pre-Training
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Language Models are Unsupervised Multitask Learners
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
- Scaling Laws for Neural Language Models
- Language models are few-shot learners
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- Evaluating Large Language Models Trained on Code
- On the Opportunities and Risks of Foundation Models
- Finetuned Language Models are Zero-Shot Learners
- Multitask Prompted Training Enables Zero-Shot Task Generalization
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
- WebGPT: Improving the Factual Accuracy of Language Models through Web Browsing
- Improving language models by retrieving from trillions of tokens
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- LaMDA: Language Models for Dialog Applications
- Solving Quantitative Reasoning Problems with Language Models
- Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
- Training language models to follow instructions with human feedback
- PaLM: Scaling Language Modeling with Pathways
- An empirical analysis of compute-optimal large language model training
- OPT: Open Pre-trained Transformer Language Models
- Unifying Language Learning Paradigms
- Emergent Abilities of Large Language Models
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
- Language Models are General-Purpose Interfaces
- Improving alignment of dialogue agents via targeted human judgements
- Scaling Instruction-Finetuned Language Models
- GLM-130B: An Open Bilingual Pre-trained Model
- Holistic Evaluation of Language Models
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
- Galactica: A Large Language Model for Science
- OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
- The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
- LLaMA: Open and Efficient Foundation Language Models
- Language Is Not All You Need: Aligning Perception with Language Models
- PaLM-E: An Embodied Multimodal Language Model
- GPT-4 Technical Report
- Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
- Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
- PaLM 2 Technical Report
- RWKV: Reinventing RNNs for the Transformer Era
- Let’s Verify Step by Step - Open AI
- Switch Transformer: Paper
- GLaM: Paper
- PaLM: Paper
- MT-NLG: Paper
- J1-Jumbo: api, Paper
- OPT: api, ckpt, Paper, OPT-175B License Agreement
- BLOOM: api, ckpt, Paper, BigScience RAIL License v1.0
- GPT 3.0: api, Paper
- LaMDA: Paper
- GLM: ckpt, Paper, The GLM-130B License
- YaLM: ckpt, Blog, Apache 2.0 License
- LLaMA: ckpt, Paper, Non-commercial bespoke license
- GPT-NeoX: ckpt, Paper, Apache 2.0 License
- UL2: ckpt, Paper, Apache 2.0 License
- T5: ckpt, Paper, Apache 2.0 License
- CPM-Bee: api, Paper
- rwkv-4: ckpt, Github, Apache 2.0 License
- GPT-J: ckpt, Github, Apache 2.0 License
- GPT-Neo: ckpt, Github, MIT License
- Flan-PaLM: Link
- BLOOMZ: Link
- InstructGPT: Link
- Galactica: Link
- OpenChatKit: Link
- Flan-UL2: Link
- Flan-T5: Link
- T0: Link
- Alpaca: Link
- Visuallization of Open LLM Leader Board: https://github.com/dsdanielpark/Open-LLM-Leaderboard-Report
- Open LLM Leader Board: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
- LLaMA - A foundational, 65-billion-parameter large language model.
- Alpaca - A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations.
- Flan-Alpaca - Instruction Tuning from Humans and Machines.
- Baize - Baize is an open-source chat model trained with LoRA.
- Cabrita - A Portuguese finetuned instruction LLaMA.
- Vicuna - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.
- Llama-X - Open Academic Research on Improving LLaMA to SOTA LLM.
- Chinese-Vicuna - A Chinese Instruction-following LLaMA-based Model.
- GPTQ-for-LLaMA - 4 bits quantization of LLaMA using GPTQ.
- GPT4All - Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa.
- Koala - A Dialogue Model for Academic Research.
- BELLE - Be Everyone's Large Language model Engine.
- StackLLaMA - A hands-on guide to train LLaMA with RLHF.
- RedPajama - An Open Source Recipe to Reproduce LLaMA training dataset.
- Chimera - Latin Phoenix.
- CaMA - a Chinese-English Bilingual LLaMA Model.
- BLOOM - BigScience Large Open-science Open-access Multilingual Language Model.
- BLOOMZ&mT0 - a family of models capable of following human instructions in dozens of languages zero-shot.
- Phoenix
- T5 - Text-to-Text Transfer Transformer.
- T0 - Multitask Prompted Training Enables Zero-Shot Task Generalization.
- OPT - Open Pre-trained Transformer Language Models.
- UL2 - a unified framework for pretraining models that are universally effective across datasets and setups.
- GLM- GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.
- ChatGLM-6B - ChatGLM-6B is an open-source, supporting Chinese and English dialogue language model based on General Language Model (GLM) architecture.
- RWKV - Parallelizable RNN with Transformer-level LLM Performance.
- ChatRWKV - ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model.
- StableLM - Stability AI Language Models.
- YaLM - a GPT-like neural network for generating and processing text.
- GPT-Neo - An implementation of model & data parallel GPT3-like models.
- GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile.
- Dolly - a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT.
- Pythia - Interpreting Autoregressive Transformers Across Time and Scale.
- Dolly 2.0 - the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
- OpenFlamingo - an open-source reproduction of DeepMind's Flamingo model.
- Cerebras-GPT - A Family of Open, Compute-efficient, Large Language Models.
- GALACTICA - The GALACTICA models are trained on a large-scale scientific corpus.
- GALPACA - GALACTICA 30B fine-tuned on the Alpaca dataset.
- Palmyra - Palmyra Base was primarily pre-trained with English text.
- Camel - a state-of-the-art instruction-following large language model.
- h2oGPT
- PanGu-α - PanGu-α is a 200B parameter autoregressive pretrained Chinese language model.
- MOSS - MOSS is an open-source dialogue language model that supports Chinese and English.
- Open-Assistant - a project meant to give everyone access to a great chat-based large language model.
- HuggingChat - Powered by Open Assistant's latest model – the best open-source chat model right now and @huggingface Inference API.
- StarCoder - Hugging Face LLM for Code
- MPT-7B - Open LLM for commercial use by MosaicML
- Serving OPT-175B, BLOOM-176B and CodeGen-16B using Alpa
- Alpa
- Megatron-LM GPT2 tutorial
- DeepSpeed Chat
- pretrain_gpt3_175B.sh
- Megatron-LM
- deepspeed.ai
- Github repo
- Colossal-AI
- Open source solution replicates ChatGPT training process! Ready to go with only 1.6GB GPU memory and gives you 7.73 times faster training!
- BMTrain
- Mesh TensorFlow
(mtf)
- This tutorial
- github: https://github.com/huggingface/peft
- abstract: Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. Fine-tuning large-scale PLMs is often prohibitively costly. In this regard, PEFT methods only fine-tune a small number of (extra) model parameters, thereby greatly decreasing the computational and storage costs. Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full fine-tuning. Seamlessly integrated with hugging face Accelerate for large scale models leveraging DeepSpeed and Big Model Inference.
- Supported methods: Supported methods:
- LoRA: LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
- Prefix Tuning: Prefix-Tuning: Optimizing Continuous Prompts for Generation, P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
- P-Tuning: GPT Understands, Too
- Prompt Tuning: The Power of Scale for Parameter-Efficient Prompt Tuning
- AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
- [Andrej Karpathy] State of GPT video
- [Hyung Won Chung] Instruction finetuning and RLHF lecture Youtube
- [Jason Wei] Scaling, emergence, and reasoning in large language models Slides
- [Susan Zhang] Open Pretrained Transformers Youtube
- [Ameet Deshpande] How Does ChatGPT Work? Slides
- [Yao Fu] The Source of the Capability of Large Language Models: Pretraining, Instructional Fine-tuning, Alignment, and Specialization Bilibili
- [Hung-yi Lee] ChatGPT: Analyzing the Principle Youtube
- [Jay Mody] GPT in 60 Lines of NumPy Link
- [ICML 2022] Welcome to the "Big Model" Era: Techniques and Systems to Train and Serve Bigger Models Link
- [NeurIPS 2022] Foundational Robustness of Foundation Models Link
- [Andrej Karpathy] Let's build GPT: from scratch, in code, spelled out. Video|Code
- [DAIR.AI] Prompt Engineering Guide Link
- [Philipp Schmid] Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers Link
- [HuggingFace] Illustrating Reinforcement Learning from Human Feedback (RLHF) Link
- [HuggingFace] What Makes a Dialog Agent Useful? Link
- [HeptaAI] ChatGPT Kernel: InstructGPT, PPO Reinforcement Learning Based on Feedback Instructions Link
- [Yao Fu] How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources Link
- [Stephen Wolfram] What Is ChatGPT Doing ... and Why Does It Work? Link
- [Jingfeng Yang] Why did all of the public reproduction of GPT-3 fail? Link
- [Hung-yi Lee] ChatGPT (possibly) How It Was Created - The Socialization Process of GPT Video
- Open AI Improving mathematical reasoning with process supervision
- [DeepLearning.AI] ChatGPT Prompt Engineering for Developers Homepage
- [Princeton] Understanding Large Language Models Homepage
- [Stanford] CS224N-Lecture 11: Prompting, Instruction Finetuning, and RLHF Slides
- [Stanford] CS324-Large Language Models Homepage
- [Stanford] CS25-Transformers United V2 Homepage
- [Stanford Webinar] GPT-3 & Beyond Video
- [MIT] Introduction to Data-Centric AI Homepage
- Google "We Have No Moat, And Neither Does OpenAI" [2023-05-05]
- AI competition statement [2023-04-20] [petergabriel]
- Noam Chomsky: The False Promise of ChatGPT [2023-03-08][Noam Chomsky]
- Is ChatGPT 175 Billion Parameters? Technical Analysis [2023-03-04][Owen]
- The Next Generation Of Large Language Models [2023-02-07][Forbes]
- Large Language Model Training in 2023 [2023-02-03][Cem Dilmegani]
- What Are Large Language Models Used For? [2023-01-26][NVIDIA]
- Large Language Models: A New Moore's Law [2021-10-26][Huggingface]
- LLMsPracticalGuide
- Awesome ChatGPT Prompts
- awesome-chatgpt-prompts-zh
- Awesome ChatGPT
- Chain-of-Thoughts Papers
- Instruction-Tuning-Papers
- LLM Reading List
- Reasoning using Language Models
- Chain-of-Thought Hub
- Awesome GPT
- Awesome GPT-3
- Awesome LLM Human Preference Datasets
- RWKV-howto
- Amazing-Bard-Prompts
- Arize-Phoenix
- Emergent Mind
- ShareGPT
- Major LLMs + Data Availability
- 500+ Best AI Tools
- Cohere Summarize Beta
- chatgpt-wrapper
- Open-evals
- Cursor
- AutoGPT
- OpenAGI
- HuggingGPT
Since this repository focuses on collecting various datasets for LLM, you are welcome to contribute and add datasets in any form you prefer.
[1]https://github.com/KennethanCeyer/awesome-llm
[2]https://github.com/Hannibal046/Awesome-LLM
[3]https://github.com/Zjh-819/LLMDataHub
[4]https://huggingface.co/datasets