A Comprehensive Survey of Small Language Models: Technology, On-Device Applications, Efficiency, Enhancements for LLMs, and Trustworthiness
This repo include the papers discussed in our latest survey paper on small language models.
📖 Read the full paper here: Paper Link
- 2024/11/04: The first version of our survey is on Arxiv!
If our survey is useful for your research, please kindly cite our paper:
@article{wang2024comprehensive,
title={A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness},
author={Wang, Fali and Zhang, Zhiwei and Zhang, Xianren and Wu, Zongyu and Mo, Tzuhao and Lu, Qiuhao and Wang, Wanjing and Li, Rui and Xu, Junjie and Tang, Xianfeng and others},
journal={arXiv preprint arXiv:2411.03350},
year={2024}
}
Model | #Params | Date | Paradigm | Domain | Code | HF Model | Paper/Blog |
---|---|---|---|---|---|---|---|
Llama 3.2 | 1B; 3B | 2024.9 | Pre-train | Generic | Github | HF | Blog |
Qwen 1 | 1.8B; 7B; 14B; 72B | 2023.12 | Pre-train | Generic | Github | HF | Paper |
Qwen 1.5 | 0.5B; 1.8B; 4B; 7B; 14B; 32B; 72B | 2024.2 | Pre-train | Generic | Github | HF | Paper |
Qwen 2 | 0.5B; 1.5B; 7B; 57B; 72B | 2024.6 | Pre-train | Generic | Github | HF | Paper |
Qwen 2.5 | 0.5B; 1.5B; 3B; 7B; 14B; 32B; 72B | 2024.9 | Pre-train | Generic | Github | HF | Paper |
Gemma | 2B; 7B | 2024.2 | Pre-train | Generic | HF | Paper | |
Gemma 2 | 2B; 9B; 27B | 2024.7 | Pre-train | Generic | HF | Paper | |
H2O-Danube3 | 500M; 4B | 2024.7 | Pre-train | Generic | HF | Paper | |
Fox-1 | 1.6B | 2024.6 | Pre-train | Generic | HF | Blog | |
Rene | 1.3B | 2024.5 | Pre-train | Generic | HF | Paper | |
MiniCPM | 1.2B; 2.4B | 2024.4 | Pre-train | Generic | Github | HF | Paper |
OLMo | 1B; 7B | 2024.2 | Pre-train | Generic | Github | HF | Paper |
TinyLlama | 1B | 2024.1 | Pre-train | Generic | Github | HF | Paper |
Phi-1 | 1.3B | 2023.6 | Pre-train | Coding | HF | Paper | |
Phi-1.5 | 1.3B | 2023.9 | Pre-train | Generic | HF | Paper | |
Phi-2 | 2.7B | 2023.12 | Pre-train | Generic | HF | Paper | |
Phi-3 | 3.8B; 7B; 14B | 2024.4 | Pre-train | Generic | HF | Paper | |
Phi-3.5 | 3.8B; 4.2B; 6.6B | 2024.4 | Pre-train | Generic | HF | Paper | |
OpenELM | 270M; 450M; 1.1B; 3B | 2024.4 | Pre-train | Generic | Github | HF | Paper |
MobiLlama | 0.5B; 0.8B | 2024.2 | Pre-train | Generic | Github | HF | Paper |
MobileLLM | 125M; 350M | 2024.2 | Pre-train | Generic | Github | HF | Paper |
StableLM | 3B; 7B | 2023.4 | Pre-train | Generic | Github | HF | Paper |
StableLM 2 | 1.6B | 2024.2 | Pre-train | Generic | Github | HF | Paper |
Cerebras-GPT | 111M-13B | 2023.4 | Pre-train | Generic | HF | Paper | |
BLOOM, BLOOMZ | 560M; 1.1B; 1.7B; 3B; 7.1B; 176B | 2022.11 | Pre-train | Generic | HF | Paper | |
OPT | 125M; 350M; 1.3B; 2.7B; 5.7B | 2022.5 | Pre-train | Generic | HF | Paper | |
XGLM | 1.7B; 2.9B; 7.5B | 2021.12 | Pre-train | Generic | Github | HF | Paper |
GPT-Neo | 125M; 350M; 1.3B; 2.7B | 2021.5 | Pre-train | Generic | Github | Paper | |
Megatron-gpt2 | 355M; 2.5B; 8.3B | 2019.9 | Pre-train | Generic | Github | Paper, Blog | |
MINITRON | 4B; 8B; 15B | 2024.7 | Pruning and Distillation | Generic | Github | HF | Paper |
Orca 2 | 7B | 2023.11 | Distillation | Generic | HF | Paper | |
Dolly-v2 | 3B; 7B; 12B | 2023.4 | Instruction tuning | Generic | Github | HF | Blog |
LaMini-LM | 61M-7B | 2023.4 | Distillation | Generic | Github | HF | Blog |
Specialized FlanT5 | 250M; 760M; 3B | 2023.1 | Instruction Tuning | Generic (math) | Github | - | Paper |
FlanT5 | 80M; 250M; 780M; 3B | 2022.10 | Instruction Tuning | Generic | Gihub | HF | Paper |
T5 | 60M; 220M; 770M; 3B; 11B | 2019.9 | Pre-train | Generic | Github | HF | Paper |
- Transformer: Attention is all you need. Ashish Vaswani. NeurIPS 2017.
- Mamba 1: Mamba: Linear-time sequence modeling with selective state spaces. Albert Gu and Tri Dao. COLM 2024. [Paper].
- Mamba 2: Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. Tri Dao and Albert Gu. ICML 2024. [Paper] [Code]
- MobiLlama: "MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT". Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan. arXiv 2024. [Paper] [Github] [HuggingFace]
- MobileLLM: "MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases". Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra ICML 2024. [Paper] [Github] [HuggingFace]
- Rethinking optimization and architecture for tiny language models. Yehui Tang, Fangcheng Liu, Yunsheng Ni, Yuchuan Tian, Zheyuan Bai, Yi-Qi Hu, Sichao Liu, Shangling Jui, Kai Han, and Yunhe Wang. ICML 2024. [Paper] [Code]
- MindLLM: "MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications". Yizhe Yang, Huashan Sun, Jiawei Li, Runheng Liu, Yinghao Li, Yuhang Liu, Heyan Huang, Yang Gao. arXiv 2023. [Paper] [HuggingFace]
- Direct preference optimization: Your language model is secretly a reward model. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. NeurIPS, 2024. [Paper] [Code]
- Enhancing chat language models by scaling high-quality instructional conversations. Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. EMNLP 2023. [Paper] [Code]
- SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces, with Verification. Wing Lian, Guan Wang, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and "Teknium". Huggingface, 2023. [Data]
- Stanford Alpaca: An Instruction-following LLaMA model. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. GitHub, 2023. [Blog] [Github] [HuggingFace]
- OpenChat: Advancing Open-source Language Models with Mixed-Quality Data. Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. ICLR, 2024. [Paper] [Code] [HuggingFace]
- Training language models to follow instructions with human feedback. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. NeurIPS, 2022. [Paper]
- RLHF: "Training language models to follow instructions with human feedback". Long Ouyang et al. 2022. [Paper]
- MobileBERT: "MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices". Zhiqing Sun et al. ACL 2020. [Paper] [Github] [HuggingFace]
- Language models are unsupervised multitask learners. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. OpenAI Blog, 2019. [Paper]
- TinyStory: "TinyStories: How Small Can Language Models Be and Still Speak Coherent English?". Ronen Eldan et al. 2023. [Paper] [HuggingFace]
- AS-ES: "AS-ES Learning: Towards Efficient CoT Learning in Small Models". Nuwa Xi et al. 2024. [Paper]
- Self-Amplify: "Self-AMPLIFY: Improving Small Language Models with Self Post Hoc Explanations". Milan Bhan et al. 2024. [Paper]
- Large Language Models Can Self-Improve. Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. EMNLP 2023. [Paper]
- Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing. Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. NeurIPS 2024. [Paper] [Code]
- GKD: "On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes". Rishabh Agarwal et al. ICLR 2024. [Paper]
- DistilLLM: "DistiLLM: Towards Streamlined Distillation for Large Language Models". Jongwoo Ko et al. ICML 2024. [Paper] [Github]
- Adapt-and-Distill: "Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains". Yunzhi Yao et al. ACL2021. [Paper] [Github]
- SmoothQuant: "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models". Guangxuan Xiao et al. ICML 2023. [Paper] [Github]
- BiLLM: "BiLLM: Pushing the Limit of Post-Training Quantization for LLMs". Wei Huang et al. 2024. [Paper] [Github]
- LLM-QAT: "LLM-QAT: Data-Free Quantization Aware Training for Large Language Models". Zechun Liu et al. 2023. [Paper]
- PB-LLM: "PB-LLM: Partially Binarized Large Language Models". Yuzhang Shang et al. 2024. [Paper] [Github]
- OneBit: "OneBit: Towards Extremely Low-bit Large Language Models". Yuzhuang Xu et al. NeurIPS 2024. [Paper]
- BitNet: "BitNet: Scaling 1-bit Transformers for Large Language Models". Hongyu Wang et al. 2023. [Paper]
- BitNet b1.58: "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits". Shuming Ma et al. 2024. [Paper]
- SqueezeLLM: "SqueezeLLM: Dense-and-Sparse Quantization". Sehoon Kim et al. ICML 2024. [Paper] [Github]
- JSQ: "Compressing Large Language Models by Joint Sparsification and Quantization". Jinyang Guo et al. PMLR 2024. [Paper] [Github]
- FrameQuant: "FrameQuant: Flexible Low-Bit Quantization for Transformers". Harshavardhan Adepu et al. 2024. [Paper] [Github]
- BiLLM: "BiLLM: Pushing the Limit of Post-Training Quantization for LLMs". Wei Huang et al. 2024. [Paper] [Github]
- LQER: "LQER: Low-Rank Quantization Error Reconstruction for LLMs". Cheng Zhang et al. ICML 2024. [Paper] [Github]
- I-LLM: "I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models". Xing Hu et al. 2024. [Paper] [Github]
- PV-Tuning: "PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression". Vladimir Malinovskii et al. 2024. [Paper]
- PEQA: "Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization". Jeonghoon Kim et al. NIPS 2023. [Paper]
- QLoRA: "QLORA: efficient finetuning of quantized LLMs". Tim Dettmers et al. NIPS 2023. [Paper] [Github]
- Ma et al.: "Large Language Model Is Not a Good Few-shot Information Extractor, but a Good Reranker for Hard Samples!". Yubo Ma et al. EMNLP 2023. [Paper] [Github]
- MoQE: "Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness". Young Jin Kim et al. 2023. [Paper]
- SLM-RAG: "Can Small Language Models With Retrieval-Augmented Generation Replace Large Language Models When Learning Computer Science?". Suqing Liu et al. ITiCSE 2024. [Paper]
- Alpaca: "Alpaca: A Strong, Replicable Instruction-Following Model". Rohan Taori et al. 202X. [Paper] [Github] [HuggingFace]
- Stable Beluga 7B: "Stable Beluga 2". Mahan et al. 2023. [HuggingFace]
- Fine-tuned BioGPT Guo et al.: "Improving Small Language Models on PubMedQA via Generative Data Augmentation". Zhen Guo et al. 2023. [Paper]
- Financial SLMs: "Fine-tuning Smaller Language Models for Question Answering over Financial Documents". Karmvir Singh Phogat et al. 2024. [Paper]
- ColBERT: "ColBERT Retrieval and Ensemble Response Scoring for Language Model Question Answering". Alex Gichamba et al. IEEE 2024. [Paper]
- T-SAS: "Test-Time Self-Adaptive Small Language Models for Question Answering". Soyeong Jeong et al. ACL 2023. [Paper] [Github]
- Rationale Ranking: "Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval". Tim Hartill et al. 2023. [Paper]
- Phi-3.5-mini: "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone". Marah Abdin et al. 202X. [Paper] [HuggingFace]
- TinyLlama: "TinyLlama: An Open-Source Small Language Model". Peiyuan Zhang et al. 2024. [Paper] [HuggingFace]
- CodeLlama: "Code Llama: Open Foundation Models for Code". Baptiste Rozière et al. 2024. [Paper] [HuggingFace]
- CodeGemma: "CodeGemma: Open Code Models Based on Gemma". Heri Zhao et al. 2024. [Paper] [HuggingFace]
- PromptRec: "Could Small Language Models Serve as Recommenders? Towards Data-centric Cold-start Recommendations". Xuansheng Wu, et al. 2024. [Paper] [Github]
- SLIM: "Can Small Language Models be Good Reasoners for Sequential Recommendation?". Yuling Wang et al. 2024. [Paper]
- BiLLP: "Large Language Models are Learnable Planners for Long-Term Recommendation". Wentao Shi et al. 2024. [Paper]
- ONCE: "ONCE: Boosting Content-based Recommendation with Both Open- and Closed-source Large Language Models". Qijiong Liu et al. 2023. [Paper] [Github] [HuggingFace]
- RecLoRA: "Lifelong Personalized Low-Rank Adaptation of Large Language Models for Recommendation". Jiachen Zhu et al. 2024. [Paper]
- Content encoder: "Pre-training Tasks for Embedding-based Large-scale Retrieval". Wei-Cheng Chang et al. ICLR 2020. [Paper]
- Poly-encoders: "Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring". Samuel Humeau et al. ICLR 2020. [Paper]
- Twin-BERT: "TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval". Wenhao Lu et al. 2020. [Paper]
- H-ERNIE: "H-ERNIE: A Multi-Granularity Pre-Trained Language Model for Web Search". Xiaokai Chu et al. SIGIR 2022. [Paper]
- Ranker: "Passage Re-ranking with BERT". Rodrigo Nogueira et al. 2019. [Paper] [Github]
- Rewriter: "Query Rewriting for Retrieval-Augmented Large Language Models". Xinbei Ma et al. EMNLP2023. [Paper] [Github]
- Octopus: "Octopus: On-device language model for function calling of software APIs". Wei Chen et al. 2024. [Paper]
- MobileAgent: "Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration". Junyang Wang et al. 2024. [Paper] [Github] [HuggingFace]
- Revolutionizing Mobile Interaction: "Revolutionizing Mobile Interaction: Enabling a 3 Billion Parameter GPT LLM on Mobile". Samuel Carreira et al. 2023. [Paper]
- AutoDroid: "AutoDroid: LLM-powered Task Automation in Android". Hao Wen et al. 2023. [Paper]
- On-device Agent for Text Rewriting: "Towards an On-device Agent for Text Rewriting". Yun Zhu et al. 2023. [Paper]
- EDGE-LLM: "EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting". Zhongzhi Yu et al. 2024. [Paper] [Github]
- LLM-PQ: "LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization". Juntao Zhao et al. 2024. [Paper] [Github]
- AWQ: "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration". Ji Lin et al. MLSys 2024. [Paper] [Github]
- MobileAIBench: "MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases". Rithesh Murthy et al. 2024. [Paper] [Github]
- MobileLLM: "MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases". Zechun Liu et al. ICML 2024. [Paper] [Github] [HuggingFace]
- EdgeMoE: "EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models". Rongjie Yi et al. 2023. [Paper] [Github]
- GEAR: "GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM". Hao Kang et al. 2024. [Paper] [Github]
- DMC: "Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference". Name et al. 202X. [Paper]
- Transformer-Lite: "Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs". Luchang Li et al. 202X. [Paper]
- LLMaaS: "LLM as a System Service on Mobile Devices". Wangsong Yin et al. 2024. [Paper]
- EdgeMoE: "EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models". Rongjie Yi et al. 2023. [Paper] [Github]
- LLMCad: "LLMCad: Fast and Scalable On-device Large Language Model Inference". Daliang Xu et al. 2023. [Paper]
- LinguaLinked: "LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices". Junchen Zhao et al. 2023 [Paper]
- Calibrating Large Language Models Using Their Generations Only. Dennis Ulmer, Martin Gubri, Hwaran Lee, Sangdoo Yun, Seong Joon Oh. ACL 2024 Long, [pdf] [code]
- Pareto Optimal Learning for Estimating Large Language Model Errors. Theodore Zhao, Mu Wei, J. Samuel Preston, Hoifung Poon. ACL 2024 Long, [pdf]
- The Internal State of an LLM Knows When It’s Lying. Amos Azaria, Tom Mitchell. EMNLP 2023 Findings. [pdf]
- Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs. Jiejun Tan, Zhicheng Dou, Yutao Zhu, Peidong Guo, Kun Fang, Ji-Rong Wen. ACL 2024 Long. [pdf] [code] [huggingface]
- Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi. ICLR 2024 Oral. [pdf] [huggingface] [code] [website] [model] [data]
- LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu. ICLR 2024 Workshop ME-FoMo Poster. [pdf]
- Corrective Retrieval Augmented Generation. Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, Zhen-Hua Ling. arXiv 2024.1. [pdf] [code]
- Self-Knowledge Guided Retrieval Augmentation for Large Language Models. Yile Wang, Peng Li, Maosong Sun, Yang Liu. EMNLP 2023 Findings. [pdf] [code]
- In-Context Retrieval-Augmented Language Models. Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, Yoav Shoham. TACL 2023. [pdf] [code]