{% hint style="warning" %} Large language models (LLMs) are hot and diverse compared to conventional models. Therefore, I have classified the related works for LLMs in another paper list. {% endhint %}
{% hint style="info" %} I am actively maintaining this list. {% endhint %}
- Usher: Holistic Interference Avoidance for Resource Optimized ML Inference (OSDI 2024) [Paper] [Code]
- UVA & GaTech
- Paella: Low-latency Model Serving with Software-defined GPU Scheduling (SOSP 2023) [Paper]
- UPenn & DBOS, Inc.
- Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences (OSDI 2022) [Personal Notes] [Paper] [Code] [Benchmark] [Artifact]
- SJTU
- REEF: GPU kernel preemption; dynamic kernel padding.
- INFaaS: Automated Model-less Inference Serving (ATC 2021) [Paper] [Code]
- Stanford
- Best Paper
- Consider model-variants
- Clipper: A Low-Latency Online Prediction Serving System (NSDI 2017) [Personal Notes] [Paper] [Code]
- UC Berkeley
- Caching, batching, adaptive model selection.
- TensorFlow-Serving: Flexible, High-Performance ML Serving (NIPS 2017 Workshop on ML Systems) [Paper]
- Serving Unseen Deep Learning Models with Near-Optimal Configurations: a Fast Adaptive Search Approach (SoCC 2022) [Personal Notes] [Paper] [Code]
- ISCAS
- Characterize a DL model by its key operators.
- Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving (SoCC 2021) [Paper] [Code]
- HKUST & Alibaba
- Meta learning; bayesian optimization; Kubernetes.
- A Survey of Multi-Tenant Deep Learning Inference on GPU (MLSys 2022 Workshop on Cloud Intelligence / AIOps) [Paper]
- George Mason & Microsoft & Maryland
- A Survey of Large-Scale Deep Learning Serving System Optimization: Challenges and Opportunities (arXiv 2111.14247) [Paper]
- George Mason & Microsoft & Pittsburgh & Maryland