Skip to content

Releases: sgl-project/sglang

Release v0.4.0

04 Dec 02:14
f8b0326
Compare
Choose a tag to compare

Highlights

blog: https://lmsys.org/blog/2024-12-04-sglang-v0-4/

We’re excited to release SGLang v0.4, featuring significant performance improvements and new features:

  • Zero-overhead batch scheduler: 1.1x increase in throughput.
  • Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate.
  • Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement.
  • Fast structured outputs with xgrammar: up to 10x faster.

What's Changed

Read more

Release v0.3.6

22 Nov 11:36
9a00e6f
Compare
Choose a tag to compare

Highlights

  • Reduce CPU overhead by enabling overlap scheduler by default. 1.1x higher throughput. (#2105, #2067, #2095)
  • Support data parallelism for attention and MLA. 1.5x higher decoding throughput. (#1970, #2061)
  • Cache-aware load balancer. 4x higher cache hit rate (#1934)
  • Support xgrammar backend for grammar-guided decoding (#2056)
  • Support Prometheus metrics (#1853, #1981)
  • Support torch 2.5.1 (#2069) and torch-native tensor parallelism (#1876)
  • Support graceful termination (#1838) and watchdog (#1816)
  • Support notebook-style documentation (https://sgl-project.github.io/)
  • Add an offline benchmark script (#1968)
  • Bug, deadlock, NaN, and OOM fixes (#2083, #1850, #1800, #1779, #1789, #1858)
  • New models: Phi3-small (#2062), Gemma-2 reward model (#1954), GPT-2 (#1833)

What's Changed

Read more

Release v0.3.4.post1

22 Oct 04:30
1f26e8b
Compare
Choose a tag to compare

Highlights

  • Hosted the first LMSYS online meetup: Efficient LLM Deployment and Serving.
    • Covered CPU overhead hiding, faster constrained decoding, and DeepSeek MLA. Slides
  • Added Engine API for offline inference with reduced overhead. Usage. #1614 #1567
  • Added an overlap scheduler for reducing CPU overhead #1738
  • New models: Llama 3.2 (#1551), QWen-VL2 (#1721), OLMo (#1676), GLM 4 (#1736).
  • Added support for reward models #1525.
  • Added support for Intel XPU #1480.
  • Improved stability for greedy decoding #1589.
  • Accelerated multi-LoRA serving #1587.

What's Changed

Read more

Release v0.3.2

02 Oct 17:19
37c5899
Compare
Choose a tag to compare

Highlight

  • Support torch.compile, cuda graph for triton attention backend and DeepSeek MLA #1442 #1422
  • Initial support for multi-LoRA serving #1307
  • Integrate torchao for quantization #1341
  • Optimize the CPU scheduler overhead
  • Multiple critical bug fixes for llama and llava (tokenizer, modality)
  • Support AMD backend #1420
  • New models: MiniCPM3, OLMoE

What's Changed

Read more

Release v0.3.0

19 Sep 10:09
5ab9418
Compare
Choose a tag to compare

Highlights

Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.

  • Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
  • Up to 1.5x lower latency with torch.compile on small batch sizes
  • Support for interleaved text and multi-image/video in LLaVA-OneVision
  • Support for interleaved window attention and 2x longer context length in Gemma-2
  • Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
  • Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.

What's Changed

Read more

Release v0.2.13

19 Sep 10:08
5bd9537
Compare
Choose a tag to compare

Highlights

  • New Feature: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973)
  • New Models: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API.
  • Performance: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905).
  • More CI Tests: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
  • Refactor and fix: More modular, better stability, use more kernels from flashinfer (#907)

What's Changed

Read more

Release v0.2.9

02 Aug 08:55
30a9b2e
Compare
Choose a tag to compare

Highlights

  • New feature: Chunked prefill (#800, #811)
  • New models: Deepseek v2
  • Performance improvement: vectorized logprob computation
  • Accuracy fix: fix the double BOS problem in the chat template; move logits to float32; update flashinfer sampling kernels
  • Feature fix: fixed many missing logprob-related features in the OpenAI API server
  • CI/CD infra is now fully ready. The tests cover frontend, backend, accuracy, and performance tests.

What's Changed

New Contributors

Full Changelog: v0.2.5...v0.2.9

Release v0.2.5

26 Jul 19:56
5bd06b4
Compare
Choose a tag to compare

Highlights

  • We recently released a blog. Compared to TensorRT-LLM and vLLM, SGLang Runtime consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, and on A100 and H100 GPUs, using FP8 and FP16. SGLang consistently outperforms vLLM, achieving up to 3.1x higher throughput on Llama-70B. It also often matches or sometimes outperforms TensorRT-LLM.

  • We have now automated the release processes for PyPI, Docker, and Release using GitHub workflows. Previously, because Release was not automated, GitHub Tags were not updated in time, leading to a jump from v0.2.0 directly to v0.2.5.

  • Welcome everyone to try using https://github.com/sgl-project/sglang, and also welcome everyone to actively participate in the community, including but not limited to issues, PRs, and discussions. Cheers!

Release v0.2.0

25 Jul 15:58
1a491d0
Compare
Choose a tag to compare

Highlights

  • We performed extensive engineering to improve the base performance. Compared to TensorRT-LLM and vLLM, SGLang now consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, on A100 and H100 GPUs, using FP8 and FP16. See the latest blog.
  • New models: Llama3 405B, Deepseek MoE, InternLM, GPTBigCode, Mistral-Nemo

What's Changed

New Contributors

Read more

Release v0.1.20

14 Jul 00:33
5d264a9
Compare
Choose a tag to compare

Highlights

  • Enable CUDA graph by default. It brings 1.5x - 2x speedup for small batch size decoding (#612)
  • Model support: Gemma2, minicpm, Qwen2 MoE
  • Docker support (#217 )
  • Various latency optimizations

What's Changed

New Contributors

Full Changelog: v0.1.18...v0.1.20