Highlights
blog: https://lmsys.org/blog/2024-12-04-sglang-v0-4/
We’re excited to release SGLang v0.4, featuring significant performance improvements and new features:
- Zero-overhead batch scheduler: 1.1x increase in throughput.
- Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate.
- Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement.
- Fast structured outputs with xgrammar: up to 10x faster.
What's Changed
- fix: add xgrammar dependency by @zhyncs in #2126
- docs: fix module docstrings and copyright headers by @XuehaiPan in #2077
- feat(pre-commit): trim unnecessary notebook metadata from git history by @XuehaiPan in #2127
- Expose max total num tokens from Runtime & Engine API by @henryhmko in #2092
- Only stream output on tp rank 0 by @merrymercy in #2124
- Revert "Only stream output on tp rank 0" by @merrymercy in #2130
- Add initial support for intel Gaudi accelerators by @ankurneog in #2121
- Add simple CPU offloading support. by @janimo in #2081
- Fix grid size in Triton decoding kernel by @ispobock in #2134
- [CI] Fix test cases by @merrymercy in #2137
- Add concurrency option for benchmark by @cermeng in #2136
- Fix dp print message by @merrymercy in #2138
- fix: resolve bench_serving args by @zhyncs in #2139
- [router] cache-aware load-balancing router v1 by @ByronHsu in #2114
- Bump sglang-router to 0.0.5 by @ByronHsu in #2142
- update router doc by @ByronHsu in #2143
- fix dp_rank env by @ByronHsu in #2144
- Add more api routes (completion, health, etc) to the router by @ByronHsu in #2146
- add prefix match for certain tenant by @ByronHsu in #2147
- Improve sglang router by @ByronHsu in #2148
- Merged three native APIs into one: get_server_info by @henryhmko in #2152
- feat: remove the dependency on FusedMoE by @zhyncs in #2153
- feat: update gitignore and add tuning config for FusedMoE by @zhyncs in #2155
- fix: resolve end-of-file-fixer by @zhyncs in #2157
- Simplify
Scheduler.update_running_batch
by @merrymercy in #2154 - feat: update other MoE models deps by @zhyncs in #2156
- Update CI threshold & Improve code style by @merrymercy in #2159
- fix: use torch.sum for compatible by @zhyncs in #2161
- Fix mixed chunked prefill in overlap mode by @merrymercy in #2158
- Balance CI tests by @merrymercy in #2162
- Rename triton_fused_moe -> fused_moe_triton by @merrymercy in #2163
- Fix docs by @merrymercy in #2164
- [Fused moe] add tuning fused configs for qwen2 57b and mixtral 8x7b by @BBuf in #2167
- Allow overwrite flashinfer use_tensorcore by @merrymercy in #2169
- Replace prob based with threshold based load balancing by @ByronHsu in #2170
- feat: fused_moe fp8 monkey patch by @zhyncs in #2174
- [Fix] Avoid calling fill_vocab_mask for terminated requests by @Ubospica in #2175
- [CI] Split test cases in CI for better load balancing by @merrymercy in #2180
- Bump rustls from 0.23.16 to 0.23.18 in /rust by @dependabot in #2182
- [feat] Refactor session control interface and add CI by @Ying1123 in #2173
- [router] Replace print with logger by @ByronHsu in #2183
- Use custom allreduce w/ torch.compile by @merrymercy in #2185
- [Performance]: Process affinity to CPU cores with multiple sockets support by @HaiShaw in #2171
- Update CI threshold by @merrymercy in #2186
- Update XGrammar to the latest API by @Ubospica in #2176
- [router] Rust e2e test by @ByronHsu in #2184
- Input_embeds support by @RinRin-32 in #2052
- [CI] Minor fix for CI by @merrymercy in #2187
- Rename double sparsity config file by @merrymercy in #2188
- Release v0.3.6.post1 by @merrymercy in #2189
- Update sampler.py to skip the success check by @merrymercy in #2197
- remove unused imports by @WrRan in #2195
- Remove unresolved reference 'self' by @apemost in #2198
- using
is not
not!=
to testNone
by @WrRan in #2196 - fix: add cuda-python for xgrammar by @zhyncs in #2199
- minor: update check_env by @zhyncs in #2201
- add sglang version to get_server_info by @binarycrayon in #2206
- docs: update adoption by @zhyncs in #2204
- Bump router to 0.0.9 with better logging by @ByronHsu in #2207
- Fix rust warning by @ByronHsu in #2208
- Fix flasky tests by @merrymercy in #2212
- [feat] Support session control for vision language models by @Ying1123 in #2210
- Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default by @merrymercy in #2217
- Revert "Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default" by @merrymercy in #2221
- Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default by @merrymercy in #2222
- Release v0.3.6.post2 by @merrymercy in #2214
- Rename DP_RANK to SGLANG_DP_RANK by @merrymercy in #2218
- [3rdparty, document] Updated Documentation that for triton fused_moe kernel tuning for AMD Instinct GPUs by @kkHuang-amd in #2191
- Bump sglang-router to 0.0.10 for env name change by @ByronHsu in #2226
- fix typo prompts by @qibaoyuan in #2224
- Remove fused_moe_grok by @merrymercy in #2223
- add profile in offline benchmark & update doc by @bjmsong in #2123
- Rename tuned MI300X config files for fused_moe_triton by @HaiShaw in #2228
- Update Install Method 2. From source by @HaiShaw in #2232
- Fix chunked prefill size for bench_offline_throughput by @merrymercy in #2234
- Disable overlap scheduler for multimodal models by @merrymercy in #2235
- Add OLMo2 model. by @janimo in #2233
- Crash the server correctly during error by @merrymercy in #2231
- Fix memory leak during abort by @merrymercy in #2238
- fix missing launch server import by @qeternity in #2242
- [fix] Fix prefix caching for multi-image/video by @Ying1123 in #2239
- Update backend.md by @merrymercy in #2250
- Update backend.md by @merrymercy in #2251
- Revert "Add simple CPU offloading support" by @Ying1123 in #2252
- Revert "Revert "Add simple CPU offloading support"" by @Ying1123 in #2253
- Simplify tokenizer manager by @merrymercy in #2254
- Fix hash collision for multi modal models by @merrymercy in #2256
- [Minor] fix the style for multimodal models by @merrymercy in #2257
- chore: bump v0.3.6.post3 by @zhyncs in #2259
- minor: add sgl-kernel dir by @zhyncs in #2261
- [benchmark] Add fused_moe_triton benchmark and tuning tools by @BBuf in #2225
- Fix the default chunked prefill size by @merrymercy in #2268
- Support LoRA in Completion API by @bjmsong in #2243
- Add new contributors so they can trigger CI automatically by @merrymercy in #2269
- udate weights from disk by @zhaochenyang20 in #2265
- add get weights by parameter name for llama by @zhaochenyang20 in #2266
- [CI] Print summary on github actions by @merrymercy in #2274
- [CI] Kill zombie processes by @merrymercy in #2280
- [FEAT] Support GGUF format by @zhengy001 in #2215
- [Fix] fix assertion error for chunked prefill when disabling cache by @wangraying in #2282
- Revert "[FEAT] Support GGUF format" by @merrymercy in #2285
- Revert "[Fix] fix assertion error for chunked prefill when disabling cache" by @merrymercy in #2286
- [CI] Fix ci tests by @merrymercy in #2284
- Revert "Revert "[FEAT] Support GGUF format"" by @merrymercy in #2287
- feat: add Dockerfile for development by @zhyncs in #2289
- [CI] Fix missing files in run_suite.py by @merrymercy in #2288
- adapt vllm distributed module to sglang by @yizhang2077 in #2244
- Fix chunked prefill when ignore eos by @hnyls2002 in #2290
- [CI] Balance CI tests by @merrymercy in #2293
- feat: add should_use_tensor_core by @zhyncs in #2179
- Feat: upgrade outlines & support compatibility with the old version by @gobraves in #2292
- minor: support flashinfer nightly by @zhyncs in #2295
- Add a simple torch native attention backend by @YangQun1 in #2241
- feat: skip good first issue by @zhyncs in #2298
- minor: rm unused _grouped_size_compiled_for_decode_kernels by @zhyncs in #2299
- feat: support sgl-kernel pypi by @zhyncs in #2302
- Fix logprob for completions by @merrymercy in #2301
- feat: use warp reduce as a simple example by @zhyncs in #2304
- fix: resolve CodeQL cpp issue by @zhyncs in #2305
- misc: update build setup by @zhyncs in #2306
- Online weight updates from torch.distributed by @zhaochenyang20 in #2279
- [Fix] Fix the padded hash value for image tokens by @merrymercy in #2309
- Use rocminfo instead of rocm-smi for more OS/WSL support by @HaiShaw in #2310
- [Minor] Fix code style by @merrymercy in #2311
- Add more fused moe benchmark utilities by @merrymercy in #2314
- Update model_loader deps and qqq quantization deps (#2220) by @zhyncs in #2318
- Relax to include more AMD GPUs by @HaiShaw in #2319
- [feat] Enable chunked prefill for llava-onevision by @Ying1123 in #2281
- [Minor] Fix logger and style by @merrymercy in #2325
- Revert "[feat] Enable chunked prefill for llava-onevision" by @Ying1123 in #2329
- ROCm Container: set SGLANG_SET_CPU_AFFINITY=1 by @HaiShaw in #2328
- Add missing license for router wheel by @MrAta in #2324
- Improve torch compile for fused moe by @merrymercy in #2327
- fix: resolve cmake url for Dockerfile.dev by @zhyncs in #2335
- Fix gptq for moe layers by @merrymercy in #2300
- [router] Copy license when publishing & bump version by @ByronHsu in #2339
- chore: bump v0.4.0 by @zhyncs in #2338
New Contributors
- @henryhmko made their first contribution in #2092
- @ankurneog made their first contribution in #2121
- @cermeng made their first contribution in #2136
- @Ubospica made their first contribution in #2175
- @dependabot made their first contribution in #2182
- @RinRin-32 made their first contribution in #2052
- @WrRan made their first contribution in #2195
- @apemost made their first contribution in #2198
- @qibaoyuan made their first contribution in #2224
- @zhengy001 made their first contribution in #2215
- @wangraying made their first contribution in #2282
- @gobraves made their first contribution in #2292
- @YangQun1 made their first contribution in #2241
- @MrAta made their first contribution in #2324
Full Changelog: v0.3.6...v0.4.0