Release Release v0.4.0 · sgl-project/sglang

Highlights

blog: https://lmsys.org/blog/2024-12-04-sglang-v0-4/

We’re excited to release SGLang v0.4, featuring significant performance improvements and new features:

Zero-overhead batch scheduler: 1.1x increase in throughput.
Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate.
Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement.
Fast structured outputs with xgrammar: up to 10x faster.

What's Changed

fix: add xgrammar dependency by @zhyncs in #2126
docs: fix module docstrings and copyright headers by @XuehaiPan in #2077
feat(pre-commit): trim unnecessary notebook metadata from git history by @XuehaiPan in #2127
Expose max total num tokens from Runtime & Engine API by @henryhmko in #2092
Only stream output on tp rank 0 by @merrymercy in #2124
Revert "Only stream output on tp rank 0" by @merrymercy in #2130
Add initial support for intel Gaudi accelerators by @ankurneog in #2121
Add simple CPU offloading support. by @janimo in #2081
Fix grid size in Triton decoding kernel by @ispobock in #2134
[CI] Fix test cases by @merrymercy in #2137
Add concurrency option for benchmark by @cermeng in #2136
Fix dp print message by @merrymercy in #2138
fix: resolve bench_serving args by @zhyncs in #2139
[router] cache-aware load-balancing router v1 by @ByronHsu in #2114
Bump sglang-router to 0.0.5 by @ByronHsu in #2142
update router doc by @ByronHsu in #2143
fix dp_rank env by @ByronHsu in #2144
Add more api routes (completion, health, etc) to the router by @ByronHsu in #2146
add prefix match for certain tenant by @ByronHsu in #2147
Improve sglang router by @ByronHsu in #2148
Merged three native APIs into one: get_server_info by @henryhmko in #2152
feat: remove the dependency on FusedMoE by @zhyncs in #2153
feat: update gitignore and add tuning config for FusedMoE by @zhyncs in #2155
fix: resolve end-of-file-fixer by @zhyncs in #2157
Simplify Scheduler.update_running_batch by @merrymercy in #2154
feat: update other MoE models deps by @zhyncs in #2156
Update CI threshold & Improve code style by @merrymercy in #2159
fix: use torch.sum for compatible by @zhyncs in #2161
Fix mixed chunked prefill in overlap mode by @merrymercy in #2158
Balance CI tests by @merrymercy in #2162
Rename triton_fused_moe -> fused_moe_triton by @merrymercy in #2163
Fix docs by @merrymercy in #2164
[Fused moe] add tuning fused configs for qwen2 57b and mixtral 8x7b by @BBuf in #2167
Allow overwrite flashinfer use_tensorcore by @merrymercy in #2169
Replace prob based with threshold based load balancing by @ByronHsu in #2170
feat: fused_moe fp8 monkey patch by @zhyncs in #2174
[Fix] Avoid calling fill_vocab_mask for terminated requests by @Ubospica in #2175
[CI] Split test cases in CI for better load balancing by @merrymercy in #2180
Bump rustls from 0.23.16 to 0.23.18 in /rust by @dependabot in #2182
[feat] Refactor session control interface and add CI by @Ying1123 in #2173
[router] Replace print with logger by @ByronHsu in #2183
Use custom allreduce w/ torch.compile by @merrymercy in #2185
[Performance]: Process affinity to CPU cores with multiple sockets support by @HaiShaw in #2171
Update CI threshold by @merrymercy in #2186
Update XGrammar to the latest API by @Ubospica in #2176
[router] Rust e2e test by @ByronHsu in #2184
Input_embeds support by @RinRin-32 in #2052
[CI] Minor fix for CI by @merrymercy in #2187
Rename double sparsity config file by @merrymercy in #2188
Release v0.3.6.post1 by @merrymercy in #2189
Update sampler.py to skip the success check by @merrymercy in #2197
remove unused imports by @WrRan in #2195
Remove unresolved reference 'self' by @apemost in #2198
using is not not != to test None by @WrRan in #2196
fix: add cuda-python for xgrammar by @zhyncs in #2199
minor: update check_env by @zhyncs in #2201
add sglang version to get_server_info by @binarycrayon in #2206
docs: update adoption by @zhyncs in #2204
Bump router to 0.0.9 with better logging by @ByronHsu in #2207
Fix rust warning by @ByronHsu in #2208
Fix flasky tests by @merrymercy in #2212
[feat] Support session control for vision language models by @Ying1123 in #2210
Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default by @merrymercy in #2217
Revert "Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default" by @merrymercy in #2221
Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default by @merrymercy in #2222
Release v0.3.6.post2 by @merrymercy in #2214
Rename DP_RANK to SGLANG_DP_RANK by @merrymercy in #2218
[3rdparty, document] Updated Documentation that for triton fused_moe kernel tuning for AMD Instinct GPUs by @kkHuang-amd in #2191
Bump sglang-router to 0.0.10 for env name change by @ByronHsu in #2226
fix typo prompts by @qibaoyuan in #2224
Remove fused_moe_grok by @merrymercy in #2223
add profile in offline benchmark & update doc by @bjmsong in #2123
Rename tuned MI300X config files for fused_moe_triton by @HaiShaw in #2228
Update Install Method 2. From source by @HaiShaw in #2232
Fix chunked prefill size for bench_offline_throughput by @merrymercy in #2234
Disable overlap scheduler for multimodal models by @merrymercy in #2235
Add OLMo2 model. by @janimo in #2233
Crash the server correctly during error by @merrymercy in #2231
Fix memory leak during abort by @merrymercy in #2238
fix missing launch server import by @qeternity in #2242
[fix] Fix prefix caching for multi-image/video by @Ying1123 in #2239
Update backend.md by @merrymercy in #2250
Update backend.md by @merrymercy in #2251
Revert "Add simple CPU offloading support" by @Ying1123 in #2252
Revert "Revert "Add simple CPU offloading support"" by @Ying1123 in #2253
Simplify tokenizer manager by @merrymercy in #2254
Fix hash collision for multi modal models by @merrymercy in #2256
[Minor] fix the style for multimodal models by @merrymercy in #2257
chore: bump v0.3.6.post3 by @zhyncs in #2259
minor: add sgl-kernel dir by @zhyncs in #2261
[benchmark] Add fused_moe_triton benchmark and tuning tools by @BBuf in #2225
Fix the default chunked prefill size by @merrymercy in #2268
Support LoRA in Completion API by @bjmsong in #2243
Add new contributors so they can trigger CI automatically by @merrymercy in #2269
udate weights from disk by @zhaochenyang20 in #2265
add get weights by parameter name for llama by @zhaochenyang20 in #2266
[CI] Print summary on github actions by @merrymercy in #2274
[CI] Kill zombie processes by @merrymercy in #2280
[FEAT] Support GGUF format by @zhengy001 in #2215
[Fix] fix assertion error for chunked prefill when disabling cache by @wangraying in #2282
Revert "[FEAT] Support GGUF format" by @merrymercy in #2285
Revert "[Fix] fix assertion error for chunked prefill when disabling cache" by @merrymercy in #2286
[CI] Fix ci tests by @merrymercy in #2284
Revert "Revert "[FEAT] Support GGUF format"" by @merrymercy in #2287
feat: add Dockerfile for development by @zhyncs in #2289
[CI] Fix missing files in run_suite.py by @merrymercy in #2288
adapt vllm distributed module to sglang by @yizhang2077 in #2244
Fix chunked prefill when ignore eos by @hnyls2002 in #2290
[CI] Balance CI tests by @merrymercy in #2293
feat: add should_use_tensor_core by @zhyncs in #2179
Feat: upgrade outlines & support compatibility with the old version by @gobraves in #2292
minor: support flashinfer nightly by @zhyncs in #2295
Add a simple torch native attention backend by @YangQun1 in #2241
feat: skip good first issue by @zhyncs in #2298
minor: rm unused _grouped_size_compiled_for_decode_kernels by @zhyncs in #2299
feat: support sgl-kernel pypi by @zhyncs in #2302
Fix logprob for completions by @merrymercy in #2301
feat: use warp reduce as a simple example by @zhyncs in #2304
fix: resolve CodeQL cpp issue by @zhyncs in #2305
misc: update build setup by @zhyncs in #2306
Online weight updates from torch.distributed by @zhaochenyang20 in #2279
[Fix] Fix the padded hash value for image tokens by @merrymercy in #2309
Use rocminfo instead of rocm-smi for more OS/WSL support by @HaiShaw in #2310
[Minor] Fix code style by @merrymercy in #2311
Add more fused moe benchmark utilities by @merrymercy in #2314
Update model_loader deps and qqq quantization deps (#2220) by @zhyncs in #2318
Relax to include more AMD GPUs by @HaiShaw in #2319
[feat] Enable chunked prefill for llava-onevision by @Ying1123 in #2281
[Minor] Fix logger and style by @merrymercy in #2325
Revert "[feat] Enable chunked prefill for llava-onevision" by @Ying1123 in #2329
ROCm Container: set SGLANG_SET_CPU_AFFINITY=1 by @HaiShaw in #2328
Add missing license for router wheel by @MrAta in #2324
Improve torch compile for fused moe by @merrymercy in #2327
fix: resolve cmake url for Dockerfile.dev by @zhyncs in #2335
Fix gptq for moe layers by @merrymercy in #2300
[router] Copy license when publishing & bump version by @ByronHsu in #2339
chore: bump v0.4.0 by @zhyncs in #2338

New Contributors

@henryhmko made their first contribution in #2092
@ankurneog made their first contribution in #2121
@cermeng made their first contribution in #2136
@Ubospica made their first contribution in #2175
@dependabot made their first contribution in #2182
@RinRin-32 made their first contribution in #2052
@WrRan made their first contribution in #2195
@apemost made their first contribution in #2198
@qibaoyuan made their first contribution in #2224
@zhengy001 made their first contribution in #2215
@wangraying made their first contribution in #2282
@gobraves made their first contribution in #2292
@YangQun1 made their first contribution in #2241
@MrAta made their first contribution in #2324

Full Changelog: v0.3.6...v0.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.4.0

Highlights

What's Changed

New Contributors

Contributors