Releases: sgl-project/sglang
Release v0.4.0
Highlights
blog: https://lmsys.org/blog/2024-12-04-sglang-v0-4/
We’re excited to release SGLang v0.4, featuring significant performance improvements and new features:
- Zero-overhead batch scheduler: 1.1x increase in throughput.
- Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate.
- Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement.
- Fast structured outputs with xgrammar: up to 10x faster.
What's Changed
- fix: add xgrammar dependency by @zhyncs in #2126
- docs: fix module docstrings and copyright headers by @XuehaiPan in #2077
- feat(pre-commit): trim unnecessary notebook metadata from git history by @XuehaiPan in #2127
- Expose max total num tokens from Runtime & Engine API by @henryhmko in #2092
- Only stream output on tp rank 0 by @merrymercy in #2124
- Revert "Only stream output on tp rank 0" by @merrymercy in #2130
- Add initial support for intel Gaudi accelerators by @ankurneog in #2121
- Add simple CPU offloading support. by @janimo in #2081
- Fix grid size in Triton decoding kernel by @ispobock in #2134
- [CI] Fix test cases by @merrymercy in #2137
- Add concurrency option for benchmark by @cermeng in #2136
- Fix dp print message by @merrymercy in #2138
- fix: resolve bench_serving args by @zhyncs in #2139
- [router] cache-aware load-balancing router v1 by @ByronHsu in #2114
- Bump sglang-router to 0.0.5 by @ByronHsu in #2142
- update router doc by @ByronHsu in #2143
- fix dp_rank env by @ByronHsu in #2144
- Add more api routes (completion, health, etc) to the router by @ByronHsu in #2146
- add prefix match for certain tenant by @ByronHsu in #2147
- Improve sglang router by @ByronHsu in #2148
- Merged three native APIs into one: get_server_info by @henryhmko in #2152
- feat: remove the dependency on FusedMoE by @zhyncs in #2153
- feat: update gitignore and add tuning config for FusedMoE by @zhyncs in #2155
- fix: resolve end-of-file-fixer by @zhyncs in #2157
- Simplify
Scheduler.update_running_batch
by @merrymercy in #2154 - feat: update other MoE models deps by @zhyncs in #2156
- Update CI threshold & Improve code style by @merrymercy in #2159
- fix: use torch.sum for compatible by @zhyncs in #2161
- Fix mixed chunked prefill in overlap mode by @merrymercy in #2158
- Balance CI tests by @merrymercy in #2162
- Rename triton_fused_moe -> fused_moe_triton by @merrymercy in #2163
- Fix docs by @merrymercy in #2164
- [Fused moe] add tuning fused configs for qwen2 57b and mixtral 8x7b by @BBuf in #2167
- Allow overwrite flashinfer use_tensorcore by @merrymercy in #2169
- Replace prob based with threshold based load balancing by @ByronHsu in #2170
- feat: fused_moe fp8 monkey patch by @zhyncs in #2174
- [Fix] Avoid calling fill_vocab_mask for terminated requests by @Ubospica in #2175
- [CI] Split test cases in CI for better load balancing by @merrymercy in #2180
- Bump rustls from 0.23.16 to 0.23.18 in /rust by @dependabot in #2182
- [feat] Refactor session control interface and add CI by @Ying1123 in #2173
- [router] Replace print with logger by @ByronHsu in #2183
- Use custom allreduce w/ torch.compile by @merrymercy in #2185
- [Performance]: Process affinity to CPU cores with multiple sockets support by @HaiShaw in #2171
- Update CI threshold by @merrymercy in #2186
- Update XGrammar to the latest API by @Ubospica in #2176
- [router] Rust e2e test by @ByronHsu in #2184
- Input_embeds support by @RinRin-32 in #2052
- [CI] Minor fix for CI by @merrymercy in #2187
- Rename double sparsity config file by @merrymercy in #2188
- Release v0.3.6.post1 by @merrymercy in #2189
- Update sampler.py to skip the success check by @merrymercy in #2197
- remove unused imports by @WrRan in #2195
- Remove unresolved reference 'self' by @apemost in #2198
- using
is not
not!=
to testNone
by @WrRan in #2196 - fix: add cuda-python for xgrammar by @zhyncs in #2199
- minor: update check_env by @zhyncs in #2201
- add sglang version to get_server_info by @binarycrayon in #2206
- docs: update adoption by @zhyncs in #2204
- Bump router to 0.0.9 with better logging by @ByronHsu in #2207
- Fix rust warning by @ByronHsu in #2208
- Fix flasky tests by @merrymercy in #2212
- [feat] Support session control for vision language models by @Ying1123 in #2210
- Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default by @merrymercy in #2217
- Revert "Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default" by @merrymercy in #2221
- Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default by @merrymercy in #2222
- Release v0.3.6.post2 by @merrymercy in #2214
- Rename DP_RANK to SGLANG_DP_RANK by @merrymercy in #2218
- [3rdparty, document] Updated Documentation that for triton fused_moe kernel tuning for AMD Instinct GPUs by @kkHuang-amd in #2191
- Bump sglang-router to 0.0.10 for env name change by @ByronHsu in #2226
- fix typo prompts by @qibaoyuan in #2224
- Remove fused_moe_grok by @merrymercy in #2223
- add profile in offline benchmark & update doc by @bjmsong in #2123
- Rename tuned MI300X config files for fused_moe_triton by @HaiShaw in #2228
- Update Install Method 2. From source by @HaiShaw in #2232
- Fix chunked prefill size for bench_offline_throughput by @merrymercy in #2234
- Disable overlap scheduler for multimodal models by @merrymercy in #2235
- Add OLMo2 model. by @janimo in #2233
- Crash the server correctly during error by @merrymercy in #2231
- Fix memory leak during abort by @merrymercy in #2238
- fix missing launch server import by @qeternity in #2242
- [fix] Fix prefix caching for multi-image/video by @Ying1123 in #2239
- Update backend.md by @merrymercy in #2250
- Update backend.md by @merrymercy in #2251
- Revert "Add simple CPU offloading support" by @Ying1123 in #2252
- Revert "Revert "Add simple CPU offloading support"" by @Ying1123 in #2253
- Simplify tokenizer manager by @merrymercy in #2254
- Fix hash collision for multi modal models by @merrymercy in #2256
- [Minor] fix the style for multimodal models by @merrymercy in #2257
- chore: bump v0.3.6.post3 by @zhyncs in https://github.com/sgl-project/sglang/pul...
Release v0.3.6
Highlights
- Reduce CPU overhead by enabling overlap scheduler by default. 1.1x higher throughput. (#2105, #2067, #2095)
- Support data parallelism for attention and MLA. 1.5x higher decoding throughput. (#1970, #2061)
- Cache-aware load balancer. 4x higher cache hit rate (#1934)
- Support xgrammar backend for grammar-guided decoding (#2056)
- Support Prometheus metrics (#1853, #1981)
- Support torch 2.5.1 (#2069) and torch-native tensor parallelism (#1876)
- Support graceful termination (#1838) and watchdog (#1816)
- Support notebook-style documentation (https://sgl-project.github.io/)
- Add an offline benchmark script (#1968)
- Bug, deadlock, NaN, and OOM fixes (#2083, #1850, #1800, #1779, #1789, #1858)
- New models: Phi3-small (#2062), Gemma-2 reward model (#1954), GPT-2 (#1833)
What's Changed
- Fix edge case for truncated by @ByronHsu in #1747
- Fuse more ops & Simplify token mapping by @merrymercy in #1758
- [API] add get memory pool size by @Ying1123 in #1760
- Fix perf regression for set_kv_buffer by @merrymercy in #1765
- [Fix] Fix abort in data parallelism by @merrymercy in #1767
- Fix stop condition for <|eom_id|> by @merrymercy in #1766
- Update docs by @merrymercy in #1768
- Fix missing additional_stop_token_ids by @merrymercy in #1769
- Fix out of memory message. by @hnyls2002 in #1771
- Crash the server on warnings in CI by @merrymercy in #1772
- Fix the perf regression due to additional_stop_token_ids by @merrymercy in #1773
- Fix MockTokenizer in the unit tests by @merrymercy in #1774
- [Bug] Catch any errors caused by parsing json schema by @zolinthecow in #1776
- [Fix] Fix NaN issues by fixing the cuda graph padding values for flashinfer by @merrymercy in #1779
- [Fix] Fix cuda graph padding for triton attention backend by @merrymercy in #1782
- check user-specified model_max_len with hf derived max_model_len by @BBuf in #1778
- Re-introduce
get_cuda_graph_seq_len_fill_value
by @merrymercy in #1783 - Enhance the test case for chunked prefill and check memory leak by @merrymercy in #1785
- Fix seq_lens_sum for cuda graph runner in padded cases by @merrymercy in #1789
- Qwen2vl support cuda graph and disable radix cache by @yizhang2077 in #1780
- Fix log parsing in the chunked prefill unit tests by @merrymercy in #1793
- Fix memory leak when doing chunked prefill by @hnyls2002 in #1787
- [Fix] Fix the log parsing in chunked prefill uni tests by @merrymercy in #1794
- Revert "Fix memory leak when doing chunked prefill" by @merrymercy in #1797
- Fix logprob in the overlapped mode by @merrymercy in #1795
- Release v0.3.4.post2 by @merrymercy in #1796
- [Performance] Support both xgrammar and outlines for constrained decoding by @DarkSharpness in #1752
- [Fix] Fix --skip-tokenizer-init by @merrymercy in #1798
- move max_position_embeddings to the last by @hliuca in #1799
- add support for ipynb by @zhaochenyang20 in #1786
- Fix possible ZMQ hanging by @hnyls2002 in #1800
- Set
ZMQ
buffer size heuristic by @hnyls2002 in #1801 - Allow consecutive ports when launching multiple sglang servers. by @hnyls2002 in #1802
- fix int conversion for
SGLANG_CPU_COUNT
by @ByronHsu in #1803 - Update ci workflows by @merrymercy in #1804
- Update links by @merrymercy in #1805
- Simplify our docs with complicated functions into utils by @zhaochenyang20 in #1807
- Fix docs ci by @zhaochenyang20 in #1808
- Provide an argument to set the maximum batch size for cuda graph by @merrymercy in #1809
- Improve the user control of new_token_ratio by @merrymercy in #1811
- Update hyperparameter_tuning.md by @merrymercy in #1813
- Add a watch dog thread by @merrymercy in #1816
- Fix unit tests by @merrymercy in #1817
- Add openAI compatible API by @zhaochenyang20 in #1810
- Fix Triton decode kernel & ut by @ispobock in #1819
- support token ids in
engine.generate
by @ByronHsu in #1820 - Fix docs deploy ci by @zhaochenyang20 in #1821
- [router] rust-based router by @ByronHsu in #1790
- Fix update_weights deadlock for DP by @ByronHsu in #1825
- fix get_memory_pool_size deadlock for DP by @ByronHsu in #1830
- Support setting
use_thread
in therun_program
for easier debugging. by @liuyanyi in #1823 - [3rdparty, document] Add 3rdparty/amd, with profiling and tuning instructions to be added by @HaiShaw in #1822
- stop_str of qwen2-vl template should be a tuple not a str by @yizhang2077 in #1834
- [FP8 KV Cache, Mixtral] Avoid KeyError at loading pre-quantized FP8 m… by @HaiShaw in #1835
- Gpt2 by @DanielC12321 in #1833
- Imporve openai api documents by @zhaochenyang20 in #1827
- Update docs by @merrymercy in #1839
- Update README.md by @merrymercy in #1840
- [Production] Drain requests before exit when receive SIGTERM by @Ying1123 in #1838
- [Performance, Hardware] MoE weights padding to AMD MI300x GPUs by @HaiShaw in #1836
- Fix suggest edit by @zhaochenyang20 in #1842
- [Performance, Triton Kernel Args] _decode_grouped_softmax_reducev_fwd… by @HaiShaw in #1845
- Make decode log interval configurable by @ByronHsu in #1847
- Fix mixed chunked prefill by @merrymercy in #1850
- Refactor tokenizer manager by @ByronHsu in #1846
- Simplify documentation by @merrymercy in #1851
- Fix warnings in doc build by @merrymercy in #1852
- delete unused character by @geeker-smallwhite in #1855
- Fix memory leak for chunked prefill 2 by @merrymercy in #1858
- [Build, ROCm] Dockerfile.rocm for Instinct GPUs, with package updates by @HaiShaw in #1861
- Fix retraction + overlap by @hnyls2002 in #1860
- change file tree by @zhaochenyang20 in #1859
- Update vocab embedding deps and add TP switch by @ispobock in #1856
- minor: add human eval by @zhyncs in #1754
- Add vlm document by @zhaochenyang20 in #1866
- minor: update nightly eval by @zhyncs in #1867
- [3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. by @yichiche in #1871
- Improve docs and fix the broken links by @merrymercy in #1875
- Add a FAQ documentation by @merrymercy in #1877
- Update docs title by @merrymercy in #1879
- Update docs and workflow by @merrymercy in #1881
- Fix doc links by @merrymercy in #1882
- Fix incorrect context length for llama3.2-11b by @rchen19 in #1873
- add native api docs by @zhaochenyang20 in #1883
- Update index.rst to improve the order of docs by @merrymercy in #1885
- Native api by...
Release v0.3.4.post1
Highlights
- Hosted the first LMSYS online meetup: Efficient LLM Deployment and Serving.
- Covered CPU overhead hiding, faster constrained decoding, and DeepSeek MLA. Slides
- Added Engine API for offline inference with reduced overhead. Usage. #1614 #1567
- Added an overlap scheduler for reducing CPU overhead #1738
- New models: Llama 3.2 (#1551), QWen-VL2 (#1721), OLMo (#1676), GLM 4 (#1736).
- Added support for reward models #1525.
- Added support for Intel XPU #1480.
- Improved stability for greedy decoding #1589.
- Accelerated multi-LoRA serving #1587.
What's Changed
- [Fix] Ignore model import error by @merrymercy in #1513
- minor: fix config by @hnyls2002 in #1524
- [Event] Update meeting link by @Ying1123 in #1529
- [Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B by @Ying1123 in #1525
- Add float8 dynamic quant to torchao_utils by @jerryzh168 in #1528
- [FIX] Catch syntax error of Regex Guide to avoid crash by @du00cs in #1521
- [bugfix]Add modelscope package to avoid docker image without modelscope by @KylinMountain in #1520
- Fix RuntimeEndpoint.select method by @jeffrey-fong in #1495
- Multiple minor fixes by @merrymercy in #1530
- Make detokenizer_manager.py not asyncio by @merrymercy in #1532
- Organize image inputs by @hnyls2002 in #1531
- Improve process creation by @merrymercy in #1534
- fix ipv6 url when warm up model by @cauyxy in #1537
- Move scheduler code from tp_worker.py to scheduler.py by @merrymercy in #1538
- Process image in parallel by @hnyls2002 in #1539
- Let ModelRunner take InputMetadata as input, instead of ScheduleBatch by @merrymercy in #1541
- Rename InputMetadata -> ForwardBatch by @merrymercy in #1543
- Clean up batch data structures: Introducing ModelWorkerBatch by @merrymercy in #1544
- [Fix, LoRA] fix LoRA with updates in main by @Ying1123 in #1545
- Organize Attention Backends by @hnyls2002 in #1547
- Fix bugs of
logprobs_nums
by @hnyls2002 in #1548 - Dispatch flashinfer wrappers by @hnyls2002 in #1550
- Simplify flashinfer dispatch by @hnyls2002 in #1552
- [Refactor] Simplify io_struct and tokenizer_manager by @Ying1123 in #1549
- [Performance, Hardware] MoE tuning on AMD MI300x GPUs by @kkHuang-amd in #1554
- [Fix] Fix all the Huggingface paths by @tbarton16 in #1553
- [Fix] do not maintain regex_fsm in SamplingBatchInfo by @merrymercy in #1555
- [Fix] Move ScheduleBatch out of SamplingInfo by @merrymercy in #1556
- Move status check in the memory pool to CPU by @merrymercy in #1557
- [Fix] Fix AttributeError in Qwen2.5 LoRA: 'Qwen2ForCausalLM' object has no attribute 'get_hidden_dim' by @mssongit in #1536
- [FP8 KV Cache] Avoid KeyError at loading pre-quantized FP8 model with kv_scale by @HaiShaw in #1559
- Organize sampling batch info better by @merrymercy in #1562
- Use ipc instead of tcp in zmq by @merrymercy in #1566
- Make input_ids a torch.Tensor by @merrymercy in #1568
- [Minifix] Remove extra space in cot example by @FredericOdermatt in #1569
- [Fix] Fix major performance bug in certain cases by @Ying1123 in #1563
- Refine the add request reasons to avoid corner cases. by @hnyls2002 in #1574
- chore: update README.md by @eltociear in #1580
- [Easy] use .text() instead of .text by @ByronHsu in #1577
- [Event] Update README.md by @Ying1123 in #1572
- Add llama implementation with no tensor parallel linears by @jerryzh168 in #1561
- Backend method not found when SRT Runtime is used by @ByronHsu in #1576
- default sampling param should be deepcopied by @ByronHsu in #1581
- Fix styling by @ByronHsu in #1583
- Fix runtime.generate when sampling param is not passed by @ByronHsu in #1582
- Support min_tokens in sgl.gen by @ByronHsu in #1573
- [Minor] Improve the style and fix flaky tests by @merrymercy in #1584
- [Bug] Fix decode stats error on output_len 1 by @HaiShaw in #1585
- Clean up event loop by @merrymercy in #1586
- [LoRA, Performance] Speedup multi-LoRA serving - Step 1 by @Ying1123 in #1587
- [Minor, Performance] Use torch.argmax for greedy sampling by @Ying1123 in #1589
- Test consistency for single and batch seperately by @ByronHsu in #1590
- Update README.md by @merrymercy in #1591
- Fix modality for image inputs by @merrymercy in #1592
- Provide an offline engine API by @ByronHsu in #1567
- [Fix] Fix the case where prompt_len = 0 by @merrymercy in #1593
- Use
atexit
hook to implicitly shutdownRuntime
by @ByronHsu in #1595 - Use is_flashinfer_available to replace is_hip for flashinfer check by @merrymercy in #1596
- Fix chunked prefill condition by @ispobock in #1594
- Fix the port_args in bench_latency by @merrymercy in #1597
- Remove references to squeezellm by @janimo in #1603
- [Profile] Add pytorch profiler by @Ying1123 in #1604
- [Engine] Fix generate hanging issue after the first call by @ByronHsu in #1606
- Release v0.3.3 by @merrymercy in #1605
- [Minor] Fix logging typo by @amosyou in #1615
- Fix test_vision_openai_server on CI by @ByronHsu in #1620
- [Performance, hardware] MoE tuning update to AMD MI300x GPUs by @HaiShaw in #1619
- Update README.md by @kushal34712 in #1625
- Update README.md by @merrymercy in #1629
- Add device support by @liangan1 in #1607
- Nit about the decorator of
PortArgs.init_new
by @glen-amd in #1611 - [Bug] Fix the Image Input of Batch Generation by @OBJECT907 in #1579
- Add the ability to enable and disable the Profiler via HTTP API. by @Abatom in #1626
- Fix the correctness test in bench_latency.py when tp > 1 and test_generation_models.py by @merrymercy in #1631
- Add image_token in conversation.py by @merrymercy in #1632
- Added a "Back To Top" Button by @JanumalaAkhilendra in #1633
- Fix constrained decoding by @merrymercy in #1634
- Add back data parallelism by @merrymercy in #1635
- Release v0.3.3.post1 by @merrymercy in #1636
- [engine] support async and streaming by @ByronHsu in #1614
- [Fix] Fix the style of test_large_max_new_tokens.py by @merrymercy in #1638
- fix missing ignore_eos in v1/chat/completions by @learninmou in #1642
- Fix ignore_eos in the OpenAI ChatCompletions API by @merrymercy in #1645
- [Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch by @liangan1 in #1480
- Fix...
Release v0.3.2
Highlight
- Support torch.compile, cuda graph for triton attention backend and DeepSeek MLA #1442 #1422
- Initial support for multi-LoRA serving #1307
- Integrate torchao for quantization #1341
- Optimize the CPU scheduler overhead
- Multiple critical bug fixes for llama and llava (tokenizer, modality)
- Support AMD backend #1420
- New models: MiniCPM3, OLMoE
What's Changed
- Remove useless fields in global_config.py by @merrymercy in #1328
- docs: update README by @zhyncs in #1336
- docs: highlight ttft itl and throughput by @zhyncs in #1337
- docs: add conclusion by @zhyncs in #1340
- Optimize schedule by @hnyls2002 in #1339
- Fix some online scheduling delay by @hnyls2002 in #1345
- [triton] Support head_dim not 2^n in triton extend and decode attention by @ByronHsu in #1281
- [Feat] Add modalities for vision server when handling pixel values for llava by @kcz358 in #1346
- [server] Passing
model_override_args
tolaunch_server
via the CLI. by @kevin85421 in #1298 - [Minor] Many cleanup by @merrymercy in #1357
- Add torchao quant (int4/int8/fp8) to llama models by @jerryzh168 in #1341
- [CI] Return output logprobs in unit test by @Ying1123 in #1361
- Unify forward mode by @hnyls2002 in #1360
- Support OpenAI API json_schema response format by @zifeitong in #1363
- Adding Documentation for installation by @zhaochenyang20 in #1300
- [Docs] Improve documentations by @merrymercy in #1368
- fix bug of
undefined is_single
in methcreate_abort_task
by @wcsjtu in #1370 - Support MiniCPM3 by @Achazwl in #1371
- Fix CORS compatibility with OpenAI, vLLM, TGI, LMDeploy by @josephrocca in #1373
- [Minor] improve kill scripts and torchao import by @merrymercy in #1375
- Fix vocab mask update bug by @hnyls2002 in #1376
- [Minor] move triton attention kernels into a separate folder by @merrymercy in #1379
- Deprecate --disable-flashinfer and introduce --attention-backend by @merrymercy in #1380
- Organize flashinfer indices update by @hnyls2002 in #1378
- remove assertion in triton attention and add an unit test by @ByronHsu in #1385
- BaiChuan2 Model by @blacker521 in #1367
- [Fix] Fix --disable-flashinfer by @merrymercy in #1389
- Improve error reporting during server launch by @merrymercy in #1390
- Refactor attention backend by @merrymercy in #1381
- Add no commit to main rule by @hnyls2002 in #1393
- Fix README format by @Achazwl in #1399
- Support cuda graph in the triton attention backend by @merrymercy in #1401
- kernel: use tensor cores for flashinfer gqa kernels by @yzh119 in #1403
- [Minor Fix] Fix llava modalities issue for single-image by @kcz358 in #1402
- Add Support for XVERSE Models (Dense and MoE) to sglang by @hxer7963 in #1397
- [Feature] Initial support for multi-LoRA serving by @Ying1123 in #1307
- [Minor, CI] remove lora test from minimal suite by @Ying1123 in #1406
- Make stop reason a dict instead of str by @merrymercy in #1407
- [CI] Include triton backend and online serving benchmark into CI by @merrymercy in #1408
- [Minor] Raise exception for wrong import by @Ying1123 in #1409
- Balance test in CI by @merrymercy in #1411
- Update pr-test.yml by @merrymercy in #1412
- ci: fix finish by @zhyncs in #1414
- Optimize conflicts between CUDA graph and vocab mask tensors by @hnyls2002 in #1392
- Add torchao quant for mixtral and qwen_moe by @jerryzh168 in #1418
- Add pytorch sampling backend ut by @ispobock in #1425
- fix: resolve nightly eval by @zhyncs in #1426
- Enable torch.compile for triton backend by @merrymercy in #1422
- Add libibverbs-dev to Dockerfile by @Aphoh in #1427
- Update backend.md by @merrymercy in #1429
- [Fix] Fix logprob and normalized_logprob by @merrymercy in #1428
- Release v0.3.1 by @merrymercy in #1430
- Remove deprecated configs by @merrymercy in #1431
- [Feature] Support LoRA path renaming and add LoRA serving benchmarks by @Ying1123 in #1433
- Revert "[Minor] Raise exception for wrong import (#1409)" by @Ying1123 in #1432
- Add constrained_json_whitespace_pattern to ServerArgs by @zifeitong in #1438
- Clean up model loader by @merrymercy in #1440
- Simplify sampler and its error handling by @merrymercy in #1441
- [Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm by @HaiShaw in #1420
- Fix torch compile for deepseek-v2 by @ispobock in #1442
- Add OLMoE model by @janimo in #1444
- Release 0.3.1.post1 by @merrymercy in #1445
- Enable MLA by default by @ispobock in #1447
- Fix attention backend by @ispobock in #1448
- fix schedule bug by @hnyls2002 in #1450
- Fix schedule bug by @hnyls2002 in #1451
- Fixed n>1 causing list index out of range with VLM by @jasonyux in #1449
- Add bench_server_latency.py by @merrymercy in #1452
- [Bugfix] Enable SGLang on AMD GPUs via PyTorch for ROCm (#1419) by @HaiShaw in #1453
- Fix oom issues with fp8 for llama by @merrymercy in #1454
- Fuse top_k and top_k in the sampler by @merrymercy in #1457
- [Event] Add public meeting invite to README by @Ying1123 in #1458
- fix: creat new dict everytime for putting new frame by @Luodian in #1464
- Fix padding in the cuda graph by @merrymercy in #1469
- Release v0.3.1.post2 by @merrymercy in #1470
- Fix env vars in bench_latency by @merrymercy in #1472
- feat: update linear deps 1/N by @zhyncs in #1305
- minor: add quant eval compared with base by @zhyncs in #1475
- Add OLMoE by @Muennighoff in #1476
- Fix triton head num by @ispobock in #1482
- Add MLA gsm8k eval by @ispobock in #1484
- chore: bump v0.3.1.post3 by @zhyncs in #1483
- fix incorrect links in documentation by @rchen19 in #1481
- doc: update backend by @zhyncs in #1486
- Better unit tests for adding a new model by @merrymercy in #1488
- Pr fix max workers by @wellhowtosay in #1456
- Add a unit test for data parallelism by @merrymercy in #1489
- Add AMD tests to CI by @Ying1123 in #1491
- Update dockerfile to include datamodel_code_generator by @merrymercy in #1492
- [API, Feature] Support response prefill for openai API by @Ying1123 in #1490
- minor: add mla fp8 test by @zhyncs in #1494
- Fix the overhead due to penalizer in bench_latency by @merrymercy i...
Release v0.3.0
Highlights
Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.
- Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
- Up to 1.5x lower latency with torch.compile on small batch sizes
- Support for interleaved text and multi-image/video in LLaVA-OneVision
- Support for interleaved window attention and 2x longer context length in Gemma-2
- Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
- Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.
What's Changed
- update hyperparameter guide by @merrymercy in #1114
- ci: compatible with fork repo by @zhyncs in #1115
- fix: resolve Python.h header missing by @zhyncs in #1119
- Fix the deadlock in multi-node tp by @merrymercy in #1122
- Mixed style of chunked prefill by @hnyls2002 in #1013
- Fix port conflicts between local CI and runner CI. by @hnyls2002 in #1131
- Fix CI accuracy && time out limit by @hnyls2002 in #1133
- fix: use fp16 dtype for sm75 by @zhyncs in #1136
- Improve the code style: more comments and remove useless packages by @merrymercy in #1139
- Improve benchmark by @merrymercy in #1140
- Fix duplicated imports in hf_transformers_utils.py by @merrymercy in #1141
- fixed a typo by @min-xu-et in #1143
- [Docs] Add instruction for running on clouds and kubernetes with SkyPilot by @Michaelvll in #1144
- [Feat]Add support for optional start len of logprobs by @yichuan520030910320 in #1035
- Optimize MLA/GQA/MQA Triton decoding by @ispobock in #1138
- feat: allow streaming for multi-prompt and/or parallel sampling by @vhain in #1134
- Improve docs and warnings by @merrymercy in #1164
- [Feature] add disable-custom-all-reduce by @Xu-Chen in #1148
- misc: add hypervisor vendor by @zhyncs in #1165
- support /v1/health using a generation 1 token by @LucienShui in #1154
- fix: resolve README render by @zhyncs in #1166
- [Feat] Support update weights without restart server by @shanyu-sys in #1157
- Improve multi-node stability by @merrymercy in #1171
- fix: custom op fallback forward native when lower sm80 by @zhyncs in #1177
- [Feature] Add a function to convert sampling_params to kwargs by @gryffindor-rr in #1170
- Support min-p sampling by @intervitens in #1167
- [Docs] Fix rendering of details in README by @Michaelvll in #1179
- Improve code style of sampler by @hnyls2002 in #1168
- [Minor] Improve logging and rename the health check endpoint name by @merrymercy in #1180
- Fix broken penalty by @hnyls2002 in #1184
- Fix benchmark script by @Ying1123 in #1185
- [Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. by @kcz358 in #1123
- feat: use gelu_tanh_and_mul by @zhyncs in #1193
- Cleanup readme, llava examples, usage examples and nccl init by @merrymercy in #1194
- Update README.md by @merrymercy in #1198
- [CI] Fix the problem of hf runner too slow by @Ying1123 in #1202
- [Fix] the issue of random order when input is a list by @Ying1123 in #1199
- Relax the assert in moe throughput test to fix the flaky CI by @merrymercy in #1207
- [Fix] Fixing the multi-images error for llava-onevision by @kcz358 in #1205
- Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model by @zhaochenyang20 in #1186
- [Minor] Improve the function organization in TokenizerManager & improve loggers by @merrymercy in #1208
- [Minor] Temporarily skip flaky test by @Ying1123 in #1209
- [CI] Fix the issue of unit test hanging by @Ying1123 in #1211
- Update CI workflows by @merrymercy in #1210
- Update CI runner docs by @merrymercy in #1213
- [Feature] Support fp8 e5m2 kv cache with flashinfer by @ispobock in #1204
- Update workflow files by @merrymercy in #1214
- improve the threshold and ports in tests by @wisclmy0611 in #1215
- [CI] Fix CI by @wisclmy0611 in #1217
- [Fix] Multi-images loading error by @kcz358 in #1218
- [Minor] improve CI and dependencies by @hnyls2002 in #1212
- [CI] Parallelize unit tests in CI by @wisclmy0611 in #1219
- Move sampler into CUDA graph by @hnyls2002 in #1201
- chore: bump v0.2.14 by @zhyncs in #1155
- [FEAT] JSON constrained support by @havetc in #1125
- Torch compile CI throughput test by @hnyls2002 in #1223
- [FEAT] Support batches cancel by @caiyueliang in #1222
- [Minor] add delete test and delete tmp file on ci server by @yichuan520030910320 in #1227
- [FIX] Wrong logger by @havetc in #1230
- feat: replace get_act_fn for gpt_bigcode by @zhyncs in #1231
- Fix readme by @ArtificialZeng in #1236
- Fix bench latency benchmark by @hnyls2002 in #1225
- [Minor] Add more type annotations by @merrymercy in #1237
- feat: support sm75 with FlashInfer v0.1.6 by @zhyncs in #1233
- Update README.md by @merrymercy in #1239
- hotfix: revert sampler CUDA Graph by @zhyncs in #1242
- Add sglang.bench_latency to CI by @merrymercy in #1243
- fix: increase max_new_tokens when testing generation models by @zhyncs in #1244
- feat: update GemmaRMSNorm by @zhyncs in #1232
- Fix llava on multi images by @merrymercy in #1247
- feat: replace GeluAndMul by @zhyncs in #1234
- fix: resolve qwen2 moe weight loader by @zhyncs in #1252
- chore: bump v0.2.14.post2 by @zhyncs in #1250
- make json_schema usable from gen by @qeternity in #1254
- fix data racing due to mutable reference using deepcopy by @xiezhq-hermann in #1255
- Sampler cudagraph by @hnyls2002 in #1253
- fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader by @lxww302 in #1260
- Transpose mla weight offline by @ispobock in #1261
- EXAONE 3.0 Model Support by @Deepfocused in #1258
- Update README Support Exaone 3.0 by @Deepfocused in #1267
- Report median instead of mean in bench_latency.py by @merrymercy in #1269
- Allow more flexible assistant and system response by @BabyChouSr in #1256
- fix: resolve the fp8 bug introduced by vLLM 0.5.5 by @zhyncs in #1276
- [doc] fix quick start link by @ByronHsu in #1282
- Optimize the update flashinfer indices by @xiaobochen123 in #1262
- [CI] Add more multi-gpu tests by @merrymercy in #1280
- feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 by @zhyncs in #1285
- [CI] merge all ci tests into one file by @merrymercy i...
Release v0.2.13
Highlights
- New Feature: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973)
- New Models: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API.
- Performance: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905).
- More CI Tests: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
- Refactor and fix: More modular, better stability, use more kernels from flashinfer (#907)
What's Changed
- fix: set env in runner by @zhyncs in #891
- docs: update setup runner by @zhyncs in #884
- misc: update cuda graph capture exception log by @zhyncs in #894
- chore: add multipart dep for fastapi by @zhyncs in #895
- [minor] fixed code formatting doc by @min-xu-et in #896
- Bump version to 0.2.9.post1 by @Ying1123 in #899
- Update the base image of the docker by @Ying1123 in #900
- Reorder CI unit tests. by @hnyls2002 in #908
- fixed an error handling in bench_latency.py by @min-xu-et in #904
- Add model accuracy test - step 1 by @Ying1123 in #866
- latency test enhancement - part 1 by @min-xu-et in #909
- Improve the structure of CI by @Ying1123 in #911
- fix: use e2e and unit test only for original repo or pr by @zhyncs in #912
- misc: add triton in check_env PACKAGE_LIST by @zhyncs in #914
- Support MLA for DeepSeek-V2 with Triton - step 1 by @ispobock in #905
- enhance latency test - part 2 by @min-xu-et in #915
- Make API Key OpenAI-compatible by @Ying1123 in #917
- Update hyperparameter_tuning.md by @Ying1123 in #918
- Fix CI && python3.8 compatible by @hnyls2002 in #920
- Support more OpenAI API test by @yichuan520030910320 in #916
- Bump version to 0.2.10 by @Ying1123 in #923
- latency test enhancement - final part by @min-xu-et in #921
- Test openai vision api by @Ying1123 in #925
- Test regex in vision api by @Ying1123 in #926
- Update README.md by @Ying1123 in #927
- Fix prompt len in parallel sampling by @yichuan520030910320 in #928
- docs: update README by @zhyncs in #935
- Remove leftover auth_token by @AidanCooper in #934
- Feat: add alternative choices selection methods by @AidanCooper in #835
- Fix union operator by @ispobock in #940
- Support multiple args options by @yichuan520030910320 in #941
- Fix stuck in
get_new_prefill_batch
by @hnyls2002 in #948 - Organize code (rename, movement) by @hnyls2002 in #953
- fix nsys cannot profile cuda kernel by @mpjlu in #957
- Add support for Batch API test by @yichuan520030910320 in #936
- Show more error messages for warmup errors by @Ying1123 in #932
- misc: update issue template by @zhyncs in #963
- misc: simplify test by @yichuan520030910320 in #964
- misc: add compute capability in check_env by @zhyncs in #965
- Make
req_pool_indices
on CPU by @hnyls2002 in #960 - misc: fix the req_to_token member change by @hnyls2002 in #967
- chore: update vllm to 0.5.4 by @zhyncs in #966
- chore: bump v0.2.11 by @zhyncs in #970
- Purge self-runner's pip cache weekly by @hnyls2002 in #975
- Run purge-cache only in sgl-project by @hnyls2002 in #976
- misc: correct the int data type for token ids and indices by @xiezhq-hermann in #969
- PrefillAdder abstraction by @hnyls2002 in #968
- RadixCache method adjust by @hnyls2002 in #977
- Adjust max prefix len by @hnyls2002 in #980
- #590 Increase default , track changes in examples and documentation by @foszto in #971
- [minor] Update type annotation in tokenizer_manager.py by @Ying1123 in #982
- Fix chunked prefill by @hnyls2002 in #984
- Add llama embedding modules [unreachable code] - step 1/3 by @Ying1123 in #983
- Add io struct for embedding models [unreachable code] - step 2/3 by @Ying1123 in #987
- Adjust
InputeMetadata
andScheduleBatch
by @hnyls2002 in #981 - support more optioin about usage in stream mode by @yichuan520030910320 in #985
- Create contributor_guide.md by @Ying1123 in #992
- feat: frequency, min_new_tokens, presence, and repetition penalties by @vhain in #973
- Move torch.compile configs into cuda_graph_runner.py by @Ying1123 in #993
- Add e5-mistral embedding model - step 3/3 by @Ying1123 in #988
- test: negative value testing for frequency, presence penalizers by @vhain in #995
- support models from www.modelscope.cn by @liuyhwangyh in #994
- bugfix: penalizers to be merged before reqs by @vhain in #1001
- fix: resolve correctness_test issue by @zhyncs in #1002
- Minor bugfix on benchmark serving by @ywang96 in #1005
- Add openai embedding API by @Ying1123 in #997
- Add skip_tokenizer_init args. by @gryffindor-rr in #959
- Fix benchmark latency by @wisclmy0611 in #1007
- Some warnings to crash when CI by @hnyls2002 in #1009
- Reduce the overhead when cache is disabled by @hnyls2002 in #1010
- Support embedding input as a list by @Ying1123 in #1014
- misc: update test config by @zhyncs in #990
- fix: force max new tokens to be 1 for embedding request by @Ying1123 in #1019
- Clean up unit tests by @merrymercy in #1020
- Fix
input_ids
&& rename tofill_ids
by @hnyls2002 in #1021 - feat: use FlashInfer rmsnorm and silu by @zhyncs in #907
- misc: update issue template by @zhyncs in #1024
- Clean up readme and arguments of chunked prefill by @merrymercy in #1022
- Fix wrong assert by @hnyls2002 in #1028
- Improve type annotation by @merrymercy in #1029
- hotfix: add CustomOp abstraction by @zhyncs in #1027
- Fix the case where r.prefix_indices is None by @merrymercy in #1031
- Fix triton args init by @hnyls2002 in #1034
- Fix the case when max_new_tokens is too large by @merrymercy in #1025
- Test the case when max_new_tokens is very large by @merrymercy in #1038
- Fix the prefix indices by @hnyls2002 in #1037
- Improve end-to-end throughput test and its coverage by @merrymercy in #1039
- Delete the useless test/srt/test_throughput.py by @merrymercy in #1045
- minor: some potential bugs by @hnyls2002 in #1044
- Clean up the comments and names under python/sglang/srt/layers by @merrymercy in #1047
- fix...
Release v0.2.9
Highlights
- New feature: Chunked prefill (#800, #811)
- New models: Deepseek v2
- Performance improvement: vectorized logprob computation
- Accuracy fix: fix the double BOS problem in the chat template; move logits to float32; update flashinfer sampling kernels
- Feature fix: fixed many missing logprob-related features in the OpenAI API server
- CI/CD infra is now fully ready. The tests cover frontend, backend, accuracy, and performance tests.
What's Changed
- Deepseek v2 support by @hnyls2002 in #693
- Fix context length by @hnyls2002 in #757
- docs: update model support by @zhyncs in #760
- fix: not run workflows on fork repo by @zhyncs in #762
- Update supported models by @hnyls2002 in #763
- Fix TransformerTokenizer init for chatglm2 & 3 by @ispobock in #761
- [Minor] Improve the code style in TokenizerManager by @merrymercy in #767
- Update readme by @Ying1123 in #769
- feat: add fake tag by @zhyncs in #770
- Fix max_tokens for OpenAI chat completion API by @merrymercy in #766
- Fix max new tokens by @merrymercy in #772
- Move sampling logits to float32 by @merrymercy in #773
- minor refactor: move check server args to server_args.py by @wisclmy0611 in #774
- Fix return_log_probs with cuda graph by @merrymercy in #775
- Rename prefill_token_logprobs -> input_token_logprobs; decode_token_logprobs -> output_token_logprobs by @merrymercy in #776
- Allow disabling flashinfer sampling kernel by @merrymercy in #778
- Bump version to 0.2.6 by @merrymercy in #779
- fix: replace pillow with PIL in PACKAGE_LIST by @zhyncs in #781
- docs: init readthedocs support by @zhyncs in #783
- fix: init readthedocs support by @zhyncs in #784
- fix: exclude logo png in gitignore by @zhyncs in #785
- docs: update index by @zhyncs in #786
- Vectorize logprobs computation by @Ying1123 in #787
- docs: update README by @zhyncs in #788
- docs: make badges center by @zhyncs in #789
- chore: add copyright for srt by @zhyncs in #790
- Fix echo + lobprob for OpenAI API when the prompt is a list by @Ying1123 in #791
- Update README.md by @Ying1123 in #792
- Lazy-import third-party backends by @bgyoon in #794
- Fix lazy import location by @Ying1123 in #795
- Fix logging by @Ying1123 in #796
- Add role documentation, add system begin & end tokens by @objnf-dev in #793
- Chunked prefill support by @hnyls2002 in #797
- Revert "Chunked prefill support" by @Ying1123 in #799
- Chunked prefill by @hnyls2002 in #800
- fix: update flashinfer to 0.1.2 to fix sampling for cu118 by @zhyncs in #803
- Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118" by @Ying1123 in #805
- feat: add chat template for internlm2-chat by @zhyncs in #802
- Revert "Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118"" by @Ying1123 in #806
- Add support for OpenAI API : offline batch(file) processing by @yichuan520030910320 in #699
- Organize public APIs by @hnyls2002 in #809
- Remove inf value for chunked prefill size by @hnyls2002 in #812
- Revert "Organize public APIs" by @Ying1123 in #815
- fix: use v0.2.5 for benchmark by @zhyncs in #814
- Fix LiteLLM kwargs by @qeternity in #817
- Code structure refactor by @hnyls2002 in #807
- docs: update README by @zhyncs in #819
- Fix streaming bug by @objnf-dev in #820
- feat: add runner by @zhyncs in #821
- feat: add pr e2e test by @zhyncs in #822
- Support disable_ignore_eos in bench_serving.py by @Ying1123 in #824
- Adjust default mem fraction to avoid OOM by @Ying1123 in #823
- Add awq_marlin by @Ying1123 in #826
- misc: update e2e test benchmark config by @zhyncs in #825
- misc: enable e2e test when push by @zhyncs in #828
- docs: add set up runner by @zhyncs in #829
- chore: bump v0.2.7 by @zhyncs in #830
- Add
--max-total-tokens
by @hnyls2002 in #840 - Fix List input bug by @yichuan520030910320 in #838
- Add req slots leaking check by @hnyls2002 in #842
- docs: update README.md by @eltociear in #843
- misc: update e2e test paths config by @zhyncs in #848
- chore: update flashinfer to v0.1.3 by @zhyncs in #850
- Fix llama for classification by @Ying1123 in #855
- Add troubleshooting doc by @Ying1123 in #856
- Fix #857 by @kaifronsdal in #858
- Add support for logprobs in OpenAI chat API by @yichuan520030910320 in #852
- Support chunked prefill when radix cache is disabled by @hnyls2002 in #811
- misc: update e2e test paths config by @zhyncs in #860
- Rename github workflows by @Ying1123 in #861
- misc: disable auto release by @zhyncs in #862
- misc: add cancel previous at e2e by @zhyncs in #864
- Add OpenAI backend to the CI test by @Ying1123 in #869
- Fix openai CI tests by @Ying1123 in #870
- misc: use pip cache purge and add unit test ci by @zhyncs in #871
- misc: update unit test config by @zhyncs in #873
- Fix unit tests for the frontend language part by @Ying1123 in #872
- bump to 0.2.8 by @Ying1123 in #877
- Make scripts under
/test/srt
as unit tests by @Ying1123 in #875 - Update runner docs by @hnyls2002 in #876
- Improve the coverage of the openai api server test by @Ying1123 in #878
- Implement served_model_name to customize model id when use local mode… by @dionren in #749
- Update runner docs by @hnyls2002 in #879
- Add more unit tests to CI by @Ying1123 in #880
- Add accuracy test to CI: MMLU by @Ying1123 in #882
- Update workflow name by @Ying1123 in #883
- Fix the double BOS problem in the HF chat template by @Ying1123 in #888
- Add benchmark: HumanEval by @Ying1123 in #889
- Increase openai client limit by @Ying1123 in #886
- Bump version to v0.2.9 by @Ying1123 in #890
New Contributors
- @bgyoon made their first contribution in #794
- @objnf-dev made their first contribution in #793
- @kaifronsdal made their first contribution in #858
- @dionren made their first contribution in #749
Full Changelog: v0.2.5...v0.2.9
Release v0.2.5
Highlights
-
We recently released a blog. Compared to TensorRT-LLM and vLLM, SGLang Runtime consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, and on A100 and H100 GPUs, using FP8 and FP16. SGLang consistently outperforms vLLM, achieving up to 3.1x higher throughput on Llama-70B. It also often matches or sometimes outperforms TensorRT-LLM.
-
We have now automated the release processes for PyPI, Docker, and Release using GitHub workflows. Previously, because Release was not automated, GitHub Tags were not updated in time, leading to a jump from v0.2.0 directly to v0.2.5.
-
Welcome everyone to try using https://github.com/sgl-project/sglang, and also welcome everyone to actively participate in the community, including but not limited to issues, PRs, and discussions. Cheers!
Release v0.2.0
Highlights
- We performed extensive engineering to improve the base performance. Compared to TensorRT-LLM and vLLM, SGLang now consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, on A100 and H100 GPUs, using FP8 and FP16. See the latest blog.
- New models: Llama3 405B, Deepseek MoE, InternLM, GPTBigCode, Mistral-Nemo
What's Changed
- Optimize mem indices mangement by @hnyls2002 in #619
- Unify index operations by @hnyls2002 in #620
- Simplify mem state by @wisclmy0611 in #623
- Improve tensor parallel performance by @Ying1123 in #625
- Bump version to 0.1.21 by @Ying1123 in #626
- Fix model forward grad by @hnyls2002 in #628
- Update docker file by @Ying1123 in #629
- Disable NCCL_NVLS by default by @Ying1123 in #631
- Add qwen2 tie word embedding by @yileld in #630
- Add support for VertexAI safety settings by @AidanCooper in #624
- Fix vertexai by @hnyls2002 in #633
- Reduce docker size by @hnyls2002 in #632
- clean up step function by @Ying1123 in #635
- feat: support internlm2 by @zhyncs in #636
- misc: add pre-commit config by @zhyncs in #637
- misc: add issue and pr template by @zhyncs in #638
- Flashinfer sample kernel by @hnyls2002 in #617
- Move
global_server_args_dict
by @hnyls2002 in #642 - Increase the capacity of the memory pool by @Ying1123 in #643
- feat: add check_env by @zhyncs in #645
- Remove the dependency of rpyc by @wisclmy0611 in #646
- misc: rm rpyc from PACKAGE_LIST by @zhyncs in #649
- fix: set ulimit -n 65535 by @zhyncs in #647
- feat: add lint workflow by @zhyncs in #648
- fix: resolve lint error by @zhyncs in #650
- Remove useless variables in infer_batch.py by @Ying1123 in #651
- Detokenize incrementally when streaming by @hnyls2002 in #653
TokenizerManager.context_len
should inherit from `server_args.conte… by @shrirajh in #654- Remove cached triton launcher by @merrymercy in #656
- perf: reduce ttft and itl with stream_interval 1 by @zhyncs in #658
- feat: add benchmark serving by @zhyncs in #657
- refactor model loader [unreachable code]: initial refactor by @Ying1123 in #655
- misc: update SGLang package description by @zhyncs in #659
- Update Readme by @Ying1123 in #660
- feat: update check env by @zhyncs in #661
- Improve docs by @Ying1123 in #662
- Add benchmark instructions by @Ying1123 in #663
- Fix jump forward when streaming by @hnyls2002 in #665
- Fix kill process util by @ispobock in #666
- Add support for OpenAI API parallel sampling by @yichuan520030910320 in #640
- Update OpenAI API by @wisclmy0611 in #667
- Temporary fix invalid sample results by @hnyls2002 in #668
- Support random dataset in bench_serving.py by @merrymercy in #669
- Revert "Temporary fix invalid sample results" by @hnyls2002 in #673
- refactor model loader: initial refactor by @Ying1123 in #664
- Fix cuda graph with flashinfer by @merrymercy in #675
- Tmp fix illegal sample by @hnyls2002 in #676
- Update version to 0.1.22 by @Ying1123 in #677
- Fallback when sampling failed by @ispobock in #678
- feat: support TRT LLM benchmark and multiple benchmarks by @zhyncs in #670
- Decouple kv by @hnyls2002 in #679
- Support gpt-bigcode model class by @hnyls2002 in #681
- support non-streaming benchmark by @merrymercy in #682
- Fix StreamExecutor.fork() losing the current role start index. by @max99x in #684
- feat: update bench serving by @zhyncs in #685
- misc: update output file logic by @zhyncs in #686
- Allow disabling streaming in bench by @merrymercy in #687
- docs: update README by @zhyncs in #688
- Support Deepseek MoE Model by @hnyls2002 in #689
- misc: recommend to use chat model for benchmark by @zhyncs in #690
- Support Mistral-Nemo by @ispobock in #691
- docs: update README by @zhyncs in #692
- fix: update bench serving by @zhyncs in #694
- misc: update output token logic by @zhyncs in #695
- Tune params by @Ying1123 in #696
- Fix trt benchmark by @Ying1123 in #697
- misc: fix typo by @zhyncs in #698
- Fix flashinfer by @Ying1123 in #700
- Fix hf config loading by @ispobock in #702
- Use min new token ratio at start by @hnyls2002 in #701
- feat: add e2e latency by @zhyncs in #704
- Update vllm version to support llama3.1 by @Ying1123 in #705
- bump version to 0.1.23 by @Ying1123 in #706
- Reduce hardcoded logic of kernel usage by @wisclmy0611 in #707
- Fix multi-node deadlock by @merrymercy in #709
- Auto adjust new ratio by @hnyls2002 in #708
- Fix prefill size by @Ying1123 in #711
- docs: update README by @zhyncs in #712
- docs: update doc by @zhyncs in #713
- fix: llama 3.1 405b fp8 by @zhyncs in #714
- misc: update doc by @zhyncs in #715
- Improve benchmark scripts by @Ying1123 in #717
- Bump version to 0.1.24 by @Ying1123 in #718
- docs: update supported models by @zhyncs in #719
- docs: update comment by @zhyncs in #721
- chore: add close inactive issues workflow by @zhyncs in #722
- misc: update bulid instruction by @zhyncs in #724
- fix: fp8 config by @Ying1123 in #723
- Fix dockerfile and triton cache manager by @hnyls2002 in #720
- chore: bump v0.1.25 by @zhyncs in #725
- fix: resolve the logo display issue on the PyPI page by @zhyncs in #726
- misc: update bug issue template by @zhyncs in #727
- Revert "fix: fp8 config" by @Ying1123 in #728
- Fix bugs (fp8 checkpoints, triton cache manager) by @Ying1123 in #729
- Bump version to 0.2.0 by @Ying1123 in #730
New Contributors
- @yileld made their first contribution in #630
- @AidanCooper made their first contribution in #624
- @zhyncs made their first contribution in #636
- @shrirajh made their first contribution in #654
- @yichuan520030910320 made their first contribution in https://github.com/...
Release v0.1.20
Highlights
- Enable CUDA graph by default. It brings 1.5x - 2x speedup for small batch size decoding (#612)
- Model support: Gemma2, minicpm, Qwen2 MoE
- Docker support (#217 )
- Various latency optimizations
What's Changed
- Add docker file by @Ying1123 in #588
- Add Gemma2 by @Ying1123 in #592
- Format by @Ying1123 in #593
- Fix Llava model by @wisclmy0611 in #594
- Add
--enable-p2p-check
option by @hnyls2002 in #599 - Fix streaming by @hnyls2002 in #600
- Reduce number of workspaces for flashinfer by @wisclmy0611 in #601
- add
LogitsMetadata
by @hnyls2002 in #604 - add minicpm support by @Titan-p in #602
- Make sglang compat with vllm 0.5.1 by @M0gician in #598
- Add Qwen2 MoE support by @M0gician in #603
- Update chat template for qwen and yi-1.5. by @for-just-we in #530
- [Feat] Expose logprob options to
sgl.gen
API by @huyiwen in #503 - Fix bench latency by @merrymercy in #607
- Code clean up: Remove deprecated prefill move InputMetadata to infer_batch.py by @merrymercy in #609
- Clean up the usage of flashinfer by @merrymercy in #610
- Cleanup attention backend: flashinfer and triton by @merrymercy in #611
- Enable cuda graph by default by @merrymercy in #612
- Improve benchmark scripts & fix llava by @merrymercy in #613
- Memorypool chunked prefetch by @hnyls2002 in #614
- Improve benchmark scripts by @merrymercy in #615
- Fix memory pool index error by @Ying1123 in #616
- Bump version to 0.1.20 by @merrymercy in #618
New Contributors
- @wisclmy0611 made their first contribution in #594
- @Titan-p made their first contribution in #586
- @M0gician made their first contribution in #598
- @for-just-we made their first contribution in #530
Full Changelog: v0.1.18...v0.1.20