v1.7.0 - Continuous batching feature supported.
v1.7.0 - Continuous batching feature supported.
Functionality
- Refactor framework to support continuous batching feature.
vllm-xft
, a fork of vllm, integrates the xFasterTransformer backend and maintains compatibility with most of the official vLLM's features. - Remove FP32 data type option of KV Cache.
- Add
get_env()
python API to get recommended LD_PRELOAD set. - Add GPU build option for Intel Arc GPU series.
- Exposed the interface of the LLaMA model, including Attention and decoder.
Performance
- Update xDNN to release
v1.5.1
- Baichuan series models supports full FP16 pipline to improve performance.
- More FP16 data type kernel added, including MHA, MLP, YARN rotary_embedding, rmsnorm and rope.
- Kernel implementation of crossAttnByHead.
Dependency
- Bump
torch
to2.3.0
.
BUG fix
- Fixed the segament fault error when running with more than 4 ranks.
- Fixed the bugs of core dump && hang when running croos nodes.
What's Changed
Generated release nots
- [Fix] add utf-8 encoding. by @marvin-Yu in #354
- [Benchmark] Calculate throughput using avg latency. by @Duyi-Wang in #360
- [GPU] Add GPU build option. by @changqi1 in #359
- Fix Qwen prompt.json by @JunxiChhen in #368
- [Model] Fix ICX build issue. by @changqi1 in #370
- [CMake] Remove evaluation under XFT_BUILD_TESTS option. by @Duyi-Wang in #374
- [Kernel][UT] Kernel impl. of crossAttnByHead and unit test for cross attention. by @pujiang2018 in #348
- [API] Add LLaMA attention API. by @changqi1 in #378
- [Finetune] Scripts for Llama2-7b lora finetune example using stock pytorch by @ustcuna in #327
- [Demo] Add abbreviation for output length. by @Duyi-Wang in #385
- [API] Add LLaMA decoder API. by @changqi1 in #386
- [API] Optimize API Impl. by @changqi1 in #396
- [Framework] Continuous Batching Support by @pujiang2018 in #357
- [KVCache] Remove FP32 data type. by @Duyi-Wang in #399
- [Interface] Change return shape of forward_cb. by @Duyi-Wang in #400
- [Example] Add demo of offline continuous batching by @pujiang2018 in #401
- [Layers] Add alibiSlopes Attn && Flash Attn for CB. by @abenmao in #402
- [Interface] Support List[int] and List[List[int]] for set_input_sb. by @Duyi-Wang in #404
- [Bug] fix incorrect input offset computing by @pujiang2018 in #405
- [Example] Fix incorrect tensor dimension with latest interface by @pujiang2018 in #406
- [Models/Layers/Kernels] Add Baichuan1/2 full-link bf16 support & Fix next-tok gen bug by @abenmao in #407
- [xDNN] Release v1.5.0. by @changqi1 in #410
- [Kernel] Add FP16 rmsnorm and rope kernels. by @changqi1 in #408
- [Kenrel] Add FP16 LLaMA YARN rotary_embedding. by @changqi1 in #412
- [Benchmark] Add platform options. Support real model. by @JunxiChhen in #409
- [Dependency] Update torch to 2.3.0. by @Duyi-Wang in #416
- [COMM] Fix bugs of core dump && hang when running cross nodes by @abenmao in #423
- [xDNN] Release v1.5.1. by @changqi1 in #422
- [Kernel] Add FP16 MHA and MLP kernels. by @changqi1 in #415
- [Python] Add
get_env()
to get LD_PRELOAD set. by @Duyi-Wang in #427 - Add --padding and fix bug by @yangkunx in #418
- [Layers] Fixed the seg fault error when running with more than 4 ranks by @abenmao in #424
- [Kernel] Less compute for Self-Attention (Q * K) by @pujiang2018 in #420
- [Dependency] Update libiomp5.so to
5.0.20230815
contained in mkl. by @Duyi-Wang in #430 - [Distribute] Add distribute support for continuous batching api. by @Duyi-Wang in #421
- [Layers] Fixed error in yarn by @abenmao in #429
- [README] Update readme. by @Duyi-Wang in #431
- [Dependency] Fix wrong so path returned in
get_env()
. by @Duyi-Wang in #432 - [Version] v1.7.0. by @Duyi-Wang in #433
New Contributors
Full Changelog: v1.6.0...v1.7.0