Release v1.7.0 - Continuous batching feature supported. · intel/xFasterTransformer

v1.7.0 - Continuous batching feature supported.

Functionality

Refactor framework to support continuous batching feature. vllm-xft, a fork of vllm, integrates the xFasterTransformer backend and maintains compatibility with most of the official vLLM's features.
Remove FP32 data type option of KV Cache.
Add get_env() python API to get recommended LD_PRELOAD set.
Add GPU build option for Intel Arc GPU series.
Exposed the interface of the LLaMA model, including Attention and decoder.

Performance

Update xDNN to release v1.5.1
Baichuan series models supports full FP16 pipline to improve performance.
More FP16 data type kernel added, including MHA, MLP, YARN rotary_embedding, rmsnorm and rope.
Kernel implementation of crossAttnByHead.

Dependency

Bump torch to 2.3.0.

BUG fix

Fixed the segament fault error when running with more than 4 ranks.
Fixed the bugs of core dump && hang when running croos nodes.

What's Changed

Generated release nots

[Fix] add utf-8 encoding. by @marvin-Yu in #354
[Benchmark] Calculate throughput using avg latency. by @Duyi-Wang in #360
[GPU] Add GPU build option. by @changqi1 in #359
Fix Qwen prompt.json by @JunxiChhen in #368
[Model] Fix ICX build issue. by @changqi1 in #370
[CMake] Remove evaluation under XFT_BUILD_TESTS option. by @Duyi-Wang in #374
[Kernel][UT] Kernel impl. of crossAttnByHead and unit test for cross attention. by @pujiang2018 in #348
[API] Add LLaMA attention API. by @changqi1 in #378
[Finetune] Scripts for Llama2-7b lora finetune example using stock pytorch by @ustcuna in #327
[Demo] Add abbreviation for output length. by @Duyi-Wang in #385
[API] Add LLaMA decoder API. by @changqi1 in #386
[API] Optimize API Impl. by @changqi1 in #396
[Framework] Continuous Batching Support by @pujiang2018 in #357
[KVCache] Remove FP32 data type. by @Duyi-Wang in #399
[Interface] Change return shape of forward_cb. by @Duyi-Wang in #400
[Example] Add demo of offline continuous batching by @pujiang2018 in #401
[Layers] Add alibiSlopes Attn && Flash Attn for CB. by @abenmao in #402
[Interface] Support List[int] and List[List[int]] for set_input_sb. by @Duyi-Wang in #404
[Bug] fix incorrect input offset computing by @pujiang2018 in #405
[Example] Fix incorrect tensor dimension with latest interface by @pujiang2018 in #406
[Models/Layers/Kernels] Add Baichuan1/2 full-link bf16 support & Fix next-tok gen bug by @abenmao in #407
[xDNN] Release v1.5.0. by @changqi1 in #410
[Kernel] Add FP16 rmsnorm and rope kernels. by @changqi1 in #408
[Kenrel] Add FP16 LLaMA YARN rotary_embedding. by @changqi1 in #412
[Benchmark] Add platform options. Support real model. by @JunxiChhen in #409
[Dependency] Update torch to 2.3.0. by @Duyi-Wang in #416
[COMM] Fix bugs of core dump && hang when running cross nodes by @abenmao in #423
[xDNN] Release v1.5.1. by @changqi1 in #422
[Kernel] Add FP16 MHA and MLP kernels. by @changqi1 in #415
[Python] Add get_env() to get LD_PRELOAD set. by @Duyi-Wang in #427
Add --padding and fix bug by @yangkunx in #418
[Layers] Fixed the seg fault error when running with more than 4 ranks by @abenmao in #424
[Kernel] Less compute for Self-Attention (Q * K) by @pujiang2018 in #420
[Dependency] Update libiomp5.so to 5.0.20230815 contained in mkl. by @Duyi-Wang in #430
[Distribute] Add distribute support for continuous batching api. by @Duyi-Wang in #421
[Layers] Fixed error in yarn by @abenmao in #429
[README] Update readme. by @Duyi-Wang in #431
[Dependency] Fix wrong so path returned in get_env(). by @Duyi-Wang in #432
[Version] v1.7.0. by @Duyi-Wang in #433

New Contributors

@ustcuna made their first contribution in #327
@yangkunx made their first contribution in #418

Full Changelog: v1.6.0...v1.7.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.7.0 - Continuous batching feature supported.

Functionality

Performance

Dependency

BUG fix

What's Changed

New Contributors

Contributors