10 Oct 08:17

Duyi-Wang

v1.8.2

b43edc8

v1.8.2 Latest

Latest

v1.8.2

Performance

Enable flash attention by default for W8A8 dtype to accelerate the performance of the 1st token.

Benchmark

When the number of ranks is 1, run in single mode to avoid the dependency on mpirun.
Support SNC-3 platform.

Assets 2

31 Jul 08:08

Duyi-Wang

v1.8.1

df57cb2

v1.8.1

Functionality

Expose the interface of embedding lookup.

Performance

Optimized the performance of grouped query attention (GQA).
Enhanced the performance of creating keys for the oneDNN primitive cache.
Set the [bs][nh][seq][hs] layout as the default for KV Cache, resulting in better performance.
Improved the task split imbalance issue in self-attention.

Assets 2

23 Jul 01:25

Duyi-Wang

v1.8.0

faa25f4

v1.8.0 Continuous Batching on Single ARC GPU and AMX_FP16 Support.

Highlight

Continuous Batching on Single ARC GPU is supported and can be integrated by vllm-xft.
Introduce Intel AMX instructions support for float16 data type.

Models

Support ChatGLM4 series models.
Introduce BF16/FP16 full path support for Qwen series models.

BUG fix

Fixed memory leak of oneDNN primitive cache.
Fixed SPR-HBM flat QUAD mode detect issue in benchmark scripts.
Fixed heads Split error for distributed Grouped-query attention(GQA).
Fixed an issue with the invokeAttentionLLaMA API.

What's Changed

Generated release nots

What's Changed

[Kernel] Enable continuous batching on single GPU. by @changqi1 in #452
[Bugfix] fixed shm reduceAdd & rope error when batch size is large by @abenmao in #457
[Feature] Enable AMX FP16 on next generation CPU by @wenhuanh in #456
[Kernel] Cache oneDNN primitive when M < XFT_PRIMITIVE_CACHE_M, default 256. by @Duyi-Wang in #460
[Denpendency] Pin python requirements.txt version. by @Duyi-Wang in #458
[Dependency] Bump web_demo requirement. by @Duyi-Wang in #463
[Layers] Enable AMX FP16 of FlashAttn by @abenmao in #459
[Layers] Fix invokeAttentionLLaMA API by @wenhuanh in #464
[Readme] Add accepted papers by @wenhuanh in #465
[Kernel] Make SelfAttention prepared for AMX_FP16; More balanced task split in Cross Attention by @pujiang2018 in #466
[Kernel] Upgrade xDNN to v1.5.2 and make AMX_FP16 work by @pujiang2018 in #468

Full Changelog: v1.7.3...v1.8.0

Contributors

abenmao, changqi1, and 3 other contributors

Assets 2

01 Jul 01:52

Duyi-Wang

v1.7.3

d01f0fd

v1.7.3

BUG fix

Fixed SHM reduceAdd & rope error when batch size is large.
Fixed the issue of abnormal usage of oneDNN primitive cache.

Assets 2

18 Jun 05:07

Duyi-Wang

v1.7.2

da2a7fa

v1.7.2 - Continuous batching feature supports Qwen 1.0 & hybrid data types.

Functionality

Add continuous batching support of Qwen 1.0 models.
Enable hybrid data types for continuous batching feature, including BF16_FP16, BF16_INT8, BF16_W8A8, BF16_INT4, BF16_NF4, W8A8_INT8, W8A8_int4, W8A8_NF4.

BUG fix

Fixed the convert fault in Baichuan1 models.

What's Changed

Generated release nots

[Doc] Add vllm benchmark docs. by @marvin-Yu in #448
[Kernel] Add GPU kernels and enable LLaMA model. by @changqi1 in #372
[Tools] Add Baichuan1/2 convert tool by @abenmao in #451
[Layers] Add qwenRope support for Qwen1.0 in CB mode by @abenmao in #449
[Framework] Remove duplicated code by @xiangzez in #450
[Model] Support hybrid model in continuous batching. by @Duyi-Wang in #453
[Version] v1.7.2. by @Duyi-Wang in #454

Full Changelog: v1.7.1...v1.7.2

Contributors

marvin-Yu, abenmao, and 3 other contributors

Assets 2

12 Jun 05:27

Duyi-Wang

v1.7.1

38658b1

v1.7.1 - Continuous batching feature supports ChatGLM2/3.

Functionality

Add continuous batching support of ChatGLM2/3 models.
Qwen2Convert supports quantized Qwen2 models by GPTQ, such as GPTQ-Int8 and GPTQ-Int4, by param from_quantized_model="gptq".

BUG fix

Fixed the segament fault error when running with more than 2 ranks in vllm-xft serving.

What's Changed

Generated release nots

[README] Update README.md. by @Duyi-Wang in #434
[README] Update README.md. by @Duyi-Wang in #435
[Common]Add INT8/UINT4 to BF16 weight convert by @xiangzez in #436
Add Continue Batching support for Chatglm2/3 by @a3213105 in #438
[Model] Add Qwen2 GPTQ model support by @xiangzez in #439
[Model] Fix array out of bounds when rank > 2. by @Duyi-Wang in #441
Bump gradio from 4.19.2 to 4.36.0 in /examples/web_demo by @dependabot in #442
[Version] v1.7.1. by @Duyi-Wang in #445

Full Changelog: v1.7.0...v1.7.1

Contributors

a3213105, dependabot, and 2 other contributors

Assets 2

05 Jun 05:13

Duyi-Wang

v1.7.0

76ddad7

v1.7.0 - Continuous batching feature supported.

Functionality

Refactor framework to support continuous batching feature. vllm-xft, a fork of vllm, integrates the xFasterTransformer backend and maintains compatibility with most of the official vLLM's features.
Remove FP32 data type option of KV Cache.
Add get_env() python API to get recommended LD_PRELOAD set.
Add GPU build option for Intel Arc GPU series.
Exposed the interface of the LLaMA model, including Attention and decoder.

Performance

Update xDNN to release v1.5.1
Baichuan series models supports full FP16 pipline to improve performance.
More FP16 data type kernel added, including MHA, MLP, YARN rotary_embedding, rmsnorm and rope.
Kernel implementation of crossAttnByHead.

Dependency

Bump torch to 2.3.0.

BUG fix

Fixed the segament fault error when running with more than 4 ranks.
Fixed the bugs of core dump && hang when running croos nodes.

What's Changed

Generated release nots

[Fix] add utf-8 encoding. by @marvin-Yu in #354
[Benchmark] Calculate throughput using avg latency. by @Duyi-Wang in #360
[GPU] Add GPU build option. by @changqi1 in #359
Fix Qwen prompt.json by @JunxiChhen in #368
[Model] Fix ICX build issue. by @changqi1 in #370
[CMake] Remove evaluation under XFT_BUILD_TESTS option. by @Duyi-Wang in #374
[Kernel][UT] Kernel impl. of crossAttnByHead and unit test for cross attention. by @pujiang2018 in #348
[API] Add LLaMA attention API. by @changqi1 in #378
[Finetune] Scripts for Llama2-7b lora finetune example using stock pytorch by @ustcuna in #327
[Demo] Add abbreviation for output length. by @Duyi-Wang in #385
[API] Add LLaMA decoder API. by @changqi1 in #386
[API] Optimize API Impl. by @changqi1 in #396
[Framework] Continuous Batching Support by @pujiang2018 in #357
[KVCache] Remove FP32 data type. by @Duyi-Wang in #399
[Interface] Change return shape of forward_cb. by @Duyi-Wang in #400
[Example] Add demo of offline continuous batching by @pujiang2018 in #401
[Layers] Add alibiSlopes Attn && Flash Attn for CB. by @abenmao in #402
[Interface] Support List[int] and List[List[int]] for set_input_sb. by @Duyi-Wang in #404
[Bug] fix incorrect input offset computing by @pujiang2018 in #405
[Example] Fix incorrect tensor dimension with latest interface by @pujiang2018 in #406
[Models/Layers/Kernels] Add Baichuan1/2 full-link bf16 support & Fix next-tok gen bug by @abenmao in #407
[xDNN] Release v1.5.0. by @changqi1 in #410
[Kernel] Add FP16 rmsnorm and rope kernels. by @changqi1 in #408
[Kenrel] Add FP16 LLaMA YARN rotary_embedding. by @changqi1 in #412
[Benchmark] Add platform options. Support real model. by @JunxiChhen in #409
[Dependency] Update torch to 2.3.0. by @Duyi-Wang in #416
[COMM] Fix bugs of core dump && hang when running cross nodes by @abenmao in #423
[xDNN] Release v1.5.1. by @changqi1 in #422
[Kernel] Add FP16 MHA and MLP kernels. by @changqi1 in #415
[Python] Add get_env() to get LD_PRELOAD set. by @Duyi-Wang in #427
Add --padding and fix bug by @yangkunx in #418
[Layers] Fixed the seg fault error when running with more than 4 ranks by @abenmao in #424
[Kernel] Less compute for Self-Attention (Q * K) by @pujiang2018 in #420
[Dependency] Update libiomp5.so to 5.0.20230815 contained in mkl. by @Duyi-Wang in #430
[Distribute] Add distribute support for continuous batching api. by @Duyi-Wang in #421
[Layers] Fixed error in yarn by @abenmao in #429
[README] Update readme. by @Duyi-Wang in #431
[Dependency] Fix wrong so path returned in get_env(). by @Duyi-Wang in #432
[Version] v1.7.0. by @Duyi-Wang in #433

New Contributors

@ustcuna made their first contribution in #327
@yangkunx made their first contribution in #418

Full Changelog: v1.6.0...v1.7.0

Contributors

marvin-Yu, abenmao, and 6 other contributors

Assets 2

26 Apr 07:48

Duyi-Wang

v1.6.0

f9cdcba

v1.6.0 - Llama3 and Qwen2 series models supported.

Functionality

Support Llama3 and Qwen2 series models.
Add INT8 KV cache datatype, using kv_cache_dtype params to specify, including int8, fp16(default) and fp32.
More models enable full BF16 pipline, includes Chatglm2/3 and yarn-llama.
Add invokeMLPLLaMA FP16 API.
Support logits output using forward() api.

Dependency

Bump transformers to 4.40.0 to support Llama3 models.

Performance

Update xDNN to release v1.4.6

BUG fix

Fix numeric overflow when calculate softmax in sampling.
fix assert bug when concat gate&up.

What's Changed

Generated release nots

[Model] Expose KV cache data type in Llama model. by @pujiang2018 in #313
[API] Format rotary_embedding api. by @changqi1 in #303
[Kernel] Add kernel support for INT8 KV cache. by @pujiang2018 in #314
[Convert] Fix Qwen convert issue. by @marvin-Yu in #315
[API] Add invokeMLPLLaMA FP16 API. by @changqi1 in #302
[Build] Fix build issue. by @changqi1 in #316
Chatglm2/3 bf16 pipeline support by @a3213105 in #301
[README] Add README_CN.md. by @Duyi-Wang in #317
[Kernel] Bug fix for small_gemm_transb by @pujiang2018 in #318
[Eval] Get logits output. by @marvin-Yu in #319
[CMake] Add oneccl build depends for comm_helper. by @Duyi-Wang in #322
[Layers] fix assert bug when concat gate&up by @abenmao in #323
[Sample] Fix numeric overflow when calculate softmax. by @Duyi-Wang in #326
[Models] Use factory class to create decoder. by @Duyi-Wang in #321
[RAEDME] Update readme for the dependent lib. by @xwang98 in #331
[KVCache] INT8 KV cache implementation and related changes by @pujiang2018 in #320
[Model] Add Qwen2 model. by @marvin-Yu in #330
[KVCache] Add inferface and register for kvcache. by @Duyi-Wang in #336
[Demo] Add kvcache type option in web demo. by @Duyi-Wang in #338
[Benchmark] Add KVCache data type option. by @Duyi-Wang in #337
[model] Add llama3 model. by @marvin-Yu in #340
[Kernel] Add 'acc' param in small_gemm, add lacked and remove unused small_gemm kernels. by @pujiang2018 in #346
[xDNN] Release v1.4.6. by @changqi1 in #342
[Evaluation] fix the model register bug in evaluation by @abenmao in #347
[Models] YaRN-Llama full-link bf16 support by @abenmao in #344
[UT] Remove beam search test temporarily. by @Duyi-Wang in #349
[Version] v1.6.0. by @Duyi-Wang in #352

New Contributors

@xwang98 made their first contribution in #331

Full Changelog: v1.5.0...v1.6.0

Contributors

a3213105, marvin-Yu, and 5 other contributors

Assets 2

12 Apr 06:10

Duyi-Wang

v1.5.0

62994fa

v1.5.0 - Gemma series models supported.

Functionality

Support Gemma series medels, including Gemma and CodeGemma, and DeepSeek model.
Llama Converter support convert quantized huggingface model by params from_quantized_model='gptq' into xFt format INT8/INT4 model files.
Support loading INT4 data weights directly from local files.
Optimize memory usage during QWen model conversion, particularly for QWen 72B.

Dependency

Bump transformers to 4.38.1 to support Gemma models.
Add protobuf to support new behavier in tokenzier.

Performance

Update xDNN to release v1.4.5
Add GPU kernel library gpuDNN v0.1 to support Intel Arc GPU series.
Optimize ROPE perfermance by reducing repeated sin and cos embedding table data.
Accelerate KVCache copy by increasing parallelism in self attention.
Accelerate addreduce operation in long sequence case by transposing KVCache and tuned comm.

BUG fix

Fix a incorrect computing which should be in float, but was in integer.
Fix timeline is disordered.
Fix runtime issue of Qwen when seq_length is bigger than 32768.

Generated release nots

What's Changed

[Kernel] Fix the incorrect computing which should be in float, but was in integer by @pujiang2018 in #267
[Layer] Reduce repeated sin and cos embedding table data to optimize ROPE perf. by @changqi1 in #266
[Kernel] increase parallelism for KV cache copy in self attention by @pujiang2018 in #268
[Include] Fix include not work. by @Duyi-Wang in #271
Issue qwen72b seq length by @a3213105 in #273
[Common] Unify memory allocation into xft::alloc by @pujiang2018 in #272
[Timeline] Fix disordered timeline. by @changqi1 in #277
[model] Add deepseek model. by @marvin-Yu in #274
[Bug] Fix incorrect context parameter order. by @changqi1 in #280
[CI] Check for UT status. by @marvin-Yu in #278
[CMake] Check existence of MKL & oneDNN directory before installation. by @Duyi-Wang in #283
Add KVCache trans for long sequence && tuned comm for faster Addreduce by @abenmao in #279
[Dependency] Add protobuf in requirements.txt by @Duyi-Wang in #284
[xDNN] Release v1.4.5. by @changqi1 in #285
[CI] Add rls test case. by @marvin-Yu in #286
[Bug] fix baichuan model test issue. by @marvin-Yu in #287
[Fix] Fix baichuan2-13 without rope. by @marvin-Yu in #289
[Tools] Add convert tool for Llama models quantized by AutoGPTQ by @xiangzez in #276
[Common] Support loading int4 weights by @xiangzez in #275
[KVCache] KV Cache refactor and related unit test case fix by @pujiang2018 in #290
[Model] Update isMaster func. by @changqi1 in #292
[Bug] Fix oneDNN GPU build issue. by @changqi1 in #293
[UT] add unit test for selfAttention, and a small fix by @pujiang2018 in #294
[gpuDNN] Add gpuDNN v0.1.0 library files. by @feng-intel in #291
[UT] MLP unit test case fix by @abenmao in #296
[Fix] Reduce convert memory usage. by @marvin-Yu in #297
[ENV] Use Meyers' Singleton Env object. by @Duyi-Wang in #295
[fix] fix compile issue. by @marvin-Yu in #299
[Example] Add gemma model config and web demo. by @marvin-Yu in #304
[Model] Add gemma model support. by @marvin-Yu in #259
[example] add gemma model support with example. by @marvin-Yu in #307
Bump transformers from 4.36.0 to 4.38.0 in /examples/web_demo by @dependabot in #308
Fix timeline compile issue by @xiangzez in #309
[Build] Fix build issues. by @changqi1 in #310
[Version] v1.5.0. by @Duyi-Wang in #311

New Contributors

@feng-intel made their first contribution in #291

Full Changelog: v1.4.0...v1.5.0

Contributors

a3213105, marvin-Yu, and 7 other contributors

Assets 2

08 Mar 05:55

Duyi-Wang

v1.4.0

7587560

v1.4.0 - Fully BF16 support in Llama for better performance and serving framework support.

Functionality

Introduce pure BF16 support to Llama series models, now can use fully BF16 data type to to utilize AMX more effectively when deploying Llama models.
Add MLServer serving framework support and demo in serving directory.
GCC for compiling release binary files has been updated from GCC 8.5 to GCC 12.
Introduce pipeline parallel feature for distributing deployment. Enabled by cmake .. -DWITH_PIPELINE_PARALLEL=ON in compilation and use XFT_PIPELINE_STAGE Marco to define pipeline parallel stages num.
Deprecate convert tool scripts in tools directory and it recommended to using Convert in xfastertransformer python wheel.
Support loading int8 data weights directly from local files.

Performance

Update xDNN to release v1.4.4.
Accelerate model weights loading by optimizing cast operation after loading and gain up to 50% speed up.
Optimize BF16 performance using AMX instruction when batchsize <= 8, and add XFT_USE_AMX_M to set threshold of M using AMX instead of AVX512, default 1.

Demo & Benchmark

Update dependency transformers requirement from 4.30.0 to 4.36.0 for high risk CVE Vulnerabilities.
Add distributed inference benchmark script which support deployment across platfrom.
Add single node platform support in benchmark script.
Add Yi model web demo.
Enhance the command-line chat mode in pytorch demo.py, using --chat true to enable.

BUG fix

Fix calculation issue in Qwen models and enhance LogN support for long token sequence.
Fix unsync issue in multi-rank model when do_sample is enabled.
Fix Baichuan models calculation and convert issue.
Fix repetition penalties not taking effect on other batches.

What's Changed

[Demo] Update web demo to adapt gradio 4.11.0. by @Duyi-Wang in #201
Bump gradio from 3.40.1 to 4.11.0 in /examples/web_demo by @dependabot in #150
[demo] Add Yi model demo. by @marvin-Yu in #200
[Dependency] transformers version warning when error occurs. by @Duyi-Wang in #202
[Tools] Deprecate convert tools in tools dir. by @Duyi-Wang in #203
[benchmark] Add one node benchmark. by @marvin-Yu in #205
Bump transformers from 4.30.0 to 4.36.0 by @dependabot in #145
Bump transformers from 4.30.0 to 4.36.0 in /examples/web_demo by @dependabot in #144
[CMake] Check if the compiler really supports avx512bf16 with try_compile by @pujiang2018 in #206
[Layer] Fine grained data type definition for Attention and MLP by @pujiang2018 in #194
Add recommend GCC version by @a3213105 in #207
[TP] Make split dimension align with oneDNN packing by @pujiang2018 in #208
Support loading int8 weights by @xiangzez in #157
[benchmark] Add distributed benchmark. by @marvin-Yu in #211
[ci] Fix python path issue. by @marvin-Yu in #214
[Fix] Fix repetition penalties not taking effect on other batches. by @Duyi-Wang in #212
[xDNN] Release v1.4.3. by @changqi1 in #213
[ci] Add workflow permission. by @marvin-Yu in #218
[Layer] Enable pipeline parallel feature. by @changqi1 in #221
[Dockerfile] Remove dockerfile. by @Duyi-Wang in #219
[CI] Align using benchmark tests. by @marvin-Yu in #216
[xDNN] Release v1.4.4. by @changqi1 in #223
[Layer] Support pure full-link BF16 LLaMa model. by @pujiang2018 in #222
[Layers] Qwen LogN for query by @a3213105 in #215
[Layer] Convert static MMHelper class to instance Class in DecoderContext. by @changqi1 in #225
[models][layers/tools] Refine and bugfix for baichuan models by @abenmao in #226
[Serving] Add MLServer serving support. by @Duyi-Wang in #217
[Dependencies] Remove tokenizers requirement. by @Duyi-Wang in #227
[kernel] Add ICX compiler. by @changqi1 in #228
[Env] Add XFT_ENGINE env variable. by @changqi1 in #231
[CMake] Open the pip-install information for MKL. by @marvin-Yu in #234
[Fix] Add parameter check for logN and NTK rotary embedding of QWEN by @a3213105 in #232
[CMake] Remvoe force reinstall for mkl dependencies. by @Duyi-Wang in #237
[Example] Add seq_length in qwen fake config.ini by @Duyi-Wang in #238
[Tools] Accelerate model loading. by @marvin-Yu in #224
[Fix] Fix the wrong output of QWEN-14B. by @marvin-Yu in #240
fix issue #220 by @a3213105 in #242
Bump gradio from 4.11.0 to 4.19.2 in /examples/web_demo by @dependabot in #241
[Example] Add llama2 chat support in Cli demo. by @Duyi-Wang in #243
[Dependency] Update web demo requirement. by @Duyi-Wang in #246
[Docs] Initial documents. by @Duyi-Wang in #248
Fix Opt issue by @xiangzez in #251
[Serving] Fix fail to set pad_token_id when it's not None in single mode. by @Duyi-Wang in #254
[layers] Add bf16-type input/output support for flash attention by @abenmao in #252
[Kernel] Set USE_AMX_M to 1. by @Duyi-Wang in #245
[Benchmark] Fix typo in benchmark script. by @Duyi-Wang in #261
[Attention Kernel/Layer] group attention support in full-link BF16 path; attention layer refactor by @pujiang2018 in #258
[Search] Sync smaple result in multi-rank. by @Duyi-Wang in #260
[Benchmark] Update model cfg for transformers>4.36. by @Duyi-Wang in #257
[Layer] Use flash attention when larger than threshold ('>=' to '>') by @pujiang2018 in #265
[Benchmark] Modify CPU affinity logic, add CI prompt output. by @marvin-Yu in #263
[Version] v1.4.0. by @Duyi-Wang in #262

New Contributors

@dependabot made their first contribution in #150

Full Changelog: v1.3.1...v1.4.0

Contributors

a3213105, marvin-Yu, and 6 other contributors

Assets 2

Releases: intel/xFasterTransformer

v1.8.2

Performance

Benchmark

v1.8.1

Functionality

Performance

v1.8.0 Continuous Batching on Single ARC GPU and AMX_FP16 Support.

Highlight

Models

BUG fix

What's Changed

What's Changed

Contributors

v1.7.3

BUG fix

v1.7.2 - Continuous batching feature supports Qwen 1.0 & hybrid data types.

Functionality

BUG fix

What's Changed

Contributors

v1.7.1 - Continuous batching feature supports ChatGLM2/3.

Functionality

BUG fix

What's Changed

Contributors

v1.7.0 - Continuous batching feature supported.

Functionality

Performance

Dependency

BUG fix

What's Changed

New Contributors

Contributors

v1.6.0 - Llama3 and Qwen2 series models supported.

Functionality

Dependency

Performance

BUG fix

What's Changed

New Contributors

Contributors

v1.5.0 - Gemma series models supported.

Functionality

Dependency

Performance

BUG fix

What's Changed

New Contributors

Contributors

v1.4.0 - Fully BF16 support in Llama for better performance and serving framework support.

Functionality

Performance

Demo & Benchmark

BUG fix

What's Changed

New Contributors

Contributors