Releases: intel/xFasterTransformer
Releases · intel/xFasterTransformer
v1.8.2
v1.8.1
v1.8.1
Functionality
- Expose the interface of embedding lookup.
Performance
- Optimized the performance of grouped query attention (GQA).
- Enhanced the performance of creating keys for the oneDNN primitive cache.
- Set the [bs][nh][seq][hs] layout as the default for KV Cache, resulting in better performance.
- Improved the task split imbalance issue in self-attention.
v1.8.0 Continuous Batching on Single ARC GPU and AMX_FP16 Support.
Highlight
- Continuous Batching on Single ARC GPU is supported and can be integrated by
vllm-xft
. - Introduce Intel AMX instructions support for
float16
data type.
Models
- Support ChatGLM4 series models.
- Introduce BF16/FP16 full path support for Qwen series models.
BUG fix
- Fixed memory leak of oneDNN primitive cache.
- Fixed SPR-HBM flat QUAD mode detect issue in benchmark scripts.
- Fixed heads Split error for distributed Grouped-query attention(GQA).
- Fixed an issue with the invokeAttentionLLaMA API.
What's Changed
Generated release nots
What's Changed
- [Kernel] Enable continuous batching on single GPU. by @changqi1 in #452
- [Bugfix] fixed shm reduceAdd & rope error when batch size is large by @abenmao in #457
- [Feature] Enable AMX FP16 on next generation CPU by @wenhuanh in #456
- [Kernel] Cache oneDNN primitive when M <
XFT_PRIMITIVE_CACHE_M
, default 256. by @Duyi-Wang in #460 - [Denpendency] Pin python requirements.txt version. by @Duyi-Wang in #458
- [Dependency] Bump web_demo requirement. by @Duyi-Wang in #463
- [Layers] Enable AMX FP16 of FlashAttn by @abenmao in #459
- [Layers] Fix invokeAttentionLLaMA API by @wenhuanh in #464
- [Readme] Add accepted papers by @wenhuanh in #465
- [Kernel] Make SelfAttention prepared for AMX_FP16; More balanced task split in Cross Attention by @pujiang2018 in #466
- [Kernel] Upgrade xDNN to v1.5.2 and make AMX_FP16 work by @pujiang2018 in #468
Full Changelog: v1.7.3...v1.8.0
v1.7.3
v1.7.2 - Continuous batching feature supports Qwen 1.0 & hybrid data types.
v1.7.2 - Continuous batching feature supports Qwen 1.0 & hybrid data types.
Functionality
- Add continuous batching support of Qwen 1.0 models.
- Enable hybrid data types for continuous batching feature, including
BF16_FP16, BF16_INT8, BF16_W8A8, BF16_INT4, BF16_NF4, W8A8_INT8, W8A8_int4, W8A8_NF4
.
BUG fix
- Fixed the convert fault in Baichuan1 models.
What's Changed
Generated release nots
- [Doc] Add vllm benchmark docs. by @marvin-Yu in #448
- [Kernel] Add GPU kernels and enable LLaMA model. by @changqi1 in #372
- [Tools] Add Baichuan1/2 convert tool by @abenmao in #451
- [Layers] Add qwenRope support for Qwen1.0 in CB mode by @abenmao in #449
- [Framework] Remove duplicated code by @xiangzez in #450
- [Model] Support hybrid model in continuous batching. by @Duyi-Wang in #453
- [Version] v1.7.2. by @Duyi-Wang in #454
Full Changelog: v1.7.1...v1.7.2
v1.7.1 - Continuous batching feature supports ChatGLM2/3.
v1.7.1 - Continuous batching feature supports ChatGLM2/3.
Functionality
- Add continuous batching support of ChatGLM2/3 models.
- Qwen2Convert supports quantized Qwen2 models by GPTQ, such as GPTQ-Int8 and GPTQ-Int4, by param
from_quantized_model="gptq"
.
BUG fix
- Fixed the segament fault error when running with more than 2 ranks in vllm-xft serving.
What's Changed
Generated release nots
- [README] Update README.md. by @Duyi-Wang in #434
- [README] Update README.md. by @Duyi-Wang in #435
- [Common]Add INT8/UINT4 to BF16 weight convert by @xiangzez in #436
- Add Continue Batching support for Chatglm2/3 by @a3213105 in #438
- [Model] Add Qwen2 GPTQ model support by @xiangzez in #439
- [Model] Fix array out of bounds when rank > 2. by @Duyi-Wang in #441
- Bump gradio from 4.19.2 to 4.36.0 in /examples/web_demo by @dependabot in #442
- [Version] v1.7.1. by @Duyi-Wang in #445
Full Changelog: v1.7.0...v1.7.1
v1.7.0 - Continuous batching feature supported.
v1.7.0 - Continuous batching feature supported.
Functionality
- Refactor framework to support continuous batching feature.
vllm-xft
, a fork of vllm, integrates the xFasterTransformer backend and maintains compatibility with most of the official vLLM's features. - Remove FP32 data type option of KV Cache.
- Add
get_env()
python API to get recommended LD_PRELOAD set. - Add GPU build option for Intel Arc GPU series.
- Exposed the interface of the LLaMA model, including Attention and decoder.
Performance
- Update xDNN to release
v1.5.1
- Baichuan series models supports full FP16 pipline to improve performance.
- More FP16 data type kernel added, including MHA, MLP, YARN rotary_embedding, rmsnorm and rope.
- Kernel implementation of crossAttnByHead.
Dependency
- Bump
torch
to2.3.0
.
BUG fix
- Fixed the segament fault error when running with more than 4 ranks.
- Fixed the bugs of core dump && hang when running croos nodes.
What's Changed
Generated release nots
- [Fix] add utf-8 encoding. by @marvin-Yu in #354
- [Benchmark] Calculate throughput using avg latency. by @Duyi-Wang in #360
- [GPU] Add GPU build option. by @changqi1 in #359
- Fix Qwen prompt.json by @JunxiChhen in #368
- [Model] Fix ICX build issue. by @changqi1 in #370
- [CMake] Remove evaluation under XFT_BUILD_TESTS option. by @Duyi-Wang in #374
- [Kernel][UT] Kernel impl. of crossAttnByHead and unit test for cross attention. by @pujiang2018 in #348
- [API] Add LLaMA attention API. by @changqi1 in #378
- [Finetune] Scripts for Llama2-7b lora finetune example using stock pytorch by @ustcuna in #327
- [Demo] Add abbreviation for output length. by @Duyi-Wang in #385
- [API] Add LLaMA decoder API. by @changqi1 in #386
- [API] Optimize API Impl. by @changqi1 in #396
- [Framework] Continuous Batching Support by @pujiang2018 in #357
- [KVCache] Remove FP32 data type. by @Duyi-Wang in #399
- [Interface] Change return shape of forward_cb. by @Duyi-Wang in #400
- [Example] Add demo of offline continuous batching by @pujiang2018 in #401
- [Layers] Add alibiSlopes Attn && Flash Attn for CB. by @abenmao in #402
- [Interface] Support List[int] and List[List[int]] for set_input_sb. by @Duyi-Wang in #404
- [Bug] fix incorrect input offset computing by @pujiang2018 in #405
- [Example] Fix incorrect tensor dimension with latest interface by @pujiang2018 in #406
- [Models/Layers/Kernels] Add Baichuan1/2 full-link bf16 support & Fix next-tok gen bug by @abenmao in #407
- [xDNN] Release v1.5.0. by @changqi1 in #410
- [Kernel] Add FP16 rmsnorm and rope kernels. by @changqi1 in #408
- [Kenrel] Add FP16 LLaMA YARN rotary_embedding. by @changqi1 in #412
- [Benchmark] Add platform options. Support real model. by @JunxiChhen in #409
- [Dependency] Update torch to 2.3.0. by @Duyi-Wang in #416
- [COMM] Fix bugs of core dump && hang when running cross nodes by @abenmao in #423
- [xDNN] Release v1.5.1. by @changqi1 in #422
- [Kernel] Add FP16 MHA and MLP kernels. by @changqi1 in #415
- [Python] Add
get_env()
to get LD_PRELOAD set. by @Duyi-Wang in #427 - Add --padding and fix bug by @yangkunx in #418
- [Layers] Fixed the seg fault error when running with more than 4 ranks by @abenmao in #424
- [Kernel] Less compute for Self-Attention (Q * K) by @pujiang2018 in #420
- [Dependency] Update libiomp5.so to
5.0.20230815
contained in mkl. by @Duyi-Wang in #430 - [Distribute] Add distribute support for continuous batching api. by @Duyi-Wang in #421
- [Layers] Fixed error in yarn by @abenmao in #429
- [README] Update readme. by @Duyi-Wang in #431
- [Dependency] Fix wrong so path returned in
get_env()
. by @Duyi-Wang in #432 - [Version] v1.7.0. by @Duyi-Wang in #433
New Contributors
Full Changelog: v1.6.0...v1.7.0
v1.6.0 - Llama3 and Qwen2 series models supported.
v1.6.0 - Llama3 and Qwen2 series models supported.
Functionality
- Support Llama3 and Qwen2 series models.
- Add INT8 KV cache datatype, using
kv_cache_dtype
params to specify, includingint8
,fp16
(default) andfp32
. - More models enable full BF16 pipline, includes Chatglm2/3 and yarn-llama.
- Add invokeMLPLLaMA FP16 API.
- Support logits output using
forward()
api.
Dependency
- Bump
transformers
to4.40.0
to support Llama3 models.
Performance
- Update xDNN to release
v1.4.6
BUG fix
- Fix numeric overflow when calculate softmax in sampling.
- fix assert bug when concat gate&up.
What's Changed
Generated release nots
- [Model] Expose KV cache data type in Llama model. by @pujiang2018 in #313
- [API] Format rotary_embedding api. by @changqi1 in #303
- [Kernel] Add kernel support for INT8 KV cache. by @pujiang2018 in #314
- [Convert] Fix Qwen convert issue. by @marvin-Yu in #315
- [API] Add invokeMLPLLaMA FP16 API. by @changqi1 in #302
- [Build] Fix build issue. by @changqi1 in #316
- Chatglm2/3 bf16 pipeline support by @a3213105 in #301
- [README] Add README_CN.md. by @Duyi-Wang in #317
- [Kernel] Bug fix for small_gemm_transb by @pujiang2018 in #318
- [Eval] Get logits output. by @marvin-Yu in #319
- [CMake] Add oneccl build depends for comm_helper. by @Duyi-Wang in #322
- [Layers] fix assert bug when concat gate&up by @abenmao in #323
- [Sample] Fix numeric overflow when calculate softmax. by @Duyi-Wang in #326
- [Models] Use factory class to create decoder. by @Duyi-Wang in #321
- [RAEDME] Update readme for the dependent lib. by @xwang98 in #331
- [KVCache] INT8 KV cache implementation and related changes by @pujiang2018 in #320
- [Model] Add Qwen2 model. by @marvin-Yu in #330
- [KVCache] Add inferface and register for kvcache. by @Duyi-Wang in #336
- [Demo] Add kvcache type option in web demo. by @Duyi-Wang in #338
- [Benchmark] Add KVCache data type option. by @Duyi-Wang in #337
- [model] Add llama3 model. by @marvin-Yu in #340
- [Kernel] Add 'acc' param in small_gemm, add lacked and remove unused small_gemm kernels. by @pujiang2018 in #346
- [xDNN] Release v1.4.6. by @changqi1 in #342
- [Evaluation] fix the model register bug in evaluation by @abenmao in #347
- [Models] YaRN-Llama full-link bf16 support by @abenmao in #344
- [UT] Remove beam search test temporarily. by @Duyi-Wang in #349
- [Version] v1.6.0. by @Duyi-Wang in #352
New Contributors
Full Changelog: v1.5.0...v1.6.0
v1.5.0 - Gemma series models supported.
v1.5.0 - Gemma series models supported.
Functionality
- Support Gemma series medels, including Gemma and CodeGemma, and DeepSeek model.
- Llama Converter support convert quantized huggingface model by params
from_quantized_model='gptq'
into xFt format INT8/INT4 model files. - Support loading INT4 data weights directly from local files.
- Optimize memory usage during QWen model conversion, particularly for QWen 72B.
Dependency
- Bump
transformers
to4.38.1
to support Gemma models. - Add
protobuf
to support new behavier intokenzier
.
Performance
- Update xDNN to release
v1.4.5
- Add GPU kernel library gpuDNN v0.1 to support Intel Arc GPU series.
- Optimize ROPE perfermance by reducing repeated sin and cos embedding table data.
- Accelerate KVCache copy by increasing parallelism in self attention.
- Accelerate addreduce operation in long sequence case by transposing KVCache and tuned comm.
BUG fix
- Fix a incorrect computing which should be in float, but was in integer.
- Fix timeline is disordered.
- Fix runtime issue of Qwen when seq_length is bigger than 32768.
Generated release nots
What's Changed
- [Kernel] Fix the incorrect computing which should be in float, but was in integer by @pujiang2018 in #267
- [Layer] Reduce repeated sin and cos embedding table data to optimize ROPE perf. by @changqi1 in #266
- [Kernel] increase parallelism for KV cache copy in self attention by @pujiang2018 in #268
- [Include] Fix include not work. by @Duyi-Wang in #271
- Issue qwen72b seq length by @a3213105 in #273
- [Common] Unify memory allocation into xft::alloc by @pujiang2018 in #272
- [Timeline] Fix disordered timeline. by @changqi1 in #277
- [model] Add deepseek model. by @marvin-Yu in #274
- [Bug] Fix incorrect context parameter order. by @changqi1 in #280
- [CI] Check for UT status. by @marvin-Yu in #278
- [CMake] Check existence of MKL & oneDNN directory before installation. by @Duyi-Wang in #283
- Add KVCache trans for long sequence && tuned comm for faster Addreduce by @abenmao in #279
- [Dependency] Add protobuf in requirements.txt by @Duyi-Wang in #284
- [xDNN] Release v1.4.5. by @changqi1 in #285
- [CI] Add rls test case. by @marvin-Yu in #286
- [Bug] fix baichuan model test issue. by @marvin-Yu in #287
- [Fix] Fix baichuan2-13 without rope. by @marvin-Yu in #289
- [Tools] Add convert tool for Llama models quantized by AutoGPTQ by @xiangzez in #276
- [Common] Support loading int4 weights by @xiangzez in #275
- [KVCache] KV Cache refactor and related unit test case fix by @pujiang2018 in #290
- [Model] Update isMaster func. by @changqi1 in #292
- [Bug] Fix oneDNN GPU build issue. by @changqi1 in #293
- [UT] add unit test for selfAttention, and a small fix by @pujiang2018 in #294
- [gpuDNN] Add gpuDNN v0.1.0 library files. by @feng-intel in #291
- [UT] MLP unit test case fix by @abenmao in #296
- [Fix] Reduce convert memory usage. by @marvin-Yu in #297
- [ENV] Use Meyers' Singleton Env object. by @Duyi-Wang in #295
- [fix] fix compile issue. by @marvin-Yu in #299
- [Example] Add gemma model config and web demo. by @marvin-Yu in #304
- [Model] Add gemma model support. by @marvin-Yu in #259
- [example] add gemma model support with example. by @marvin-Yu in #307
- Bump transformers from 4.36.0 to 4.38.0 in /examples/web_demo by @dependabot in #308
- Fix timeline compile issue by @xiangzez in #309
- [Build] Fix build issues. by @changqi1 in #310
- [Version] v1.5.0. by @Duyi-Wang in #311
New Contributors
- @feng-intel made their first contribution in #291
Full Changelog: v1.4.0...v1.5.0
v1.4.0 - Fully BF16 support in Llama for better performance and serving framework support.
Functionality
- Introduce pure BF16 support to Llama series models, now can use fully BF16 data type to to utilize AMX more effectively when deploying Llama models.
- Add MLServer serving framework support and demo in
serving
directory. - GCC for compiling release binary files has been updated from GCC 8.5 to GCC 12.
- Introduce pipeline parallel feature for distributing deployment. Enabled by
cmake .. -DWITH_PIPELINE_PARALLEL=ON
in compilation and useXFT_PIPELINE_STAGE
Marco to define pipeline parallel stages num. - Deprecate convert tool scripts in
tools
directory and it recommended to usingConvert
in xfastertransformer python wheel. - Support loading int8 data weights directly from local files.
Performance
- Update xDNN to release
v1.4.4
. - Accelerate model weights loading by optimizing cast operation after loading and gain up to 50% speed up.
- Optimize BF16 performance using AMX instruction when batchsize <= 8, and add
XFT_USE_AMX_M
to set threshold of M using AMX instead of AVX512, default1
.
Demo & Benchmark
- Update dependency
transformers
requirement from4.30.0
to4.36.0
for high risk CVE Vulnerabilities. - Add distributed inference benchmark script which support deployment across platfrom.
- Add single node platform support in benchmark script.
- Add Yi model web demo.
- Enhance the command-line chat mode in pytorch demo.py, using
--chat true
to enable.
BUG fix
- Fix calculation issue in Qwen models and enhance LogN support for long token sequence.
- Fix unsync issue in multi-rank model when
do_sample
is enabled. - Fix Baichuan models calculation and convert issue.
- Fix repetition penalties not taking effect on other batches.
What's Changed
- [Demo] Update web demo to adapt gradio 4.11.0. by @Duyi-Wang in #201
- Bump gradio from 3.40.1 to 4.11.0 in /examples/web_demo by @dependabot in #150
- [demo] Add Yi model demo. by @marvin-Yu in #200
- [Dependency] transformers version warning when error occurs. by @Duyi-Wang in #202
- [Tools] Deprecate convert tools in tools dir. by @Duyi-Wang in #203
- [benchmark] Add one node benchmark. by @marvin-Yu in #205
- Bump transformers from 4.30.0 to 4.36.0 by @dependabot in #145
- Bump transformers from 4.30.0 to 4.36.0 in /examples/web_demo by @dependabot in #144
- [CMake] Check if the compiler really supports avx512bf16 with try_compile by @pujiang2018 in #206
- [Layer] Fine grained data type definition for Attention and MLP by @pujiang2018 in #194
- Add recommend GCC version by @a3213105 in #207
- [TP] Make split dimension align with oneDNN packing by @pujiang2018 in #208
- Support loading int8 weights by @xiangzez in #157
- [benchmark] Add distributed benchmark. by @marvin-Yu in #211
- [ci] Fix python path issue. by @marvin-Yu in #214
- [Fix] Fix repetition penalties not taking effect on other batches. by @Duyi-Wang in #212
- [xDNN] Release v1.4.3. by @changqi1 in #213
- [ci] Add workflow permission. by @marvin-Yu in #218
- [Layer] Enable pipeline parallel feature. by @changqi1 in #221
- [Dockerfile] Remove dockerfile. by @Duyi-Wang in #219
- [CI] Align using benchmark tests. by @marvin-Yu in #216
- [xDNN] Release v1.4.4. by @changqi1 in #223
- [Layer] Support pure full-link BF16 LLaMa model. by @pujiang2018 in #222
- [Layers] Qwen LogN for query by @a3213105 in #215
- [Layer] Convert static MMHelper class to instance Class in DecoderContext. by @changqi1 in #225
- [models][layers/tools] Refine and bugfix for baichuan models by @abenmao in #226
- [Serving] Add MLServer serving support. by @Duyi-Wang in #217
- [Dependencies] Remove tokenizers requirement. by @Duyi-Wang in #227
- [kernel] Add ICX compiler. by @changqi1 in #228
- [Env] Add XFT_ENGINE env variable. by @changqi1 in #231
- [CMake] Open the pip-install information for MKL. by @marvin-Yu in #234
- [Fix] Add parameter check for logN and NTK rotary embedding of QWEN by @a3213105 in #232
- [CMake] Remvoe force reinstall for mkl dependencies. by @Duyi-Wang in #237
- [Example] Add seq_length in qwen fake config.ini by @Duyi-Wang in #238
- [Tools] Accelerate model loading. by @marvin-Yu in #224
- [Fix] Fix the wrong output of QWEN-14B. by @marvin-Yu in #240
- fix issue #220 by @a3213105 in #242
- Bump gradio from 4.11.0 to 4.19.2 in /examples/web_demo by @dependabot in #241
- [Example] Add llama2 chat support in Cli demo. by @Duyi-Wang in #243
- [Dependency] Update web demo requirement. by @Duyi-Wang in #246
- [Docs] Initial documents. by @Duyi-Wang in #248
- Fix Opt issue by @xiangzez in #251
- [Serving] Fix fail to set pad_token_id when it's not None in single mode. by @Duyi-Wang in #254
- [layers] Add bf16-type input/output support for flash attention by @abenmao in #252
- [Kernel] Set USE_AMX_M to 1. by @Duyi-Wang in #245
- [Benchmark] Fix typo in benchmark script. by @Duyi-Wang in #261
- [Attention Kernel/Layer] group attention support in full-link BF16 path; attention layer refactor by @pujiang2018 in #258
- [Search] Sync smaple result in multi-rank. by @Duyi-Wang in #260
- [Benchmark] Update model cfg for transformers>4.36. by @Duyi-Wang in #257
- [Layer] Use flash attention when larger than threshold ('>=' to '>') by @pujiang2018 in #265
- [Benchmark] Modify CPU affinity logic, add CI prompt output. by @marvin-Yu in #263
- [Version] v1.4.0. by @Duyi-Wang in #262
New Contributors
- @dependabot made their first contribution in #150
Full Changelog: v1.3.1...v1.4.0