Release v1.4.0 - Fully BF16 support in Llama for better performance and serving framework support. · intel/xFasterTransformer

Functionality

Introduce pure BF16 support to Llama series models, now can use fully BF16 data type to to utilize AMX more effectively when deploying Llama models.
Add MLServer serving framework support and demo in serving directory.
GCC for compiling release binary files has been updated from GCC 8.5 to GCC 12.
Introduce pipeline parallel feature for distributing deployment. Enabled by cmake .. -DWITH_PIPELINE_PARALLEL=ON in compilation and use XFT_PIPELINE_STAGE Marco to define pipeline parallel stages num.
Deprecate convert tool scripts in tools directory and it recommended to using Convert in xfastertransformer python wheel.
Support loading int8 data weights directly from local files.

Performance

Update xDNN to release v1.4.4.
Accelerate model weights loading by optimizing cast operation after loading and gain up to 50% speed up.
Optimize BF16 performance using AMX instruction when batchsize <= 8, and add XFT_USE_AMX_M to set threshold of M using AMX instead of AVX512, default 1.

Demo & Benchmark

Update dependency transformers requirement from 4.30.0 to 4.36.0 for high risk CVE Vulnerabilities.
Add distributed inference benchmark script which support deployment across platfrom.
Add single node platform support in benchmark script.
Add Yi model web demo.
Enhance the command-line chat mode in pytorch demo.py, using --chat true to enable.

BUG fix

Fix calculation issue in Qwen models and enhance LogN support for long token sequence.
Fix unsync issue in multi-rank model when do_sample is enabled.
Fix Baichuan models calculation and convert issue.
Fix repetition penalties not taking effect on other batches.

What's Changed

[Demo] Update web demo to adapt gradio 4.11.0. by @Duyi-Wang in #201
Bump gradio from 3.40.1 to 4.11.0 in /examples/web_demo by @dependabot in #150
[demo] Add Yi model demo. by @marvin-Yu in #200
[Dependency] transformers version warning when error occurs. by @Duyi-Wang in #202
[Tools] Deprecate convert tools in tools dir. by @Duyi-Wang in #203
[benchmark] Add one node benchmark. by @marvin-Yu in #205
Bump transformers from 4.30.0 to 4.36.0 by @dependabot in #145
Bump transformers from 4.30.0 to 4.36.0 in /examples/web_demo by @dependabot in #144
[CMake] Check if the compiler really supports avx512bf16 with try_compile by @pujiang2018 in #206
[Layer] Fine grained data type definition for Attention and MLP by @pujiang2018 in #194
Add recommend GCC version by @a3213105 in #207
[TP] Make split dimension align with oneDNN packing by @pujiang2018 in #208
Support loading int8 weights by @xiangzez in #157
[benchmark] Add distributed benchmark. by @marvin-Yu in #211
[ci] Fix python path issue. by @marvin-Yu in #214
[Fix] Fix repetition penalties not taking effect on other batches. by @Duyi-Wang in #212
[xDNN] Release v1.4.3. by @changqi1 in #213
[ci] Add workflow permission. by @marvin-Yu in #218
[Layer] Enable pipeline parallel feature. by @changqi1 in #221
[Dockerfile] Remove dockerfile. by @Duyi-Wang in #219
[CI] Align using benchmark tests. by @marvin-Yu in #216
[xDNN] Release v1.4.4. by @changqi1 in #223
[Layer] Support pure full-link BF16 LLaMa model. by @pujiang2018 in #222
[Layers] Qwen LogN for query by @a3213105 in #215
[Layer] Convert static MMHelper class to instance Class in DecoderContext. by @changqi1 in #225
[models][layers/tools] Refine and bugfix for baichuan models by @abenmao in #226
[Serving] Add MLServer serving support. by @Duyi-Wang in #217
[Dependencies] Remove tokenizers requirement. by @Duyi-Wang in #227
[kernel] Add ICX compiler. by @changqi1 in #228
[Env] Add XFT_ENGINE env variable. by @changqi1 in #231
[CMake] Open the pip-install information for MKL. by @marvin-Yu in #234
[Fix] Add parameter check for logN and NTK rotary embedding of QWEN by @a3213105 in #232
[CMake] Remvoe force reinstall for mkl dependencies. by @Duyi-Wang in #237
[Example] Add seq_length in qwen fake config.ini by @Duyi-Wang in #238
[Tools] Accelerate model loading. by @marvin-Yu in #224
[Fix] Fix the wrong output of QWEN-14B. by @marvin-Yu in #240
fix issue #220 by @a3213105 in #242
Bump gradio from 4.11.0 to 4.19.2 in /examples/web_demo by @dependabot in #241
[Example] Add llama2 chat support in Cli demo. by @Duyi-Wang in #243
[Dependency] Update web demo requirement. by @Duyi-Wang in #246
[Docs] Initial documents. by @Duyi-Wang in #248
Fix Opt issue by @xiangzez in #251
[Serving] Fix fail to set pad_token_id when it's not None in single mode. by @Duyi-Wang in #254
[layers] Add bf16-type input/output support for flash attention by @abenmao in #252
[Kernel] Set USE_AMX_M to 1. by @Duyi-Wang in #245
[Benchmark] Fix typo in benchmark script. by @Duyi-Wang in #261
[Attention Kernel/Layer] group attention support in full-link BF16 path; attention layer refactor by @pujiang2018 in #258
[Search] Sync smaple result in multi-rank. by @Duyi-Wang in #260
[Benchmark] Update model cfg for transformers>4.36. by @Duyi-Wang in #257
[Layer] Use flash attention when larger than threshold ('>=' to '>') by @pujiang2018 in #265
[Benchmark] Modify CPU affinity logic, add CI prompt output. by @marvin-Yu in #263
[Version] v1.4.0. by @Duyi-Wang in #262

New Contributors

@dependabot made their first contribution in #150

Full Changelog: v1.3.1...v1.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.4.0 - Fully BF16 support in Llama for better performance and serving framework support.

Functionality

Performance

Demo & Benchmark

BUG fix

What's Changed

New Contributors

Contributors