v1.4.0 - Fully BF16 support in Llama for better performance and serving framework support.
Functionality
- Introduce pure BF16 support to Llama series models, now can use fully BF16 data type to to utilize AMX more effectively when deploying Llama models.
- Add MLServer serving framework support and demo in
serving
directory. - GCC for compiling release binary files has been updated from GCC 8.5 to GCC 12.
- Introduce pipeline parallel feature for distributing deployment. Enabled by
cmake .. -DWITH_PIPELINE_PARALLEL=ON
in compilation and useXFT_PIPELINE_STAGE
Marco to define pipeline parallel stages num. - Deprecate convert tool scripts in
tools
directory and it recommended to usingConvert
in xfastertransformer python wheel. - Support loading int8 data weights directly from local files.
Performance
- Update xDNN to release
v1.4.4
. - Accelerate model weights loading by optimizing cast operation after loading and gain up to 50% speed up.
- Optimize BF16 performance using AMX instruction when batchsize <= 8, and add
XFT_USE_AMX_M
to set threshold of M using AMX instead of AVX512, default1
.
Demo & Benchmark
- Update dependency
transformers
requirement from4.30.0
to4.36.0
for high risk CVE Vulnerabilities. - Add distributed inference benchmark script which support deployment across platfrom.
- Add single node platform support in benchmark script.
- Add Yi model web demo.
- Enhance the command-line chat mode in pytorch demo.py, using
--chat true
to enable.
BUG fix
- Fix calculation issue in Qwen models and enhance LogN support for long token sequence.
- Fix unsync issue in multi-rank model when
do_sample
is enabled. - Fix Baichuan models calculation and convert issue.
- Fix repetition penalties not taking effect on other batches.
What's Changed
- [Demo] Update web demo to adapt gradio 4.11.0. by @Duyi-Wang in #201
- Bump gradio from 3.40.1 to 4.11.0 in /examples/web_demo by @dependabot in #150
- [demo] Add Yi model demo. by @marvin-Yu in #200
- [Dependency] transformers version warning when error occurs. by @Duyi-Wang in #202
- [Tools] Deprecate convert tools in tools dir. by @Duyi-Wang in #203
- [benchmark] Add one node benchmark. by @marvin-Yu in #205
- Bump transformers from 4.30.0 to 4.36.0 by @dependabot in #145
- Bump transformers from 4.30.0 to 4.36.0 in /examples/web_demo by @dependabot in #144
- [CMake] Check if the compiler really supports avx512bf16 with try_compile by @pujiang2018 in #206
- [Layer] Fine grained data type definition for Attention and MLP by @pujiang2018 in #194
- Add recommend GCC version by @a3213105 in #207
- [TP] Make split dimension align with oneDNN packing by @pujiang2018 in #208
- Support loading int8 weights by @xiangzez in #157
- [benchmark] Add distributed benchmark. by @marvin-Yu in #211
- [ci] Fix python path issue. by @marvin-Yu in #214
- [Fix] Fix repetition penalties not taking effect on other batches. by @Duyi-Wang in #212
- [xDNN] Release v1.4.3. by @changqi1 in #213
- [ci] Add workflow permission. by @marvin-Yu in #218
- [Layer] Enable pipeline parallel feature. by @changqi1 in #221
- [Dockerfile] Remove dockerfile. by @Duyi-Wang in #219
- [CI] Align using benchmark tests. by @marvin-Yu in #216
- [xDNN] Release v1.4.4. by @changqi1 in #223
- [Layer] Support pure full-link BF16 LLaMa model. by @pujiang2018 in #222
- [Layers] Qwen LogN for query by @a3213105 in #215
- [Layer] Convert static MMHelper class to instance Class in DecoderContext. by @changqi1 in #225
- [models][layers/tools] Refine and bugfix for baichuan models by @abenmao in #226
- [Serving] Add MLServer serving support. by @Duyi-Wang in #217
- [Dependencies] Remove tokenizers requirement. by @Duyi-Wang in #227
- [kernel] Add ICX compiler. by @changqi1 in #228
- [Env] Add XFT_ENGINE env variable. by @changqi1 in #231
- [CMake] Open the pip-install information for MKL. by @marvin-Yu in #234
- [Fix] Add parameter check for logN and NTK rotary embedding of QWEN by @a3213105 in #232
- [CMake] Remvoe force reinstall for mkl dependencies. by @Duyi-Wang in #237
- [Example] Add seq_length in qwen fake config.ini by @Duyi-Wang in #238
- [Tools] Accelerate model loading. by @marvin-Yu in #224
- [Fix] Fix the wrong output of QWEN-14B. by @marvin-Yu in #240
- fix issue #220 by @a3213105 in #242
- Bump gradio from 4.11.0 to 4.19.2 in /examples/web_demo by @dependabot in #241
- [Example] Add llama2 chat support in Cli demo. by @Duyi-Wang in #243
- [Dependency] Update web demo requirement. by @Duyi-Wang in #246
- [Docs] Initial documents. by @Duyi-Wang in #248
- Fix Opt issue by @xiangzez in #251
- [Serving] Fix fail to set pad_token_id when it's not None in single mode. by @Duyi-Wang in #254
- [layers] Add bf16-type input/output support for flash attention by @abenmao in #252
- [Kernel] Set USE_AMX_M to 1. by @Duyi-Wang in #245
- [Benchmark] Fix typo in benchmark script. by @Duyi-Wang in #261
- [Attention Kernel/Layer] group attention support in full-link BF16 path; attention layer refactor by @pujiang2018 in #258
- [Search] Sync smaple result in multi-rank. by @Duyi-Wang in #260
- [Benchmark] Update model cfg for transformers>4.36. by @Duyi-Wang in #257
- [Layer] Use flash attention when larger than threshold ('>=' to '>') by @pujiang2018 in #265
- [Benchmark] Modify CPU affinity logic, add CI prompt output. by @marvin-Yu in #263
- [Version] v1.4.0. by @Duyi-Wang in #262
New Contributors
- @dependabot made their first contribution in #150
Full Changelog: v1.3.1...v1.4.0