Release Intel® Extension for PyTorch* v2.2.0+cpu Release Notes · intel/intel-extension-for-pytorch

We are excited to announce the release of Intel® Extension for PyTorch* 2.2.0+cpu which accompanies PyTorch 2.2. This release mainly brings in our latest optimization on Large Language Model (LLM) including new dedicated API set (ipex.llm), new capability for auto-tuning accuracy recipe for LLM, and a broader list of optimized LLM models, together with a set of bug fixing and small optimization. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try this release and feedback as to improve further on this product.

Highlights

Large Language Model (LLM) optimization

Intel® Extension for PyTorch* provides a new dedicated module, ipex.llm, to host for Large Language Models (LLMs) specific APIs. With ipex.llm, Intel® Extension for PyTorch* provides comprehensive LLM optimization cross various popular datatypes including FP32/BF16/INT8/INT4. Specifically for low precision, both SmoothQuant and Weight-Only quantization are supported for various scenarios. And user can also run Intel® Extension for PyTorch* with Tensor Parallel to fit in the multiple ranks or multiple nodes scenarios to get even better performance.

A typical API under this new module is ipex.llm.optimize, which is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. ipex.llm.optimize is an upgrade API to replace previous ipex.optimize_transformers, which will bring you more consistent LLM experience and performance. Below shows a simple example of ipex.llm.optimize for fp32 or bf16 inference:
```
import torch
import intel_extension_for_pytorch as ipex
import transformers

model= transformers.AutoModelForCausalLM(model_name_or_path).eval()

dtype = torch.float # or torch.bfloat16
model = ipex.llm.optimize(model, dtype=dtype)

model.generate(YOUR_GENERATION_PARAMS)
```
More examples of this API can be found at LLM optimization API.

Besides the new optimization API for LLM inference, Intel® Extension for PyTorch* also provides new capability for users to auto-tune a good quantization recipe for running SmoothQuant INT8 with good accuracy. SmoothQuant is a popular method to improve the accuracy of int8 quantization. The new auto-tune API allows automatic global alpha tuning, and automatic layer-by-layer alpha tuning provided by Intel® Neural Compressor for the best INT8 accuracy. More details can be found at SmoothQuant Recipe Tuning API Introduction.

Intel® Extension for PyTorch* newly optimized many more LLM models including more llama2 variance like llama2-13b/llama2-70b, encoder-decoder model like T5, code generation models like starcoder/codegen, and more like Baichuan, Baichuan2, ChatGLM2, ChatGLM3, mistral, mpt, dolly, etc.. A full list of optimized models can be found at LLM Optimization.
Bug fixing and other optimization
- Further optimized the performance of LLMs #2349 #2412, #2469, #2476
- Optimized the Flash Attention Operator #2317 #2334 #2392 #2480
- Fixed the static quantization of the ELSER model #2491
- Switched deepspeed to the public release version on PyPI #2473 #2511
- Upgrade oneDNN to v3.3.4 #2433

Full Changelog: v2.1.100+cpu...v2.2.0+cpu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel® Extension for PyTorch* v2.2.0+cpu Release Notes

Highlights