Intel® Extension for PyTorch* v2.2.0+cpu Release Notes
We are excited to announce the release of Intel® Extension for PyTorch* 2.2.0+cpu which accompanies PyTorch 2.2. This release mainly brings in our latest optimization on Large Language Model (LLM) including new dedicated API set (ipex.llm
), new capability for auto-tuning accuracy recipe for LLM, and a broader list of optimized LLM models, together with a set of bug fixing and small optimization. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try this release and feedback as to improve further on this product.
Highlights
-
Large Language Model (LLM) optimization
Intel® Extension for PyTorch* provides a new dedicated module,
ipex.llm
, to host for Large Language Models (LLMs) specific APIs. Withipex.llm
, Intel® Extension for PyTorch* provides comprehensive LLM optimization cross various popular datatypes including FP32/BF16/INT8/INT4. Specifically for low precision, both SmoothQuant and Weight-Only quantization are supported for various scenarios. And user can also run Intel® Extension for PyTorch* with Tensor Parallel to fit in the multiple ranks or multiple nodes scenarios to get even better performance.A typical API under this new module is
ipex.llm.optimize
, which is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise.ipex.llm.optimize
is an upgrade API to replace previousipex.optimize_transformers
, which will bring you more consistent LLM experience and performance. Below shows a simple example ofipex.llm.optimize
for fp32 or bf16 inference:import torch import intel_extension_for_pytorch as ipex import transformers model= transformers.AutoModelForCausalLM(model_name_or_path).eval() dtype = torch.float # or torch.bfloat16 model = ipex.llm.optimize(model, dtype=dtype) model.generate(YOUR_GENERATION_PARAMS)
More examples of this API can be found at LLM optimization API.
Besides the new optimization API for LLM inference, Intel® Extension for PyTorch* also provides new capability for users to auto-tune a good quantization recipe for running SmoothQuant INT8 with good accuracy. SmoothQuant is a popular method to improve the accuracy of int8 quantization. The new auto-tune API allows automatic global alpha tuning, and automatic layer-by-layer alpha tuning provided by Intel® Neural Compressor for the best INT8 accuracy. More details can be found at SmoothQuant Recipe Tuning API Introduction.
Intel® Extension for PyTorch* newly optimized many more LLM models including more llama2 variance like llama2-13b/llama2-70b, encoder-decoder model like T5, code generation models like starcoder/codegen, and more like Baichuan, Baichuan2, ChatGLM2, ChatGLM3, mistral, mpt, dolly, etc.. A full list of optimized models can be found at LLM Optimization.
-
Bug fixing and other optimization
Full Changelog: v2.1.100+cpu...v2.2.0+cpu