release for phi3 model (#2913)

intel · May 21, 2024 · 2b852fd · 2b852fd
1 parent d85c47f
commit 2b852fd
Show file tree

Hide file tree

Showing 147 changed files with 29,543 additions and 4 deletions.
diff --git a/llm/llama3/cpu/index.html b/llm/llama3/cpu/index.html
@@ -113,14 +113,14 @@ <h1>1. Environment Setup<a class="headerlink" href="#environment-setup" title="L
 <p>There are several environment setup methodologies provided. You can choose either of them according to your usage scenario. The Docker-based ones are recommended.</p>
 <section id="recommended-docker-based-environment-setup-with-pre-built-wheels">
 <h2>1.1 [RECOMMENDED] Docker-based environment setup with pre-built wheels<a class="headerlink" href="#recommended-docker-based-environment-setup-with-pre-built-wheels" title="Link to this heading"></a></h2>
-<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># Get the Intel® Extension for PyTorch\* source code</span>
+<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># Get the Intel® Extension for PyTorch* source code</span>
 git<span class="w"> </span>clone<span class="w"> </span>https://github.com/intel/intel-extension-for-pytorch.git
 <span class="nb">cd</span><span class="w"> </span>intel-extension-for-pytorch
 git<span class="w"> </span>checkout<span class="w"> </span><span class="m">2</span>.3-rc1-sp
 git<span class="w"> </span>submodule<span class="w"> </span>sync
 git<span class="w"> </span>submodule<span class="w"> </span>update<span class="w"> </span>--init<span class="w"> </span>--recursive
 
-<span class="c1"># Build an image with the provided Dockerfile by installing from Intel® Extension for PyTorch\* prebuilt wheel files</span>
+<span class="c1"># Build an image with the provided Dockerfile by installing from Intel® Extension for PyTorch* prebuilt wheel files</span>
 <span class="nv">DOCKER_BUILDKIT</span><span class="o">=</span><span class="m">1</span><span class="w"> </span>docker<span class="w"> </span>build<span class="w"> </span>-f<span class="w"> </span>examples/cpu/inference/python/llm/Dockerfile<span class="w"> </span>-t<span class="w"> </span>ipex-llm:2.3.0<span class="w"> </span>.
 
 <span class="c1"># Run the container with command below</span>
@@ -136,7 +136,7 @@ <h2>1.1 [RECOMMENDED] Docker-based environment setup with pre-built wheels<a cla
 </section>
 <section id="conda-based-environment-setup-with-pre-built-wheels">
 <h2>1.2 Conda-based environment setup with pre-built wheels<a class="headerlink" href="#conda-based-environment-setup-with-pre-built-wheels" title="Link to this heading"></a></h2>
-<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># Get the Intel® Extension for PyTorch\* source code</span>
+<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># Get the Intel® Extension for PyTorch* source code</span>
 git<span class="w"> </span>clone<span class="w"> </span>https://github.com/intel/intel-extension-for-pytorch.git
 <span class="nb">cd</span><span class="w"> </span>intel-extension-for-pytorch
 git<span class="w"> </span>checkout<span class="w"> </span><span class="m">2</span>.3-rc1-sp
@@ -321,4 +321,4 @@ <h2>Miscellaneous Tips<a class="headerlink" href="#miscellaneous-tips" title="Li
   </script> 
 
 </body>
-</html>
+</html>
diff --git a/llm/phi3/cpu/_images/1ins_cus.gif b/llm/phi3/cpu/_images/1ins_cus.gif
diff --git a/llm/phi3/cpu/_images/1ins_log.gif b/llm/phi3/cpu/_images/1ins_log.gif
diff --git a/llm/phi3/cpu/_images/1ins_phy.gif b/llm/phi3/cpu/_images/1ins_phy.gif
diff --git a/llm/phi3/cpu/_images/1ins_soc.gif b/llm/phi3/cpu/_images/1ins_soc.gif
diff --git a/llm/phi3/cpu/_images/GenAI-bf16.gif b/llm/phi3/cpu/_images/GenAI-bf16.gif
diff --git a/llm/phi3/cpu/_images/GenAI-int8.gif b/llm/phi3/cpu/_images/GenAI-int8.gif
diff --git a/llm/phi3/cpu/_images/autotp_bf16_llama.gif b/llm/phi3/cpu/_images/autotp_bf16_llama.gif
diff --git a/llm/phi3/cpu/_images/autotp_woq_int8_llama.gif b/llm/phi3/cpu/_images/autotp_woq_int8_llama.gif
diff --git a/llm/phi3/cpu/_images/bf16_llama.gif b/llm/phi3/cpu/_images/bf16_llama.gif
diff --git a/llm/phi3/cpu/_images/block_diagram_xeon_architecture.png b/llm/phi3/cpu/_images/block_diagram_xeon_architecture.png
diff --git a/llm/phi3/cpu/_images/figure1_memory_layout.png b/llm/phi3/cpu/_images/figure1_memory_layout.png
diff --git a/llm/phi3/cpu/_images/figure2_dispatch.png b/llm/phi3/cpu/_images/figure2_dispatch.png
diff --git a/llm/phi3/cpu/_images/figure3_strided_layout.png b/llm/phi3/cpu/_images/figure3_strided_layout.png
diff --git a/llm/phi3/cpu/_images/hypertune.png b/llm/phi3/cpu/_images/hypertune.png
diff --git a/llm/phi3/cpu/_images/int8_pattern.png b/llm/phi3/cpu/_images/int8_pattern.png
diff --git a/llm/phi3/cpu/_images/intel_extension_for_pytorch_structure.png b/llm/phi3/cpu/_images/intel_extension_for_pytorch_structure.png
diff --git a/llm/phi3/cpu/_images/kmp_affinity.jpg b/llm/phi3/cpu/_images/kmp_affinity.jpg
diff --git a/llm/phi3/cpu/_images/llm_iakv_1.png b/llm/phi3/cpu/_images/llm_iakv_1.png
diff --git a/llm/phi3/cpu/_images/llm_iakv_2.png b/llm/phi3/cpu/_images/llm_iakv_2.png
diff --git a/llm/phi3/cpu/_images/m7i_m6i_comp_gptj6b.png b/llm/phi3/cpu/_images/m7i_m6i_comp_gptj6b.png
diff --git a/llm/phi3/cpu/_images/m7i_m6i_comp_llama13b.png b/llm/phi3/cpu/_images/m7i_m6i_comp_llama13b.png
diff --git a/llm/phi3/cpu/_images/m7i_m6i_comp_llama7b.png b/llm/phi3/cpu/_images/m7i_m6i_comp_llama7b.png
diff --git a/llm/phi3/cpu/_images/nins_cus.gif b/llm/phi3/cpu/_images/nins_cus.gif
diff --git a/llm/phi3/cpu/_images/nins_lat.gif b/llm/phi3/cpu/_images/nins_lat.gif
diff --git a/llm/phi3/cpu/_images/nins_thr.gif b/llm/phi3/cpu/_images/nins_thr.gif
diff --git a/llm/phi3/cpu/_images/smoothquant_int8_llama.gif b/llm/phi3/cpu/_images/smoothquant_int8_llama.gif
diff --git a/llm/phi3/cpu/_images/split_sgd.png b/llm/phi3/cpu/_images/split_sgd.png
diff --git a/llm/phi3/cpu/_images/two_socket_config.png b/llm/phi3/cpu/_images/two_socket_config.png
diff --git a/llm/phi3/cpu/_images/woq_int4_gptj.gif b/llm/phi3/cpu/_images/woq_int4_gptj.gif
diff --git a/llm/phi3/cpu/_images/woq_int8_llama.gif b/llm/phi3/cpu/_images/woq_int8_llama.gif
diff --git a/llm/phi3/cpu/_sources/design_doc/cpu/isa_dyndisp.md.txt b/llm/phi3/cpu/_sources/design_doc/cpu/isa_dyndisp.md.txt
@@ -0,0 +1,3 @@
+# Intel® Extension for PyTorch\* CPU ISA Dynamic Dispatch Design Doc
+
+The design document is redirected to [this link](../../tutorials/features/isa_dynamic_dispatch.md) now.
diff --git a/llm/phi3/cpu/_sources/index.md.txt b/llm/phi3/cpu/_sources/index.md.txt
@@ -0,0 +1,150 @@
+# Intel® Extension for PyTorch\* Large Language Model (LLM) Feature Get Started For Phi 3 models
+
+Intel® Extension for PyTorch* provides dedicated optimization for running Phi 3 models faster, including technical points like paged attention, ROPE fusion, etc. And a set of data types are supported for various scenarios, including BF16, Weight Only Quantization, etc. 
+# 1. Environment Setup
+
+There are several environment setup methodologies provided. You can choose either of them according to your usage scenario. The Docker-based ones are recommended.
+
+## 1.1 [RECOMMENDED] Docker-based environment setup with pre-built wheels
+
+```bash
+# Get the Intel® Extension for PyTorch\* source code
+git clone https://github.com/intel/intel-extension-for-pytorch.git
+cd intel-extension-for-pytorch
+git checkout 2.3-phi-3
+git submodule sync
+git submodule update --init --recursive
+
+# Build an image with the provided Dockerfile by installing from Intel® Extension for PyTorch\* prebuilt wheel files
+DOCKER_BUILDKIT=1 docker build -f examples/cpu/inference/python/llm/Dockerfile -t ipex-llm:phi3 .
+
+# Run the container with command below
+docker run --rm -it --privileged ipex-llm:phi3 bash
+
+# When the command prompt shows inside the docker container, enter llm examples directory
+cd llm
+
+# Activate environment variables
+source ./tools/env_activate.sh
+```
+
+## 1.2 Conda-based environment setup with pre-built wheels
+
+```bash
+# Get the Intel® Extension for PyTorch\* source code
+git clone https://github.com/intel/intel-extension-for-pytorch.git
+cd intel-extension-for-pytorch
+git checkout 2.3-phi-3
+git submodule sync
+git submodule update --init --recursive
+
+# Create a conda environment (pre-built wheel only available with python=3.10)
+conda create -n llm python=3.10 -y
+conda activate llm
+
+# Setup the environment with the provided script
+# A sample "prompt.json" file for benchmarking is also downloaded
+cd examples/cpu/inference/python/llm
+bash ./tools/env_setup.sh 7
+
+# Activate environment variables
+source ./tools/env_activate.sh
+```
+<br>
+
+# 2. How To Run Phi 3 with ipex.llm
+
+**ipex.llm provides a single script to facilitate running generation tasks as below:**
+
+```
+# if you are using a docker container built from commands above in Sec. 1.1, the placeholder LLM_DIR below is /home/ubuntu/llm
+# if you are using a conda env created with commands above in Sec. 1.2, the placeholder LLM_DIR below is intel-extension-for-pytorch/examples/cpu/inference/python/llm
+cd <LLM_DIR>
+python run.py --help # for more detailed usages
+```
+
+| Key args of run.py | Notes |
+|---|---|
+| model id | "--model-name-or-path" or "-m" to specify the <PHI3_MODEL_ID_OR_LOCAL_PATH>, it is model id from Huggingface or downloaded local path |
+| generation | default: beam search (beam size = 4), "--greedy" for greedy search |
+| input tokens | provide fixed sizes for input prompt size, use "--input-tokens" for <INPUT_LENGTH> in [1024, 2048, 4096, 8192, 130944]; if "--input-tokens" is not used, use "--prompt" to choose other strings as inputs|
+| output tokens | default: 32, use "--max-new-tokens" to choose any other size |
+| batch size |  default: 1, use "--batch-size" to choose any other size |
+| token latency |  enable "--token-latency" to print out the first or next token latency |
+| generation iterations |  use "--num-iter" and "--num-warmup" to control the repeated iterations of generation, default: 100-iter/10-warmup |
+| streaming mode output | greedy search only (work with "--greedy"), use "--streaming" to enable the streaming generation output |
+
+*Note:* You may need to log in your HuggingFace account to access the model files. Please refer to [HuggingFace login](https://huggingface.co/docs/huggingface_hub/quick-start#login).
+
+## 2.1 Usage of running Phi 3 models
+
+The _\<PHI3_MODEL_ID_OR_LOCAL_PATH\>_ in the below commands specifies the Phi 3 model you will run, which can be found from [HuggingFace Models](https://huggingface.co/models).
+
+### 2.1.1 Run generation with multiple instances on multiple CPU numa nodes
+
+#### 2.1.1.1 Prepare:
+
+```bash
+unset KMP_AFFINITY
+```
+
+In the DeepSpeed cases below, we recommend "--shard-model" to shard model weight sizes more even for better memory usage when running with DeepSpeed.
+
+If using "--shard-model", it will save a copy of the shard model weights file in the path of "--output-dir" (default path is "./saved_results" if not provided).
+If you have used "--shard-model" and generated such a shard model path (or your model weights files are already well sharded), in further repeated benchmarks, please remove "--shard-model", and replace "-m <PHI3_MODEL_ID_OR_LOCAL_PATH>" with "-m <shard model path>" to skip the repeated shard steps.
+
+Besides, the standalone shard model function/scripts are also provided in section 2.1.1.4, in case you would like to generate the shard model weights files in advance before running distributed inference.
+
+#### 2.1.1.2 BF16:
+
+- Command:
+```bash
+deepspeed --bind_cores_to_rank  run.py --benchmark -m <PHI3_MODEL_ID_OR_LOCAL_PATH> --dtype bfloat16 --ipex  --greedy --input-tokens <INPUT_LENGTH> --autotp --shard-model
+```
+
+#### 2.1.1.3 Weight-only quantization (INT8):
+
+By default, for weight-only quantization, we use quantization with [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html) inference ("--quant-with-amp") to get peak performance and fair accuracy.
+For weight-only quantization with deepspeed, we quantize the model then run the benchmark. The quantized model won't be saved.
+
+- Command:
+```bash
+deepspeed --bind_cores_to_rank run.py  --benchmark -m <PHI3_MODEL_ID_OR_LOCAL_PATH> --ipex --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --greedy --input-tokens <INPUT_LENGTH>  --autotp --shard-model --output-dir "saved_results"
+```
+
+#### 2.1.1.4 How to Shard Model weight files for Distributed Inference with DeepSpeed
+
+To save memory usage, we could shard the model weights files under the local path before we launch distributed tests with DeepSpeed.
+
+```
+cd ./utils
+# general command:
+python create_shard_model.py -m <PHI3_MODEL_ID_OR_LOCAL_PATH>  --save-path ./local_phi3_model_shard
+# After sharding the model, using "-m ./local_phi3_model_shard" in later tests
+```
+
+### 2.1.2 Run generation with single instance on a single numa node
+#### 2.1.2.1 BF16:
+
+- Command:
+```bash
+OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <PHI3_MODEL_ID_OR_LOCAL_PATH> --dtype bfloat16 --ipex --greedy --input-tokens <INPUT_LENGTH> 
+```
+
+#### 2.1.2.2 Weight-only quantization (INT8):
+
+By default, for weight-only quantization, we use quantization with [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html) inference ("--quant-with-amp") to get peak performance and fair accuracy.
+
+- Command:
+```bash
+OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list>  python run.py  --benchmark -m <PHI3_MODEL_ID_OR_LOCAL_PATH> --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir "saved_results"  --greedy --input-tokens <INPUT_LENGTH>
+```
+
+#### 2.1.2.3 Notes:
+
+(1) [_numactl_](https://linux.die.net/man/8/numactl) is used to specify memory and cores of your hardware to get better performance. _\<node N\>_ specifies the [numa](https://en.wikipedia.org/wiki/Non-uniform_memory_access) node id (e.g., 0 to use the memory from the first numa node). _\<physical cores list\>_ specifies phsysical cores which you are using from the _\<node N\>_ numa node. You can use [_lscpu_](https://man7.org/linux/man-pages/man1/lscpu.1.html) command in Linux to check the numa node information.
+
+(2) For all quantization benchmarks, both quantization and inference stages will be triggered by default. For quantization stage, it will auto-generate the quantized model named "best_model.pt" in the "--output-dir" path, and for inference stage, it will launch the inference with the quantized model "best_model.pt".  For inference-only benchmarks (avoid the repeating quantization stage), you can also reuse these quantized models for by adding "--quantized-model-path <output_dir + "best_model.pt">" .
+
+## Miscellaneous Tips
+Intel® Extension for PyTorch* also provides dedicated optimization for many other Large Language Models (LLM), which cover a set of data types that are supported for various scenarios. For more details, please check this [Intel® Extension for PyTorch* doc](https://github.com/intel/intel-extension-for-pytorch/blob/release/2.3/README.md).
diff --git a/llm/phi3/cpu/_sources/index.rst.txt b/llm/phi3/cpu/_sources/index.rst.txt
@@ -0,0 +1,100 @@
+.. meta::
+   :description: This website introduces Intel® Extension for PyTorch*
+   :keywords: Intel optimization, PyTorch, Intel® Extension for PyTorch*, GPU, discrete GPU, Intel discrete GPU
+
+Intel® Extension for PyTorch*
+#############################
+
+Intel® Extension for PyTorch* extends PyTorch* with the latest performance optimizations for Intel hardware. 
+Optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel X\ :sup:`e`\ Matrix Extensions (XMX) AI engines on Intel discrete GPUs. 
+Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* ``xpu`` device.
+
+In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Starting from 2.1.0, specific optimizations for certain 
+LLMs are introduced in the Intel® Extension for PyTorch*. For more information on LLM optimizations, refer to the `Large Language Models (LLM) <tutorials/llm.html>`_ section.
+
+The extension can be loaded as a Python module for Python programs or linked as a C++ library for C++ programs. In Python scripts, users can enable it dynamically by importing ``intel_extension_for_pytorch``.
+
+.. note:: 
+
+   - GPU features are not included in CPU-only packages.
+   - Optimizations for CPU-only may have a newer code base due to different development schedules.
+
+Intel® Extension for PyTorch* has been released as an open–source project at `Github <https://github.com/intel/intel-extension-for-pytorch>`_. You can find the source code and instructions on how to get started at:
+
+- **CPU**: `CPU main branch <https://github.com/intel/intel-extension-for-pytorch/tree/main>`_ |  `Quick Start <https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/getting_started>`_ 
+- **XPU**: `XPU main branch <https://github.com/intel/intel-extension-for-pytorch/tree/xpu-main>`_ | `Quick Start <https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/getting_started>`_
+
+You can find more information about the product at:
+
+- `Features <https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/features>`_
+- `Performance <https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/performance>`_ 
+
+Architecture
+------------
+
+Intel® Extension for PyTorch* is structured as shown in the following figure:
+
+.. figure:: ../images/intel_extension_for_pytorch_structure.png
+  :width: 800
+  :align: center
+  :alt: Architecture of Intel® Extension for PyTorch*
+
+  Architecture of Intel® Extension for PyTorch*
+
+- **Eager Mode**: In the eager mode, the PyTorch frontend is extended with custom Python modules (such as fusion modules), optimal optimizers, and INT8 quantization APIs. Further performance improvement is achieved by converting eager-mode models into graph mode using extended graph fusion passes. 
+- **Graph Mode**: In the graph mode, fusions reduce operator/kernel invocation overhead, resulting in improved performance. Compared to the eager mode, the graph mode in PyTorch* normally yields better performance from the optimization techniques like operation fusion. Intel® Extension for PyTorch* amplifies them with more comprehensive graph optimizations. Both PyTorch ``Torchscript`` and ``TorchDynamo`` graph modes are supported. With ``Torchscript``, we recommend using ``torch.jit.trace()`` as your preferred option, as it generally supports a wider range of workloads compared to ``torch.jit.script()``. With ``TorchDynamo``, ipex backend is available to provide good performances.
+- **CPU Optimization**: On CPU, Intel® Extension for PyTorch* automatically dispatches operators to underlying kernels based on detected instruction set architecture (ISA). The extension leverages vectorization and matrix acceleration units available on Intel hardware. The runtime extension offers finer-grained thread runtime control and weight sharing for increased efficiency.
+- **GPU Optimization**: On GPU, optimized operators and kernels are implemented and registered through PyTorch dispatching mechanism. These operators and kernels are accelerated from native vectorization feature and matrix calculation feature of Intel GPU hardware. Intel® Extension for PyTorch* for GPU utilizes the `DPC++ <https://github.com/intel/llvm#oneapi-dpc-compiler>`_ compiler that supports the latest `SYCL* <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html>`_ standard and also a number of extensions to the SYCL* standard, which can be found in the `sycl/doc/extensions <https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions>`_ directory. 
+
+
+Support
+-------
+The team tracks bugs and enhancement requests using `GitHub issues <https://github.com/intel/intel-extension-for-pytorch/issues/>`_. Before submitting a suggestion or bug report, search the existing GitHub issues to see if your issue has already been reported.
+
+.. toctree::
+   :caption: ABOUT
+   :maxdepth: 3
+   :hidden:
+
+   tutorials/introduction
+   tutorials/features
+   Large Language Models (LLM)<tutorials/llm>
+   tutorials/performance
+   tutorials/releases
+   tutorials/known_issues
+   tutorials/blogs_publications
+   tutorials/license
+
+.. toctree::
+   :maxdepth: 3
+   :caption: GET STARTED
+   :hidden:
+
+   tutorials/installation
+   tutorials/getting_started
+   tutorials/examples
+   tutorials/cheat_sheet
+
+.. toctree::
+   :maxdepth: 3
+   :caption: DEVELOPER REFERENCE
+   :hidden:
+
+   tutorials/api_doc
+
+.. toctree::
+   :maxdepth: 3
+   :caption: PERFORMANCE TUNING
+   :hidden:
+
+   tutorials/performance_tuning/tuning_guide
+   tutorials/performance_tuning/launch_script
+   tutorials/performance_tuning/torchserve   
+
+.. toctree::
+   :maxdepth: 3
+   :caption: CONTRIBUTING GUIDE
+   :hidden:
+
+   tutorials/contribution
+