diff --git a/cpu/2.3.0+cpu/_sources/tutorials/llm.rst.txt b/cpu/2.3.0+cpu/_sources/tutorials/llm.rst.txt
index 4cb02e6a0..e9690b677 100644
--- a/cpu/2.3.0+cpu/_sources/tutorials/llm.rst.txt
+++ b/cpu/2.3.0+cpu/_sources/tutorials/llm.rst.txt
@@ -30,14 +30,14 @@ Verified for distributed inference mode via DeepSpeed
*Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp32/bf16). We are working in progress to better support the models in the tables with various data types. In addition, more models will be optimized in the future.
-Please check `LLM best known practice <../../examples/cpu/inference/python/llm>`_ for instructions to install/setup environment and example scripts.
+Please check `LLM best known practice `_ for instructions to install/setup environment and example scripts.
Module Level Optimization API for customized LLM (Prototype)
In the past year, LLM has been flourishing with many open-sourced models contributed to the community, while researchers are building their own LLMs from transformer blocks with variants in implementation details. To help LLM researchers and developers improve their productivity, Intel® Extension for PyTorch* provides module level optimizations for commonly used LLM modules and functionalities, which are operators or certain operator combinations in nature.
-Please check `LLM module level optimization practice <../../examples/cpu/inference/python/llm-modeling>`_ to better understand how to use `module level APIs `_ to optimize your LLM and achieve better performance.
+Please check `LLM module level optimization practice `_ to better understand how to use `module level APIs `_ to optimize your LLM and achieve better performance.
diff --git a/cpu/2.3.0+cpu/design_doc/cpu/isa_dyndisp.html b/cpu/2.3.0+cpu/design_doc/cpu/isa_dyndisp.html
index 42911773f..759ecfb0d 100644
--- a/cpu/2.3.0+cpu/design_doc/cpu/isa_dyndisp.html
+++ b/cpu/2.3.0+cpu/design_doc/cpu/isa_dyndisp.html
@@ -125,7 +125,7 @@ Intel® Extension for PyTorch* CPU ISA Dynamic Dispatch Design DocSphinx using a
provided by Read the Docs.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD),
diff --git a/cpu/2.3.0+cpu/genindex.html b/cpu/2.3.0+cpu/genindex.html
index e881b05a6..ef43fe926 100644
--- a/cpu/2.3.0+cpu/genindex.html
+++ b/cpu/2.3.0+cpu/genindex.html
@@ -375,7 +375,7 @@ V
Built with Sphinx using a
provided by Read the Docs.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD),
diff --git a/cpu/2.3.0+cpu/index.html b/cpu/2.3.0+cpu/index.html
index faeb6e121..b05e1f672 100644
--- a/cpu/2.3.0+cpu/index.html
+++ b/cpu/2.3.0+cpu/index.html
@@ -182,7 +182,7 @@ Support using a
provided by Read the Docs.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD),
diff --git a/cpu/2.3.0+cpu/py-modindex.html b/cpu/2.3.0+cpu/py-modindex.html
index 7aab41737..323572252 100644
--- a/cpu/2.3.0+cpu/py-modindex.html
+++ b/cpu/2.3.0+cpu/py-modindex.html
@@ -165,7 +165,7 @@ Python Module Index
Built with Sphinx using a
provided by Read the Docs.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD),
diff --git a/cpu/2.3.0+cpu/search.html b/cpu/2.3.0+cpu/search.html
index a1f97b24d..d94b3c725 100644
--- a/cpu/2.3.0+cpu/search.html
+++ b/cpu/2.3.0+cpu/search.html
@@ -133,7 +133,7 @@
Built with Sphinx using a
provided by Read the Docs.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD),
diff --git a/cpu/2.3.0+cpu/searchindex.js b/cpu/2.3.0+cpu/searchindex.js
index c8847b61b..e55de425d 100644
--- a/cpu/2.3.0+cpu/searchindex.js
+++ b/cpu/2.3.0+cpu/searchindex.js
@@ -1 +1 @@
-Search.setIndex({"docnames": ["design_doc/cpu/isa_dyndisp", "index", "tutorials/api_doc", "tutorials/blogs_publications", "tutorials/cheat_sheet", "tutorials/contribution", "tutorials/examples", "tutorials/features", "tutorials/features/amp", "tutorials/features/auto_channels_last", "tutorials/features/codeless_optimization", "tutorials/features/fast_bert", "tutorials/features/graph_capture", "tutorials/features/graph_optimization", "tutorials/features/hypertune", "tutorials/features/int8_overview", "tutorials/features/int8_recipe_tuning_api", "tutorials/features/isa_dynamic_dispatch", "tutorials/features/nhwc", "tutorials/features/optimizer_fusion", "tutorials/features/runtime_extension", "tutorials/features/split_sgd", "tutorials/features/sq_recipe_tuning_api", "tutorials/getting_started", "tutorials/installation", "tutorials/introduction", "tutorials/known_issues", "tutorials/license", "tutorials/llm", "tutorials/llm/llm_optimize", "tutorials/performance", "tutorials/performance_tuning/launch_script", "tutorials/performance_tuning/torchserve", "tutorials/performance_tuning/tuning_guide", "tutorials/releases"], "filenames": ["design_doc/cpu/isa_dyndisp.md", "index.rst", "tutorials/api_doc.rst", "tutorials/blogs_publications.md", "tutorials/cheat_sheet.md", "tutorials/contribution.md", "tutorials/examples.md", "tutorials/features.rst", "tutorials/features/amp.md", "tutorials/features/auto_channels_last.md", "tutorials/features/codeless_optimization.md", "tutorials/features/fast_bert.md", "tutorials/features/graph_capture.md", "tutorials/features/graph_optimization.md", "tutorials/features/hypertune.md", "tutorials/features/int8_overview.md", "tutorials/features/int8_recipe_tuning_api.md", "tutorials/features/isa_dynamic_dispatch.md", "tutorials/features/nhwc.md", "tutorials/features/optimizer_fusion.md", "tutorials/features/runtime_extension.md", "tutorials/features/split_sgd.rst", "tutorials/features/sq_recipe_tuning_api.md", "tutorials/getting_started.md", "tutorials/installation.md", "tutorials/introduction.rst", "tutorials/known_issues.md", "tutorials/license.md", "tutorials/llm.rst", "tutorials/llm/llm_optimize.md", "tutorials/performance.md", "tutorials/performance_tuning/launch_script.md", "tutorials/performance_tuning/torchserve.md", "tutorials/performance_tuning/tuning_guide.md", "tutorials/releases.md"], "titles": ["Intel\u00ae Extension for PyTorch* CPU ISA Dynamic Dispatch Design Doc", "Intel\u00ae Extension for PyTorch*", "API Documentation", "Blogs & Publications", "Cheat Sheet", "Contribution", "Examples", "Features", "Auto Mixed Precision (AMP)", "Auto Channels Last", "Codeless Optimization (Prototype)", "Fast BERT (Prototype)", "Graph Capture (Prototype)", "Graph Optimization", "HyperTune (Prototype)", "Intel\u00ae Extension for PyTorch* optimizations for quantization", "INT8 Recipe Tuning API (Prototype)", "ISA Dynamic Dispatching", "Channels Last", "Optimizer Fusion", "Runtime Extension", "Split SGD", "Smooth Quant Recipe Tuning API (Prototype)", "Quick Start", "Installation", "Introduction", "Troubleshooting", "License", "Large Language Models (LLM) Optimization Overview", "Transformers Optimization Frontend API", "Performance", "Launch Script Usage Guide", "TorchServe with Intel\u00ae Extension for PyTorch*", "Performance Tuning Guide", "Releases"], "terms": {"The": [0, 1, 2, 5, 6, 7, 8, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 28, 29, 30, 31, 32, 33, 34], "document": [0, 7, 17, 20, 29, 34], "i": [0, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 21, 22, 23, 26, 27, 28, 29, 30, 32, 33, 34], "redirect": 0, "thi": [0, 2, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 26, 27, 28, 29, 30, 31, 34], "link": [0, 1, 6, 17, 34], "now": [0, 2, 7, 15, 18, 32, 33, 34], "intel optim": 1, "intel\u00ae extension for pytorch*": 1, "gpu": [1, 3, 18, 34], "discrete gpu": 1, "intel discrete gpu": 1, "extend": [1, 18, 25, 33, 34], "latest": [1, 2, 25, 28, 30, 34], "perform": [1, 2, 3, 4, 6, 7, 8, 9, 10, 13, 14, 15, 16, 18, 19, 21, 25, 28, 29, 31], "optim": [1, 3, 4, 6, 8, 9, 11, 12, 14, 16, 18, 20, 21, 23, 25, 26, 31, 32, 33, 34], "hardwar": [1, 3, 17, 25, 28, 32, 34], "take": [1, 2, 7, 8, 10, 12, 13, 14, 18, 21, 25, 26, 30, 31, 33], "advantag": [1, 2, 7, 9, 12, 18, 21, 25, 30, 31, 33], "advanc": [1, 2, 6, 7, 16, 25, 28], "vector": [1, 2, 6, 17, 18, 25, 28], "512": [1, 6, 11, 16, 25, 28, 31], "avx": [1, 6, 17, 25, 28], "neural": [1, 3, 7, 16, 22, 25, 28, 33, 34], "network": [1, 3, 7, 8, 20, 25, 28, 33], "instruct": [1, 5, 6, 7, 8, 17, 21, 23, 24, 25, 28, 30, 33, 34], "vnni": [1, 15, 17, 25, 28], "matrix": [1, 6, 7, 25, 28], "amx": [1, 3, 6, 7, 17, 25, 28, 30], "cpu": [1, 3, 4, 5, 6, 7, 8, 10, 14, 15, 16, 19, 20, 23, 25, 26, 28, 30, 31, 32, 34], "well": [1, 2, 5, 6, 7, 11, 16, 20, 21, 24, 28, 32, 33, 34], "x": [1, 5, 6, 8, 10, 13, 15, 16, 17, 18, 20, 21, 23, 26, 34], "e": [1, 2, 6, 7, 8, 12, 16, 17, 18, 28, 31, 33, 34], "xmx": 1, "ai": [1, 2, 3, 7, 28], "engin": [1, 6, 18, 33], "discret": 1, "moreov": [1, 2, 28], "provid": [1, 2, 5, 6, 7, 8, 11, 12, 13, 14, 16, 20, 22, 24, 26, 28, 29, 31, 32, 33, 34], "easi": [1, 3, 21], "acceler": [1, 2, 3, 6, 7, 13, 28, 29, 30, 34], "through": [1, 2, 6, 7, 8, 12, 25, 28, 33, 34], "xpu": [1, 2, 3, 34], "devic": [1, 2, 15, 29, 31, 34], "In": [1, 2, 6, 7, 8, 12, 16, 17, 18, 19, 21, 23, 28, 31, 32, 33, 34], "current": [1, 2, 5, 7, 11, 13, 14, 15, 16, 17, 19, 20, 26, 28, 29, 34], "technolog": [1, 7, 28], "landscap": [1, 7, 28], "gener": [1, 5, 6, 7, 10, 12, 16, 17, 18, 21, 23, 28, 29, 30, 31, 32, 33, 34], "genai": [1, 7, 28], "workload": [1, 6, 7, 8, 10, 11, 12, 21, 26, 28, 29, 30, 31, 33, 34], "model": [1, 2, 3, 4, 8, 9, 10, 11, 12, 14, 16, 23, 24, 25, 26, 29, 30, 33, 34], "have": [1, 2, 5, 6, 7, 9, 14, 17, 18, 20, 21, 23, 26, 27, 28, 30, 31, 32, 33, 34], "gain": [1, 7, 26, 28, 34], "widespread": [1, 7, 28], "attent": [1, 2, 7, 28, 34], "popular": [1, 7, 22, 28, 30, 34], "larg": [1, 2, 19, 23, 24, 25, 26, 29, 30, 33, 34], "languag": [1, 2, 23, 24, 25, 26, 29, 34], "llm": [1, 16, 22, 24, 25, 29, 34], "emerg": [1, 7, 28], "domin": [1, 7, 28], "drive": [1, 7, 28], "applic": [1, 2, 7, 20, 28, 32, 33], "start": [1, 3, 4, 5, 6, 7, 10, 20, 24, 34], "from": [1, 2, 3, 4, 5, 8, 10, 11, 13, 15, 16, 17, 18, 19, 20, 21, 23, 25, 28, 29, 31, 32, 33, 34], "2": [1, 2, 3, 8, 10, 16, 17, 18, 20, 21, 25, 26, 27, 28, 29, 30, 31, 33], "1": [1, 2, 3, 4, 6, 8, 10, 11, 12, 13, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 28, 29, 30, 31, 33], "0": [1, 2, 4, 5, 8, 10, 11, 13, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 30, 31, 32, 33], "specif": [1, 2, 5, 6, 7, 12, 18, 20, 26, 28, 31, 33, 34], "certain": [1, 7, 26, 28, 29, 31, 33], "ar": [1, 2, 3, 5, 6, 7, 8, 10, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 28, 29, 30, 31, 32, 33, 34], "introduc": [1, 3, 7, 15, 18, 21, 22, 31, 33, 34], "For": [1, 2, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 31, 32, 33, 34], "more": [1, 2, 5, 6, 7, 8, 10, 11, 13, 16, 17, 19, 20, 21, 23, 26, 28, 32, 33, 34], "inform": [1, 2, 6, 7, 14, 17, 18, 28, 31, 32, 33, 34], "refer": [1, 7, 9, 13, 14, 16, 17, 18, 20, 22, 23, 24, 25, 32, 34], "section": [1, 6, 7, 8, 14, 20, 23, 24, 25, 28, 29, 32, 33, 34], "can": [1, 2, 5, 6, 7, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 26, 28, 29, 30, 31, 32, 33, 34], "load": [1, 2, 6, 7, 13, 15, 16, 17, 23, 29, 32, 34], "python": [1, 2, 4, 10, 14, 17, 20, 26, 28, 29, 31, 32, 33, 34], "modul": [1, 6, 7, 8, 13, 16, 17, 26, 29, 31, 34], "program": [1, 5, 7, 11, 20, 31, 33, 34], "c": [1, 7, 8, 16, 17, 20, 26, 28, 31, 32, 33, 34], "librari": [1, 2, 5, 6, 7, 17, 20, 32, 33, 34], "script": [1, 2, 3, 4, 5, 6, 7, 8, 10, 14, 17, 20, 23, 24, 26, 28, 29, 30, 32, 33, 34], "user": [1, 2, 7, 9, 10, 12, 13, 15, 16, 18, 20, 26, 31, 32, 33, 34], "enabl": [1, 2, 3, 4, 6, 7, 8, 10, 13, 16, 18, 20, 22, 23, 26, 28, 31, 32, 33, 34], "dynam": [1, 4, 20, 28, 32, 33, 34], "import": [1, 2, 4, 5, 6, 7, 10, 11, 12, 13, 15, 16, 17, 18, 20, 21, 23, 25, 26, 28, 29, 32, 33, 34], "intel_extension_for_pytorch": [1, 2, 4, 5, 6, 7, 10, 11, 12, 13, 14, 15, 16, 17, 20, 23, 25, 29, 32, 34], "featur": [1, 2, 3, 5, 8, 10, 13, 14, 18, 20, 23, 25, 26, 28, 30, 31, 32, 33, 34], "includ": [1, 2, 5, 6, 7, 10, 14, 15, 17, 23, 26, 27, 28, 30, 34], "onli": [1, 2, 5, 7, 8, 10, 11, 13, 14, 15, 16, 17, 18, 20, 21, 26, 28, 31, 32, 34], "packag": [1, 2, 5, 6, 7, 10, 23, 25, 26, 32, 33, 34], "mai": [1, 2, 3, 5, 6, 7, 8, 9, 16, 17, 18, 20, 26, 28, 31, 32, 33, 34], "newer": [1, 28, 33], "code": [1, 2, 5, 6, 7, 10, 11, 12, 13, 18, 19, 21, 23, 24, 26, 27, 29, 33, 34], "base": [1, 2, 3, 4, 5, 6, 7, 10, 11, 17, 20, 21, 26, 28, 29, 30, 32, 33, 34], "due": [1, 8, 10, 17, 20, 26], "differ": [1, 2, 6, 7, 15, 16, 17, 18, 20, 28, 31, 32, 33, 34], "develop": [1, 3, 6, 28, 30, 33, 34], "schedul": [1, 2, 13, 20, 31, 33], "ha": [1, 2, 7, 10, 14, 17, 18, 20, 21, 26, 28, 30, 31, 33, 34], "been": [1, 6, 7, 10, 17, 18, 28, 31, 33, 34], "releas": [1, 17, 18, 26, 30, 33], "an": [1, 2, 5, 6, 7, 8, 10, 11, 13, 14, 16, 17, 18, 19, 20, 21, 26, 31, 32, 33, 34], "open": [1, 16, 28, 33], "sourc": [1, 5, 6, 17, 27, 28, 33, 34], "project": [1, 6], "github": [1, 2, 5, 6, 7, 8, 34], "you": [1, 2, 5, 6, 7, 8, 13, 14, 15, 17, 18, 20, 23, 25, 26, 28, 29, 31, 33, 34], "find": [1, 2, 6, 7, 14, 16, 23, 26, 30, 31, 34], "how": [1, 2, 6, 10, 15, 17, 18, 23, 28, 32, 33, 34], "get": [1, 2, 3, 4, 6, 7, 10, 11, 15, 17, 20, 21, 22, 26, 28, 29, 30, 31, 33, 34], "main": [1, 2, 5, 6, 14, 20, 31, 32], "branch": [1, 7, 30], "quick": [1, 20, 24, 25], "about": [1, 2, 5, 7, 13, 16, 32, 33, 34], "product": [1, 2, 7, 14, 28, 34], "structur": [1, 18, 31], "shown": [1, 6, 18, 28, 31, 32], "follow": [1, 2, 4, 5, 6, 7, 8, 11, 14, 15, 16, 17, 18, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34], "figur": [1, 2, 21, 28, 33], "eager": [1, 7, 12, 23, 32, 34], "mode": [1, 2, 5, 7, 10, 12, 18, 20, 23, 26, 32, 34], "frontend": [1, 2, 7, 20, 28, 34], "custom": [1, 2, 7, 26, 34], "fusion": [1, 2, 7, 10, 21, 28, 34], "int8": [1, 2, 3, 4, 17, 18, 20, 22, 28, 29, 34], "quantiz": [1, 3, 4, 13, 22, 26, 28, 30, 32, 34], "api": [1, 3, 6, 10, 11, 15, 20, 26, 33, 34], "further": [1, 2, 5, 6, 7, 18, 20, 28, 33, 34], "improv": [1, 3, 7, 8, 13, 20, 22, 28, 30, 32, 33], "achiev": [1, 2, 6, 7, 28, 33, 34], "convert": [1, 2, 4, 6, 7, 8, 9, 10, 13, 16, 17, 18, 20, 23, 26, 32, 34], "graph": [1, 4, 8, 10, 16, 23, 26, 31, 34], "us": [1, 2, 3, 4, 5, 6, 11, 14, 15, 17, 18, 19, 21, 23, 24, 25, 26, 27, 28, 32, 33, 34], "pass": [1, 2, 5, 10, 17, 20, 26, 32, 34], "reduc": [1, 2, 7, 15, 19, 20, 21, 22, 26, 28, 33, 34], "oper": [1, 2, 6, 8, 13, 15, 21, 32, 33, 34], "kernel": [1, 2, 7, 20, 26, 28, 30, 33, 34], "invoc": [1, 7], "overhead": [1, 2, 7, 10, 19, 20, 26, 28, 33, 34], "result": [1, 2, 6, 10, 12, 14, 16, 18, 20, 21, 30, 31, 32, 33], "compar": [1, 2, 7, 13, 18, 21, 26, 28, 30, 31, 33, 34], "normal": [1, 2, 6, 7, 13, 20, 28, 33, 34], "yield": [1, 7, 33], "better": [1, 2, 6, 7, 15, 18, 20, 28, 31, 32, 33, 34], "techniqu": [1, 2, 7, 11, 12, 28, 34], "like": [1, 2, 3, 5, 6, 7, 8, 14, 18, 19, 21, 26, 28, 31, 33, 34], "amplifi": 1, "them": [1, 5, 7, 18, 19, 28, 31, 33], "comprehens": [1, 34], "both": [1, 2, 6, 7, 16, 18, 19, 21, 28, 29, 31, 32, 33, 34], "torchscript": [1, 2, 5, 7, 10, 11, 12, 19, 23, 26, 32, 34], "torchdynamo": [1, 7, 12, 23, 34], "With": [1, 2, 7, 10, 20, 31, 34], "we": [1, 2, 5, 6, 7, 8, 9, 10, 14, 15, 16, 17, 18, 19, 20, 21, 23, 28, 30, 32, 33, 34], "recommend": [1, 5, 6, 7, 9, 10, 15, 16, 20, 23, 30, 31, 33, 34], "torch": [1, 2, 4, 6, 8, 10, 11, 12, 13, 15, 16, 18, 20, 23, 26, 29, 32, 33, 34], "jit": [1, 2, 5, 6, 7, 8, 13, 15, 16, 18, 20, 23, 26, 32, 34], "trace": [1, 6, 7, 8, 12, 13, 15, 16, 20, 23, 26, 32, 34], "your": [1, 5, 6, 7, 8, 10, 14, 15, 20, 23, 24, 26, 27, 28, 29, 34], "prefer": [1, 7, 8, 15, 24], "option": [1, 2, 5, 7, 10, 14, 15, 16, 29, 31, 34], "wider": 1, "rang": [1, 6, 7, 15, 16, 19, 21, 26, 31, 32, 34], "ipex": [1, 2, 3, 4, 6, 7, 9, 11, 12, 13, 15, 16, 17, 19, 20, 23, 26, 29, 31, 32, 34], "backend": [1, 2, 3, 6, 7, 12, 13, 16, 17, 23, 26, 28, 31, 33, 34], "avail": [1, 2, 6, 7, 11, 17, 20, 22, 23, 29, 31, 33, 34], "good": [1, 2, 5, 7, 12, 18, 19, 28, 33, 34], "On": [1, 2, 7, 18, 28, 33], "automat": [1, 2, 6, 7, 9, 10, 12, 13, 15, 16, 18, 22, 28, 31, 32, 33, 34], "dispatch": [1, 34], "underli": [1, 17, 28], "detect": [1, 6, 12, 17, 26, 33, 34], "set": [1, 2, 4, 5, 6, 7, 8, 14, 15, 16, 17, 21, 24, 26, 28, 30, 31, 32, 33, 34], "isa": [1, 34], "leverag": [1, 7, 11, 28, 32, 34], "unit": [1, 2, 33], "runtim": [1, 8, 13, 17, 31, 33, 34], "offer": [1, 5, 33], "finer": [1, 7, 20], "grain": [1, 3, 7, 20], "thread": [1, 2, 7, 20, 26, 30, 31, 32, 33, 34], "control": [1, 2, 7, 20, 26, 31, 33, 34], "weight": [1, 2, 7, 10, 12, 13, 15, 16, 18, 20, 22, 23, 26, 28, 34], "share": [1, 5, 6, 16, 20, 32, 33, 34], "increas": [1, 2, 3, 21, 26, 28, 30, 33, 34], "effici": [1, 7, 11, 19, 20, 28, 31, 33, 34], "implement": [1, 5, 7, 11, 19, 26, 28, 33, 34], "regist": [1, 7, 10, 16, 17, 34], "mechan": [1, 7, 17, 21, 34], "These": [1, 5, 6, 7, 8, 13, 28], "nativ": [1, 6, 7, 8, 17, 19, 21, 26, 28, 34], "calcul": [1, 2, 8, 16, 21, 22], "util": [1, 6, 7, 10, 13, 15, 16, 18, 21, 28, 31, 33, 34], "dpc": 1, "compil": [1, 5, 6, 23, 26, 33, 34], "sycl": 1, "standard": [1, 34], "also": [1, 2, 6, 7, 10, 13, 14, 16, 18, 19, 28, 30, 31, 33, 34], "number": [1, 2, 5, 6, 7, 14, 16, 19, 20, 21, 26, 32, 34], "which": [1, 2, 5, 7, 8, 10, 14, 15, 16, 17, 18, 20, 26, 28, 30, 31, 32, 33, 34], "found": [1, 6, 7, 14, 16, 18, 29, 31, 32, 33, 34], "doc": [1, 2, 5, 11, 29, 34], "directori": [1, 5, 6, 14, 29, 31, 32], "team": [1, 5], "track": 1, "bug": [1, 5, 34], "enhanc": [1, 3, 28, 34], "request": [1, 5, 20, 32], "issu": [1, 2, 5, 8, 21, 26, 33], "befor": [1, 2, 5, 6, 13, 14, 17, 18, 20, 31, 33, 34], "submit": [1, 5, 7, 20], "suggest": [1, 2, 15, 18, 20, 33, 34], "report": [1, 17], "search": [1, 2, 4, 5, 7, 16, 22, 28, 31], "exist": [1, 5, 7, 13, 26, 31, 33], "see": [1, 2, 5, 8, 14, 34], "alreadi": [1, 5, 6, 18, 28, 33], "pytorch": [2, 3, 4, 6, 7, 8, 9, 10, 13, 14, 16, 17, 20, 23, 25, 26, 27, 28, 29, 30, 31, 33, 34], "dtype": [2, 4, 6, 7, 8, 10, 11, 13, 15, 16, 17, 23, 26, 29, 31, 34], "none": [2, 6, 29, 31], "o1": [2, 26, 34], "inplac": [2, 4, 6, 13, 15, 18, 23, 32], "fals": [2, 4, 6, 7, 8, 13, 14, 15, 16, 17, 20, 22, 23, 26, 31, 32, 34], "conv_bn_fold": [2, 26, 34], "linear_bn_fold": 2, "weights_prepack": [2, 6, 7, 23, 26], "replace_dropout_with_ident": 2, "optimize_lstm": 2, "split_master_weight_for_bf16": 2, "fuse_update_step": 2, "auto_kernel_select": [2, 7, 30], "sample_input": [2, 9, 34], "graph_mod": [2, 4, 7, 12, 34], "concat_linear": 2, "appli": [2, 6, 7, 8, 12, 13, 16, 18, 19, 21, 23, 26, 28, 29, 31, 34], "given": [2, 6, 13, 14, 16, 28], "nn": [2, 6, 7, 8, 10, 13, 15, 16, 18, 20, 26, 34], "If": [2, 5, 6, 7, 8, 9, 10, 13, 14, 15, 16, 17, 20, 26, 31, 32, 33, 34], "train": [2, 3, 4, 7, 11, 13, 15, 16, 18, 21, 23, 26, 28, 29, 31, 34], "otherwis": [2, 7, 20], "infer": [2, 3, 4, 7, 10, 11, 12, 15, 18, 20, 21, 23, 26, 30, 33, 34], "conv": [2, 8, 10, 13, 15, 20, 26, 34], "bn": [2, 10, 15, 26, 34], "fold": [2, 10, 15, 16, 26, 34], "prepack": [2, 6, 10, 18, 26, 28, 34], "so": [2, 5, 6, 7, 8, 15, 17, 18, 20, 30, 31, 32, 33, 34], "onednn": [2, 3, 13, 17, 26, 28, 34], "order": [2, 17, 18, 21, 31, 33, 34], "cach": [2, 5, 7, 19, 20, 30, 34], "reus": [2, 33], "memori": [2, 6, 7, 8, 9, 10, 13, 19, 20, 21, 26, 28, 30, 32, 34], "layout": [2, 26, 34], "call": [2, 6, 8, 13, 17, 18, 21, 26, 32, 33, 34], "block": [2, 5, 16, 20, 22, 28, 33, 34], "although": [2, 33], "itself": [2, 5, 18], "enough": [2, 7, 19], "usag": [2, 6, 7, 8, 23, 25, 32, 33, 34], "perspect": [2, 13, 18, 21, 28, 31, 33], "drawback": [2, 21], "run": [2, 4, 5, 6, 7, 8, 10, 12, 14, 16, 20, 26, 30, 31, 32, 33, 34], "split": [2, 6, 7, 16, 17, 19, 20, 26, 34], "one": [2, 5, 7, 12, 13, 14, 16, 18, 19, 20, 26, 29, 31, 33, 34], "sever": [2, 7, 10, 19, 30, 31, 34], "dimens": [2, 18, 26], "data": [2, 4, 6, 7, 8, 9, 10, 11, 12, 13, 16, 17, 18, 19, 20, 21, 23, 26, 31, 32, 34], "fix": [2, 5, 7, 34], "size": [2, 6, 7, 11, 15, 16, 17, 18, 23, 26, 28, 30, 32, 33, 34], "each": [2, 8, 14, 16, 17, 19, 20, 21, 31, 32, 33, 34], "time": [2, 5, 7, 14, 16, 17, 18, 19, 26, 28, 30, 33, 34], "execut": [2, 4, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 19, 20, 26, 31, 32, 33, 34], "detail": [2, 5, 6, 7, 8, 9, 11, 13, 17, 18, 24, 25, 26, 28, 30, 32, 33, 34], "mermori": 2, "format": [2, 5, 6, 7, 9, 14, 22, 26, 28, 31, 33, 34], "manual": [2, 7, 10, 14, 18, 20, 34], "To": [2, 5, 6, 7, 10, 13, 15, 16, 17, 18, 20, 21, 23, 28, 32, 33, 34], "predefin": 2, "shape": [2, 6, 7, 16, 20, 23, 30, 33, 34], "prior": [2, 23], "match": [2, 8, 17, 31], "requir": [2, 5, 6, 8, 10, 16, 18, 21, 26, 28, 29, 31, 32, 34], "won": [2, 7, 8, 17, 26], "t": [2, 5, 7, 8, 14, 15, 16, 17, 18, 20, 26, 32, 34], "convers": [2, 8, 13, 34], "directli": [2, 6, 33, 34], "go": [2, 5, 8], "methodologi": [2, 6, 7, 19, 33], "possibl": [2, 14, 15, 19, 28, 33, 34], "avoid": [2, 10, 20, 21, 26, 31, 32, 33, 34], "thu": [2, 7, 8, 10, 18, 20, 21, 28, 31, 32, 33], "paramet": [2, 6, 7, 8, 10, 16, 17, 19, 20, 21, 26, 28, 29, 30, 31, 33, 34], "work": [2, 5, 6, 7, 14, 15, 17, 20, 26, 28, 29, 31, 33, 34], "bfloat16": [2, 3, 4, 7, 10, 11, 17, 18, 23, 29, 31, 34], "half": [2, 7, 17, 21], "k": [2, 5], "float16": [2, 8], "cast": [2, 8, 21, 28], "accord": [2, 13, 28, 33, 34], "default": [2, 4, 6, 7, 10, 12, 13, 15, 16, 17, 20, 22, 23, 26, 28, 30, 32, 33, 34], "valu": [2, 6, 10, 14, 16, 17, 19, 20, 21, 22, 26, 28, 31, 32, 33, 34], "mean": [2, 16, 17, 18, 20, 22, 28, 34], "do": [2, 5, 8, 16, 18, 20, 21, 26, 28, 30, 31, 32, 33, 34], "noth": 2, "note": [2, 3, 5, 6, 15, 16, 17, 18, 20, 22, 24, 28, 30, 31, 32, 33], "type": [2, 4, 5, 6, 7, 10, 16, 17, 18, 20, 21, 23, 30, 31, 32, 34], "conv2d": [2, 7, 8, 10, 13, 18, 20, 26, 34], "linear": [2, 6, 7, 8, 13, 15, 16, 18, 26, 33, 34], "convtranspose2d": [2, 13], "case": [2, 6, 7, 9, 12, 16, 17, 18, 28, 31, 33, 34], "addit": [2, 6, 7, 17, 21, 28, 34], "embed": [2, 7, 28, 34], "lstm": [2, 10, 15, 34], "sgd": [2, 6, 7, 8, 16, 19], "string": [2, 31], "o0": [2, 26, 34], "No": [2, 18, 34], "function": [2, 5, 6, 7, 8, 10, 11, 12, 14, 15, 17, 20, 21, 23, 26, 28, 29, 31, 33, 34], "just": [2, 14, 29, 34], "return": [2, 6, 7, 8, 10, 16, 17, 20, 26, 34], "origin": [2, 6, 7, 12, 13, 15, 17, 20, 29, 34], "dropout": [2, 10], "remov": [2, 5, 21, 34], "inferenc": 2, "master": [2, 7, 21, 31], "fuse": [2, 7, 13, 16, 19, 28, 34], "updat": [2, 5, 7, 16, 19, 21, 22, 34], "step": [2, 5, 6, 7, 8, 14, 16, 19, 21, 32], "overridden": [2, 17], "explicitli": [2, 8, 16, 20, 26, 31, 34], "bool": [2, 14], "whether": [2, 6, 8, 16, 18, 22, 23, 33], "conv_bn": 2, "It": [2, 6, 7, 8, 10, 13, 17, 18, 20, 21, 23, 26, 29, 31, 33, 34], "knob": [2, 4, 12, 31], "overwrit": [2, 31], "configur": [2, 4, 6, 7, 14, 15, 16, 17, 31, 32, 34], "linear_bn": 2, "convolut": [2, 6, 7, 13, 20, 33, 34], "reorder": [2, 18, 28], "doesn": [2, 15, 16, 18, 26, 34], "support": [2, 5, 6, 7, 13, 15, 16, 17, 18, 19, 20, 21, 25, 26, 28, 29, 31, 32, 33, 34], "replac": [2, 5, 7, 10, 26, 34], "ident": [2, 10, 18], "aten": [2, 6, 7, 34], "opportunit": 2, "bf16": [2, 3, 7, 17, 19, 21, 23, 26, 28, 30, 34], "save": [2, 5, 6, 7, 13, 14, 15, 16, 18, 21, 28, 32, 34], "solut": [2, 7, 26, 28, 34], "all": [2, 5, 6, 8, 13, 14, 17, 19, 20, 28, 29, 32, 33, 34], "param": [2, 19, 31], "tupl": [2, 6, 17, 20], "tensor": [2, 6, 7, 8, 11, 15, 16, 17, 20, 26, 28, 32, 34], "feed": [2, 9, 18], "sampl": [2, 6, 9, 14, 16, 17, 29, 33], "input": [2, 6, 7, 9, 10, 13, 15, 16, 17, 18, 22, 23, 26, 29, 30, 32, 33, 34], "impact": [2, 7, 20], "pack": [2, 20, 34], "intel": [2, 3, 4, 7, 8, 9, 10, 11, 13, 14, 16, 17, 20, 21, 22, 23, 25, 26, 27, 28, 29, 34], "extens": [2, 3, 4, 6, 9, 10, 13, 14, 16, 17, 23, 24, 25, 27, 28, 29, 30, 31, 33, 34], "per": [2, 10, 15, 16, 20, 30, 31, 32, 33, 34], "some": [2, 5, 7, 8, 13, 16, 17, 18, 20, 26, 28, 31, 32, 33, 34], "heurist": [2, 20, 34], "real": [2, 7, 14, 15, 30, 34], "best": [2, 6, 7, 8, 14, 16, 17, 22, 24, 28, 33, 34], "try": [2, 5, 6, 7, 12, 14, 16, 26, 31, 33, 34], "select": [2, 5, 7, 13, 24, 34], "true": [2, 4, 6, 10, 12, 13, 14, 15, 16, 17, 22, 23, 31, 32, 33, 34], "might": [2, 7, 18, 26, 33, 34], "cost": [2, 6, 28, 30, 33], "extra": [2, 5, 10, 20, 31, 32], "combin": [2, 12, 14, 28, 31, 34], "method": [2, 8, 15, 16, 18, 22, 26, 33, 34], "multipl": [2, 5, 7, 8, 16, 17, 18, 26, 28, 30, 32, 33, 34], "subgraph": 2, "modifi": [2, 5, 6], "other": [2, 6, 7, 8, 14, 17, 18, 19, 23, 28, 31, 33], "place": [2, 8, 28, 33, 34], "scenario": [2, 6, 7, 18, 33, 34], "convolutuon": 2, "counterpart": [2, 7, 18, 34], "pleas": [2, 6, 7, 11, 16, 22, 26, 28, 31, 33, 34], "invok": [2, 6, 8, 10, 13, 20, 23, 26, 29, 34], "ddp": [2, 6], "distribut": [2, 3, 7, 16, 31, 32, 33], "deepcopi": 2, "rather": [2, 18], "than": [2, 5, 7, 17, 18, 20, 21, 26, 33, 34], "allreduc": 2, "caus": [2, 7, 21, 26, 28, 31, 33, 34], "unpredict": 2, "accuraci": [2, 3, 6, 7, 8, 15, 16, 21, 22, 26, 28, 34], "loss": [2, 5, 6, 8, 16, 18, 21, 26], "exampl": [2, 5, 7, 8, 13, 18, 19, 21, 22, 23, 24, 25, 28, 29, 32, 33, 34], "load_state_dict": [2, 34], "path": [2, 6, 7, 14, 18, 20, 23, 31, 33, 34], "eval": [2, 4, 6, 8, 10, 11, 12, 13, 15, 16, 20, 23, 26, 29, 32, 34], "optimized_model": [2, 34], "evalu": [2, 16, 34], "optimized_optim": 2, "altern": [2, 6, 18], "motiv": [2, 20], "ad": [2, 7, 10, 33, 34], "alia": 2, "unifi": [2, 31], "style": [2, 5], "modular": 2, "float32": [2, 13, 21, 23, 26, 30, 31, 34], "quantization_config": [2, 6, 29], "qconfig_summary_fil": [2, 6, 29], "low_precision_checkpoint": [2, 6, 29], "deployment_mod": [2, 6, 23], "transform": [2, 3, 4, 6, 10, 11, 13, 16, 18, 22, 23, 28, 32, 33, 34], "focu": [2, 10, 18, 29, 34], "especi": [2, 5, 28, 34], "task": [2, 7, 28, 31, 33, 34], "famili": [2, 28, 33], "full": [2, 5, 18, 32, 33, 34], "llama": [2, 3, 6, 28], "gpt": [2, 28, 30], "j": [2, 5, 17, 28, 30], "neox": [2, 28], "opt": [2, 6, 17, 28], "falcon": [2, 28], "bloom": [2, 28], "codegen": [2, 28, 34], "baichuan": [2, 28, 34], "chatglm": [2, 28], "gptbigcod": [2, 28], "t5": [2, 26, 28, 34], "mistral": [2, 28, 34], "mpt": [2, 28, 34], "mixtral": [2, 28], "stablelm": [2, 28], "qwen": [2, 28], "git": [2, 5, 28], "llava": [2, 28], "yuan": [2, 28], "phi": [2, 28], "scope": [2, 7, 8, 21, 34], "abov": [2, 5, 10, 19, 28, 30, 31, 32], "transpar": [2, 7, 29, 33, 34], "benifit": 2, "float": [2, 6, 7, 8, 14, 15, 16, 17, 21, 29, 34], "when": [2, 5, 6, 7, 8, 9, 14, 18, 19, 20, 21, 22, 25, 26, 28, 30, 31, 32, 33, 34], "mix": [2, 6, 13, 23, 26, 28, 34], "str": [2, 6, 14, 23, 31], "specifi": [2, 5, 6, 14, 20, 31, 33, 34], "either": [2, 26, 31], "object": [2, 6, 7, 14, 17, 20, 33, 34], "defin": [2, 5, 6, 7, 8, 10, 16, 17, 18, 22, 32], "recip": [2, 4, 7, 13, 15, 26, 28, 34], "quant": [2, 16], "static": [2, 4, 16, 26, 28, 31, 32, 33, 34], "onc": [2, 5, 6, 14, 17, 18, 20, 21, 32, 33], "quantizat": 2, "config": [2, 6, 11, 23, 31, 32], "json": [2, 6, 15, 16, 32, 34], "file": [2, 4, 5, 6, 8, 14, 15, 16, 17, 18, 31, 34], "under": [2, 6, 8, 18, 20, 27, 31, 34], "need": [2, 5, 6, 7, 10, 13, 14, 16, 17, 18, 19, 20, 21, 23, 26, 29, 31, 32, 33, 34], "calibr": [2, 13, 22, 26, 29, 30, 32, 34], "dict": [2, 6, 23], "int4": [2, 28, 29, 34], "": [2, 3, 5, 8, 10, 14, 15, 18, 19, 20, 21, 22, 26, 31, 32, 33], "should": [2, 5, 8, 15, 20, 28, 31, 33], "state_dict": [2, 6], "checkpoint": [2, 6, 29], "pt": [2, 6, 13, 14, 15, 23, 32, 34], "gptq": [2, 6, 34], "etc": [2, 5, 6, 17, 34], "where": [2, 5, 7, 16, 21, 33], "kei": [2, 7, 28, 34], "scale": [2, 3, 6, 15, 28], "zero": [2, 6, 15, 34], "point": [2, 6, 8, 15, 21, 33, 34], "bia": [2, 8, 20, 34], "weight_kei": 2, "packed_weight": 2, "scale_kei": 2, "zero_point_kei": 2, "packed_zp": 2, "bias_kei": 2, "chang": [2, 5, 6, 7, 8, 10, 11, 12, 15, 17, 18, 20, 23, 25, 26, 29, 31], "make": [2, 5, 6, 7, 14, 15, 17, 21, 23, 28, 32, 33], "n": [2, 6, 7, 16, 18, 19, 20, 26, 32, 33, 34], "thei": [2, 7, 8, 31, 33], "uint4": 2, "compress": 2, "along": [2, 5, 6, 21, 33, 34], "store": [2, 17, 18, 19, 21, 28, 31, 32, 33, 34], "int32": 2, "state": [2, 15, 19, 28], "automaticlli": 2, "deploy": [2, 7, 13, 34], "torchscirpt": 2, "workabl": 2, "forward": [2, 6, 8, 13, 16, 20, 21, 26, 32, 33, 34], "after": [2, 5, 7, 13, 20, 21, 23, 24, 32, 33, 34], "deepspe": [2, 34], "parallel": [2, 5, 6, 7, 28, 33, 34], "class": [2, 5, 6, 7, 8, 10, 16, 20, 26, 34], "verbos": [2, 4, 31], "demand": [2, 7], "easier": [2, 18, 21], "debug": [2, 31], "dump": [2, 31], "messag": [2, 6, 10, 12, 18, 31], "contain": [2, 5, 6, 13, 17, 26, 31, 32, 33, 34], "durat": [2, 21], "while": [2, 7, 8, 11, 12, 18, 21, 26, 28, 32, 33, 34], "via": [2, 5, 6, 7, 18, 20, 30, 31, 33, 34], "environ": [2, 5, 6, 17, 20, 24, 28, 30, 31, 32, 33], "variabl": [2, 5, 17, 30, 31, 32, 33, 34], "name": [2, 5, 7, 14, 17, 25, 28, 31, 32, 33, 34], "dnnl_verbos": 2, "howev": [2, 5, 7, 8, 9, 16, 20, 26, 28, 31, 33, 34], "those": [2, 15, 33], "amount": [2, 16, 26, 28, 33], "investig": [2, 31], "singl": [2, 7, 13, 14, 16, 19, 20, 30, 32, 34], "iter": [2, 16, 21, 28, 34], "out": [2, 5, 6, 7, 8, 10, 13, 16, 19, 20, 30, 31, 33, 34], "second": [2, 10, 28, 32, 33], "verbose_on": 2, "verbose_off": 2, "disabl": [2, 6, 7, 13, 26, 31, 33, 34], "verbose_on_cr": 2, "creation": 2, "linearsilu": [2, 34], "silu": [2, 13], "http": [2, 5, 16, 34], "org": [2, 7, 16, 26, 34], "stabl": [2, 3, 8, 34], "html": [2, 5, 16], "output": [2, 6, 7, 8, 13, 14, 16, 18, 23, 26, 34], "same": [2, 5, 7, 10, 15, 16, 17, 18, 20, 21, 28, 31, 32, 33, 34], "init": [2, 5, 15, 34], "linear_modul": 2, "4096": [2, 33], "ipex_fus": 2, "randn": [2, 10, 13, 16, 18, 32, 34], "linearsilumul": [2, 34], "multipli": 2, "mul": [2, 13, 16], "linear2silumul": [2, 34], "linear_": 2, "linear_m": 2, "two": [2, 7, 14, 16, 20, 21, 28, 32, 33, 34], "linear_s_modul": 2, "linear_m_modul": 2, "linearrelu": [2, 34], "relu": [2, 7, 13, 16, 18, 26, 34], "linearnewgelu": [2, 34], "newgeluactiv": 2, "com": [2, 5, 34], "huggingfac": [2, 6, 26, 28, 32, 34], "blob": 2, "src": [2, 17], "activ": [2, 6, 7, 15, 16, 20, 28, 31, 33], "py": [2, 5, 10, 14, 20, 31, 32, 34], "l50": 2, "new_gelu": 2, "lineargelu": [2, 34], "gelu": [2, 13, 34], "linearmul": [2, 34], "linearadd": [2, 34], "add": [2, 5, 7, 8, 13, 14, 19, 21, 32, 34], "linearaddadd": [2, 34], "other_1": 2, "other_2": 2, "rotaryembed": [2, 34], "max_position_embed": 2, "int": [2, 6, 7, 14, 17, 23, 26, 29, 31, 34], "pos_embd_dim": 2, "10000": 2, "backbon": 2, "co": 2, "paper": [2, 34], "2104": 2, "09864": 2, "queri": [2, 17, 18], "multi": [2, 7, 14, 20, 28, 31, 33, 34], "head": [2, 34], "comput": [2, 6, 7, 13, 15, 16, 18, 20, 21, 28, 30, 31, 32, 33, 34], "max": [2, 6, 16, 17, 22, 23, 26, 34], "posit": [2, 28, 33, 34], "frequenc": [2, 30], "exact": 2, "g": [2, 7, 8, 16, 17, 18, 28, 34], "gptjforcausallm": 2, "architectur": [2, 28, 30, 33], "eleutherai": [2, 28], "6b": [2, 28, 30], "l4": 2, "batch": [2, 6, 7, 13, 16, 18, 20, 23, 26, 30, 32, 34], "sequenc": [2, 18, 21, 28, 34], "length": [2, 5, 14, 21, 26, 30, 34], "num_head": 2, "num_kv_head": 2, "head_dim": 2, "position_id": [2, 6], "element": [2, 18, 19], "past_kv_length": 2, "id": [2, 31, 32], "construct": [2, 7, 13], "current_posit": 2, "num": [2, 20, 32, 33, 34], "dim": [2, 6, 18, 23], "offset": [2, 18, 28], "sin": 2, "neighbor": 2, "rotary_dim": 2, "rotary_ndim": 2, "rotari": [2, 28], "64": [2, 8, 10, 16, 20, 30, 31, 34], "gptj": 2, "rope_modul": 2, "2048": [2, 6], "32": [2, 6, 18, 21, 23, 30, 31, 32], "16": [2, 17, 20, 21, 30, 31, 32], "256": [2, 30], "arang": [2, 6, 16], "unsqueez": 2, "query_roteri": 2, "direct": [2, 5, 13], "apply_funct": 2, "without": [2, 5, 6, 7, 8, 10, 16, 20, 21, 26, 32, 34], "initi": [2, 20, 32], "assum": [2, 7, 8, 23, 32, 33, 34], "arg": [2, 4, 6, 7, 14, 16, 19, 23, 31, 32, 34], "num_token": 2, "rotary_half": 2, "rmsnorm": [2, 28, 34], "hidden_s": [2, 6], "ep": [2, 7, 10, 19], "1e": [2, 7, 10, 16], "06": [2, 31, 32], "hidden": [2, 18, 28], "modeling_llama": 2, "l76": 2, "variance_epsilon": 2, "6": [2, 5, 7, 11, 14, 20, 30, 31, 32, 33, 34], "ones": [2, 6, 17], "hidden_st": 2, "usual": [2, 18, 20, 33], "rmsnorm_modul": 2, "fastlayernorm": [2, 34], "normalized_shap": 2, "layernorm": [2, 13, 16, 22, 34], "list": [2, 5, 7, 8, 13, 14, 16, 18, 25, 29, 31, 32, 33, 34], "denomin": 2, "numer": [2, 8, 33], "stabil": [2, 8, 34], "layernorm_modul": 2, "05": [2, 7, 10, 30, 31], "expect": [2, 7, 30, 34], "indirectaccesskvcacheattent": [2, 34], "text_max_length": 2, "kv_cach": [2, 28], "decod": [2, 28, 30, 34], "layer": [2, 16, 20, 22, 28, 34], "bring": [2, 6, 7, 9, 15, 16, 21, 28, 31, 33, 34], "beam": [2, 28], "idx": [2, 28, 31], "concat": [2, 20, 26, 28, 34], "entir": [2, 16, 28], "context": [2, 5, 6, 8, 20, 28, 33], "dot": [2, 7, 18, 28], "veri": [2, 5, 15, 18, 28], "long": [2, 6, 18, 21, 26, 28, 34], "bottleneck": [2, 28], "indirect": 2, "access": [2, 6, 7, 18, 19, 32], "iakv": [2, 28], "firstli": [2, 28], "pre": [2, 28, 34], "alloc": [2, 10, 20, 28, 30, 32, 34], "buffer": [2, 28], "index": [2, 5, 18, 28, 33], "histori": [2, 14, 28], "decid": [2, 15, 20, 28], "timestamp": [2, 28], "max_seq": 2, "head_num": 2, "head_siz": 2, "token": [2, 6, 23, 28, 30], "everi": [2, 28], "kv": 2, "seq_len": [2, 30], "scale_attn": 2, "sqrt": [2, 13, 19], "layer_past": 2, "seq_info": 2, "key_cach": 2, "value_cach": 2, "info": [2, 6, 17, 26, 31, 32, 34], "head_mask": 2, "mask": [2, 7, 17, 26], "yet": [2, 6, 26, 34], "attention_mask": [2, 6], "attn_output": 2, "attn_weight": 2, "first": [2, 3, 5, 6, 7, 9, 10, 12, 16, 19, 20, 21, 26, 31, 32, 33], "matmul": [2, 8, 13, 26, 34], "new_layer_past": 2, "l1318": 2, "def": [2, 6, 8, 10, 16, 20, 26, 34], "_reorder_cach": 2, "self": [2, 6, 8, 10, 16, 20, 26, 34], "past_key_valu": [2, 6], "beam_idx": 2, "len": [2, 6, 7, 13, 16, 17], "4": [2, 6, 11, 13, 14, 18, 20, 23, 28, 30, 31, 33, 34], "3": [2, 5, 6, 7, 8, 10, 12, 13, 14, 16, 17, 18, 20, 21, 28, 30, 31, 33], "pagedattent": [2, 34], "vllm": 2, "blog": [2, 34], "2023": [2, 3, 30], "20": [2, 7, 18, 30, 31, 32, 34], "page": [2, 6, 13, 20, 24, 29, 30, 33, 34], "num_block": 2, "block_siz": 2, "basic": [2, 4, 16, 21, 33], "logic": [2, 14, 18, 32, 33], "dram": 2, "manag": [2, 8, 13, 20, 28, 31], "slot": [2, 30], "reshape_and_cach": 2, "single_query_cached_kv_attent": 2, "mha": [2, 34], "intra": 2, "tabl": [2, 7, 17, 28, 30, 34], "map": [2, 6, 18, 30], "physic": [2, 14, 20, 32, 33], "slot_map": 2, "allcat": 2, "keytensor": 2, "num_seq": 2, "_i_": 2, "block_numb": 2, "head_map": 2, "block_tabl": 2, "context_len": 2, "max_context_len": 2, "alibi_slop": 2, "5": [2, 6, 10, 13, 14, 16, 17, 18, 19, 20, 21, 22, 26, 28, 30, 31, 32, 33, 34], "max_num_blocks_per_seq": 2, "optin": 2, "alibi": 2, "slope": 2, "varlenattent": [2, 34], "scaled_dot_product_attent": 2, "accept": [2, 34], "variant": [2, 8, 28], "among": [2, 31, 32, 33], "doe": [2, 7, 13, 18, 20, 26, 34], "query_token": 2, "total": [2, 6, 30, 33], "key_token": 2, "value_token": 2, "seqlen_q": 2, "batch_siz": [2, 6, 11, 13, 16, 18, 23, 32], "seqlen_k": 2, "max_seqlen_q": 2, "max_seqlen_k": 2, "pdropout": 2, "probabl": 2, "greater": 2, "softmax_scal": 2, "factor": [2, 6, 16, 31], "softmax": [2, 13, 34], "is_caus": 2, "causal": 2, "varlenattention_modul": 2, "emply_lik": 2, "rotary_embed": [2, 34], "rms_norm": [2, 34], "fast_layer_norm": [2, 34], "indirect_access_kv_cache_attent": [2, 34], "add_casual_mask": 2, "varlen_attent": [2, 34], "zero_tensor": 2, "return_softmax": 2, "gen_": 2, "fast_bert": [2, 4, 6, 7, 11, 34], "unpad": 2, "tpp": [2, 28], "speedup": [2, 6, 8, 28, 30, 34], "still": [2, 5, 7, 8, 13, 16, 18, 21, 26, 34], "squenc": 2, "sparsiti": 2, "seed": 2, "libxsmm": 2, "though": [2, 7], "peak": [2, 7, 11, 34], "enable_onednn_fus": [2, 13], "get_smooth_quant_qconfig_map": [2, 6, 29], "alpha": [2, 6, 19, 22], "act_observ": 2, "act_ic_observ": 2, "wei_observ": 2, "wei_ic_observ": 2, "share_weight_observ": 2, "smoothquant": [2, 6, 7, 16, 22, 28, 34], "arxiv": 2, "pdf": 2, "2211": 2, "10438": 2, "hyper": [2, 30, 33, 34], "observ": [2, 9, 13, 15, 34], "op": [2, 7, 15, 16, 22, 28, 34], "histogramobserv": [2, 15], "q": [2, 28], "min": [2, 16, 22, 26, 34], "affect": [2, 31], "argument": [2, 6, 7, 22, 26, 31], "ao": [2, 6, 15], "minmaxobserv": [2, 6, 15], "channel": [2, 3, 10, 15, 16, 26, 34], "perchannelminmaxobserv": [2, 6, 15], "with_arg": [2, 6, 15], "ch_axi": 2, "qint8": [2, 6, 15], "qscheme": [2, 6, 15, 34], "per_channel_symmetr": [2, 6, 15], "qconfig": [2, 4, 6, 13, 16, 26, 29, 32, 34], "prepar": [2, 4, 6, 13, 16, 26, 29, 32, 34], "example_input": [2, 4, 6, 13, 15, 29, 32, 34], "bn_fold": 2, "example_kwarg_input": 2, "fp32": [2, 4, 16, 17, 19, 21, 23, 28, 34], "A": [2, 5, 6, 7, 10, 11, 17, 26, 28, 31, 33, 34], "even": [2, 5, 7, 33, 34], "prepared_model": [2, 4, 6, 13, 15, 16, 26, 29, 34], "original_model": 2, "later": [2, 7, 25, 33], "unexpect": 2, "behavior": [2, 20, 31, 33], "insert": [2, 16], "fake": 2, "introduct": [2, 7, 28, 33, 34], "avaiabl": 2, "autotun": [2, 4, 22, 34], "calib_dataload": [2, 6, 16, 34], "calib_func": 2, "eval_func": [2, 16, 34], "op_type_dict": 2, "smoothquant_arg": [2, 16], "sampling_s": [2, 4, 16, 34], "accuracy_criterion": [2, 4, 16, 34], "tuning_tim": [2, 4, 16, 34], "driven": 2, "tune": [2, 3, 4, 7, 8, 15, 20, 26, 28, 31, 32, 34], "help": [2, 5, 6, 17, 23, 28, 31, 33, 34], "quickli": 2, "dataload": [2, 6, 10, 13, 16, 20, 22, 29, 34], "post": [2, 4, 5, 7, 15, 28, 34], "process": [2, 6, 7, 11, 12, 14, 16, 19, 20, 21, 26, 31, 32, 33], "metric": [2, 16, 30], "scalar": 2, "higher": [2, 7, 13, 17, 18, 28], "constraint": [2, 34], "optyp": 2, "wise": [2, 16, 19, 22, 29, 34], "space": [2, 7, 16, 18, 22, 33], "global": [2, 20, 22, 34], "algorithm": [2, 13, 18, 30, 34], "would": [2, 5, 6, 14, 16, 17, 18, 30, 31, 32, 33, 34], "explor": 2, "100": [2, 4, 14, 16, 17, 30, 32], "accuracy_criterion_typ": 2, "rel": [2, 4, 16, 31, 34], "absolut": [2, 31], "accuracy_criterion_valu": 2, "maximum": [2, 16, 17], "allow": [2, 8, 14, 16, 22, 31, 33, 34], "01": [2, 4, 7, 16, 31, 32, 34], "timeout": [2, 5, 21], "earli": [2, 34], "stop": [2, 33], "is_runtime_ext_en": 2, "helper": 2, "check": [2, 5, 6, 7, 13, 18, 28, 29, 31, 34], "exetens": 2, "openmp": [2, 7, 20, 26, 30, 32, 34], "preload": [2, 31], "cpupool": [2, 20, 34], "core_id": [2, 20, 31], "node_id": [2, 20, 31, 32, 34], "abstract": [2, 11, 20], "pool": [2, 20, 34], "core": [2, 7, 14, 17, 30, 33, 34], "numa": [2, 20, 31, 32, 34], "node": [2, 20, 30, 32, 33, 34], "pin": [2, 20], "cpu_pool": [2, 20, 34], "region": [2, 8, 17, 33], "design": [2, 5, 8, 18, 21, 29, 34], "decor": 2, "multistreammodulehint": [2, 20, 34], "kwarg": [2, 29], "hint": [2, 20], "multistreammodul": [2, 7, 20, 26, 34], "its": [2, 6, 7, 8, 14, 17, 21, 28, 30, 31, 32, 33, 34], "arbitrari": 2, "keyword": 2, "num_stream": [2, 20, 34], "auto": [2, 6, 10, 17, 18, 22, 23, 26, 28, 31, 33, 34], "concat_output": 2, "input_split_hint": [2, 20], "multi_stream": 2, "output_concat_hint": [2, 20], "stream": [2, 7, 20, 34], "throughput": [2, 3, 18, 20, 26, 28, 30, 34], "insid": [2, 5, 20, 31], "divis": [2, 20], "equal": [2, 15, 20, 32, 33], "remaind": [2, 20], "divisor": [2, 20], "batchsiz": [2, 20], "larger": [2, 20, 30, 33], "piec": [2, 20], "less": [2, 8, 18, 20, 26, 34], "mini": [2, 20, 34], "don": [2, 5, 8, 14, 17, 34], "want": [2, 5, 7, 14, 15, 17, 20, 31, 34], "leav": [2, 20, 33], "scriptmodul": [2, 13, 20], "union": 2, "instanc": [2, 7, 10, 14, 32, 34], "reason": [2, 10, 18, 20, 34], "flag": [2, 5, 7, 17, 20, 31, 34], "indic": [2, 6, 18, 28], "concaten": [2, 21], "raw": 2, "asynchron": [2, 7], "get_core_list_of_node_id": 2, "softwar": [3, 27, 34], "jul": 3, "deep": [3, 7, 8, 11, 13, 14, 21, 33], "learn": [3, 7, 8, 11, 13, 14, 21, 31, 33], "boost": [3, 6, 7, 9, 21, 30, 31, 33, 34], "dl": [3, 7, 34], "hug": 3, "face": 3, "bert": [3, 4, 10, 30, 34], "googl": [3, 5, 28], "cloud": 3, "platform": [3, 7, 18, 32, 33, 34], "gcp": 3, "technologi": [3, 7], "guid": [3, 6, 7, 17, 32, 34], "apr": 3, "mar": [3, 32], "new": [3, 5, 12, 16, 17, 18, 20, 23, 26, 29, 33], "x86": 3, "sapphir": 3, "rapid": 3, "part": [3, 5, 7, 8, 18, 21, 26, 33, 34], "jan": 3, "secur": 3, "torchserv": [3, 34], "confer": 3, "dec": 3, "2022": [3, 31, 32], "what": [3, 5, 6, 8, 23], "pyg": 3, "diffus": [3, 34], "arc": 3, "nov": 3, "13": [3, 10, 17, 30, 31, 32, 33], "potenti": [3, 7, 34], "fine": [3, 20, 31, 32, 33, 34], "fx": [3, 7, 10, 26, 34], "sep": [3, 17], "empow": 3, "xeon": [3, 7, 14, 21, 28, 30, 32, 33, 34], "scalabl": [3, 7, 21, 28, 30, 33, 34], "processor": [3, 7, 19, 21, 28, 30, 33, 34], "aug": [3, 30], "vision": [3, 6, 30], "last": [3, 10, 21, 26, 34], "One": [3, 18, 19, 31, 33], "click": 3, "compressor": [3, 7, 16, 22, 34], "4x": 3, "jun": 3, "grokk": 3, "principl": [3, 18], "kt": 3, "person": 3, "text": [3, 6, 26, 28, 30, 33], "speech": [3, 33], "2021": [3, 17, 31, 32], "up": [3, 7, 11, 20, 24, 28, 33, 34], "modern": 3, "naver": 3, "low": [3, 4, 6, 7, 21, 23, 31, 33, 34], "latenc": [3, 14, 18, 28, 30, 32, 34], "machin": [3, 5, 6, 7, 14, 17, 26, 31, 32, 33, 34], "feb": 3, "dlrm": [3, 7, 26, 30, 34], "oneccl": [3, 6, 31, 34], "mention": [3, 10, 20, 21, 34], "deprec": [3, 26], "facebook": [3, 6, 28], "3rd": [3, 7, 21, 30, 34], "gen": [3, 30, 34], "capabl": [3, 17, 34], "2020": 3, "collabor": 3, "2019": 3, "caff": 3, "2017": 3, "command": [4, 5, 6, 14, 23, 31, 32, 33, 34], "descript": [4, 7, 16, 18, 20, 25, 33, 34], "instal": [4, 5, 6, 23, 25, 26, 28, 33, 34], "m": [4, 14, 20, 26, 31, 32, 33, 34], "pip": [4, 5, 34], "captur": [4, 34], "log": [4, 6, 13, 31, 32, 34], "prompt": [4, 6, 23, 34], "export": [4, 31, 33], "onednn_verbos": 4, "dure": [4, 6, 7, 10, 13, 16, 21, 31, 33, 34], "precis": [4, 6, 13, 21, 23, 26, 30, 34], "no_grad": [4, 6, 10, 11, 12, 13, 15, 16, 20, 23, 26, 29, 32, 34], "amp": [4, 6, 10, 23, 26, 34], "autocast": [4, 6, 7, 10, 23, 34], "prototyp": [4, 13, 20, 26, 34], "fast": [4, 12, 33, 34], "bertmodelmodel": 4, "bertmodel": [4, 6, 11, 32], "from_pretrain": [4, 6, 11, 23, 29, 32], "uncas": [4, 6, 10, 11, 32, 34], "launch": [4, 6, 20, 32, 34], "autom": [4, 7, 8, 14, 31, 32, 34], "ipexrun": [4, 10, 31, 34], "lt": [4, 28, 30], "your_pytorch_script": [4, 31], "gt": [4, 14, 28, 33], "hypertun": [4, 34], "hyperparamet": [4, 7], "conf": [4, 13, 14, 31, 34], "your_conf_fil": [4, 34], "your_python_script": [4, 34], "default_static_qconfigprepared_model": 4, "anyplac": 4, "d": [4, 5, 6, 7, 8, 13, 26, 28, 34], "calibration_data_load": [4, 6, 13], "converted_model": [4, 6, 26, 34], "default_dynamic_qconfigprepared_model": 4, "tuned_model": [4, 16, 34], "eval_funct": 4, "convert_model": [4, 13, 15, 16], "thank": [5, 34], "interest": 5, "begin": 5, "intent": 5, "propos": [5, 7, 11, 16, 18, 21], "intend": 5, "shall": [5, 18, 33], "discuss": [5, 18, 33], "agre": 5, "plan": [5, 7, 10], "look": [5, 14, 16, 18], "ahead": 5, "outstand": 5, "pick": 5, "comment": [5, 14, 17, 22, 34], "particular": [5, 6, 8, 29, 34], "ask": 5, "pull": 5, "here": [5, 8, 10, 13, 16, 17, 18, 20, 26, 32, 33, 34], "uninstal": 5, "ll": [5, 32, 33], "know": 5, "fulli": [5, 15, 17, 21, 33, 34], "warn": [5, 6, 12, 31, 32, 34], "skip": [5, 6, 17, 18, 31], "few": [5, 7, 9, 13, 16, 18, 32, 34], "alwai": [5, 6, 7, 8, 18, 31, 33, 34], "loop": [5, 21, 29], "re": [5, 8, 32, 33], "feel": [5, 18, 34], "lazi": 5, "ye": 5, "clone": 5, "copi": [5, 17, 18], "cd": [5, 6], "rebas": [5, 34], "submodul": 5, "sync": [5, 20], "recurs": 5, "job": 5, "setup": [5, 6, 28, 34], "symlink": 5, "tree": [5, 6], "reinstal": [5, 26], "again": [5, 19, 32], "__init__": [5, 6, 8, 10, 16, 20, 26, 34], "repeatedli": 5, "interfac": [5, 6, 18, 26, 28], "pyi": 5, "non": [5, 8, 13, 18, 30, 32, 34], "cpp": [5, 6, 33], "cc": [5, 6, 17], "cu": 5, "h": [5, 6, 7, 16, 18, 26, 31, 32], "sure": [5, 14, 15, 32, 33], "until": [5, 20, 21, 33], "next": [5, 7, 34], "clean": 5, "cmake": [5, 6, 17, 34], "must": [5, 14, 17, 19], "maco": 5, "linux": [5, 6, 17, 30, 31, 33], "homebrew": 5, "brew": 5, "our": [5, 16, 19, 28, 33, 34], "error": [5, 6, 7, 10, 16, 18, 21, 22, 26, 34], "printf": 5, "stdio": 5, "nint": 5, "hello": 5, "world": [5, 7], "clang": 5, "simpl": [5, 7, 8, 11, 18, 33, 34], "binari": [5, 6, 7, 8, 17, 34], "folder": 5, "mani": [5, 14, 28, 31, 33, 34], "wai": [5, 10, 16, 18, 28, 34], "rm": 5, "rf": 5, "toplevel": 5, "over": [5, 7, 8, 9, 16, 18, 30, 31, 34], "made": [5, 34], "edit": [5, 26, 34], "repo": [5, 6, 7], "commit": 5, "ani": [5, 8, 10, 17, 18, 32, 34], "keep": [5, 12, 18, 21, 28, 32, 33, 34], "realli": 5, "untrack": 5, "deinit": 5, "f": [5, 6, 13, 16, 28, 34], "xdf": 5, "within": [5, 16, 21, 29, 33, 34], "experi": [5, 7, 10, 12, 16, 18, 26, 33, 34], "env_key1": 5, "env_val1": 5, "env_key2": 5, "env_val2": 5, "suit": 5, "locat": [5, 17, 34], "test_": 5, "individu": [5, 30], "filenam": 5, "repres": [5, 7, 21], "wish": [5, 7], "test_jit": 5, "narrow": 5, "down": [5, 32, 34], "testclassnam": 5, "testnam": 5, "let": [5, 10, 18, 19, 20, 21], "sai": 5, "test_sequenti": 5, "testjit": 5, "expecttest": 5, "hypothesi": 5, "mypi": 5, "depend": [5, 7, 17, 18, 25, 26, 33, 34], "conda": [5, 33], "offici": [5, 32, 33, 34], "unittest": 5, "substr": 5, "test_nn": 5, "v": 5, "testnn": 5, "test_bceloss": 5, "test_mseloss": 5, "keystrok": 5, "ci": 5, "quicklint": 5, "aren": 5, "setup_lint": 5, "target": [5, 6, 10, 13, 14, 17, 34], "makefil": 5, "complet": [5, 6, 14, 18, 29, 33], "tab": 5, "trail": [5, 21], "newlin": 5, "quick_check": 5, "flake8": 5, "cmakelint": 5, "tidi": 5, "changed_onli": 5, "written": [5, 6, 17], "framework": [5, 34], "runner": 5, "bin": [5, 6, 17, 31, 32], "gtest_filt": 5, "testsuit": 5, "maycontainalia": 5, "containeraliasingtest": 5, "test_alias_analysi": 5, "docstr": 5, "line": [5, 10, 13, 18, 31, 32, 33], "limit": [5, 8, 10, 20, 26, 32, 33, 34], "80": [5, 30, 31], "charact": 5, "fit": [5, 7, 33, 34], "jupyt": 5, "popup": 5, "prerequisit": [5, 6], "r": [5, 6, 7, 14, 23, 30, 32, 33], "txt": [5, 6, 32], "_build": 5, "rst": 5, "live": 5, "tutori": [5, 6, 15, 16, 34], "autofunct": 5, "autoclass": 5, "shorten": 5, "sphinx": 5, "produc": [5, 8], "miss": 5, "relat": [6, 13, 17, 31, 33, 34], "demonstr": [6, 18, 26, 32], "box": [6, 10, 33], "benefit": [6, 7, 8, 10, 20, 21, 28, 32, 33, 34], "against": 6, "below": [6, 8, 10, 14, 19, 20, 21, 22, 23, 26, 28, 31, 32, 33, 34], "criterion": [6, 8, 16, 22], "zero_grad": [6, 7, 16], "torchvis": [6, 10, 12, 13, 16, 18, 32, 34], "lr": [6, 7, 8, 16, 19], "001": [6, 8], "download": [6, 13, 16], "dataset": [6, 13, 16, 29, 30, 33, 34], "cifar10": [6, 13], "compos": [6, 13], "resiz": [6, 13], "224": [6, 8, 10, 12, 13, 30, 32, 34], "totensor": [6, 13, 16], "train_dataset": [6, 13], "root": [6, 13, 16, 17, 28], "train_load": [6, 8], "128": [6, 8, 10, 13, 20, 30, 34], "crossentropyloss": [6, 16], "momentum": [6, 10, 21], "9": [6, 7, 14, 17, 23, 25, 31, 32], "uncom": 6, "batch_idx": [6, 13], "enumer": [6, 13, 16, 29], "backward": [6, 7, 8, 16, 21, 33, 34], "print": [6, 11, 12, 13, 14, 16, 17, 23, 31], "model_state_dict": 6, "optimizer_state_dict": 6, "pth": 6, "finish": [6, 11, 12, 13, 16, 20], "noqa": [6, 11, 12, 13, 16, 23, 29], "f401": [6, 11, 12, 13, 16, 23, 29], "oneapi": [6, 33], "collect": [6, 32, 33, 34], "commun": [6, 28, 31, 32, 33, 34], "bind": [6, 7, 31, 32, 33, 34], "o": [6, 17, 23, 30], "dist": 6, "oneccl_bindings_for_pytorch": 6, "torch_ccl": 6, "master_addr": 6, "127": [6, 31, 34], "master_port": 6, "29500": [6, 31], "rank": [6, 31, 34], "pmi_rank": 6, "world_siz": [6, 29], "pmi_siz": [6, 29], "init_process_group": 6, "ccl": [6, 31, 34], "init_method": 6, "env": [6, 29], "dist_sampl": 6, "distributedsampl": 6, "sampler": 6, "distributeddataparallel": 6, "batch_id": 6, "destroy_process_group": 6, "nlp": [6, 7, 26, 30, 34], "resnet50_weight": [6, 12, 13], "rand": [6, 8, 12, 13, 20, 26, 34], "vocab_s": [6, 11, 32], "seq_length": [6, 11, 32], "randint": [6, 11, 32], "freez": [6, 8, 10, 13, 15, 16, 20, 23, 26, 32, 34], "check_trac": [6, 13, 32], "strict": [6, 32], "sinc": [6, 7, 18, 19, 20, 21, 26, 33, 34], "manual_se": [6, 11], "43": [6, 11, 31, 32], "12": [6, 10, 14, 17, 30, 31, 32], "instanti": 6, "qconfig_map": 6, "default_static_qconfig_map": 6, "own": [6, 15, 28], "qconfigmap": 6, "per_tensor_affin": [6, 15, 34], "quint8": [6, 15], "set_glob": 6, "traced_model": [6, 10, 13, 15, 16, 26, 34], "static_quantized_model": 6, "local": [6, 20, 28, 31, 32, 33], "default_dynamic_qconfig_map": 6, "placeholderobserv": [6, 15], "is_dynam": [6, 15], "dynamic_quantized_model": 6, "dedic": [6, 28, 34], "faster": [6, 7, 8, 30, 33], "variou": [6, 7, 14, 28, 33, 34], "38": [6, 11, 31, 32], "account": 6, "pretrain": [6, 32, 34], "login": 6, "argpars": [6, 23], "autoconfig": [6, 23], "automodelforcausallm": [6, 23, 29, 34], "autotoken": [6, 23], "parser": [6, 23], "argumentpars": [6, 23], "add_help": [6, 23], "add_argu": [6, 23], "choic": [6, 21, 23, 31], "choos": [6, 8, 20, 23, 31, 33, 34], "dinner": [6, 23], "greedi": [6, 23], "action": [6, 23], "store_tru": [6, 23], "parse_arg": [6, 23], "amp_en": [6, 23], "els": [6, 14, 17, 18, 23], "amp_dtyp": [6, 23], "getattr": [6, 23], "model_id": [6, 23], "125m": 6, "trust_remote_cod": [6, 23], "torch_dtyp": [6, 23], "low_cpu_mem_usag": [6, 23], "memory_format": [6, 7, 18, 23], "channels_last": [6, 7, 18, 23, 33, 34], "num_beam": [6, 23], "generate_kwarg": [6, 23], "do_sampl": [6, 23], "temperatur": [6, 23], "input_s": [6, 23], "return_tensor": [6, 23], "input_id": [6, 23], "inference_mod": [6, 23, 29], "gen_id": [6, 23], "max_new_token": [6, 23], "gen_text": [6, 23], "batch_decod": [6, 23], "skip_special_token": [6, 23], "input_tokens_length": [6, 23], "output_tokens_length": [6, 23], "total_new_token": [6, 23], "zip": [6, 23, 34], "flush": [6, 23], "typic": [6, 10, 28, 33, 34], "summari": [6, 34], "narg": 6, "neelnanda": 6, "pile": 6, "10k": 6, "meta": [6, 18, 28, 29], "7b": [6, 28, 30], "hf": [6, 28], "beam_idx_tmp": 6, "contigu": [6, 13, 18, 33, 34], "global_past_key_valu": 6, "num_attention_head": 6, "user_model": [6, 15], "num_hidden_lay": 6, "pad_val": 6, "pad_max": 6, "tokenize_funct": 6, "set_format": 6, "column": 6, "elif": 6, "collate_batch": 6, "position_ids_pad": 6, "input_ids_pad": 6, "last_ind": 6, "attention_mask_pad": 6, "append": [6, 7], "vstack": 6, "calib_dataset": [6, 29], "load_dataset": 6, "calib_evalu": 6, "shuffl": 6, "collate_fn": 6, "break": [6, 16, 34], "calibration_sampl": 6, "save_qconf_summari": [6, 15, 16, 29], "qconf_summari": [6, 15, 16, 29], "int8_qconfig": 6, "done": [6, 10, 16, 17, 26, 33, 34], "Will": [6, 18], "exit": [6, 31], "benchmark": [6, 26, 30, 31, 34], "lowp": 6, "fp16": [6, 17, 29], "unrel": 6, "lowp_mod": [6, 29], "fall": [6, 12], "back": [6, 12, 17, 18, 21, 26], "implicitli": 6, "determin": [6, 17, 21, 33], "woqweightdtyp": [6, 29], "weight_dtyp": [6, 29], "woqlowpmod": [6, 29], "get_weight_only_quant_qconfig_map": [6, 29], "known": [6, 10, 28], "practic": [6, 21, 24, 28, 33], "libtorch": [6, 34], "suppos": [6, 14, 33], "handl": [6, 18, 33], "servic": [6, 28, 30, 33], "regular": [6, 21], "unlik": 6, "app": [6, 34], "iostream": 6, "argc": 6, "const": [6, 17], "char": 6, "argv": 6, "catch": 6, "c10": [6, 17], "std": [6, 17, 19], "cerr": 6, "ivalu": 6, "push_back": 6, "cout": 6, "slice": [6, 18], "end": [6, 13, 20, 34], "endl": 6, "cmakelist": 6, "cmake_minimum_requir": 6, "version": [6, 7, 16, 17, 25, 26, 27, 32, 33, 34], "fatal_error": 6, "find_packag": 6, "add_execut": 6, "target_link_librari": 6, "torch_ipex_librari": 6, "set_properti": 6, "properti": [6, 32], "cxx_standard": 6, "17": [6, 30, 31, 32], "mkdir": 6, "build": [6, 28, 33, 34], "dcmake_prefix_path": 6, "libpytorch_path": 6, "had": [6, 33], "verifi": [6, 7], "ldd": 6, "workspac": 6, "identif": [6, 17], "gnu": [6, 17, 32], "xx": 6, "cxx": [6, 17], "abi": [6, 17, 34], "usr": [6, 17, 31, 32], "torchconfig": 6, "22": [6, 30, 31, 32], "kineto_librari": 6, "notfound": 6, "stack": [6, 8], "most": [6, 7, 13, 21, 28, 30, 32, 33, 34], "recent": [6, 7, 18], "append_torchlib_if_found": 6, "ipexconfig": 6, "84": [6, 30, 31, 33], "lib": [6, 31, 32], "libintel": [6, 34], "ext": [6, 34], "0x00007f3cf98e0000": 6, "libc10": 6, "0x00007f3cf985a000": 6, "0x00007f3cf70fc000": 6, "libtorch_cpu": 6, "0x00007f3ce16ac000": 6, "libdnnl_graph": 6, "0x00007f3cde954000": 6, "former": 6, "zoo": [6, 30], "simpli": [6, 7, 26, 31], "overview": [7, 25, 29, 34], "three": [7, 16, 17], "claus": [7, 10, 19], "guidanc": 7, "intel_pytorch_extens": [7, 25, 26, 34], "10": [7, 14, 16, 17, 18, 21, 25, 26, 31, 32, 33], "correct": [7, 18, 25, 34], "speed": [7, 11, 19, 28, 33, 34], "happen": 7, "inductor": [7, 34], "level": [7, 10, 13, 16, 18, 20, 21, 26, 33, 34], "migrat": 7, "pattern": [7, 11, 18, 28, 34], "highli": [7, 23, 28, 33, 34], "adapt": 7, "nchw": [7, 33], "nhwc": [7, 33, 34], "could": [7, 13, 16, 18, 26, 32, 33, 34], "anymor": [7, 34], "aka": [7, 18], "cooper": [7, 30, 34], "lake": [7, 30, 34], "avx512": [7, 17, 18, 32, 34], "partial": 7, "upstream": [7, 18, 34], "land": [7, 34], "pr": [7, 18, 34], "being": [7, 33], "review": [7, 34], "instead": [7, 8, 14, 19, 20, 29, 30, 31, 32, 33, 34], "device_nam": [7, 8], "conduct": 7, "frequent": 7, "websit": 7, "registr": 7, "topologi": [7, 18, 19, 26, 30, 31, 33, 34], "roialign": [7, 34], "nm": [7, 34], "cnn": [7, 18, 26, 30, 33, 34], "frozenbatchnorm2d": 7, "num_featur": 7, "batchnorm2d": [7, 10, 26, 34], "statist": 7, "affin": [7, 10, 15, 20, 31, 32, 33], "w": [7, 16, 18, 21, 30, 32], "interact": [7, 34], "beyond": 7, "kind": 7, "gender": 7, "hobbi": 7, "between": [7, 8, 17, 20, 33, 34], "man": [7, 33], "plai": [7, 33], "footbal": 7, "b": [7, 8, 16, 28], "mergedembeddingbag": 7, "embedding_spec": 7, "embeddingspec": 7, "merg": [7, 34], "embeddingbag": [7, 26, 34], "At": [7, 17], "stage": [7, 10, 19, 20, 29, 33, 34], "spars": [7, 18, 34], "dens": [7, 18], "gradient": 7, "mergedembeddingbagwithsgd": 7, "emblist": 7, "modulist": 7, "emb1": 7, "emb2": 7, "emb3": 7, "emb_m": 7, "in1": 7, "in2": 7, "in3": 7, "in_m": 7, "emb": 7, "in_i": 7, "merged_emb": 7, "from_embeddingbag_list": 7, "minim": [7, 14, 17, 33], "heavi": 7, "big": [7, 18], "read": [7, 19], "futur": [7, 28, 34], "visit": [7, 33], "mergedembeddingbagwith": 7, "weight_decai": [7, 19], "grad": [7, 19], "creat": [7, 16, 20, 33, 34], "decai": 7, "to_bfloat16_train": 7, "merged_input": 7, "linearize_indices_and_offset": 7, "need_linearize_indices_and_offset": 7, "booltensor": 7, "becom": [7, 28, 33], "balanc": [7, 16, 22, 33], "embedingbag": 7, "often": 7, "categor": 7, "power": [7, 33, 34], "law": 7, "ag": 7, "video": 7, "game": 7, "19": [7, 30, 31, 32, 34], "29": [7, 31, 32], "row": 7, "write": [7, 17], "address": [7, 18, 31, 32, 33, 34], "conflict": [7, 17], "solv": [7, 19, 33], "togeth": [7, 14, 20, 33, 34], "immedi": 7, "right": [7, 21, 23, 28], "friendli": [7, 33], "gemm": [7, 18, 26, 28, 34], "aim": [7, 10, 16, 33], "math": 7, "wa": [7, 31, 32, 33, 34], "test": [7, 16, 17, 30, 34], "broad": [7, 9, 34], "toggl": 7, "switch": [7, 17, 31, 33, 34], "concern": 7, "footprint": [7, 21, 28, 34], "stick": 7, "splitsgd": [7, 21], "spawn": [7, 20], "subject": [7, 17, 20, 27, 34], "built": [7, 17, 20, 34], "deliv": [7, 28, 34], "separ": [7, 19, 27, 33], "smooth": 7, "ptq": 7, "tackl": 7, "problem": [7, 19, 26, 32, 33], "systemat": 7, "outlier": [7, 16], "commonli": [7, 28, 33, 34], "hopefulli": 7, "eas": [7, 18, 34], "small": [7, 19, 33, 34], "turn": [7, 34], "boolean": [7, 34], "off": [7, 8, 21, 28, 30, 34], "area": [7, 14], "extrem": [7, 14, 33], "situat": [7, 14], "huge": [7, 14, 33], "impract": [7, 14], "consum": [7, 14], "launcher": [7, 13, 31, 33, 34], "integr": [7, 18, 28, 33, 34], "conveni": [8, 34], "lower": [8, 17, 21, 28, 34], "becaus": [8, 17, 18, 21, 28, 33, 34], "lighter": 8, "smaller": [8, 17], "sacrif": 8, "trade": [8, 28, 30, 34], "slower": [8, 33, 34], "accur": 8, "primarili": [8, 34], "show": [8, 17, 21, 28, 29, 30, 31, 32, 33, 34], "simplenet": [8, 34], "super": [8, 10, 16, 20, 26, 34], "stride": [8, 10, 20, 34], "pad": [8, 10, 20, 34], "y": [8, 15, 16, 20, 21, 34], "chosen": [8, 14, 17], "maintain": 8, "categori": [8, 34], "circumst": 8, "imag": [8, 13, 18, 33, 34], "label": 8, "float64": 8, "suppli": 8, "addmm": 8, "addmm_": 8, "cannot": [8, 19, 26, 34], "describ": [8, 13, 18, 21, 32, 33], "expos": 8, "namespac": [8, 17], "regardless": [8, 34], "unlist": 8, "downstream": 8, "believ": [8, 18], "unstabl": 8, "conv1d": [8, 13], "conv3d": [8, 13, 34], "conv_transpose1d": 8, "conv_transpose2d": 8, "conv_transpose3d": 8, "bmm": [8, 34], "mm": 8, "baddbmm": 8, "addbmm": 8, "conv_tbc": 8, "group_norm": 8, "_native_multi_head_attent": 8, "avg_pool3d": 8, "binary_cross_entropi": 8, "grid_sampl": 8, "polar": 8, "prod": 8, "quantil": 8, "nanquantil": 8, "stft": 8, "cdist": 8, "view_as_complex": 8, "choleski": 8, "cholesky_invers": 8, "cholesky_solv": 8, "invers": 8, "lu_solv": 8, "matrix_rank": 8, "orgqr": 8, "ormqr": 8, "pinvers": 8, "max_unpool2d": 8, "max_unpool3d": 8, "adaptive_avg_pool3d": 8, "reflection_pad1d": 8, "reflection_pad2d": 8, "replication_pad1d": 8, "replication_pad2d": 8, "replication_pad3d": 8, "mse_loss": 8, "cosine_embedding_loss": 8, "nll_loss": 8, "nll_loss2d": 8, "hinge_embedding_loss": 8, "poisson_nll_loss": 8, "smooth_l1_loss": 8, "cross_entropy_loss": 8, "l1_loss": 8, "huber_loss": 8, "margin_ranking_loss": 8, "soft_margin_loss": 8, "triplet_margin_loss": 8, "multi_margin_loss": 8, "ctc_loss": 8, "kl_div": 8, "multilabel_margin_loss": 8, "binary_cross_entropy_with_logit": 8, "fft_fft": 8, "fft_ifft": 8, "fft_fft2": 8, "fft_ifft2": 8, "fft_fftn": 8, "fft_ifftn": 8, "fft_rfft": 8, "fft_irfft": 8, "fft_rfft2": 8, "fft_irfft2": 8, "fft_rfftn": 8, "fft_irfftn": 8, "fft_hfft": 8, "fft_ihfft": 8, "linalg_cond": 8, "linalg_matrix_rank": 8, "linalg_solv": 8, "linalg_choleski": 8, "linalg_svdv": 8, "linalg_eigv": 8, "linalg_eigvalsh": 8, "linalg_inv": 8, "linalg_householder_product": 8, "linalg_tensorinv": 8, "linalg_tensorsolv": 8, "fake_quantize_per_tensor_affin": 8, "eig": 8, "geqrf": 8, "lstsq": 8, "_lu_with_info": 8, "qr": 8, "svd": 8, "symeig": 8, "triangular_solv": 8, "fractional_max_pool2d": 8, "fractional_max_pool3d": 8, "adaptive_max_pool3d": 8, "multilabel_margin_loss_forward": 8, "linalg_qr": 8, "linalg_cholesky_ex": 8, "linalg_svd": 8, "linalg_eig": 8, "linalg_eigh": 8, "linalg_lstsq": 8, "linalg_inv_ex": 8, "cat": [8, 31, 32, 34], "index_copi": 8, "intervent": 8, "mixtur": [8, 34], "enable_auto_channels_last": 9, "disable_auto_channels_last": 9, "regress": [9, 34], "rais": 10, "oob": [10, 34], "easili": [10, 15], "who": 10, "inevit": 10, "simplifi": [10, 34], "snippet": [10, 29], "optimum": 10, "monkei": 10, "patch": [10, 34], "embedding_bag": 10, "qa": [10, 34], "clear": 10, "ninstanc": [10, 14, 31, 34], "ncore": [10, 31], "28": [10, 14, 16, 30, 31, 32, 33, 34], "run_qa": [10, 34], "model_name_or_path": [10, 29, 34], "dataset_nam": [10, 34], "squad": [10, 30, 34], "do_ev": [10, 34], "per_device_train_batch_s": [10, 34], "learning_r": [10, 34], "3e": [10, 34], "num_train_epoch": [10, 34], "max_seq_length": [10, 34], "384": [10, 32, 34], "doc_strid": [10, 34], "output_dir": [10, 14, 34], "tmp": [10, 32, 34], "debug_squad": [10, 34], "dummymodul": 10, "input1": 10, "kernel_s": 10, "7": [10, 14, 17, 20, 21, 31, 32, 34], "track_running_stat": 10, "customized_forward": 10, "method1": 10, "success": [10, 24], "method2": 10, "fail": [10, 26, 34], "top": [10, 21, 34], "unabl": 10, "hook": [10, 16], "As": [10, 19, 20, 28, 31, 32, 33, 34], "behaviour": 10, "repeat": [10, 18, 21], "feasibl": 10, "idea": [11, 21, 33], "primit": [11, 20, 30, 34], "portabl": 11, "hpc": 11, "ensur": [11, 19, 20, 32], "perf": [11, 18], "tri": 12, "failur": [12, 34], "incorrect": [12, 26, 34], "trigger": 12, "meanwhil": [12, 33, 34], "resnet50": [12, 13, 14, 18, 30, 31, 33, 34], "dag": 13, "acycl": 13, "straight": [13, 33], "cover": [13, 18, 31], "constant": 13, "resourc": [13, 20, 28, 32, 33], "focus": [13, 34], "front": [13, 34], "batchnorm": [13, 17, 18, 26, 34], "propag": [13, 21, 33], "graph_for": 13, "regard": 13, "rn50": [13, 34], "sum": [13, 16, 18, 19, 34], "convrelu": 13, "convsumrelu": 13, "default_static_qconfig": [13, 15, 32, 34], "quantized_model": [13, 15, 34], "244": 13, "convtranspose3d": 13, "ab": [13, 32], "clamp": 13, "elu": 13, "exp": 13, "hardtanh": 13, "hardswish": [13, 34], "mish": 13, "sigmoid": [13, 34], "pow": 13, "round": [13, 21], "squar": [13, 28], "tanh": [13, 34], "leaki": 13, "_": [13, 15, 16, 17, 18, 20, 30, 31, 32, 33, 34], "div": 13, "view": [13, 18, 20, 21], "transpos": [13, 34], "dequant": [13, 16], "partit": [13, 33], "leaky_relu": 13, "___": 13, "divid": [13, 32, 33, 34], "maxpool2d": 13, "_____": 13, "stock": [13, 30, 34], "owner": 13, "otheriws": 13, "compuat": 13, "wikipedia": [13, 33], "There": [14, 16, 20, 33, 34], "thing": [14, 33], "yaml": 14, "strategi": [14, 33, 34], "grid": 14, "random": 14, "max_trial": 14, "trial": 14, "record": [14, 32], "csv": 14, "hyperparam": 14, "mandatori": 14, "hp": 14, "ncores_per_inst": 14, "all_physical_cor": 14, "ncore_per_inst": [14, 34], "all_logical_cor": 14, "use_all_nod": 14, "num_nod": 14, "use_logical_cor": [14, 32], "is_hyperthreading_en": 14, "disable_numactl": [14, 32], "disable_iomp": [14, 32], "malloc": [14, 31, 33], "tc": 14, "je": 14, "previou": [14, 16, 18, 33, 34], "hyperparamt": 14, "8": [14, 16, 30, 31, 32, 33], "respect": [14, 16, 30, 31, 34], "maxim": 14, "statement": [14, 17], "higher_is_bett": 14, "target_v": 14, "inf": 14, "minimum": [14, 16, 18], "platinum": [14, 30, 32, 33], "8180m": [14, 33], "socket": [14, 30, 32, 33, 34], "anoth": [14, 31, 33, 34], "conf_fil": [14, 34], "hypertune_directori": 14, "termin": 14, "15": [14, 17, 30, 31, 32], "339081764221191": 14, "gave": 14, "side": [15, 33], "compon": [15, 26, 27, 28], "much": [15, 18, 21, 28, 33], "abl": 15, "similar": [15, 17, 33], "satisfi": [15, 26], "tradeoff": 15, "reduce_rang": 15, "methond": 15, "obsev": 15, "symmetr": 15, "sete": 15, "skylak": 15, "quant_stat": 15, "calibration_data_set": [15, 34], "qparam": 15, "And": [15, 20, 32, 34], "achang": 15, "overrid": 15, "load_qconf_summari": 15, "dynamic_qconfig": 15, "default_dynamic_qconfig": [15, 32], "per_tensor_symmetr": 15, "gru": 15, "lstmcell": 15, "rnncell": 15, "grucel": 15, "bother": 16, "desir": [16, 31], "receip": [16, 20], "sq": 16, "difficulti": 16, "vari": 16, "across": [16, 31], "herebi": 16, "obtain": 16, "abil": 16, "optdecoderlay": 16, "blockwis": 16, "consist": [16, 28, 33, 34], "major": 16, "adjust": 16, "accordingli": 16, "predict": 16, "criteria": 16, "consider": 16, "numpi": 16, "np": [16, 31], "tolist": 16, "auto_alpha_arg": 16, "init_alpha": [16, 22], "baselin": [16, 22, 34], "alpha_min": [16, 22], "alpha_max": [16, 22], "99": [16, 30, 34], "alpha_step": [16, 22], "step_siz": [16, 22], "shared_criterion": [16, 22], "enable_blockwise_loss": [16, 22], "portion": 16, "beginn": 16, "quickstart_tutori": 16, "training_data": 16, "fashionmnist": 16, "test_data": 16, "loader": 16, "train_dataload": 16, "test_dataload": 16, "neuralnetwork": 16, "flatten": [16, 20], "linear_relu_stack": 16, "sequenti": 16, "logit": 16, "loss_fn": 16, "pred": 16, "backpropag": 16, "item": 16, "7f": 16, "5d": 16, "epoch": 16, "argmax": 16, "inc": [16, 17, 22, 28], "accu": 16, "tuned_conf": 16, "explain": [17, 18, 21], "fork": [17, 33], "avx512_vnni": 17, "avx512_bf16": 17, "avx2": [17, 26, 34], "avx2_vnni": 17, "avx512_fp16": 17, "11": [17, 31, 32], "gcc": 17, "findavx": 17, "bodi": 17, "anonym": 17, "virtual": 17, "polymorph": 17, "pertain": 17, "cpuid": 17, "statu": 17, "pointer": 17, "system": [17, 33], "specifii": 17, "complier": 17, "isacodegen": 17, "suffix": 17, "adaptiveaveragepoolingkrnl": 17, "isa_codegen": 17, "o3": 17, "d__avx__": 17, "dcpu_capability_avx2": 17, "mavx2": 17, "mfma": 17, "mno": 17, "avx256": 17, "unalign": [17, 34], "dcpu_cap": 17, "dcpu_capability_default": 17, "d__avx512f__": 17, "mavx512f": 17, "mavx512bw": 17, "mavx512vl": 17, "mavx512dq": 17, "dcpu_capability_avx512": 17, "mavx512vnni": 17, "dcpu_capability_avx512_vnni": 17, "mavx512bf16": 17, "dcpu_capability_avx512_bf16": 17, "mamx": 17, "tile": 17, "dcpu_capability_amx": 17, "mavx512fp16": 17, "dcpu_capability_avx512_fp16": 17, "align": [17, 18, 21, 34], "stead": 17, "sleef": 17, "width": [17, 18], "isa_nam": 17, "inlin": 17, "compat": [17, 21], "definit": [17, 21], "Such": 17, "But": [17, 18], "tip": 17, "newkernelkrnl": 17, "newkernel": 17, "header": 17, "special": [17, 18, 28], "fastest": 17, "cpuinfo": 17, "mykernel": 17, "fn_type": 17, "void": 17, "ipex_declare_dispatch": 17, "ipex_define_dispatch": 17, "ipex_register_dispatch": 17, "kcpu": 17, "declar": 17, "ideep": [17, 18], "common": [17, 21, 28, 31, 33], "intrins": 17, "cvtfp32tobf16": 17, "pragma": 17, "torch_ipex": [17, 34], "cvt_fp32_to_bf16": 17, "dst": 17, "cvt_fp32_to_bf16_kernel_impl": 17, "cvt_fp32_to_bf16_kernel_fn": 17, "cvt_fp32_to_bf16_kernel_stub": 17, "macro": 17, "cpu_capability_avx512": 17, "cpu_capability_avx512_bf16": 17, "hav": 17, "cvtfp32tobf16krnl": 17, "vec512": 17, "vec256": 17, "endif": 17, "immintrin": 17, "__m256i": 17, "_cvt_fp32_to_bf16": 17, "__m512": 17, "reinterpret_cast": 17, "_mm512_cvtneps_pbh": 17, "__m512i": 17, "_mm512_castps_si512": 17, "nan": [17, 34], "_mm512_set1_epi32": 17, "0xffff": 17, "mask_valu": 17, "_mm512_cmp_ps_mask": 17, "_cmp_ord_q": 17, "0x1": 17, "vec_bia": 17, "0x7fff": 17, "uint32_t": 17, "lsb": 17, "t_valu": 17, "_mm512_and_si512": 17, "_mm512_srli_epi32": 17, "rounding_bia": 17, "_mm512_add_epi32": 17, "_mm512_mask_blend_epi32": 17, "_mm512_cvtusepi32_epi16": 17, "f32": [17, 18], "_mm512_loadu_p": 17, "_mm256_storeu_si256": 17, "_mm512_maskz_loadu_p": 17, "_mm256_mask_storeu_epi16": 17, "getveclength": 17, "get_cpp_typesize_and_vecs": 17, "scalartyp": 17, "get_cpp_typesize_and_vecsize_kernel_impl": 17, "get_cpp_typesize_and_vecsize_kernel_fn": 17, "get_cpp_typesize_and_vecsize_kernel_stub": 17, "types": 17, "vectors": 17, "getveclengthkrnl": 17, "doubl": 17, "make_tupl": 17, "sizeof": 17, "complexdoubl": 17, "complex": 17, "complexfloat": 17, "decltyp": 17, "impl": 17, "scalartypetocpptyp": 17, "torch_check": 17, "09": [17, 31], "58": [17, 31], "anaconda": 17, "copyright": [17, 27], "credit": 17, "licens": 17, "_c": [17, 26], "_get_current_isa_level": 17, "_get_highest_cpu_support_isa_level": 17, "_get_highest_binary_support_isa_level": 17, "quit": [17, 34], "By": [17, 31, 33], "aten_cpu_cap": 17, "effect": [17, 21, 26, 32, 33], "intern": [17, 18, 20, 32], "purpos": [17, 31, 32, 33], "addtion": 17, "tool": [17, 33, 34], "subfold": 17, "rh": 17, "toolset": 17, "33": [17, 31, 32], "cmakefil": 17, "cpu_featur": 17, "dir": [17, 31], "66": [17, 31, 34], "cpu_feature_main": 17, "xcr0": 17, "00000000000602e7": 17, "mmx": 17, "sse": 17, "sse2": 17, "sse3": 17, "ssse3": 17, "sse4_1": 17, "sse4_2": 17, "aes_ni": 17, "sha": 17, "xsave": 17, "fma": 17, "f16c": 17, "avx_vnni": 17, "avx512_f": 17, "avx512_cd": 17, "avx512_pf": 17, "avx512_er": 17, "avx512_vl": 17, "avx512_bw": 17, "avx512_dq": 17, "avx512_ifma": 17, "avx512_vbmi": 17, "avx512_vpopcntdq": 17, "avx512_4fmap": 17, "avx512_4vnniw": 17, "avx512_vbmi2": 17, "avx512_vpclmul": 17, "avx512_bitalg": 17, "avx512_vp2intersect": 17, "amx_bf16": 17, "amx_til": 17, "amx_int8": 17, "prefetchw": 17, "prefetchwt1": 17, "represent": 18, "multidimension": 18, "arrai": 18, "nd": 18, "1d": 18, "semant": 18, "attribut": 18, "coo": 18, "canon": 18, "assign": [18, 32, 33], "2d": 18, "height": 18, "illustr": [18, 19, 21, 31, 33], "actual": [18, 21], "bmp": 18, "contiguous_format": [18, 33], "tensorflow": 18, "close": [18, 31, 33], "to_mkldnn": 18, "difficult": 18, "manipul": 18, "to_dens": 18, "natur": [18, 21, 28], "hold": [18, 33], "secret": 18, "ingredi": 18, "almost": 18, "foundat": [18, 33], "upper": [18, 33], "fact": [18, 33], "expens": 18, "benefici": 18, "nb": 18, "me": 18, "roughli": 18, "50": [18, 31, 32], "mkldnn": 18, "mkldnn_util": 18, "subsequ": [18, 33], "concept": [18, 33], "diagram": [18, 33], "hard": [18, 26], "conclus": 18, "necessari": 18, "neglig": 18, "move": [18, 33], "organ": 18, "question": [18, 30], "reinterpret": 18, "answer": [18, 30], "chw": 18, "hw": 18, "stride_n": 18, "stride_c": 18, "stride_h": 18, "stride_w": 18, "merit": 18, "express": [18, 34], "noncontigu": 18, "n1": 18, "n2": 18, "mind": [18, 32], "someth": 18, "reli": [18, 20], "rfc": 18, "hwc": 18, "wc": 18, "chwn": 18, "hwn": 18, "wn": 18, "empti": [18, 31], "outplac": [18, 34], "is_contigu": 18, "_appli": 18, "brief": [18, 28, 34], "imagenet": [18, 30], "spontan": 18, "tell": [18, 20, 33], "NOT": [18, 31], "compris": 18, "explicit": [18, 20, 33], "implicit": 18, "tensoriter": 18, "guidelin": 18, "awar": [18, 20, 31, 32], "my": 18, "upsampl": [18, 34], "cudnn": 18, "accommod": 18, "md": 18, "format_tag": 18, "src_md": 18, "desc": 18, "data_typ": 18, "src_mem": 18, "src_data_ptr": 18, "card": 18, "hwio": 18, "resnext101": [18, 34], "detectron2": 18, "8x": 18, "lamb": [19, 21], "adagrad": [19, 21], "clr": 19, "lr_decai": 19, "state_sum": 19, "addcmul_": 19, "add_": 19, "addcdiv_": 19, "whole": [19, 20, 33], "storag": 19, "onboard": [19, 33], "third": [19, 34], "high": [19, 21, 33], "bound": [19, 20, 28, 33], "bottl": 19, "neck": 19, "prevent": 19, "pseudo": [19, 21, 34], "adagrad_fused_step": 19, "group": [19, 20, 33], "grad0": 19, "grad1": 19, "grad_n": 19, "param_n": 19, "state_sum_n": 19, "adagrad_step": 19, "grad_i": 19, "param_i": 19, "state_sum_i": 19, "other_arg": 19, "coupl": [20, 33, 34], "omp": [20, 26, 31, 32, 33, 34], "ld_preload": [20, 31, 32, 33], "libiomp5": [20, 31, 32, 33], "model_script": 20, "examplenet": 20, "examplenet1": 20, "x1": 20, "start_dim": 20, "examplenet2": 20, "conv2": 20, "x2": 20, "y1": 20, "y2": 20, "model1": 20, "traced_model1": 20, "model2": 20, "traced_model2": 20, "multi_stream_model": [20, 34], "datatyp": [20, 34], "receipt": 20, "steam": [20, 34], "input_hint": 20, "output_hint": 20, "pthread": 20, "async": [20, 34], "wake": 20, "synchron": [20, 26, 34], "imper": [20, 34], "suffer": 20, "gil": 20, "hurt": 20, "mitig": [20, 30], "omp_num_thread": [20, 26, 31, 32, 34], "phase": 20, "s1": 20, "c1": 20, "numactl": [20, 31, 32], "outsid": 20, "superset": 20, "undefin": [20, 33], "gb": 20, "simultan": 20, "correspond": [20, 31, 34], "cpu_pool1": 20, "cpu_pool2": 20, "task1": 20, "task2": 20, "y1_futur": 20, "y2_futur": 20, "y_runtim": 20, "kmp_": 20, "fulfil": 20, "worker": [20, 31], "serv": [20, 34], "sub": [20, 28, 33], "wait": [20, 33], "futuretensor": 20, "didn": 20, "dlopen": 20, "symbol": 20, "bottom": 21, "bit": [21, 28], "sign": 21, "expon": 21, "mantissa": 21, "23": [21, 31, 32], "capac": [21, 30], "digit": 21, "shorter": [21, 28], "fewer": 21, "neg": 21, "disadvantag": 21, "shift": 21, "left": [21, 28, 32], "lose": 21, "decim": 21, "valid": [21, 34], "1234500000": 21, "0000012345": 21, "1234512345": 21, "sens": 21, "fraction": 21, "12345": 21, "00000": 21, "signific": 21, "bui": 21, "involv": 21, "ground": 21, "truth": 21, "chain": 21, "rule": [21, 34], "meet": [21, 33, 34], "wide": [21, 34], "understand": [21, 28, 33], "formula": 21, "\u03b1": 21, "gw": 21, "denot": 21, "receiv": 21, "rate": 21, "earlier": 21, "inaccur": 21, "exactli": 21, "kept": 21, "halv": 21, "recov": 21, "fp32_w": 21, "concat_fp32_from_bf16": 21, "bf16_w": 21, "fp32_gw": 21, "bf16_gw": 21, "weight_dacai": 21, "split_bf16_from_fp32": 21, "ratio": [22, 30, 34], "beta": [23, 26], "demostr": 23, "cheat": 23, "sheet": 23, "pypi": [26, 34], "occupi": 26, "remark": [26, 30, 33], "__name__": [26, 34], "__main__": [26, 31, 32, 34], "112": [26, 30, 33, 34], "nnc": 26, "poor": [26, 34], "xlm": 26, "roberta": [26, 34], "casual": 26, "gpt2": 26, "summar": 26, "classif": [26, 30], "allenai": 26, "longform": 26, "409": 26, "workaround": [26, 34], "_jit_set_texpr_fuser_en": 26, "csrc": 26, "tensorexpr_fus": 26, "settensorexprfuseren": 26, "longer": [26, 30], "complic": [26, 31, 33], "undergo": [26, 29], "runtimeerror": [26, 34], "overflow": [26, 34], "unpack": [26, 34], "exce": [26, 30, 33, 34], "quantize_per_tensor": 26, "pseudocod": [26, 34], "omp_num_threa": 26, "set_num_thread": [26, 34], "freezed_model": [26, 34], "run_benchmark": [26, 34], "flow": 26, "bag": [26, 34], "progress": [26, 28, 34], "abnorm": [26, 34], "tbd": 26, "transformerencoderlay": 26, "encount": [26, 34], "rnnt": [26, 34], "joint_net": [26, 34], "caller": [26, 34], "apach": [27, 32], "notic": [27, 31, 32], "term": 27, "condit": 27, "multiheadattent": 28, "feedforward": 28, "lot": [28, 34], "besid": [28, 33, 34], "adopt": [28, 34], "modelfamili": 28, "hub": 28, "staticquantizationint8": 28, "onlyquantizationint8": 28, "onlyquantizationint4": 28, "13b": [28, 30, 34], "70b": [28, 34], "8b": 28, "20b": 28, "dolli": [28, 34], "databrick": 28, "v2": [28, 30, 34], "12b": 28, "tiiuae": 28, "40b": 28, "30b": 28, "3b": 28, "bigscienc": 28, "1b7": 28, "salesforc": 28, "2b": 28, "baichuan2": [28, 34], "chat": 28, "thudm": 28, "chatglm3": [28, 34], "chatglm2": [28, 34], "bigcod": 28, "starcod": [28, 34], "flan": 28, "xl": 28, "mosaicml": 28, "mistralai": 28, "v0": 28, "8x7b": 28, "stabilityai": 28, "1_6b": 28, "liuhaotian": 28, "v1": [28, 34], "microsoft": 28, "ieityuan": 28, "yuan2": 28, "102b": 28, "signifi": 28, "perfect": 28, "codellama": 28, "rope": 28, "past": 28, "year": 28, "flourish": 28, "contribut": [28, 31, 34], "research": 28, "web": 28, "legend": 28, "autotp": 28, "obviou": 28, "hotspot": 28, "lead": 28, "significantli": [28, 34], "heavier": 28, "io": 28, "occurr": 28, "ship": 28, "2nd": 28, "4th": [28, 30], "except": [28, 31], "beeter": 28, "Its": 28, "seen": 28, "woq": 28, "integ": [28, 33], "bandwidth": 28, "reorder_cach": 28, "beam_width": 28, "secondli": 28, "elimin": 28, "shard": 28, "content": [29, 34], "your_calibration_dataset": 29, "calib_sampl": 29, "calibration_model": 29, "qconfig_summary_file_path": 29, "nf4": 29, "init_distribut": 29, "get_acceler": 29, "communication_backend_nam": 29, "var": 29, "ondevic": 29, "init_infer": 29, "mp_size": 29, "base_dir": 29, "repo_root": 29, "checkpoints_json": 29, "zone": [30, 34], "articl": [30, 33], "llama2": [30, 34], "1024": [30, 33], "were": [30, 31, 32, 33], "carri": 30, "m7i": 30, "m6i": [30, 32], "47x": 30, "62x": 30, "57x": 30, "58x": 30, "85x": 30, "27x": 30, "38x": 30, "29x": 30, "36x": 30, "conclud": [30, 34], "respons": 30, "session": 30, "exhibit": 30, "wherea": 30, "p90": 30, "26x": 30, "sec": 30, "39": [30, 31, 32, 34], "26": [30, 31, 32], "49": [30, 31, 32], "170": 30, "21": [30, 31, 32], "measur": [30, 34], "17th": 30, "16xlarg": 30, "u": [30, 32], "west": 30, "ubuntu": 30, "04": [30, 31], "1009": 30, "sw": 30, "workload1": 30, "inference2": 30, "realtim": 30, "inference3": 30, "tunabl": [30, 32], "8380": 30, "30ghz": 30, "83x": 30, "44x": 30, "ssd": [30, 34], "resnet34": [30, 34], "16x": 30, "coco": 30, "1200": 30, "resnext": 30, "32x16d": 30, "81x": 30, "21x": 30, "vgg": 30, "75x": 30, "19x": 30, "shufflenetv2_x1": 30, "07x": 30, "78x": 30, "04x": 30, "max_seq_len": 30, "384task": 30, "jemalloc": [30, 32, 34], "05x": 30, "96x": 30, "mrpc": 30, "128task": 30, "distilbert": 30, "12x": 30, "dnnl": 30, "base_text_classif": 30, "f1": 30, "81": [30, 31], "79": [30, 31], "93": 30, "02": [30, 32], "85": [30, 31], "86": [30, 31], "top1": 30, "76": [30, 31], "75": [30, 31], "98": 30, "78": [30, 31], "199": 30, "48": [30, 31, 32], "vgg11": 30, "69": [30, 31], "67": [30, 31, 34], "96": 30, "44": [30, 31, 32], "36": [30, 31, 32], "92": 30, "97": 30, "shufflenet": 30, "histogram": [30, 34], "40": [30, 31, 32, 34], "ucod": 30, "0xd0002a0": 30, "ON": 30, "turboboost": 30, "bio": 30, "ddr": 30, "16gb": 30, "3200": 30, "dcpmm": 30, "256gb": 30, "host": [30, 34], "cento": 30, "2105": 30, "18": [30, 31, 32], "305": 30, "el8_4": 30, "x86_64": 30, "docker": [30, 34], "spectr": 30, "meltdown": 30, "24x": 30, "31x": 30, "15x": 30, "30x": 30, "mobilenet": 30, "08x": 30, "03x": 30, "09x": 30, "39x": 30, "35x": 30, "160": 30, "55x": 30, "06x": 30, "fpn": 30, "71x": 30, "20x": 30, "13x": 30, "32x": 30, "48x": 30, "11x": 30, "terabyt": 30, "14x": 30, "02x": 30, "10x": 30, "33x": 30, "8380h": 30, "90ghz": 30, "56": [30, 31, 32, 33], "67x": 30, "45x": 30, "77x": 30, "18x": 30, "formerli": [30, 33, 34], "0x700001c": 30, "wlydcrb1": 30, "sy": 30, "0016": 30, "p29": 30, "2006080250": 30, "64gb": 30, "768gb": 30, "influenc": [31, 33], "properli": 31, "themselv": [31, 34], "free": [31, 34], "mainli": [31, 34], "around": 31, "interpret": 31, "prefix": 31, "cross": [31, 32, 33, 34], "taskset": 31, "malloc_conf": [31, 33], "crash": [31, 33, 34], "nnode": 31, "nproc": 31, "count": 31, "addr": 31, "ip": 31, "hostnam": 31, "proc": 31, "port": 31, "hostfil": 31, "mpi": 31, "mpiexec": 31, "hydra": 31, "ppn": 31, "genv": 31, "i_mpi_pin_domain": 31, "codeless": 31, "ut": 31, "exclus": 31, "mutual": 31, "ld": 31, "favorit": 31, "kmp": [31, 33], "granular": [31, 32, 33], "compact": [31, 32, 33], "stdout": 31, "afterward": [31, 33], "undesir": 31, "_timestamp_inst": 31, "_timestamp_instance_": 31, "_core": 31, "run_20210712212258_inst": 31, "run_20210712212258_instance_0_cores_0": 31, "gif": 31, "07": 31, "764": 31, "conda_prefix": [31, 32], "virtual_env": [31, 32], "lib64": [31, 32], "home": [31, 32], "drop": [31, 32], "kmp_affin": [31, 32, 33], "kmp_blocktim": [31, 32, 33], "14": [31, 32, 34], "24": [31, 32], "25": [31, 32], "27": [31, 32, 33], "30": [31, 32], "31": [31, 32], "34": [31, 32], "35": [31, 32], "37": [31, 32, 34], "41": [31, 32], "42": [31, 32], "tee": 31, "run_20210712223308_inst": 31, "run_20210712223308_instance_0_cores_0": 31, "87": 31, "08": 31, "117": 31, "88": 31, "118": 31, "45": [31, 32], "46": [31, 32], "47": [31, 32], "51": [31, 32], "52": [31, 32], "53": [31, 32], "54": [31, 32], "55": [31, 32, 33], "57": 31, "59": 31, "60": 31, "61": 31, "62": 31, "63": [31, 34], "65": 31, "68": [31, 34], "70": 31, "71": 31, "72": 31, "73": 31, "74": 31, "77": 31, "82": 31, "83": [31, 33], "run_20210712214504_inst": 31, "run_20210712214504_instance_0_cores_22": 31, "513": 31, "run_20210712220928_inst": 31, "run_20210712220928_instance_0_cores_0": 31, "355": 31, "356": 31, "deduct": 31, "run_20210712221615_inst": 31, "run_20210712221615_instance_0_cores_11": 31, "591": 31, "run_20210712221150_inst": 31, "run_20210712221150_instance_0_cores_0": 31, "run_20210712221150_instance_1_cores_22": 31, "233": 31, "236": 31, "run_20210712221415_inst": 31, "run_20210712221415_instance_0_cores_0": 31, "run_20210712221415_instance_1_cores_4": 31, "run_20210712221415_instance_2_cores_8": 31, "run_20210712221415_instance_3_cores_12": 31, "run_20210712221415_instance_4_cores_16": 31, "run_20210712221415_instance_5_cores_20": 31, "run_20210712221415_instance_6_cores_24": 31, "run_20210712221415_instance_7_cores_28": 31, "run_20210712221415_instance_8_cores_32": 31, "run_20210712221415_instance_9_cores_36": 31, "run_20210712221415_instance_10_cores_40": 31, "140": 31, "143": 31, "146": 31, "149": 31, "151": 31, "154": 31, "157": 31, "159": 31, "162": 31, "164": 31, "167": 31, "run_20210712221305_inst": 31, "run_20210712221305_instance_0_cores_0": 31, "run_20210712221305_instance_1_cores_11": 31, "run_20210712221305_instance_2_cores_22": 31, "run_20210712221305_instance_3_cores_33": 31, "470": 31, "471": 31, "473": 31, "476": 31, "479": 31, "instance_idx": 31, "independ": 31, "confirm": 31, "175": 31, "176": 31, "177": 31, "run_20220106130151_instance_0_cores_0": 31, "sometim": [31, 33], "235": 31, "jemallocl": 31, "oversize_threshold": [31, 33], "background_thread": [31, 33], "metadata_thp": [31, 33], "dirty_decay_m": [31, 33], "9000000000": [31, 33], "muzzy_decay_m": [31, 33], "libjemalloc": 31, "run_20210713153048_instance_0_cores_0": 31, "654": 31, "libtcmalloc": [31, 32], "655": 31, "run_20210713153333_instance_0_cores_0": 31, "784": 31, "run_20210713153659_instance_0_cores_0": 31, "blocktim": 31, "00": [31, 34], "760": [31, 32], "761": [31, 32], "omp_schedul": [31, 33], "omp_proc_bind": [31, 33], "run_20210713152500_instance_0_cores_0": 31, "give": [32, 34], "ipex_en": 32, "procedur": 32, "tunin": 32, "dramat": [32, 33], "cpu_launcher_en": 32, "cpu_launcher_arg": 32, "hyperthread": 32, "present": 32, "ital": 32, "ptmalloc": 32, "use_default_alloc": [32, 34], "tcmalloc": 32, "enable_tcmalloc": 32, "enable_jemalloc": 32, "nth": [32, 33], "uniform": 32, "overlap": 32, "signficantli": 32, "8180": 32, "affinit": 32, "addition": 32, "kill": 32, "unutil": 32, "restart": 32, "remain": 32, "aliv": 32, "taken": 32, "care": 32, "worri": 32, "continu": [32, 34], "Then": 32, "interrupt": 32, "dummi": 32, "dummy_tensor": 32, "scheme": 32, "bert_int8_jit": 32, "n_iter": 32, "rn50_int8_jit": 32, "usus": 32, "rn50_ipex_int8": 32, "handler": 32, "image_classifi": 32, "similarli": 32, "bert_ipex_int8": 32, "transformer_handler_gener": 32, "setup_config": 32, "seq_classification_artifact": 32, "index_to_nam": 32, "nc": 32, "model_stor": 32, "server": [32, 33], "rest": 32, "model_log": 32, "096": 32, "8375c": 32, "03": 32, "981": 32, "982": 32, "previous": 32, "cases": 32, "223": 32, "site": 32, "model_service_work": 32, "sock": 32, "unix": 32, "9000": 32, "762": 32, "763": 32, "9001": 32, "274": 32, "9002": 32, "975": 32, "9003": 32, "bench": 32, "amazon": 32, "ec2": 32, "24xlarg": 32, "reproduc": 32, "url": [32, 34], "modelurl": 32, "inputpath": 32, "concurr": [32, 33], "huggingface_transform": 32, "sample_text_captum_input": 32, "graphic": 33, "xe": 33, "briefli": 33, "background": 33, "knowledg": 33, "c620": 33, "seri": 33, "chipset": 33, "purlei": 33, "chip": 33, "inclus": 33, "1mb": 33, "l2": 33, "2666": 33, "mhz": 33, "ddr4": 33, "six": 33, "ultra": 33, "interconnect": 33, "upi": 33, "microarchitectur": 33, "connect": 33, "transfer": 33, "equip": 33, "motherboard": 33, "attach": 33, "remot": 33, "asu": 33, "z11pa": 33, "d8": 33, "competit": 33, "stall": 33, "busi": 33, "uma": 33, "lscpu": 33, "retriev": 33, "111": 33, "50ghz": 33, "node0": 33, "node1": 33, "sophist": 33, "brought": [33, 34], "polici": 33, "put": 33, "sysctl": 33, "great": 33, "placement": 33, "cpunodebind": 33, "membind": 33, "multithread": 33, "primari": 33, "consecut": 33, "join": 33, "libgomp": 33, "libiomp": 33, "hang": [33, 34], "gomp_cpu_affin": 33, "comma": 33, "invalid": 33, "thrash": 33, "did": [33, 34], "compet": 33, "unus": 33, "proclist": 33, "millisecond": 33, "sleep": 33, "200m": 33, "period": 33, "elaps": 33, "overal": 33, "appropri": 33, "reserv": 33, "sole": 33, "penal": 33, "role": 33, "unnecessari": 33, "destruct": 33, "emphas": 33, "fragment": 33, "mmuzzy_decay_m": 33, "forg": 33, "dealloc": 33, "costli": 33, "gpertool": 33, "plu": 33, "pretti": 33, "nifti": 33, "analysi": 33, "gperftool": 33, "set_flush_denorm": 33, "warm": 33, "therefor": 33, "threshold": 33, "usuali": 33, "come": 33, "maskrcnn": [33, 34], "wav2vec2": 33, "recognit": 33, "onednn_primitive_cache_capac": 33, "65536": 33, "voic": 33, "excit": 34, "announc": 34, "accompani": 34, "privat": 34, "broader": 34, "sincer": 34, "encourag": 34, "feedback": 34, "creator": 34, "reach": 34, "hf_beam_sampl": 34, "hf_beam_search": 34, "hf_greedy_search": 34, "hf_sampl": 34, "walk": 34, "2561": 34, "2584": 34, "2617": 34, "2663": 34, "2733": 34, "act": 34, "2550": 34, "2568": 34, "2641": 34, "2675": 34, "2613": 34, "upgrad": 34, "v3": 34, "2747": 34, "misc": 34, "2468": 34, "2627": 34, "2631": 34, "2704": 34, "changelog": 34, "optimize_transform": 34, "your_generation_param": 34, "newli": 34, "varianc": 34, "encod": 34, "2349": 34, "2412": 34, "2469": 34, "2476": 34, "flash": 34, "2317": 34, "2334": 34, "2392": 34, "2480": 34, "elser": 34, "2491": 34, "public": 34, "2473": 34, "2511": 34, "2433": 34, "2253": 34, "2251": 34, "2236": 34, "2278": 34, "2257": 34, "dockerfil": 34, "ux": 34, "2229": 34, "2195": 34, "2299": 34, "2315": 34, "2283": 34, "2280": 34, "2292": 34, "2275": 34, "2319": 34, "2198": 34, "2264": 34, "2290": 34, "experiment": 34, "workflow": 34, "1563": 34, "excess": 34, "1677": 34, "1688": 34, "1664": 34, "lar": 34, "1695": 34, "dictionari": 34, "1682": 34, "2137": 34, "1568": 34, "1585": 34, "1590": 34, "1587": 34, "1594": 34, "old": 34, "hypervisor": 34, "vm": 34, "1513": 34, "1593": 34, "padding_mod": 34, "1580": 34, "1566": 34, "transnetv2": 34, "1564": 34, "rnn": 34, "avx512_core_vnni": 34, "1592": 34, "1589": 34, "1517": 34, "hero": 34, "inspir": 34, "stanford": 34, "consumpt": 34, "ve": 34, "1341": 34, "instancenorm": 34, "1330": 34, "1414": 34, "1473": 34, "1419": 34, "1488": 34, "webpag": 34, "1318": 34, "1353": 34, "1328": 34, "1355": 34, "1367": 34, "1384": 34, "1295": 34, "1392": 34, "1376": 34, "1373": 34, "1338": 34, "1391": 34, "1322": 34, "usabl": 34, "effort": 34, "cv": 34, "refin": 34, "identifi": 34, "torchrun": 34, "shortcut": 34, "mkl": 34, "sgemm": 34, "geomean": 34, "auto_ipex": 34, "hood": 34, "calibrated_model": 34, "model_to_be_calibr": 34, "992": 34, "64byte": 34, "addlayernorm": 34, "retinanet": 34, "1032": 34, "1053": 34, "1074": 34, "tightli": 34, "matur": 34, "offlin": 34, "becam": 34, "bake": 34, "wave2vec": 34, "albert": 34, "facilit": 34, "minmax": 34, "movingaverageminmax": 34, "polish": 34, "flexibl": 34, "quantconf": 34, "multi_stream_input_hint": 34, "multi_stream_output_hint": 34, "adam": 34, "822": 34, "3d": 34, "642": 34, "deconv3d": 34, "692": 34, "787": 34, "swish": 34, "fsi": 34, "risk": 34, "551": 34, "leakyrelu": 34, "589": 34, "407": 34, "647": 34, "convolution1d": 34, "657": 34, "einsum": 34, "alphafold2": 34, "674": 34, "711": 34, "threa": 34, "slow": 34, "equival": 34, "joint": 34, "net": 34, "pend": 34, "648": 34, "684": 34, "685": 34, "dockerhub": 34, "wheel": 34, "sdk": 34, "2x": 34, "5x": 34, "reduct": 34, "center": 34, "deploi": 34, "u8": 34, "s8": 34, "satur": 34, "occur": 34, "u7": 34, "unsign": 34, "s7": 34, "worth": 34, "upload": 34, "pip3": 34, "whl": 34, "220mb": 34, "5mb": 34, "dep": 34, "220m": 34, "cxx11": 34, "224m": 34, "7m": 34, "5m": 34, "qkv": 34, "278": 34, "531": 34, "432": 34, "438": 34, "602": 34, "sliu": 34, "hardsigmoid": 34, "relu6": 34, "selu": 34, "524": 34, "452": 34, "425": 34, "100mb": 34, "40mb": 34, "meant": 34, "resolv": 34, "te": 34, "wrap": 34, "bactchnorm": 34, "205": 34, "straightforward": 34, "underhood": 34, "torchvison": 34, "hugginfac": 34, "legal": 34, "resnet18": 34, "resnet18_xpu": 34, "enable_auto_mixed_precis": 34, "mixed_dtyp": 34, "mymodel": 34, "xx_c": 34, "xx_v": 34, "clibrat": 34, "ampconf": 34, "automixprecis": 34, "running_mod": 34, "cali_dataset": 34, "trace_model": 34, "omp_set_num_thread": 34, "model_execut": 34, "same_model_execution_again": 34, "descriptor": 34, "rc3": 34, "parti": 34, "49786": 34, "rc": 34, "readm": 34, "stakehold": 34, "5rc3": 34, "dpcpp": 34, "heterogen": 34, "bfp16": 34, "proper": 34, "tacotron2": 34, "frozenbatchnorm": 34, "embeddingbad": 34, "daili": 34, "resnext3d": 34, "maskrnn": 34, "codenam": 34, "mlp": 34, "eltwis": 34, "7x": 34, "enable_auto_optim": 34, "streamlin": 34, "enable_auto_mix_precis": 34, "inject": 34, "resnet3d": 34, "fb": 34, "yolov3": 34, "maxpool": 34}, "objects": {"": [[2, 0, 0, "-", "intel_extension_for_pytorch"]], "intel_extension_for_pytorch.cpu": [[2, 0, 0, "-", "runtime"]], "intel_extension_for_pytorch.cpu.runtime": [[2, 1, 1, "", "CPUPool"], [2, 1, 1, "", "MultiStreamModule"], [2, 1, 1, "", "MultiStreamModuleHint"], [2, 1, 1, "", "Task"], [2, 2, 1, "", "get_core_list_of_node_id"], [2, 2, 1, "", "is_runtime_ext_enabled"], [2, 1, 1, "", "pin"]], "intel_extension_for_pytorch": [[2, 2, 1, "", "enable_onednn_fusion"], [2, 2, 1, "", "fast_bert"], [2, 0, 0, "-", "llm"], [2, 2, 1, "", "optimize"], [2, 0, 0, "-", "quantization"], [2, 1, 1, "", "verbose"]], "intel_extension_for_pytorch.llm": [[2, 0, 0, "-", "functional"], [2, 0, 0, "-", "modules"], [2, 2, 1, "", "optimize"]], "intel_extension_for_pytorch.llm.functional": [[2, 2, 1, "", "fast_layer_norm"], [2, 2, 1, "", "indirect_access_kv_cache_attention"], [2, 2, 1, "", "rms_norm"], [2, 2, 1, "", "rotary_embedding"], [2, 2, 1, "", "varlen_attention"]], "intel_extension_for_pytorch.llm.modules": [[2, 1, 1, "", "FastLayerNorm"], [2, 1, 1, "", "IndirectAccessKVCacheAttention"], [2, 1, 1, "", "Linear2SiluMul"], [2, 1, 1, "", "LinearAdd"], [2, 1, 1, "", "LinearAddAdd"], [2, 1, 1, "", "LinearGelu"], [2, 1, 1, "", "LinearMul"], [2, 1, 1, "", "LinearNewGelu"], [2, 1, 1, "", "LinearRelu"], [2, 1, 1, "", "LinearSilu"], [2, 1, 1, "", "LinearSiluMul"], [2, 1, 1, "", "PagedAttention"], [2, 1, 1, "", "RMSNorm"], [2, 1, 1, "", "RotaryEmbedding"], [2, 1, 1, "", "VarlenAttention"]], "intel_extension_for_pytorch.nn": [[7, 1, 1, "", "FrozenBatchNorm2d"]], "intel_extension_for_pytorch.nn.functional": [[7, 2, 1, "", "interaction"]], "intel_extension_for_pytorch.nn.modules": [[7, 1, 1, "", "MergedEmbeddingBag"], [7, 1, 1, "", "MergedEmbeddingBagWithSGD"]], "intel_extension_for_pytorch.quantization": [[2, 2, 1, "", "autotune"], [2, 2, 1, "", "convert"], [2, 2, 1, "", "get_smooth_quant_qconfig_mapping"], [2, 2, 1, "", "prepare"]]}, "objtypes": {"0": "py:module", "1": "py:class", "2": "py:function"}, "objnames": {"0": ["py", "module", "Python module"], "1": ["py", "class", "Python class"], "2": ["py", "function", "Python function"]}, "titleterms": {"intel": [0, 1, 5, 6, 15, 30, 31, 32, 33], "extens": [0, 1, 5, 7, 15, 20, 26, 32], "pytorch": [0, 1, 5, 15, 18, 32], "cpu": [0, 2, 17, 18, 33], "isa": [0, 7, 17], "dynam": [0, 6, 7, 15, 17, 26], "dispatch": [0, 7, 17], "design": [0, 17, 20, 31], "doc": 0, "architectur": 1, "support": [1, 8, 10], "api": [2, 7, 9, 13, 16, 17, 18, 22, 25, 28, 29], "document": [2, 5, 25, 32, 33], "gener": [2, 26], "llm": [2, 6, 7, 23, 28, 30], "modul": [2, 10, 20, 28], "level": [2, 17, 28], "optim": [2, 7, 10, 13, 15, 19, 28, 29], "prototyp": [2, 6, 7, 10, 11, 12, 14, 16, 22, 28], "fast": [2, 6, 7, 11], "bert": [2, 6, 7, 11, 32], "graph": [2, 7, 12, 13, 28], "quantiz": [2, 6, 7, 15, 16, 29], "runtim": [2, 7, 20, 26], "blog": 3, "public": 3, "cheat": 4, "sheet": 4, "contribut": 5, "develop": 5, "tip": 5, "debug": [5, 17], "unit": 5, "test": 5, "python": [5, 6, 7], "better": 5, "local": 5, "pytest": 5, "lint": 5, "c": [5, 6, 18], "write": [5, 18], "build": [5, 17], "exampl": [6, 10, 11, 12, 14, 16, 17, 20, 31], "train": [6, 8], "singl": [6, 28, 31], "instanc": [6, 28, 30, 31], "float32": [6, 8], "bfloat16": [6, 8, 21, 26, 30], "distribut": [6, 28, 29], "infer": [6, 8, 28, 29, 31, 32], "eager": [6, 8], "mode": [6, 28, 31], "resnet50": [6, 32], "torchscript": [6, 8], "torchdynamo": [6, 26], "beta": [6, 7], "new": [6, 7, 34], "featur": [6, 7, 11, 12, 17], "from": [6, 7], "2": [6, 7, 14, 32, 34], "0": [6, 7, 34], "int8": [6, 7, 13, 16, 26, 30, 32], "static": [6, 15], "calibr": [6, 15], "deploy": 6, "larg": [6, 7, 28], "languag": [6, 7, 28], "model": [6, 7, 13, 15, 18, 20, 28, 32], "fp32": [6, 10, 13, 29, 30], "bf16": [6, 10, 13, 29], "smooth": [6, 16, 22], "weight": [6, 29], "onli": [6, 29], "int4": 6, "ai": [6, 30], "refer": [6, 8], "easi": 7, "us": [7, 8, 9, 10, 13, 16, 20, 31], "1": [7, 14, 32, 34], "torch": 7, "compil": [7, 17], "auto": [7, 8, 9, 16, 20], "channel": [7, 9, 18, 33], "last": [7, 9, 18, 33], "mix": [7, 8], "precis": [7, 8, 28], "amp": [7, 8], "oper": [7, 18, 19, 28], "codeless": [7, 10], "13": [7, 34], "captur": [7, 12], "hypertun": [7, 14], "introduct": [8, 19, 25], "case": [8, 10, 20], "default": [8, 9, 14, 18, 31], "path": 8, "autocast": 8, "op": 8, "elig": 8, "specif": [8, 17], "behavior": 8, "can": 8, "promot": 8, "widest": 8, "input": [8, 20], "type": [8, 28], "eas": [9, 13], "enabl": 9, "disabl": 9, "known": [9, 20, 34], "issu": [9, 20, 34], "motiv": 10, "usag": [10, 11, 12, 14, 16, 20, 26, 29, 31], "huggingfac": 10, "The": 10, "origin": 10, "command": 10, "ipex": [10, 28], "launch": [10, 31], "appli": 10, "forward": 10, "method": 10, "explicitli": 10, "instead": 10, "__call__": 10, "attr": 10, "alreadi": 10, "jit": 10, "trace": 10, "descript": [11, 12], "prerequisit": 11, "methodologi": [13, 28], "fusion": [13, 19], "pattern": 13, "fold": 13, "your_conf_fil": 14, "hyperparamet": 14, "launcher": [14, 32], "defin": [14, 15], "search": 14, "space": 14, "tune": [14, 16, 22, 33], "user": 14, "your_python_script": 14, "qconfig": 15, "prepar": 15, "do": 15, "convert": 15, "deploi": [15, 32], "recip": [16, 20, 22], "autotun": 16, "algorithm": 16, "alpha": [16, 34], "fix": 16, "determin": 16, "through": 16, "overview": [17, 28, 30, 31, 33], "requir": [17, 20], "code": 17, "folder": 17, "struct": 17, "kernel": [17, 18], "implement": [17, 20], "csrc": 17, "aten": [17, 18], "xyzkrnl": 17, "cpp": 17, "stub": 17, "xyz": 17, "h": 17, "dyndisp": 17, "dispatchstub": 17, "codegen": 17, "process": 17, "add": 17, "custom": [17, 28], "intrin": 17, "vec": 17, "privat": 17, "select": 17, "manual": 17, "check": 17, "what": [18, 34], "i": [18, 20, 31], "memori": [18, 31, 33], "format": 18, "all": [18, 31], "That": 18, "matter": 18, "nchw": 18, "b": 18, "nhwc": 18, "wip": 18, "block": 18, "nchw16c": 18, "stride": 18, "layout": 18, "tensor": 18, "creation": 18, "convers": 18, "d": 18, "coverag": 18, "statu": 18, "regist": [18, 32], "nativ": 18, "manner": 18, "onednn": [18, 33], "creat": [18, 32], "convolut": 18, "primit": [18, 33], "target": 18, "multistream": 20, "examples1": 20, "basic": 20, "examples2": 20, "set": 20, "examples3": 20, "structur": [20, 33], "output": 20, "perform": [20, 26, 30, 32, 33, 34], "asynchron": 20, "task": 20, "configur": [20, 30, 33], "core": [20, 31, 32], "bind": 20, "detail": 20, "how": 20, "iomp": 20, "preload": 20, "load": 20, "dure": 20, "split": 21, "sgd": 21, "stochast": 21, "gradient": 21, "descent": 21, "quant": 22, "quick": 23, "start": [23, 25, 32], "instal": [24, 32], "get": 25, "troubleshoot": 26, "regress": 26, "shape": 26, "result": [26, 34], "correct": 26, "licens": 27, "list": 28, "verifi": 28, "via": 28, "deepspe": [28, 29], "demo": 28, "linear": 28, "low": 28, "data": [28, 30], "indirect": 28, "access": [28, 33], "kv": 28, "cach": [28, 33], "transform": 29, "frontend": 29, "pseudocod": 29, "common": 29, "scenario": 29, "smoothquant": 29, "woq": 29, "center": 30, "product": 30, "v1": 30, "11": [30, 34], "number": [30, 31, 33], "accuraci": 30, "softwar": [30, 33], "version": 30, "hardwar": [30, 33], "200": [30, 34], "an": 30, "aw": 30, "ec2": 30, "c6i": 30, "2xlarg": 30, "10": [30, 34], "script": 31, "guid": [31, 33], "physic": 31, "ii": 31, "includ": 31, "logic": 31, "iii": 31, "node": 31, "iv": 31, "your": 31, "multipl": 31, "v": 31, "throughput": 31, "vi": 31, "latenc": 31, "vii": 31, "viii": 31, "index": 31, "jemalloc": [31, 33], "tcmalloc": [31, 33], "alloc": [31, 33], "openmp": [31, 33], "librari": 31, "gnu": [31, 33], "torchserv": 32, "content": [32, 33], "thi": [32, 33], "serv": 32, "pin": 32, "boost": 32, "multi": 32, "worker": 32, "scale": 32, "export": 32, "serial": 32, "file": 32, "archiv": 32, "3": [32, 34], "4": 32, "benchmark": 32, "non": 33, "uniform": 33, "numa": 33, "numactl": 33, "omp_num_thread": 33, "omp_thread_limit": 33, "denorm": 33, "releas": 34, "highlight": 34, "100": 34, "12": 34, "300": 34, "": 34, "chang": 34, "9": 34, "8": 34, "improv": 34, "other": 34, "note": 34}, "envversion": {"sphinx.domains.c": 3, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 9, "sphinx.domains.index": 1, "sphinx.domains.javascript": 3, "sphinx.domains.math": 2, "sphinx.domains.python": 4, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx": 58}, "alltitles": {"Intel\u00ae Extension for PyTorch* CPU ISA Dynamic Dispatch Design Doc": [[0, "intel-extension-for-pytorch-cpu-isa-dynamic-dispatch-design-doc"]], "Intel\u00ae Extension for PyTorch*": [[1, "intel-extension-for-pytorch"]], "Architecture": [[1, "architecture"]], "Support": [[1, "support"]], "API Documentation": [[2, "api-documentation"], [25, "api-documentation"]], "General": [[2, "general"]], "LLM Module Level Optimizations (Prototype)": [[2, "llm-module-level-optimizations-prototype"]], "Fast Bert (Prototype)": [[2, "fast-bert-prototype"], [6, "fast-bert-prototype"]], "Graph Optimization": [[2, "graph-optimization"], [7, "graph-optimization"], [13, "graph-optimization"], [28, "graph-optimization"]], "Quantization": [[2, "module-intel_extension_for_pytorch.quantization"]], "CPU Runtime": [[2, "module-intel_extension_for_pytorch.cpu.runtime"]], "Blogs & Publications": [[3, "blogs-publications"]], "Cheat Sheet": [[4, "cheat-sheet"]], "Contribution": [[5, "contribution"]], "Contributing to Intel\u00ae Extension for PyTorch*": [[5, "contributing-to-intel-extension-for-pytorch"]], "Developing Intel\u00ae Extension for PyTorch*": [[5, "developing-intel-extension-for-pytorch"]], "Tips and Debugging": [[5, "tips-and-debugging"]], "Unit testing": [[5, "unit-testing"]], "Python Unit Testing": [[5, "python-unit-testing"]], "Better local unit tests with pytest": [[5, "better-local-unit-tests-with-pytest"]], "Local linting": [[5, "local-linting"]], "C++ Unit Testing": [[5, "c-unit-testing"]], "Writing documentation": [[5, "writing-documentation"]], "Building documentation": [[5, "building-documentation"]], "Tips": [[5, "tips"]], "Examples": [[6, "examples"]], "Python": [[6, "python"]], "Training": [[6, "training"]], "Single-instance Training": [[6, "single-instance-training"]], "Float32": [[6, "float32"], [6, "id1"]], "BFloat16": [[6, "bfloat16"], [6, "id6"], [21, "bfloat16"], [26, "bfloat16"]], "Distributed Training": [[6, "distributed-training"]], "Inference": [[6, "inference"]], "Eager Mode": [[6, "eager-mode"], [6, "id7"]], "Resnet50": [[6, "resnet50"], [6, "id2"], [6, "id4"], [6, "id8"], [6, "id11"], [6, "id14"]], "BERT": [[6, "bert"], [6, "id3"], [6, "id5"], [6, "id9"], [6, "id12"], [6, "id15"], [32, "bert"]], "TorchScript Mode": [[6, "torchscript-mode"], [6, "id10"]], "TorchDynamo Mode (Beta, NEW feature from 2.0.0)": [[6, "torchdynamo-mode-beta-new-feature-from-2-0-0"], [6, "id13"]], "INT8": [[6, "int8"], [26, "int8"]], "Static Quantization": [[6, "static-quantization"], [15, "static-quantization"]], "Calibration": [[6, "calibration"]], "Deployment": [[6, "deployment"]], "Dynamic Quantization": [[6, "dynamic-quantization"], [15, "dynamic-quantization"]], "Large Language Model (LLM)": [[6, "large-language-model-llm"]], "FP32/BF16": [[6, "fp32-bf16"], [29, "fp32-bf16"]], "Smooth Quantization INT8": [[6, "smooth-quantization-int8"]], "Weight Only Quantization INT8/INT4": [[6, "weight-only-quantization-int8-int4"]], "C++": [[6, "c"]], "Intel\u00ae AI Reference Models": [[6, "intel-ai-reference-models"]], "Features": [[7, "features"]], "Easy-to-use Python API": [[7, "easy-to-use-python-api"]], "Large Language Models (LLM, NEW feature from 2.1.0)": [[7, "large-language-models-llm-new-feature-from-2-1-0"]], "torch.compile (Beta, NEW feature from 2.0.0)": [[7, "torch-compile-beta-new-feature-from-2-0-0"]], "ISA Dynamic Dispatching": [[7, "isa-dynamic-dispatching"], [17, "isa-dynamic-dispatching"]], "Auto Channels Last": [[7, "auto-channels-last"], [9, "auto-channels-last"]], "Auto Mixed Precision (AMP)": [[7, "auto-mixed-precision-amp"], [8, "auto-mixed-precision-amp"]], "Operator Optimization": [[7, "operator-optimization"]], "Optimizer Optimization": [[7, "optimizer-optimization"]], "Runtime Extension": [[7, "runtime-extension"], [20, "runtime-extension"], [26, "runtime-extension"]], "INT8 Quantization": [[7, "int8-quantization"]], "Codeless Optimization (Prototype, NEW feature from 1.13.0)": [[7, "codeless-optimization-prototype-new-feature-from-1-13-0"]], "Graph Capture (Prototype, NEW feature from 1.13.0)": [[7, "graph-capture-prototype-new-feature-from-1-13-0"]], "HyperTune (Prototype, NEW feature from 1.13.0)": [[7, "hypertune-prototype-new-feature-from-1-13-0"]], "Fast BERT Optimization (Prototype, NEW feature from 2.0.0)": [[7, "fast-bert-optimization-prototype-new-feature-from-2-0-0"]], "Introduction": [[8, "introduction"], [19, "introduction"], [25, "introduction"]], "Use Case": [[8, "use-case"]], "Default Precision": [[8, "default-precision"]], "Inference with Eager Path": [[8, "inference-with-eager-path"]], "Inference with TorchScript Path": [[8, "inference-with-torchscript-path"]], "Training Support": [[8, "training-support"]], "Autocast Op Reference": [[8, "autocast-op-reference"]], "Op Eligibility": [[8, "op-eligibility"]], "Op-Specific Behavior": [[8, "op-specific-behavior"]], "Ops that can autocast to bfloat16": [[8, "ops-that-can-autocast-to-bfloat16"]], "Ops that can autocast to float32": [[8, "ops-that-can-autocast-to-float32"]], "Ops that promote to the widest input type": [[8, "ops-that-promote-to-the-widest-input-type"]], "Ease-of-use auto channels last API": [[9, "ease-of-use-auto-channels-last-api"]], "default": [[9, "default"]], "enable": [[9, "enable"]], "disable": [[9, "disable"]], "Known issue": [[9, "known-issue"], [34, "known-issue"], [34, "id43"]], "Codeless Optimization (Prototype)": [[10, "codeless-optimization-prototype"]], "Motivation": [[10, "motivation"]], "Example Usage with HuggingFace": [[10, "example-usage-with-huggingface"]], "The origin command with ipex launch": [[10, "the-origin-command-with-ipex-launch"]], "Command to apply ipex optimization for FP32": [[10, "command-to-apply-ipex-optimization-for-fp32"]], "Command to apply ipex optimization for BF16": [[10, "command-to-apply-ipex-optimization-for-bf16"]], "Use Case not supported": [[10, "use-case-not-supported"]], "Module uses forward method explicitly instead of the __call__ attr": [[10, "module-uses-forward-method-explicitly-instead-of-the-call-attr"]], "Already using ipex.optimize": [[10, "already-using-ipex-optimize"]], "Already using Jit Trace": [[10, "already-using-jit-trace"]], "Fast BERT (Prototype)": [[11, "fast-bert-prototype"]], "Feature Description": [[11, "feature-description"], [12, "feature-description"]], "Prerequisite": [[11, "prerequisite"]], "Usage Example": [[11, "usage-example"], [12, "usage-example"], [16, "usage-example"]], "Graph Capture (Prototype)": [[12, "graph-capture-prototype"]], "Ease-of-use graph optimization API": [[13, "ease-of-use-graph-optimization-api"]], "FP32 and BF16 models": [[13, "fp32-and-bf16-models"]], "INT8 models": [[13, "int8-models"]], "Methodology": [[13, "methodology"]], "Fusion": [[13, "fusion"]], "FP32 and BF16 fusion patterns": [[13, "fp32-and-bf16-fusion-patterns"]], "INT8 fusion patterns": [[13, "int8-fusion-patterns"]], "Folding": [[13, "folding"]], "HyperTune (Prototype)": [[14, "hypertune-prototype"]], "Usage of Hypertune": [[14, "usage-of-hypertune"]], "your_conf_file": [[14, "your-conf-file"]], "Hyperparameters": [[14, "hyperparameters"]], "Launcher Hyperparameters": [[14, "launcher-hyperparameters"]], "Defining hyperparameters and their search spaces": [[14, "defining-hyperparameters-and-their-search-spaces"]], "1. Defining hyperparameters to tune:": [[14, "defining-hyperparameters-to-tune"]], "2. Defining the search spaces of the hyperparameters:": [[14, "defining-the-search-spaces-of-the-hyperparameters"]], "Default search space": [[14, "default-search-space"]], "User defined search space": [[14, "user-defined-search-space"]], "": [[14, "your-python-script"]], "Usage Examples": [[14, "usage-examples"], [31, "usage-examples"]], "Intel\u00ae Extension for PyTorch* optimizations for quantization": [[15, "intel-extension-for-pytorch-optimizations-for-quantization"]], "Define qconfig": [[15, "define-qconfig"]], "Prepare Model and Do Calibration": [[15, "prepare-model-and-do-calibration"]], "Convert to Static Quantized Model and Deploy": [[15, "convert-to-static-quantized-model-and-deploy"]], "Define QConfig": [[15, "id1"]], "Prepare Model": [[15, "prepare-model"]], "Convert to Dynamic Quantized Model and Deploy": [[15, "convert-to-dynamic-quantized-model-and-deploy"]], "INT8 Recipe Tuning API (Prototype)": [[16, "int8-recipe-tuning-api-prototype"]], "Smooth Quantization Autotune": [[16, "smooth-quantization-autotune"]], "Algorithm: Auto-tuning of $\\alpha$.": [[16, "algorithm-auto-tuning-of-alpha"]], "$\\alpha$ Usage": [[16, "alpha-usage"]], "Using a fixed alpha": [[16, "using-a-fixed-alpha"]], "Determining the alpha through auto-tuning": [[16, "determining-the-alpha-through-auto-tuning"]], "Overview": [[17, "overview"], [30, "overview"], [31, "overview"], [33, "overview"]], "CPU ISA build compiler requirement": [[17, "cpu-isa-build-compiler-requirement"]], "Dynamic Dispatch Design": [[17, "dynamic-dispatch-design"]], "Code Folder Struct": [[17, "code-folder-struct"]], "Kernel implementation: csrc/cpu/aten/kernels/xyzKrnl.cpp": [[17, "kernel-implementation-csrc-cpu-aten-kernels-xyzkrnl-cpp"]], "Kernel Stub: csrc/cpu/aten/xyz.cpp and csrc/cpu/aten/xyz.h": [[17, "kernel-stub-csrc-cpu-aten-xyz-cpp-and-csrc-cpu-aten-xyz-h"]], "Dispatch Stub implementation: csrc/cpu/dyndisp/DispatchStub.cpp and csrc/cpu/dyndisp/DispatchStub.h": [[17, "dispatch-stub-implementation-csrc-cpu-dyndisp-dispatchstub-cpp-and-csrc-cpu-dyndisp-dispatchstub-h"]], "CodeGen Process": [[17, "codegen-process"]], "Add Custom Kernel": [[17, "add-custom-kernel"]], "ISA intrinics specific kernel example:": [[17, "isa-intrinics-specific-kernel-example"]], "Vec specific kernel example:": [[17, "vec-specific-kernel-example"]], "Private Debug APIs": [[17, "private-debug-apis"]], "Example:": [[17, "example"], [17, "id1"]], "Select ISA level manually.": [[17, "select-isa-level-manually"]], "CPU feature check": [[17, "cpu-feature-check"]], "Channels Last": [[18, "channels-last"], [33, "channels-last"]], "What is Channels Last": [[18, "what-is-channels-last"]], "Memory Format Is All That Matters": [[18, "memory-format-is-all-that-matters"]], "a. NCHW (default)": [[18, "a-nchw-default"]], "b. NHWC (WIP for CPU)": [[18, "b-nhwc-wip-for-cpu"]], "c. Blocked (nChw16c)": [[18, "c-blocked-nchw16c"]], "PyTorch Strided Layout": [[18, "pytorch-strided-layout"]], "PyTorch Channels Last Memory Format APIs": [[18, "pytorch-channels-last-memory-format-apis"]], "a. tensor creation": [[18, "a-tensor-creation"]], "b. tensor conversion": [[18, "b-tensor-conversion"]], "c. model conversion": [[18, "c-model-conversion"]], "d. operator coverage": [[18, "d-operator-coverage"]], "Writing Channels Last Kernels": [[18, "writing-channels-last-kernels"]], "a. Status on CPU": [[18, "a-status-on-cpu"]], "b. Register Channels Last Kernel in ATen Native Manner": [[18, "b-register-channels-last-kernel-in-aten-native-manner"]], "c. Register oneDNN Kernel on Channels Last": [[18, "c-register-onednn-kernel-on-channels-last"]], "oneDNN NHWC APIs": [[18, "onednn-nhwc-apis"]], "a. Create NHWC Memory": [[18, "a-create-nhwc-memory"]], "b. Create Convolution Primitive": [[18, "b-create-convolution-primitive"]], "CPU Channels Last Targets": [[18, "cpu-channels-last-targets"]], "Optimizer Fusion": [[19, "optimizer-fusion"]], "Operation Fusion": [[19, "operation-fusion"]], "Requirements": [[20, "requirements"]], "Use Cases": [[20, "use-cases"]], "Example of MultiStream Module": [[20, "example-of-multistream-module"]], "Examples1: Basic Usage": [[20, "examples1-basic-usage"]], "Examples2: Usage with \u201cAUTO\u201d setting": [[20, "examples2-usage-with-auto-setting"]], "Examples3: Usage for models with structure inputs/outputs": [[20, "examples3-usage-for-models-with-structure-inputs-outputs"]], "Performance recipes": [[20, "performance-recipes"]], "Known issues": [[20, "known-issues"], [34, "id37"]], "Example of asynchronous task": [[20, "example-of-asynchronous-task"]], "Example of configuring core binding": [[20, "example-of-configuring-core-binding"]], "Detail Design": [[20, "detail-design"]], "How the core binding is implemented": [[20, "how-the-core-binding-is-implemented"]], "Design of Task": [[20, "design-of-task"]], "IOMP preload or load during the runtime": [[20, "iomp-preload-or-load-during-the-runtime"]], "Split SGD": [[21, "split-sgd"], [21, "id2"]], "Stochastic Gradient Descent (SGD)": [[21, "stochastic-gradient-descent-sgd"]], "Smooth Quant Recipe Tuning API (Prototype)": [[22, "smooth-quant-recipe-tuning-api-prototype"]], "Quick Start": [[23, "quick-start"]], "LLM Quick Start": [[23, "llm-quick-start"]], "Installation": [[24, "installation"]], "Get Started": [[25, "get-started"]], "Troubleshooting": [[26, "troubleshooting"]], "General Usage": [[26, "general-usage"]], "Performance Regression": [[26, "performance-regression"]], "TorchDynamo": [[26, "torchdynamo"]], "Dynamic Shape": [[26, "dynamic-shape"]], "Result Correctness": [[26, "result-correctness"]], "License": [[27, "license"]], "Large Language Models (LLM) Optimization Overview": [[28, "large-language-models-llm-optimization-overview"]], "ipex.llm Optimized Model List": [[28, "ipex-llm-optimized-model-list"]], "Verified for single instance mode": [[28, "verified-for-single-instance-mode"]], "Verified for distributed inference mode via DeepSpeed": [[28, "verified-for-distributed-inference-mode-via-deepspeed"]], "Module Level Optimization API for customized LLM (Prototype)": [[28, "module-level-optimization-api-for-customized-llm-prototype"]], "Demos": [[28, "demos"]], "Optimization Methodologies": [[28, "optimization-methodologies"]], "Linear Operator Optimization": [[28, "linear-operator-optimization"]], "Low Precision Data Types": [[28, "low-precision-data-types"]], "Indirect Access KV Cache": [[28, "indirect-access-kv-cache"]], "Distributed Inference": [[28, "distributed-inference"]], "Transformers Optimization Frontend API": [[29, "transformers-optimization-frontend-api"]], "Pseudocode of Common Usage Scenarios": [[29, "pseudocode-of-common-usage-scenarios"]], "SmoothQuant": [[29, "smoothquant"]], "Weight Only Quantization (WOQ)": [[29, "weight-only-quantization-woq"]], "Distributed Inference with DeepSpeed": [[29, "distributed-inference-with-deepspeed"]], "Performance": [[30, "performance"], [34, "performance"]], "Performance Data for Intel\u00ae AI Data Center Products": [[30, "performance-data-for-intel-ai-data-center-products"]], "LLM Performance": [[30, "llm-performance"]], "INT8 with v1.11": [[30, "int8-with-v1-11"]], "Performance Numbers": [[30, "performance-numbers"], [30, "id1"], [30, "id4"]], "Accuracy": [[30, "accuracy"]], "Configuration": [[30, "configuration"], [30, "id2"], [30, "id5"]], "Software Version": [[30, "software-version"], [30, "id3"], [30, "id6"]], "Hardware Configuration": [[30, "hardware-configuration"], [30, "id7"], [33, "hardware-configuration"]], "FP32 with v1.11.200 on an AWS EC2 C6i.2xlarge instance": [[30, "fp32-with-v1-11-200-on-an-aws-ec2-c6i-2xlarge-instance"]], "FP32 and BFloat16 with v1.10": [[30, "fp32-and-bfloat16-with-v1-10"]], "Launch Script Usage Guide": [[31, "launch-script-usage-guide"]], "Usage of launch script": [[31, "usage-of-launch-script"]], "Single instance for inference": [[31, "single-instance-for-inference"]], "I. Use all physical cores": [[31, "i-use-all-physical-cores"]], "II. Use all cores including logical cores": [[31, "ii-use-all-cores-including-logical-cores"]], "III. Use physical cores on designated nodes": [[31, "iii-use-physical-cores-on-designated-nodes"]], "IV. Use your designated number of cores": [[31, "iv-use-your-designated-number-of-cores"]], "Multiple instances for inference": [[31, "multiple-instances-for-inference"]], "V. Throughput mode": [[31, "v-throughput-mode"]], "VI. Latency mode": [[31, "vi-latency-mode"]], "VII. Your designated number of instances": [[31, "vii-your-designated-number-of-instances"]], "VIII. Your designated number of instances and instance index": [[31, "viii-your-designated-number-of-instances-and-instance-index"]], "Usage of Jemalloc/TCMalloc/Default memory allocator": [[31, "usage-of-jemalloc-tcmalloc-default-memory-allocator"]], "Jemalloc": [[31, "jemalloc"], [33, "jemalloc"]], "TCMalloc": [[31, "tcmalloc"], [33, "tcmalloc"]], "Default memory allocator": [[31, "default-memory-allocator"]], "Usage of OpenMP library": [[31, "usage-of-openmp-library"]], "Intel OpenMP Library": [[31, "intel-openmp-library"]], "GNU OpenMP Library": [[31, "gnu-openmp-library"]], "TorchServe with Intel\u00ae Extension for PyTorch*": [[32, "torchserve-with-intel-extension-for-pytorch"]], "Contents of this Document": [[32, "contents-of-this-document"], [33, "contents-of-this-document"]], "Install Intel\u00ae Extension for PyTorch*": [[32, "install-intel-extension-for-pytorch"]], "Serving model with Intel\u00ae Extension for PyTorch*": [[32, "serving-model-with-intel-extension-for-pytorch"]], "TorchServe with Launcher": [[32, "torchserve-with-launcher"]], "Launcher Core Pinning to Boost Performance of TorchServe Multi Worker Inference": [[32, "launcher-core-pinning-to-boost-performance-of-torchserve-multi-worker-inference"]], "Scaling workers": [[32, "scaling-workers"]], "Creating and Exporting INT8 model for Intel\u00ae Extension for PyTorch*": [[32, "creating-and-exporting-int8-model-for-intel-extension-for-pytorch"]], "1. Creating a serialized file": [[32, "creating-a-serialized-file"]], "ResNet50": [[32, "resnet50"]], "2. Creating a Model Archive": [[32, "creating-a-model-archive"]], "3. Start TorchServe to serve the model": [[32, "start-torchserve-to-serve-the-model"]], "4. Registering and Deploying model": [[32, "registering-and-deploying-model"]], "Benchmarking with Launcher": [[32, "benchmarking-with-launcher"]], "Benchmarking with Launcher Core Pinning": [[32, "benchmarking-with-launcher-core-pinning"]], "Performance Boost with Intel\u00ae Extension for PyTorch* and Launcher": [[32, "performance-boost-with-intel-extension-for-pytorch-and-launcher"]], "Performance Tuning Guide": [[33, "performance-tuning-guide"]], "Intel CPU Structure": [[33, "intel-cpu-structure"]], "Non-Uniform Memory Access (NUMA)": [[33, "non-uniform-memory-access-numa"]], "Software Configuration": [[33, "software-configuration"]], "Numactl": [[33, "numactl"]], "OpenMP": [[33, "openmp"]], "OMP_NUM_THREADS": [[33, "omp-num-threads"]], "OMP_THREAD_LIMIT": [[33, "omp-thread-limit"]], "GNU OpenMP": [[33, "gnu-openmp"]], "Intel OpenMP": [[33, "intel-openmp"]], "Memory Allocator": [[33, "memory-allocator"]], "Denormal Number": [[33, "denormal-number"]], "OneDNN primitive cache": [[33, "onednn-primitive-cache"]], "Releases": [[34, "releases"]], "2.3.0": [[34, "id1"]], "Highlights": [[34, "highlights"], [34, "id3"], [34, "id5"], [34, "id7"], [34, "id9"], [34, "id11"], [34, "id13"], [34, "id15"], [34, "id18"], [34, "id21"], [34, "id24"], [34, "id26"], [34, "id29"]], "2.2.0": [[34, "id2"]], "2.1.100": [[34, "id4"]], "2.1.0": [[34, "id6"]], "2.0.100": [[34, "id8"]], "2.0.0": [[34, "id10"]], "Known Issues": [[34, "known-issues"], [34, "id16"], [34, "id22"], [34, "id30"]], "1.13.100": [[34, "id12"]], "1.13.0": [[34, "id14"]], "1.12.300": [[34, "id17"]], "1.12.100": [[34, "id19"]], "1.12.0": [[34, "id20"]], "1.11.200": [[34, "id23"]], "1.11.0": [[34, "id25"]], "What\u2019s Changed": [[34, "what-s-changed"], [34, "id31"]], "1.10.100": [[34, "id27"]], "1.10.0": [[34, "id28"]], "1.9.0": [[34, "id32"]], "What\u2019s New": [[34, "what-s-new"], [34, "id34"], [34, "id36"], [34, "id39"], [34, "id42"]], "1.8.0": [[34, "id33"]], "1.2.0": [[34, "id35"]], "Performance Improvement": [[34, "performance-improvement"]], "Others": [[34, "others"]], "1.1.0": [[34, "id38"]], "1.0.2": [[34, "id40"]], "1.0.1-Alpha": [[34, "alpha"]], "1.0.0-Alpha": [[34, "id41"]], "Performance Result": [[34, "performance-result"]], "NOTE": [[34, "note"]]}, "indexentries": {"cpupool (class in intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.CPUPool"]], "fastlayernorm (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.FastLayerNorm"]], "indirectaccesskvcacheattention (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.IndirectAccessKVCacheAttention"]], "linear2silumul (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.Linear2SiluMul"]], "linearadd (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearAdd"]], "linearaddadd (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearAddAdd"]], "lineargelu (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearGelu"]], "linearmul (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearMul"]], "linearnewgelu (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearNewGelu"]], "linearrelu (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearRelu"]], "linearsilu (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearSilu"]], "linearsilumul (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearSiluMul"]], "multistreammodule (class in intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.MultiStreamModule"]], "multistreammodulehint (class in intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.MultiStreamModuleHint"]], "pagedattention (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.PagedAttention"]], "rmsnorm (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.RMSNorm"]], "rotaryembedding (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.RotaryEmbedding"]], "task (class in intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.Task"]], "varlenattention (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.VarlenAttention"]], "autotune() (in module intel_extension_for_pytorch.quantization)": [[2, "intel_extension_for_pytorch.quantization.autotune"]], "convert() (in module intel_extension_for_pytorch.quantization)": [[2, "intel_extension_for_pytorch.quantization.convert"]], "enable_onednn_fusion() (in module intel_extension_for_pytorch)": [[2, "intel_extension_for_pytorch.enable_onednn_fusion"]], "fast_bert() (in module intel_extension_for_pytorch)": [[2, "intel_extension_for_pytorch.fast_bert"]], "fast_layer_norm() (in module intel_extension_for_pytorch.llm.functional)": [[2, "intel_extension_for_pytorch.llm.functional.fast_layer_norm"]], "get_core_list_of_node_id() (in module intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.get_core_list_of_node_id"]], "get_smooth_quant_qconfig_mapping() (in module intel_extension_for_pytorch.quantization)": [[2, "intel_extension_for_pytorch.quantization.get_smooth_quant_qconfig_mapping"]], "indirect_access_kv_cache_attention() (in module intel_extension_for_pytorch.llm.functional)": [[2, "intel_extension_for_pytorch.llm.functional.indirect_access_kv_cache_attention"]], "intel_extension_for_pytorch": [[2, "module-intel_extension_for_pytorch"]], "intel_extension_for_pytorch.cpu.runtime": [[2, "module-intel_extension_for_pytorch.cpu.runtime"]], "intel_extension_for_pytorch.llm": [[2, "module-intel_extension_for_pytorch.llm"]], "intel_extension_for_pytorch.llm.functional": [[2, "module-intel_extension_for_pytorch.llm.functional"]], "intel_extension_for_pytorch.llm.modules": [[2, "module-intel_extension_for_pytorch.llm.modules"]], "intel_extension_for_pytorch.quantization": [[2, "module-intel_extension_for_pytorch.quantization"]], "is_runtime_ext_enabled() (in module intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.is_runtime_ext_enabled"]], "module": [[2, "module-intel_extension_for_pytorch"], [2, "module-intel_extension_for_pytorch.cpu.runtime"], [2, "module-intel_extension_for_pytorch.llm"], [2, "module-intel_extension_for_pytorch.llm.functional"], [2, "module-intel_extension_for_pytorch.llm.modules"], [2, "module-intel_extension_for_pytorch.quantization"]], "optimize() (in module intel_extension_for_pytorch)": [[2, "intel_extension_for_pytorch.optimize"]], "optimize() (in module intel_extension_for_pytorch.llm)": [[2, "intel_extension_for_pytorch.llm.optimize"]], "pin (class in intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.pin"]], "prepare() (in module intel_extension_for_pytorch.quantization)": [[2, "intel_extension_for_pytorch.quantization.prepare"]], "rms_norm() (in module intel_extension_for_pytorch.llm.functional)": [[2, "intel_extension_for_pytorch.llm.functional.rms_norm"]], "rotary_embedding() (in module intel_extension_for_pytorch.llm.functional)": [[2, "intel_extension_for_pytorch.llm.functional.rotary_embedding"]], "varlen_attention() (in module intel_extension_for_pytorch.llm.functional)": [[2, "intel_extension_for_pytorch.llm.functional.varlen_attention"]], "verbose (class in intel_extension_for_pytorch)": [[2, "intel_extension_for_pytorch.verbose"]], "frozenbatchnorm2d (class in intel_extension_for_pytorch.nn)": [[7, "intel_extension_for_pytorch.nn.FrozenBatchNorm2d"]], "mergedembeddingbag (class in intel_extension_for_pytorch.nn.modules)": [[7, "intel_extension_for_pytorch.nn.modules.MergedEmbeddingBag"]], "mergedembeddingbagwithsgd (class in intel_extension_for_pytorch.nn.modules)": [[7, "intel_extension_for_pytorch.nn.modules.MergedEmbeddingBagWithSGD"]], "interaction() (in module intel_extension_for_pytorch.nn.functional)": [[7, "intel_extension_for_pytorch.nn.functional.interaction"]]}})
\ No newline at end of file
+Search.setIndex({"docnames": ["design_doc/cpu/isa_dyndisp", "index", "tutorials/api_doc", "tutorials/blogs_publications", "tutorials/cheat_sheet", "tutorials/contribution", "tutorials/examples", "tutorials/features", "tutorials/features/amp", "tutorials/features/auto_channels_last", "tutorials/features/codeless_optimization", "tutorials/features/fast_bert", "tutorials/features/graph_capture", "tutorials/features/graph_optimization", "tutorials/features/hypertune", "tutorials/features/int8_overview", "tutorials/features/int8_recipe_tuning_api", "tutorials/features/isa_dynamic_dispatch", "tutorials/features/nhwc", "tutorials/features/optimizer_fusion", "tutorials/features/runtime_extension", "tutorials/features/split_sgd", "tutorials/features/sq_recipe_tuning_api", "tutorials/getting_started", "tutorials/installation", "tutorials/introduction", "tutorials/known_issues", "tutorials/license", "tutorials/llm", "tutorials/llm/llm_optimize", "tutorials/performance", "tutorials/performance_tuning/launch_script", "tutorials/performance_tuning/torchserve", "tutorials/performance_tuning/tuning_guide", "tutorials/releases"], "filenames": ["design_doc/cpu/isa_dyndisp.md", "index.rst", "tutorials/api_doc.rst", "tutorials/blogs_publications.md", "tutorials/cheat_sheet.md", "tutorials/contribution.md", "tutorials/examples.md", "tutorials/features.rst", "tutorials/features/amp.md", "tutorials/features/auto_channels_last.md", "tutorials/features/codeless_optimization.md", "tutorials/features/fast_bert.md", "tutorials/features/graph_capture.md", "tutorials/features/graph_optimization.md", "tutorials/features/hypertune.md", "tutorials/features/int8_overview.md", "tutorials/features/int8_recipe_tuning_api.md", "tutorials/features/isa_dynamic_dispatch.md", "tutorials/features/nhwc.md", "tutorials/features/optimizer_fusion.md", "tutorials/features/runtime_extension.md", "tutorials/features/split_sgd.rst", "tutorials/features/sq_recipe_tuning_api.md", "tutorials/getting_started.md", "tutorials/installation.md", "tutorials/introduction.rst", "tutorials/known_issues.md", "tutorials/license.md", "tutorials/llm.rst", "tutorials/llm/llm_optimize.md", "tutorials/performance.md", "tutorials/performance_tuning/launch_script.md", "tutorials/performance_tuning/torchserve.md", "tutorials/performance_tuning/tuning_guide.md", "tutorials/releases.md"], "titles": ["Intel\u00ae Extension for PyTorch* CPU ISA Dynamic Dispatch Design Doc", "Intel\u00ae Extension for PyTorch*", "API Documentation", "Blogs & Publications", "Cheat Sheet", "Contribution", "Examples", "Features", "Auto Mixed Precision (AMP)", "Auto Channels Last", "Codeless Optimization (Prototype)", "Fast BERT (Prototype)", "Graph Capture (Prototype)", "Graph Optimization", "HyperTune (Prototype)", "Intel\u00ae Extension for PyTorch* optimizations for quantization", "INT8 Recipe Tuning API (Prototype)", "ISA Dynamic Dispatching", "Channels Last", "Optimizer Fusion", "Runtime Extension", "Split SGD", "Smooth Quant Recipe Tuning API (Prototype)", "Quick Start", "Installation", "Introduction", "Troubleshooting", "License", "Large Language Models (LLM) Optimization Overview", "Transformers Optimization Frontend API", "Performance", "Launch Script Usage Guide", "TorchServe with Intel\u00ae Extension for PyTorch*", "Performance Tuning Guide", "Releases"], "terms": {"The": [0, 1, 2, 5, 6, 7, 8, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 28, 29, 30, 31, 32, 33, 34], "document": [0, 7, 17, 20, 29, 34], "i": [0, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 21, 22, 23, 26, 27, 28, 29, 30, 32, 33, 34], "redirect": 0, "thi": [0, 2, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 26, 27, 28, 29, 30, 31, 34], "link": [0, 1, 6, 17, 34], "now": [0, 2, 7, 15, 18, 32, 33, 34], "intel optim": 1, "intel\u00ae extension for pytorch*": 1, "gpu": [1, 3, 18, 34], "discrete gpu": 1, "intel discrete gpu": 1, "extend": [1, 18, 25, 33, 34], "latest": [1, 2, 25, 28, 30, 34], "perform": [1, 2, 3, 4, 6, 7, 8, 9, 10, 13, 14, 15, 16, 18, 19, 21, 25, 28, 29, 31], "optim": [1, 3, 4, 6, 8, 9, 11, 12, 14, 16, 18, 20, 21, 23, 25, 26, 31, 32, 33, 34], "hardwar": [1, 3, 17, 25, 28, 32, 34], "take": [1, 2, 7, 8, 10, 12, 13, 14, 18, 21, 25, 26, 30, 31, 33], "advantag": [1, 2, 7, 9, 12, 18, 21, 25, 30, 31, 33], "advanc": [1, 2, 6, 7, 16, 25, 28], "vector": [1, 2, 6, 17, 18, 25, 28], "512": [1, 6, 11, 16, 25, 28, 31], "avx": [1, 6, 17, 25, 28], "neural": [1, 3, 7, 16, 22, 25, 28, 33, 34], "network": [1, 3, 7, 8, 20, 25, 28, 33], "instruct": [1, 5, 6, 7, 8, 17, 21, 23, 24, 25, 28, 30, 33, 34], "vnni": [1, 15, 17, 25, 28], "matrix": [1, 6, 7, 25, 28], "amx": [1, 3, 6, 7, 17, 25, 28, 30], "cpu": [1, 3, 4, 5, 6, 7, 8, 10, 14, 15, 16, 19, 20, 23, 25, 26, 28, 30, 31, 32, 34], "well": [1, 2, 5, 6, 7, 11, 16, 20, 21, 24, 28, 32, 33, 34], "x": [1, 5, 6, 8, 10, 13, 15, 16, 17, 18, 20, 21, 23, 26, 34], "e": [1, 2, 6, 7, 8, 12, 16, 17, 18, 28, 31, 33, 34], "xmx": 1, "ai": [1, 2, 3, 7, 28], "engin": [1, 6, 18, 33], "discret": 1, "moreov": [1, 2, 28], "provid": [1, 2, 5, 6, 7, 8, 11, 12, 13, 14, 16, 20, 22, 24, 26, 28, 29, 31, 32, 33, 34], "easi": [1, 3, 21], "acceler": [1, 2, 3, 6, 7, 13, 28, 29, 30, 34], "through": [1, 2, 6, 7, 8, 12, 25, 28, 33, 34], "xpu": [1, 2, 3, 34], "devic": [1, 2, 15, 29, 31, 34], "In": [1, 2, 6, 7, 8, 12, 16, 17, 18, 19, 21, 23, 28, 31, 32, 33, 34], "current": [1, 2, 5, 7, 11, 13, 14, 15, 16, 17, 19, 20, 26, 28, 29, 34], "technolog": [1, 7, 28], "landscap": [1, 7, 28], "gener": [1, 5, 6, 7, 10, 12, 16, 17, 18, 21, 23, 28, 29, 30, 31, 32, 33, 34], "genai": [1, 7, 28], "workload": [1, 6, 7, 8, 10, 11, 12, 21, 26, 28, 29, 30, 31, 33, 34], "model": [1, 2, 3, 4, 8, 9, 10, 11, 12, 14, 16, 23, 24, 25, 26, 29, 30, 33, 34], "have": [1, 2, 5, 6, 7, 9, 14, 17, 18, 20, 21, 23, 26, 27, 28, 30, 31, 32, 33, 34], "gain": [1, 7, 26, 28, 34], "widespread": [1, 7, 28], "attent": [1, 2, 7, 28, 34], "popular": [1, 7, 22, 28, 30, 34], "larg": [1, 2, 19, 23, 24, 25, 26, 29, 30, 33, 34], "languag": [1, 2, 23, 24, 25, 26, 29, 34], "llm": [1, 16, 22, 24, 25, 29, 34], "emerg": [1, 7, 28], "domin": [1, 7, 28], "drive": [1, 7, 28], "applic": [1, 2, 7, 20, 28, 32, 33], "start": [1, 3, 4, 5, 6, 7, 10, 20, 24, 34], "from": [1, 2, 3, 4, 5, 8, 10, 11, 13, 15, 16, 17, 18, 19, 20, 21, 23, 25, 28, 29, 31, 32, 33, 34], "2": [1, 2, 3, 8, 10, 16, 17, 18, 20, 21, 25, 26, 27, 28, 29, 30, 31, 33], "1": [1, 2, 3, 4, 6, 8, 10, 11, 12, 13, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 28, 29, 30, 31, 33], "0": [1, 2, 4, 5, 8, 10, 11, 13, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 30, 31, 32, 33], "specif": [1, 2, 5, 6, 7, 12, 18, 20, 26, 28, 31, 33, 34], "certain": [1, 7, 26, 28, 29, 31, 33], "ar": [1, 2, 3, 5, 6, 7, 8, 10, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 28, 29, 30, 31, 32, 33, 34], "introduc": [1, 3, 7, 15, 18, 21, 22, 31, 33, 34], "For": [1, 2, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 31, 32, 33, 34], "more": [1, 2, 5, 6, 7, 8, 10, 11, 13, 16, 17, 19, 20, 21, 23, 26, 28, 32, 33, 34], "inform": [1, 2, 6, 7, 14, 17, 18, 28, 31, 32, 33, 34], "refer": [1, 7, 9, 13, 14, 16, 17, 18, 20, 22, 23, 24, 25, 32, 34], "section": [1, 6, 7, 8, 14, 20, 23, 24, 25, 28, 29, 32, 33, 34], "can": [1, 2, 5, 6, 7, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 26, 28, 29, 30, 31, 32, 33, 34], "load": [1, 2, 6, 7, 13, 15, 16, 17, 23, 29, 32, 34], "python": [1, 2, 4, 10, 14, 17, 20, 26, 28, 29, 31, 32, 33, 34], "modul": [1, 6, 7, 8, 13, 16, 17, 26, 29, 31, 34], "program": [1, 5, 7, 11, 20, 31, 33, 34], "c": [1, 7, 8, 16, 17, 20, 26, 28, 31, 32, 33, 34], "librari": [1, 2, 5, 6, 7, 17, 20, 32, 33, 34], "script": [1, 2, 3, 4, 5, 6, 7, 8, 10, 14, 17, 20, 23, 24, 26, 28, 29, 30, 32, 33, 34], "user": [1, 2, 7, 9, 10, 12, 13, 15, 16, 18, 20, 26, 31, 32, 33, 34], "enabl": [1, 2, 3, 4, 6, 7, 8, 10, 13, 16, 18, 20, 22, 23, 26, 28, 31, 32, 33, 34], "dynam": [1, 4, 20, 28, 32, 33, 34], "import": [1, 2, 4, 5, 6, 7, 10, 11, 12, 13, 15, 16, 17, 18, 20, 21, 23, 25, 26, 28, 29, 32, 33, 34], "intel_extension_for_pytorch": [1, 2, 4, 5, 6, 7, 10, 11, 12, 13, 14, 15, 16, 17, 20, 23, 25, 29, 32, 34], "featur": [1, 2, 3, 5, 8, 10, 13, 14, 18, 20, 23, 25, 26, 28, 30, 31, 32, 33, 34], "includ": [1, 2, 5, 6, 7, 10, 14, 15, 17, 23, 26, 27, 28, 30, 34], "onli": [1, 2, 5, 7, 8, 10, 11, 13, 14, 15, 16, 17, 18, 20, 21, 26, 28, 31, 32, 34], "packag": [1, 2, 5, 6, 7, 10, 23, 25, 26, 32, 33, 34], "mai": [1, 2, 3, 5, 6, 7, 8, 9, 16, 17, 18, 20, 26, 28, 31, 32, 33, 34], "newer": [1, 28, 33], "code": [1, 2, 5, 6, 7, 10, 11, 12, 13, 18, 19, 21, 23, 24, 26, 27, 29, 33, 34], "base": [1, 2, 3, 4, 5, 6, 7, 10, 11, 17, 20, 21, 26, 28, 29, 30, 32, 33, 34], "due": [1, 8, 10, 17, 20, 26], "differ": [1, 2, 6, 7, 15, 16, 17, 18, 20, 28, 31, 32, 33, 34], "develop": [1, 3, 6, 28, 30, 33, 34], "schedul": [1, 2, 13, 20, 31, 33], "ha": [1, 2, 7, 10, 14, 17, 18, 20, 21, 26, 28, 30, 31, 33, 34], "been": [1, 6, 7, 10, 17, 18, 28, 31, 33, 34], "releas": [1, 17, 18, 26, 30, 33], "an": [1, 2, 5, 6, 7, 8, 10, 11, 13, 14, 16, 17, 18, 19, 20, 21, 26, 31, 32, 33, 34], "open": [1, 16, 28, 33], "sourc": [1, 5, 6, 17, 27, 28, 33, 34], "project": [1, 6], "github": [1, 2, 5, 6, 7, 8, 34], "you": [1, 2, 5, 6, 7, 8, 13, 14, 15, 17, 18, 20, 23, 25, 26, 28, 29, 31, 33, 34], "find": [1, 2, 6, 7, 14, 16, 23, 26, 30, 31, 34], "how": [1, 2, 6, 10, 15, 17, 18, 23, 28, 32, 33, 34], "get": [1, 2, 3, 4, 6, 7, 10, 11, 15, 17, 20, 21, 22, 26, 28, 29, 30, 31, 33, 34], "main": [1, 2, 5, 6, 14, 20, 31, 32], "branch": [1, 7, 30], "quick": [1, 20, 24, 25], "about": [1, 2, 5, 7, 13, 16, 32, 33, 34], "product": [1, 2, 7, 14, 28, 34], "structur": [1, 18, 31], "shown": [1, 6, 18, 28, 31, 32], "follow": [1, 2, 4, 5, 6, 7, 8, 11, 14, 15, 16, 17, 18, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34], "figur": [1, 2, 21, 28, 33], "eager": [1, 7, 12, 23, 32, 34], "mode": [1, 2, 5, 7, 10, 12, 18, 20, 23, 26, 32, 34], "frontend": [1, 2, 7, 20, 28, 34], "custom": [1, 2, 7, 26, 34], "fusion": [1, 2, 7, 10, 21, 28, 34], "int8": [1, 2, 3, 4, 17, 18, 20, 22, 28, 29, 34], "quantiz": [1, 3, 4, 13, 22, 26, 28, 30, 32, 34], "api": [1, 3, 6, 10, 11, 15, 20, 26, 33, 34], "further": [1, 2, 5, 6, 7, 18, 20, 28, 33, 34], "improv": [1, 3, 7, 8, 13, 20, 22, 28, 30, 32, 33], "achiev": [1, 2, 6, 7, 28, 33, 34], "convert": [1, 2, 4, 6, 7, 8, 9, 10, 13, 16, 17, 18, 20, 23, 26, 32, 34], "graph": [1, 4, 8, 10, 16, 23, 26, 31, 34], "us": [1, 2, 3, 4, 5, 6, 11, 14, 15, 17, 18, 19, 21, 23, 24, 25, 26, 27, 28, 32, 33, 34], "pass": [1, 2, 5, 10, 17, 20, 26, 32, 34], "reduc": [1, 2, 7, 15, 19, 20, 21, 22, 26, 28, 33, 34], "oper": [1, 2, 6, 8, 13, 15, 21, 32, 33, 34], "kernel": [1, 2, 7, 20, 26, 28, 30, 33, 34], "invoc": [1, 7], "overhead": [1, 2, 7, 10, 19, 20, 26, 28, 33, 34], "result": [1, 2, 6, 10, 12, 14, 16, 18, 20, 21, 30, 31, 32, 33], "compar": [1, 2, 7, 13, 18, 21, 26, 28, 30, 31, 33, 34], "normal": [1, 2, 6, 7, 13, 20, 28, 33, 34], "yield": [1, 7, 33], "better": [1, 2, 6, 7, 15, 18, 20, 28, 31, 32, 33, 34], "techniqu": [1, 2, 7, 11, 12, 28, 34], "like": [1, 2, 3, 5, 6, 7, 8, 14, 18, 19, 21, 26, 28, 31, 33, 34], "amplifi": 1, "them": [1, 5, 7, 18, 19, 28, 31, 33], "comprehens": [1, 34], "both": [1, 2, 6, 7, 16, 18, 19, 21, 28, 29, 31, 32, 33, 34], "torchscript": [1, 2, 5, 7, 10, 11, 12, 19, 23, 26, 32, 34], "torchdynamo": [1, 7, 12, 23, 34], "With": [1, 2, 7, 10, 20, 31, 34], "we": [1, 2, 5, 6, 7, 8, 9, 10, 14, 15, 16, 17, 18, 19, 20, 21, 23, 28, 30, 32, 33, 34], "recommend": [1, 5, 6, 7, 9, 10, 15, 16, 20, 23, 30, 31, 33, 34], "torch": [1, 2, 4, 6, 8, 10, 11, 12, 13, 15, 16, 18, 20, 23, 26, 29, 32, 33, 34], "jit": [1, 2, 5, 6, 7, 8, 13, 15, 16, 18, 20, 23, 26, 32, 34], "trace": [1, 6, 7, 8, 12, 13, 15, 16, 20, 23, 26, 32, 34], "your": [1, 5, 6, 7, 8, 10, 14, 15, 20, 23, 24, 26, 27, 28, 29, 34], "prefer": [1, 7, 8, 15, 24], "option": [1, 2, 5, 7, 10, 14, 15, 16, 29, 31, 34], "wider": 1, "rang": [1, 6, 7, 15, 16, 19, 21, 26, 31, 32, 34], "ipex": [1, 2, 3, 4, 6, 7, 9, 11, 12, 13, 15, 16, 17, 19, 20, 23, 26, 29, 31, 32, 34], "backend": [1, 2, 3, 6, 7, 12, 13, 16, 17, 23, 26, 28, 31, 33, 34], "avail": [1, 2, 6, 7, 11, 17, 20, 22, 23, 29, 31, 33, 34], "good": [1, 2, 5, 7, 12, 18, 19, 28, 33, 34], "On": [1, 2, 7, 18, 28, 33], "automat": [1, 2, 6, 7, 9, 10, 12, 13, 15, 16, 18, 22, 28, 31, 32, 33, 34], "dispatch": [1, 34], "underli": [1, 17, 28], "detect": [1, 6, 12, 17, 26, 33, 34], "set": [1, 2, 4, 5, 6, 7, 8, 14, 15, 16, 17, 21, 24, 26, 28, 30, 31, 32, 33, 34], "isa": [1, 34], "leverag": [1, 7, 11, 28, 32, 34], "unit": [1, 2, 33], "runtim": [1, 8, 13, 17, 31, 33, 34], "offer": [1, 5, 33], "finer": [1, 7, 20], "grain": [1, 3, 7, 20], "thread": [1, 2, 7, 20, 26, 30, 31, 32, 33, 34], "control": [1, 2, 7, 20, 26, 31, 33, 34], "weight": [1, 2, 7, 10, 12, 13, 15, 16, 18, 20, 22, 23, 26, 28, 34], "share": [1, 5, 6, 16, 20, 32, 33, 34], "increas": [1, 2, 3, 21, 26, 28, 30, 33, 34], "effici": [1, 7, 11, 19, 20, 28, 31, 33, 34], "implement": [1, 5, 7, 11, 19, 26, 28, 33, 34], "regist": [1, 7, 10, 16, 17, 34], "mechan": [1, 7, 17, 21, 34], "These": [1, 5, 6, 7, 8, 13, 28], "nativ": [1, 6, 7, 8, 17, 19, 21, 26, 28, 34], "calcul": [1, 2, 8, 16, 21, 22], "util": [1, 6, 7, 10, 13, 15, 16, 18, 21, 28, 31, 33, 34], "dpc": 1, "compil": [1, 5, 6, 23, 26, 33, 34], "sycl": 1, "standard": [1, 34], "also": [1, 2, 6, 7, 10, 13, 14, 16, 18, 19, 28, 30, 31, 33, 34], "number": [1, 2, 5, 6, 7, 14, 16, 19, 20, 21, 26, 32, 34], "which": [1, 2, 5, 7, 8, 10, 14, 15, 16, 17, 18, 20, 26, 28, 30, 31, 32, 33, 34], "found": [1, 6, 7, 14, 16, 18, 29, 31, 32, 33, 34], "doc": [1, 2, 5, 11, 29, 34], "directori": [1, 5, 6, 14, 29, 31, 32], "team": [1, 5], "track": 1, "bug": [1, 5, 34], "enhanc": [1, 3, 28, 34], "request": [1, 5, 20, 32], "issu": [1, 2, 5, 8, 21, 26, 33], "befor": [1, 2, 5, 6, 13, 14, 17, 18, 20, 31, 33, 34], "submit": [1, 5, 7, 20], "suggest": [1, 2, 15, 18, 20, 33, 34], "report": [1, 17], "search": [1, 2, 4, 5, 7, 16, 22, 28, 31], "exist": [1, 5, 7, 13, 26, 31, 33], "see": [1, 2, 5, 8, 14, 34], "alreadi": [1, 5, 6, 18, 28, 33], "pytorch": [2, 3, 4, 6, 7, 8, 9, 10, 13, 14, 16, 17, 20, 23, 25, 26, 27, 28, 29, 30, 31, 33, 34], "dtype": [2, 4, 6, 7, 8, 10, 11, 13, 15, 16, 17, 23, 26, 29, 31, 34], "none": [2, 6, 29, 31], "o1": [2, 26, 34], "inplac": [2, 4, 6, 13, 15, 18, 23, 32], "fals": [2, 4, 6, 7, 8, 13, 14, 15, 16, 17, 20, 22, 23, 26, 31, 32, 34], "conv_bn_fold": [2, 26, 34], "linear_bn_fold": 2, "weights_prepack": [2, 6, 7, 23, 26], "replace_dropout_with_ident": 2, "optimize_lstm": 2, "split_master_weight_for_bf16": 2, "fuse_update_step": 2, "auto_kernel_select": [2, 7, 30], "sample_input": [2, 9, 34], "graph_mod": [2, 4, 7, 12, 34], "concat_linear": 2, "appli": [2, 6, 7, 8, 12, 13, 16, 18, 19, 21, 23, 26, 28, 29, 31, 34], "given": [2, 6, 13, 14, 16, 28], "nn": [2, 6, 7, 8, 10, 13, 15, 16, 18, 20, 26, 34], "If": [2, 5, 6, 7, 8, 9, 10, 13, 14, 15, 16, 17, 20, 26, 31, 32, 33, 34], "train": [2, 3, 4, 7, 11, 13, 15, 16, 18, 21, 23, 26, 28, 29, 31, 34], "otherwis": [2, 7, 20], "infer": [2, 3, 4, 7, 10, 11, 12, 15, 18, 20, 21, 23, 26, 30, 33, 34], "conv": [2, 8, 10, 13, 15, 20, 26, 34], "bn": [2, 10, 15, 26, 34], "fold": [2, 10, 15, 16, 26, 34], "prepack": [2, 6, 10, 18, 26, 28, 34], "so": [2, 5, 6, 7, 8, 15, 17, 18, 20, 30, 31, 32, 33, 34], "onednn": [2, 3, 13, 17, 26, 28, 34], "order": [2, 17, 18, 21, 31, 33, 34], "cach": [2, 5, 7, 19, 20, 30, 34], "reus": [2, 33], "memori": [2, 6, 7, 8, 9, 10, 13, 19, 20, 21, 26, 28, 30, 32, 34], "layout": [2, 26, 34], "call": [2, 6, 8, 13, 17, 18, 21, 26, 32, 33, 34], "block": [2, 5, 16, 20, 22, 28, 33, 34], "although": [2, 33], "itself": [2, 5, 18], "enough": [2, 7, 19], "usag": [2, 6, 7, 8, 23, 25, 32, 33, 34], "perspect": [2, 13, 18, 21, 28, 31, 33], "drawback": [2, 21], "run": [2, 4, 5, 6, 7, 8, 10, 12, 14, 16, 20, 26, 30, 31, 32, 33, 34], "split": [2, 6, 7, 16, 17, 19, 20, 26, 34], "one": [2, 5, 7, 12, 13, 14, 16, 18, 19, 20, 26, 29, 31, 33, 34], "sever": [2, 7, 10, 19, 30, 31, 34], "dimens": [2, 18, 26], "data": [2, 4, 6, 7, 8, 9, 10, 11, 12, 13, 16, 17, 18, 19, 20, 21, 23, 26, 31, 32, 34], "fix": [2, 5, 7, 34], "size": [2, 6, 7, 11, 15, 16, 17, 18, 23, 26, 28, 30, 32, 33, 34], "each": [2, 8, 14, 16, 17, 19, 20, 21, 31, 32, 33, 34], "time": [2, 5, 7, 14, 16, 17, 18, 19, 26, 28, 30, 33, 34], "execut": [2, 4, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 19, 20, 26, 31, 32, 33, 34], "detail": [2, 5, 6, 7, 8, 9, 11, 13, 17, 18, 24, 25, 26, 28, 30, 32, 33, 34], "mermori": 2, "format": [2, 5, 6, 7, 9, 14, 22, 26, 28, 31, 33, 34], "manual": [2, 7, 10, 14, 18, 20, 34], "To": [2, 5, 6, 7, 10, 13, 15, 16, 17, 18, 20, 21, 23, 28, 32, 33, 34], "predefin": 2, "shape": [2, 6, 7, 16, 20, 23, 30, 33, 34], "prior": [2, 23], "match": [2, 8, 17, 31], "requir": [2, 5, 6, 8, 10, 16, 18, 21, 26, 28, 29, 31, 32, 34], "won": [2, 7, 8, 17, 26], "t": [2, 5, 7, 8, 14, 15, 16, 17, 18, 20, 26, 32, 34], "convers": [2, 8, 13, 34], "directli": [2, 6, 33, 34], "go": [2, 5, 8], "methodologi": [2, 6, 7, 19, 33], "possibl": [2, 14, 15, 19, 28, 33, 34], "avoid": [2, 10, 20, 21, 26, 31, 32, 33, 34], "thu": [2, 7, 8, 10, 18, 20, 21, 28, 31, 32, 33], "paramet": [2, 6, 7, 8, 10, 16, 17, 19, 20, 21, 26, 28, 29, 30, 31, 33, 34], "work": [2, 5, 6, 7, 14, 15, 17, 20, 26, 28, 29, 31, 33, 34], "bfloat16": [2, 3, 4, 7, 10, 11, 17, 18, 23, 29, 31, 34], "half": [2, 7, 17, 21], "k": [2, 5], "float16": [2, 8], "cast": [2, 8, 21, 28], "accord": [2, 13, 28, 33, 34], "default": [2, 4, 6, 7, 10, 12, 13, 15, 16, 17, 20, 22, 23, 26, 28, 30, 32, 33, 34], "valu": [2, 6, 10, 14, 16, 17, 19, 20, 21, 22, 26, 28, 31, 32, 33, 34], "mean": [2, 16, 17, 18, 20, 22, 28, 34], "do": [2, 5, 8, 16, 18, 20, 21, 26, 28, 30, 31, 32, 33, 34], "noth": 2, "note": [2, 3, 5, 6, 15, 16, 17, 18, 20, 22, 24, 28, 30, 31, 32, 33], "type": [2, 4, 5, 6, 7, 10, 16, 17, 18, 20, 21, 23, 30, 31, 32, 34], "conv2d": [2, 7, 8, 10, 13, 18, 20, 26, 34], "linear": [2, 6, 7, 8, 13, 15, 16, 18, 26, 33, 34], "convtranspose2d": [2, 13], "case": [2, 6, 7, 9, 12, 16, 17, 18, 28, 31, 33, 34], "addit": [2, 6, 7, 17, 21, 28, 34], "embed": [2, 7, 28, 34], "lstm": [2, 10, 15, 34], "sgd": [2, 6, 7, 8, 16, 19], "string": [2, 31], "o0": [2, 26, 34], "No": [2, 18, 34], "function": [2, 5, 6, 7, 8, 10, 11, 12, 14, 15, 17, 20, 21, 23, 26, 28, 29, 31, 33, 34], "just": [2, 14, 29, 34], "return": [2, 6, 7, 8, 10, 16, 17, 20, 26, 34], "origin": [2, 6, 7, 12, 13, 15, 17, 20, 29, 34], "dropout": [2, 10], "remov": [2, 5, 21, 34], "inferenc": 2, "master": [2, 7, 21, 31], "fuse": [2, 7, 13, 16, 19, 28, 34], "updat": [2, 5, 7, 16, 19, 21, 22, 34], "step": [2, 5, 6, 7, 8, 14, 16, 19, 21, 32], "overridden": [2, 17], "explicitli": [2, 8, 16, 20, 26, 31, 34], "bool": [2, 14], "whether": [2, 6, 8, 16, 18, 22, 23, 33], "conv_bn": 2, "It": [2, 6, 7, 8, 10, 13, 17, 18, 20, 21, 23, 26, 29, 31, 33, 34], "knob": [2, 4, 12, 31], "overwrit": [2, 31], "configur": [2, 4, 6, 7, 14, 15, 16, 17, 31, 32, 34], "linear_bn": 2, "convolut": [2, 6, 7, 13, 20, 33, 34], "reorder": [2, 18, 28], "doesn": [2, 15, 16, 18, 26, 34], "support": [2, 5, 6, 7, 13, 15, 16, 17, 18, 19, 20, 21, 25, 26, 28, 29, 31, 32, 33, 34], "replac": [2, 5, 7, 10, 26, 34], "ident": [2, 10, 18], "aten": [2, 6, 7, 34], "opportunit": 2, "bf16": [2, 3, 7, 17, 19, 21, 23, 26, 28, 30, 34], "save": [2, 5, 6, 7, 13, 14, 15, 16, 18, 21, 28, 32, 34], "solut": [2, 7, 26, 28, 34], "all": [2, 5, 6, 8, 13, 14, 17, 19, 20, 28, 29, 32, 33, 34], "param": [2, 19, 31], "tupl": [2, 6, 17, 20], "tensor": [2, 6, 7, 8, 11, 15, 16, 17, 20, 26, 28, 32, 34], "feed": [2, 9, 18], "sampl": [2, 6, 9, 14, 16, 17, 29, 33], "input": [2, 6, 7, 9, 10, 13, 15, 16, 17, 18, 22, 23, 26, 29, 30, 32, 33, 34], "impact": [2, 7, 20], "pack": [2, 20, 34], "intel": [2, 3, 4, 7, 8, 9, 10, 11, 13, 14, 16, 17, 20, 21, 22, 23, 25, 26, 27, 28, 29, 34], "extens": [2, 3, 4, 6, 9, 10, 13, 14, 16, 17, 23, 24, 25, 27, 28, 29, 30, 31, 33, 34], "per": [2, 10, 15, 16, 20, 30, 31, 32, 33, 34], "some": [2, 5, 7, 8, 13, 16, 17, 18, 20, 26, 28, 31, 32, 33, 34], "heurist": [2, 20, 34], "real": [2, 7, 14, 15, 30, 34], "best": [2, 6, 7, 8, 14, 16, 17, 22, 24, 28, 33, 34], "try": [2, 5, 6, 7, 12, 14, 16, 26, 31, 33, 34], "select": [2, 5, 7, 13, 24, 34], "true": [2, 4, 6, 10, 12, 13, 14, 15, 16, 17, 22, 23, 31, 32, 33, 34], "might": [2, 7, 18, 26, 33, 34], "cost": [2, 6, 28, 30, 33], "extra": [2, 5, 10, 20, 31, 32], "combin": [2, 12, 14, 28, 31, 34], "method": [2, 8, 15, 16, 18, 22, 26, 33, 34], "multipl": [2, 5, 7, 8, 16, 17, 18, 26, 28, 30, 32, 33, 34], "subgraph": 2, "modifi": [2, 5, 6], "other": [2, 6, 7, 8, 14, 17, 18, 19, 23, 28, 31, 33], "place": [2, 8, 28, 33, 34], "scenario": [2, 6, 7, 18, 33, 34], "convolutuon": 2, "counterpart": [2, 7, 18, 34], "pleas": [2, 6, 7, 11, 16, 22, 26, 28, 31, 33, 34], "invok": [2, 6, 8, 10, 13, 20, 23, 26, 29, 34], "ddp": [2, 6], "distribut": [2, 3, 7, 16, 31, 32, 33], "deepcopi": 2, "rather": [2, 18], "than": [2, 5, 7, 17, 18, 20, 21, 26, 33, 34], "allreduc": 2, "caus": [2, 7, 21, 26, 28, 31, 33, 34], "unpredict": 2, "accuraci": [2, 3, 6, 7, 8, 15, 16, 21, 22, 26, 28, 34], "loss": [2, 5, 6, 8, 16, 18, 21, 26], "exampl": [2, 5, 7, 8, 13, 18, 19, 21, 22, 23, 24, 25, 28, 29, 32, 33, 34], "load_state_dict": [2, 34], "path": [2, 6, 7, 14, 18, 20, 23, 31, 33, 34], "eval": [2, 4, 6, 8, 10, 11, 12, 13, 15, 16, 20, 23, 26, 29, 32, 34], "optimized_model": [2, 34], "evalu": [2, 16, 34], "optimized_optim": 2, "altern": [2, 6, 18], "motiv": [2, 20], "ad": [2, 7, 10, 33, 34], "alia": 2, "unifi": [2, 31], "style": [2, 5], "modular": 2, "float32": [2, 13, 21, 23, 26, 30, 31, 34], "quantization_config": [2, 6, 29], "qconfig_summary_fil": [2, 6, 29], "low_precision_checkpoint": [2, 6, 29], "deployment_mod": [2, 6, 23], "transform": [2, 3, 4, 6, 10, 11, 13, 16, 18, 22, 23, 28, 32, 33, 34], "focu": [2, 10, 18, 29, 34], "especi": [2, 5, 28, 34], "task": [2, 7, 28, 31, 33, 34], "famili": [2, 28, 33], "full": [2, 5, 18, 32, 33, 34], "llama": [2, 3, 6, 28], "gpt": [2, 28, 30], "j": [2, 5, 17, 28, 30], "neox": [2, 28], "opt": [2, 6, 17, 28], "falcon": [2, 28], "bloom": [2, 28], "codegen": [2, 28, 34], "baichuan": [2, 28, 34], "chatglm": [2, 28], "gptbigcod": [2, 28], "t5": [2, 26, 28, 34], "mistral": [2, 28, 34], "mpt": [2, 28, 34], "mixtral": [2, 28], "stablelm": [2, 28], "qwen": [2, 28], "git": [2, 5, 28], "llava": [2, 28], "yuan": [2, 28], "phi": [2, 28], "scope": [2, 7, 8, 21, 34], "abov": [2, 5, 10, 19, 28, 30, 31, 32], "transpar": [2, 7, 29, 33, 34], "benifit": 2, "float": [2, 6, 7, 8, 14, 15, 16, 17, 21, 29, 34], "when": [2, 5, 6, 7, 8, 9, 14, 18, 19, 20, 21, 22, 25, 26, 28, 30, 31, 32, 33, 34], "mix": [2, 6, 13, 23, 26, 28, 34], "str": [2, 6, 14, 23, 31], "specifi": [2, 5, 6, 14, 20, 31, 33, 34], "either": [2, 26, 31], "object": [2, 6, 7, 14, 17, 20, 33, 34], "defin": [2, 5, 6, 7, 8, 10, 16, 17, 18, 22, 32], "recip": [2, 4, 7, 13, 15, 26, 28, 34], "quant": [2, 16], "static": [2, 4, 16, 26, 28, 31, 32, 33, 34], "onc": [2, 5, 6, 14, 17, 18, 20, 21, 32, 33], "quantizat": 2, "config": [2, 6, 11, 23, 31, 32], "json": [2, 6, 15, 16, 32, 34], "file": [2, 4, 5, 6, 8, 14, 15, 16, 17, 18, 31, 34], "under": [2, 6, 8, 18, 20, 27, 31, 34], "need": [2, 5, 6, 7, 10, 13, 14, 16, 17, 18, 19, 20, 21, 23, 26, 29, 31, 32, 33, 34], "calibr": [2, 13, 22, 26, 29, 30, 32, 34], "dict": [2, 6, 23], "int4": [2, 28, 29, 34], "": [2, 3, 5, 8, 10, 14, 15, 18, 19, 20, 21, 22, 26, 31, 32, 33], "should": [2, 5, 8, 15, 20, 28, 31, 33], "state_dict": [2, 6], "checkpoint": [2, 6, 29], "pt": [2, 6, 13, 14, 15, 23, 32, 34], "gptq": [2, 6, 34], "etc": [2, 5, 6, 17, 34], "where": [2, 5, 7, 16, 21, 33], "kei": [2, 7, 28, 34], "scale": [2, 3, 6, 15, 28], "zero": [2, 6, 15, 34], "point": [2, 6, 8, 15, 21, 33, 34], "bia": [2, 8, 20, 34], "weight_kei": 2, "packed_weight": 2, "scale_kei": 2, "zero_point_kei": 2, "packed_zp": 2, "bias_kei": 2, "chang": [2, 5, 6, 7, 8, 10, 11, 12, 15, 17, 18, 20, 23, 25, 26, 29, 31], "make": [2, 5, 6, 7, 14, 15, 17, 21, 23, 28, 32, 33], "n": [2, 6, 7, 16, 18, 19, 20, 26, 32, 33, 34], "thei": [2, 7, 8, 31, 33], "uint4": 2, "compress": 2, "along": [2, 5, 6, 21, 33, 34], "store": [2, 17, 18, 19, 21, 28, 31, 32, 33, 34], "int32": 2, "state": [2, 15, 19, 28], "automaticlli": 2, "deploy": [2, 7, 13, 34], "torchscirpt": 2, "workabl": 2, "forward": [2, 6, 8, 13, 16, 20, 21, 26, 32, 33, 34], "after": [2, 5, 7, 13, 20, 21, 23, 24, 32, 33, 34], "deepspe": [2, 34], "parallel": [2, 5, 6, 7, 28, 33, 34], "class": [2, 5, 6, 7, 8, 10, 16, 20, 26, 34], "verbos": [2, 4, 31], "demand": [2, 7], "easier": [2, 18, 21], "debug": [2, 31], "dump": [2, 31], "messag": [2, 6, 10, 12, 18, 31], "contain": [2, 5, 6, 13, 17, 26, 31, 32, 33, 34], "durat": [2, 21], "while": [2, 7, 8, 11, 12, 18, 21, 26, 28, 32, 33, 34], "via": [2, 5, 6, 7, 18, 20, 30, 31, 33, 34], "environ": [2, 5, 6, 17, 20, 24, 28, 30, 31, 32, 33], "variabl": [2, 5, 17, 30, 31, 32, 33, 34], "name": [2, 5, 7, 14, 17, 25, 28, 31, 32, 33, 34], "dnnl_verbos": 2, "howev": [2, 5, 7, 8, 9, 16, 20, 26, 28, 31, 33, 34], "those": [2, 15, 33], "amount": [2, 16, 26, 28, 33], "investig": [2, 31], "singl": [2, 7, 13, 14, 16, 19, 20, 30, 32, 34], "iter": [2, 16, 21, 28, 34], "out": [2, 5, 6, 7, 8, 10, 13, 16, 19, 20, 30, 31, 33, 34], "second": [2, 10, 28, 32, 33], "verbose_on": 2, "verbose_off": 2, "disabl": [2, 6, 7, 13, 26, 31, 33, 34], "verbose_on_cr": 2, "creation": 2, "linearsilu": [2, 34], "silu": [2, 13], "http": [2, 5, 16, 34], "org": [2, 7, 16, 26, 34], "stabl": [2, 3, 8, 34], "html": [2, 5, 16], "output": [2, 6, 7, 8, 13, 14, 16, 18, 23, 26, 34], "same": [2, 5, 7, 10, 15, 16, 17, 18, 20, 21, 28, 31, 32, 33, 34], "init": [2, 5, 15, 34], "linear_modul": 2, "4096": [2, 33], "ipex_fus": 2, "randn": [2, 10, 13, 16, 18, 32, 34], "linearsilumul": [2, 34], "multipli": 2, "mul": [2, 13, 16], "linear2silumul": [2, 34], "linear_": 2, "linear_m": 2, "two": [2, 7, 14, 16, 20, 21, 28, 32, 33, 34], "linear_s_modul": 2, "linear_m_modul": 2, "linearrelu": [2, 34], "relu": [2, 7, 13, 16, 18, 26, 34], "linearnewgelu": [2, 34], "newgeluactiv": 2, "com": [2, 5, 34], "huggingfac": [2, 6, 26, 28, 32, 34], "blob": 2, "src": [2, 17], "activ": [2, 6, 7, 15, 16, 20, 28, 31, 33], "py": [2, 5, 10, 14, 20, 31, 32, 34], "l50": 2, "new_gelu": 2, "lineargelu": [2, 34], "gelu": [2, 13, 34], "linearmul": [2, 34], "linearadd": [2, 34], "add": [2, 5, 7, 8, 13, 14, 19, 21, 32, 34], "linearaddadd": [2, 34], "other_1": 2, "other_2": 2, "rotaryembed": [2, 34], "max_position_embed": 2, "int": [2, 6, 7, 14, 17, 23, 26, 29, 31, 34], "pos_embd_dim": 2, "10000": 2, "backbon": 2, "co": 2, "paper": [2, 34], "2104": 2, "09864": 2, "queri": [2, 17, 18], "multi": [2, 7, 14, 20, 28, 31, 33, 34], "head": [2, 34], "comput": [2, 6, 7, 13, 15, 16, 18, 20, 21, 28, 30, 31, 32, 33, 34], "max": [2, 6, 16, 17, 22, 23, 26, 34], "posit": [2, 28, 33, 34], "frequenc": [2, 30], "exact": 2, "g": [2, 7, 8, 16, 17, 18, 28, 34], "gptjforcausallm": 2, "architectur": [2, 28, 30, 33], "eleutherai": [2, 28], "6b": [2, 28, 30], "l4": 2, "batch": [2, 6, 7, 13, 16, 18, 20, 23, 26, 30, 32, 34], "sequenc": [2, 18, 21, 28, 34], "length": [2, 5, 14, 21, 26, 30, 34], "num_head": 2, "num_kv_head": 2, "head_dim": 2, "position_id": [2, 6], "element": [2, 18, 19], "past_kv_length": 2, "id": [2, 31, 32], "construct": [2, 7, 13], "current_posit": 2, "num": [2, 20, 32, 33, 34], "dim": [2, 6, 18, 23], "offset": [2, 18, 28], "sin": 2, "neighbor": 2, "rotary_dim": 2, "rotary_ndim": 2, "rotari": [2, 28], "64": [2, 8, 10, 16, 20, 30, 31, 34], "gptj": 2, "rope_modul": 2, "2048": [2, 6], "32": [2, 6, 18, 21, 23, 30, 31, 32], "16": [2, 17, 20, 21, 30, 31, 32], "256": [2, 30], "arang": [2, 6, 16], "unsqueez": 2, "query_roteri": 2, "direct": [2, 5, 13], "apply_funct": 2, "without": [2, 5, 6, 7, 8, 10, 16, 20, 21, 26, 32, 34], "initi": [2, 20, 32], "assum": [2, 7, 8, 23, 32, 33, 34], "num_token": 2, "rotary_half": 2, "rmsnorm": [2, 28, 34], "hidden_s": [2, 6], "ep": [2, 7, 10, 19], "1e": [2, 7, 10, 16], "06": [2, 31, 32], "hidden": [2, 18, 28], "modeling_llama": 2, "l76": 2, "variance_epsilon": 2, "6": [2, 5, 7, 11, 14, 20, 30, 31, 32, 33, 34], "ones": [2, 6, 17], "hidden_st": 2, "usual": [2, 18, 20, 33], "rmsnorm_modul": 2, "fastlayernorm": [2, 34], "normalized_shap": 2, "layernorm": [2, 13, 16, 22, 34], "list": [2, 5, 7, 8, 13, 14, 16, 18, 25, 29, 31, 32, 33, 34], "denomin": 2, "numer": [2, 8, 33], "stabil": [2, 8, 34], "layernorm_modul": 2, "05": [2, 7, 10, 30, 31], "indirectaccesskvcacheattent": [2, 34], "text_max_length": 2, "kv_cach": [2, 28], "decod": [2, 28, 30, 34], "layer": [2, 16, 20, 22, 28, 34], "bring": [2, 6, 7, 9, 15, 16, 21, 28, 31, 33, 34], "beam": [2, 28], "idx": [2, 28, 31], "concat": [2, 20, 26, 28, 34], "entir": [2, 16, 28], "context": [2, 5, 6, 8, 20, 28, 33], "dot": [2, 7, 18, 28], "veri": [2, 5, 15, 18, 28], "long": [2, 6, 18, 21, 26, 28, 34], "bottleneck": [2, 28], "indirect": 2, "access": [2, 6, 7, 18, 19, 32], "iakv": [2, 28], "firstli": [2, 28], "pre": [2, 28, 34], "alloc": [2, 10, 20, 28, 30, 32, 34], "buffer": [2, 28], "index": [2, 5, 18, 28, 33], "histori": [2, 14, 28], "decid": [2, 15, 20, 28], "timestamp": [2, 28], "max_seq": 2, "head_num": 2, "head_siz": 2, "token": [2, 6, 23, 28, 30], "everi": [2, 28], "kv": 2, "seq_len": [2, 30], "scale_attn": 2, "sqrt": [2, 13, 19], "layer_past": 2, "seq_info": 2, "key_cach": 2, "value_cach": 2, "info": [2, 6, 17, 26, 31, 32, 34], "head_mask": 2, "mask": [2, 7, 17, 26], "yet": [2, 6, 26, 34], "attention_mask": [2, 6], "attn_weight": 2, "first": [2, 3, 5, 6, 7, 9, 10, 12, 16, 19, 20, 21, 26, 31, 32, 33], "matmul": [2, 8, 13, 26, 34], "new_layer_past": 2, "attn_output": 2, "l1318": 2, "def": [2, 6, 8, 10, 16, 20, 26, 34], "_reorder_cach": 2, "self": [2, 6, 8, 10, 16, 20, 26, 34], "past_key_valu": [2, 6], "beam_idx": 2, "len": [2, 6, 7, 13, 16, 17], "4": [2, 6, 11, 13, 14, 18, 20, 23, 28, 30, 31, 33, 34], "3": [2, 5, 6, 7, 8, 10, 12, 13, 14, 16, 17, 18, 20, 21, 28, 30, 31, 33], "pagedattent": [2, 34], "vllm": 2, "blog": [2, 34], "2023": [2, 3, 30], "20": [2, 7, 18, 30, 31, 32, 34], "page": [2, 6, 13, 20, 24, 29, 30, 33, 34], "num_block": 2, "block_siz": 2, "basic": [2, 4, 16, 21, 33], "logic": [2, 14, 18, 32, 33], "dram": 2, "manag": [2, 8, 13, 20, 28, 31], "slot": [2, 30], "reshape_and_cach": 2, "single_query_cached_kv_attent": 2, "mha": [2, 34], "intra": 2, "tabl": [2, 7, 17, 28, 30, 34], "map": [2, 6, 18, 30], "physic": [2, 14, 20, 32, 33], "slot_map": 2, "allcat": 2, "keytensor": 2, "num_seq": 2, "block_numb": 2, "head_map": 2, "block_tabl": 2, "context_len": 2, "max_context_len": 2, "alibi_slop": 2, "5": [2, 6, 10, 13, 14, 16, 17, 18, 19, 20, 21, 22, 26, 28, 30, 31, 32, 33, 34], "max_num_blocks_per_seq": 2, "optin": 2, "alibi": 2, "slope": 2, "varlenattent": [2, 34], "scaled_dot_product_attent": 2, "accept": [2, 34], "variant": [2, 8, 28], "among": [2, 31, 32, 33], "doe": [2, 7, 13, 18, 20, 26, 34], "arg": [2, 4, 6, 7, 14, 16, 19, 23, 31, 32, 34], "query_token": 2, "total": [2, 6, 30, 33], "key_token": 2, "value_token": 2, "seqlen_q": 2, "batch_siz": [2, 6, 11, 13, 16, 18, 23, 32], "seqlen_k": 2, "max_seqlen_q": 2, "max_seqlen_k": 2, "pdropout": 2, "probabl": 2, "greater": 2, "softmax_scal": 2, "factor": [2, 6, 16, 31], "softmax": [2, 13, 34], "is_caus": 2, "causal": 2, "varlenattention_modul": 2, "emply_lik": 2, "rotary_embed": [2, 34], "rms_norm": [2, 34], "fast_layer_norm": [2, 34], "expect": [2, 7, 30, 34], "indirect_access_kv_cache_attent": [2, 34], "add_casual_mask": 2, "varlen_attent": [2, 34], "zero_tensor": 2, "return_softmax": 2, "gen_": 2, "fast_bert": [2, 4, 6, 7, 11, 34], "unpad": 2, "tpp": [2, 28], "speedup": [2, 6, 8, 28, 30, 34], "still": [2, 5, 7, 8, 13, 16, 18, 21, 26, 34], "squenc": 2, "sparsiti": 2, "seed": 2, "libxsmm": 2, "though": [2, 7], "peak": [2, 7, 11, 34], "enable_onednn_fus": [2, 13], "get_smooth_quant_qconfig_map": [2, 6, 29], "alpha": [2, 6, 19, 22], "act_observ": 2, "act_ic_observ": 2, "wei_observ": 2, "wei_ic_observ": 2, "share_weight_observ": 2, "smoothquant": [2, 6, 7, 16, 22, 28, 34], "arxiv": 2, "pdf": 2, "2211": 2, "10438": 2, "hyper": [2, 30, 33, 34], "observ": [2, 9, 13, 15, 34], "op": [2, 7, 15, 16, 22, 28, 34], "histogramobserv": [2, 15], "q": [2, 28], "min": [2, 16, 22, 26, 34], "affect": [2, 31], "argument": [2, 6, 7, 22, 26, 31], "ao": [2, 6, 15], "minmaxobserv": [2, 6, 15], "channel": [2, 3, 10, 15, 16, 26, 34], "perchannelminmaxobserv": [2, 6, 15], "with_arg": [2, 6, 15], "ch_axi": 2, "qint8": [2, 6, 15], "qscheme": [2, 6, 15, 34], "per_channel_symmetr": [2, 6, 15], "qconfig": [2, 4, 6, 13, 16, 26, 29, 32, 34], "prepar": [2, 4, 6, 13, 16, 26, 29, 32, 34], "example_input": [2, 4, 6, 13, 15, 29, 32, 34], "bn_fold": 2, "example_kwarg_input": 2, "fp32": [2, 4, 16, 17, 19, 21, 23, 28, 34], "A": [2, 5, 6, 7, 10, 11, 17, 26, 28, 31, 33, 34], "even": [2, 5, 7, 33, 34], "prepared_model": [2, 4, 6, 13, 15, 16, 26, 29, 34], "original_model": 2, "later": [2, 7, 25, 33], "unexpect": 2, "behavior": [2, 20, 31, 33], "insert": [2, 16], "fake": 2, "introduct": [2, 7, 28, 33, 34], "avaiabl": 2, "autotun": [2, 4, 22, 34], "calib_dataload": [2, 6, 16, 34], "calib_func": 2, "eval_func": [2, 16, 34], "op_type_dict": 2, "smoothquant_arg": [2, 16], "sampling_s": [2, 4, 16, 34], "accuracy_criterion": [2, 4, 16, 34], "tuning_tim": [2, 4, 16, 34], "driven": 2, "tune": [2, 3, 4, 7, 8, 15, 20, 26, 28, 31, 32, 34], "help": [2, 5, 6, 17, 23, 28, 31, 33, 34], "quickli": 2, "dataload": [2, 6, 10, 13, 16, 20, 22, 29, 34], "post": [2, 4, 5, 7, 15, 28, 34], "process": [2, 6, 7, 11, 12, 14, 16, 19, 20, 21, 26, 31, 32, 33], "metric": [2, 16, 30], "scalar": 2, "higher": [2, 7, 13, 17, 18, 28], "constraint": [2, 34], "optyp": 2, "wise": [2, 16, 19, 22, 29, 34], "space": [2, 7, 16, 18, 22, 33], "global": [2, 20, 22, 34], "algorithm": [2, 13, 18, 30, 34], "would": [2, 5, 6, 14, 16, 17, 18, 30, 31, 32, 33, 34], "explor": 2, "100": [2, 4, 14, 16, 17, 30, 32], "accuracy_criterion_typ": 2, "rel": [2, 4, 16, 31, 34], "absolut": [2, 31], "accuracy_criterion_valu": 2, "maximum": [2, 16, 17], "allow": [2, 8, 14, 16, 22, 31, 33, 34], "01": [2, 4, 7, 16, 31, 32, 34], "timeout": [2, 5, 21], "earli": [2, 34], "stop": [2, 33], "is_runtime_ext_en": 2, "helper": 2, "check": [2, 5, 6, 7, 13, 18, 28, 29, 31, 34], "exetens": 2, "openmp": [2, 7, 20, 26, 30, 32, 34], "preload": [2, 31], "cpupool": [2, 20, 34], "core_id": [2, 20, 31], "node_id": [2, 20, 31, 32, 34], "abstract": [2, 11, 20], "pool": [2, 20, 34], "core": [2, 7, 14, 17, 30, 33, 34], "numa": [2, 20, 31, 32, 34], "node": [2, 20, 30, 32, 33, 34], "pin": [2, 20], "cpu_pool": [2, 20, 34], "region": [2, 8, 17, 33], "design": [2, 5, 8, 18, 21, 29, 34], "decor": 2, "multistreammodulehint": [2, 20, 34], "kwarg": [2, 29], "hint": [2, 20], "multistreammodul": [2, 7, 20, 26, 34], "its": [2, 6, 7, 8, 14, 17, 21, 28, 30, 31, 32, 33, 34], "arbitrari": 2, "keyword": 2, "num_stream": [2, 20, 34], "auto": [2, 6, 10, 17, 18, 22, 23, 26, 28, 31, 33, 34], "concat_output": 2, "input_split_hint": [2, 20], "multi_stream": 2, "output_concat_hint": [2, 20], "stream": [2, 7, 20, 34], "throughput": [2, 3, 18, 20, 26, 28, 30, 34], "insid": [2, 5, 20, 31], "divis": [2, 20], "equal": [2, 15, 20, 32, 33], "remaind": [2, 20], "divisor": [2, 20], "batchsiz": [2, 20], "larger": [2, 20, 30, 33], "piec": [2, 20], "less": [2, 8, 18, 20, 26, 34], "mini": [2, 20, 34], "don": [2, 5, 8, 14, 17, 34], "want": [2, 5, 7, 14, 15, 17, 20, 31, 34], "leav": [2, 20, 33], "scriptmodul": [2, 13, 20], "union": 2, "instanc": [2, 7, 10, 14, 32, 34], "reason": [2, 10, 18, 20, 34], "flag": [2, 5, 7, 17, 20, 31, 34], "indic": [2, 6, 18, 28], "concaten": [2, 21], "raw": 2, "asynchron": [2, 7], "get_core_list_of_node_id": 2, "softwar": [3, 27, 34], "jul": 3, "deep": [3, 7, 8, 11, 13, 14, 21, 33], "learn": [3, 7, 8, 11, 13, 14, 21, 31, 33], "boost": [3, 6, 7, 9, 21, 30, 31, 33, 34], "dl": [3, 7, 34], "hug": 3, "face": 3, "bert": [3, 4, 10, 30, 34], "googl": [3, 5, 28], "cloud": 3, "platform": [3, 7, 18, 32, 33, 34], "gcp": 3, "technologi": [3, 7], "guid": [3, 6, 7, 17, 32, 34], "apr": 3, "mar": [3, 32], "new": [3, 5, 12, 16, 17, 18, 20, 23, 26, 29, 33], "x86": 3, "sapphir": 3, "rapid": 3, "part": [3, 5, 7, 8, 18, 21, 26, 33, 34], "jan": 3, "secur": 3, "torchserv": [3, 34], "confer": 3, "dec": 3, "2022": [3, 31, 32], "what": [3, 5, 6, 8, 23], "pyg": 3, "diffus": [3, 34], "arc": 3, "nov": 3, "13": [3, 10, 17, 30, 31, 32, 33], "potenti": [3, 7, 34], "fine": [3, 20, 31, 32, 33, 34], "fx": [3, 7, 10, 26, 34], "sep": [3, 17], "empow": 3, "xeon": [3, 7, 14, 21, 28, 30, 32, 33, 34], "scalabl": [3, 7, 21, 28, 30, 33, 34], "processor": [3, 7, 19, 21, 28, 30, 33, 34], "aug": [3, 30], "vision": [3, 6, 30], "last": [3, 10, 21, 26, 34], "One": [3, 18, 19, 31, 33], "click": 3, "compressor": [3, 7, 16, 22, 34], "4x": 3, "jun": 3, "grokk": 3, "principl": [3, 18], "kt": 3, "person": 3, "text": [3, 6, 26, 28, 30, 33], "speech": [3, 33], "2021": [3, 17, 31, 32], "up": [3, 7, 11, 20, 24, 28, 33, 34], "modern": 3, "naver": 3, "low": [3, 4, 6, 7, 21, 23, 31, 33, 34], "latenc": [3, 14, 18, 28, 30, 32, 34], "machin": [3, 5, 6, 7, 14, 17, 26, 31, 32, 33, 34], "feb": 3, "dlrm": [3, 7, 26, 30, 34], "oneccl": [3, 6, 31, 34], "mention": [3, 10, 20, 21, 34], "deprec": [3, 26], "facebook": [3, 6, 28], "3rd": [3, 7, 21, 30, 34], "gen": [3, 30, 34], "capabl": [3, 17, 34], "2020": 3, "collabor": 3, "2019": 3, "caff": 3, "2017": 3, "command": [4, 5, 6, 14, 23, 31, 32, 33, 34], "descript": [4, 7, 16, 18, 20, 25, 33, 34], "instal": [4, 5, 6, 23, 25, 26, 28, 33, 34], "m": [4, 14, 20, 26, 31, 32, 33, 34], "pip": [4, 5, 34], "captur": [4, 34], "log": [4, 6, 13, 31, 32, 34], "prompt": [4, 6, 23, 34], "export": [4, 31, 33], "onednn_verbos": 4, "dure": [4, 6, 7, 10, 13, 16, 21, 31, 33, 34], "precis": [4, 6, 13, 21, 23, 26, 30, 34], "no_grad": [4, 6, 10, 11, 12, 13, 15, 16, 20, 23, 26, 29, 32, 34], "amp": [4, 6, 10, 23, 26, 34], "autocast": [4, 6, 7, 10, 23, 34], "prototyp": [4, 13, 20, 26, 34], "fast": [4, 12, 33, 34], "bertmodelmodel": 4, "bertmodel": [4, 6, 11, 32], "from_pretrain": [4, 6, 11, 23, 29, 32], "uncas": [4, 6, 10, 11, 32, 34], "launch": [4, 6, 20, 32, 34], "autom": [4, 7, 8, 14, 31, 32, 34], "ipexrun": [4, 10, 31, 34], "lt": [4, 28, 30], "your_pytorch_script": [4, 31], "gt": [4, 14, 28, 33], "hypertun": [4, 34], "hyperparamet": [4, 7], "conf": [4, 13, 14, 31, 34], "your_conf_fil": [4, 34], "your_python_script": [4, 34], "default_static_qconfigprepared_model": 4, "anyplac": 4, "d": [4, 5, 6, 7, 8, 13, 26, 28, 34], "calibration_data_load": [4, 6, 13], "converted_model": [4, 6, 26, 34], "default_dynamic_qconfigprepared_model": 4, "tuned_model": [4, 16, 34], "eval_funct": 4, "convert_model": [4, 13, 15, 16], "thank": [5, 34], "interest": 5, "begin": 5, "intent": 5, "propos": [5, 7, 11, 16, 18, 21], "intend": 5, "shall": [5, 18, 33], "discuss": [5, 18, 33], "agre": 5, "plan": [5, 7, 10], "look": [5, 14, 16, 18], "ahead": 5, "outstand": 5, "pick": 5, "comment": [5, 14, 17, 22, 34], "particular": [5, 6, 8, 29, 34], "ask": 5, "pull": 5, "here": [5, 8, 10, 13, 16, 17, 18, 20, 26, 32, 33, 34], "uninstal": 5, "ll": [5, 32, 33], "know": 5, "fulli": [5, 15, 17, 21, 33, 34], "warn": [5, 6, 12, 31, 32, 34], "skip": [5, 6, 17, 18, 31], "few": [5, 7, 9, 13, 16, 18, 32, 34], "alwai": [5, 6, 7, 8, 18, 31, 33, 34], "loop": [5, 21, 29], "re": [5, 8, 32, 33], "feel": [5, 18, 34], "lazi": 5, "ye": 5, "clone": 5, "copi": [5, 17, 18], "cd": [5, 6], "rebas": [5, 34], "submodul": 5, "sync": [5, 20], "recurs": 5, "job": 5, "setup": [5, 6, 28, 34], "symlink": 5, "tree": [5, 6], "reinstal": [5, 26], "again": [5, 19, 32], "__init__": [5, 6, 8, 10, 16, 20, 26, 34], "repeatedli": 5, "interfac": [5, 6, 18, 26, 28], "pyi": 5, "non": [5, 8, 13, 18, 30, 32, 34], "cpp": [5, 6, 33], "cc": [5, 6, 17], "cu": 5, "h": [5, 6, 7, 16, 18, 26, 31, 32], "sure": [5, 14, 15, 32, 33], "until": [5, 20, 21, 33], "next": [5, 7, 34], "clean": 5, "cmake": [5, 6, 17, 34], "must": [5, 14, 17, 19], "maco": 5, "linux": [5, 6, 17, 30, 31, 33], "homebrew": 5, "brew": 5, "our": [5, 16, 19, 28, 33, 34], "error": [5, 6, 7, 10, 16, 18, 21, 22, 26, 34], "printf": 5, "stdio": 5, "nint": 5, "hello": 5, "world": [5, 7], "clang": 5, "simpl": [5, 7, 8, 11, 18, 33, 34], "binari": [5, 6, 7, 8, 17, 34], "folder": 5, "mani": [5, 14, 28, 31, 33, 34], "wai": [5, 10, 16, 18, 28, 34], "rm": 5, "rf": 5, "toplevel": 5, "over": [5, 7, 8, 9, 16, 18, 30, 31, 34], "made": [5, 34], "edit": [5, 26, 34], "repo": [5, 6, 7], "commit": 5, "ani": [5, 8, 10, 17, 18, 32, 34], "keep": [5, 12, 18, 21, 28, 32, 33, 34], "realli": 5, "untrack": 5, "deinit": 5, "f": [5, 6, 13, 16, 28, 34], "xdf": 5, "within": [5, 16, 21, 29, 33, 34], "experi": [5, 7, 10, 12, 16, 18, 26, 33, 34], "env_key1": 5, "env_val1": 5, "env_key2": 5, "env_val2": 5, "suit": 5, "locat": [5, 17, 34], "test_": 5, "individu": [5, 30], "filenam": 5, "repres": [5, 7, 21], "wish": [5, 7], "test_jit": 5, "narrow": 5, "down": [5, 32, 34], "testclassnam": 5, "testnam": 5, "let": [5, 10, 18, 19, 20, 21], "sai": 5, "test_sequenti": 5, "testjit": 5, "expecttest": 5, "hypothesi": 5, "mypi": 5, "depend": [5, 7, 17, 18, 25, 26, 33, 34], "conda": [5, 33], "offici": [5, 32, 33, 34], "unittest": 5, "substr": 5, "test_nn": 5, "v": 5, "testnn": 5, "test_bceloss": 5, "test_mseloss": 5, "keystrok": 5, "ci": 5, "quicklint": 5, "aren": 5, "setup_lint": 5, "target": [5, 6, 10, 13, 14, 17, 34], "makefil": 5, "complet": [5, 6, 14, 18, 29, 33], "tab": 5, "trail": [5, 21], "newlin": 5, "quick_check": 5, "flake8": 5, "cmakelint": 5, "tidi": 5, "changed_onli": 5, "written": [5, 6, 17], "framework": [5, 34], "runner": 5, "bin": [5, 6, 17, 31, 32], "gtest_filt": 5, "testsuit": 5, "maycontainalia": 5, "containeraliasingtest": 5, "test_alias_analysi": 5, "docstr": 5, "line": [5, 10, 13, 18, 31, 32, 33], "limit": [5, 8, 10, 20, 26, 32, 33, 34], "80": [5, 30, 31], "charact": 5, "fit": [5, 7, 33, 34], "jupyt": 5, "popup": 5, "prerequisit": [5, 6], "r": [5, 6, 7, 14, 23, 30, 32, 33], "txt": [5, 6, 32], "_build": 5, "rst": 5, "live": 5, "tutori": [5, 6, 15, 16, 34], "autofunct": 5, "autoclass": 5, "shorten": 5, "sphinx": 5, "produc": [5, 8], "miss": 5, "relat": [6, 13, 17, 31, 33, 34], "demonstr": [6, 18, 26, 32], "box": [6, 10, 33], "benefit": [6, 7, 8, 10, 20, 21, 28, 32, 33, 34], "against": 6, "below": [6, 8, 10, 14, 19, 20, 21, 22, 23, 26, 28, 31, 32, 33, 34], "criterion": [6, 8, 16, 22], "zero_grad": [6, 7, 16], "torchvis": [6, 10, 12, 13, 16, 18, 32, 34], "lr": [6, 7, 8, 16, 19], "001": [6, 8], "download": [6, 13, 16], "dataset": [6, 13, 16, 29, 30, 33, 34], "cifar10": [6, 13], "compos": [6, 13], "resiz": [6, 13], "224": [6, 8, 10, 12, 13, 30, 32, 34], "totensor": [6, 13, 16], "train_dataset": [6, 13], "root": [6, 13, 16, 17, 28], "train_load": [6, 8], "128": [6, 8, 10, 13, 20, 30, 34], "crossentropyloss": [6, 16], "momentum": [6, 10, 21], "9": [6, 7, 14, 17, 23, 25, 31, 32], "uncom": 6, "batch_idx": [6, 13], "enumer": [6, 13, 16, 29], "backward": [6, 7, 8, 16, 21, 33, 34], "print": [6, 11, 12, 13, 14, 16, 17, 23, 31], "model_state_dict": 6, "optimizer_state_dict": 6, "pth": 6, "finish": [6, 11, 12, 13, 16, 20], "noqa": [6, 11, 12, 13, 16, 23, 29], "f401": [6, 11, 12, 13, 16, 23, 29], "oneapi": [6, 33], "collect": [6, 32, 33, 34], "commun": [6, 28, 31, 32, 33, 34], "bind": [6, 7, 31, 32, 33, 34], "o": [6, 17, 23, 30], "dist": 6, "oneccl_bindings_for_pytorch": 6, "torch_ccl": 6, "master_addr": 6, "127": [6, 31, 34], "master_port": 6, "29500": [6, 31], "rank": [6, 31, 34], "pmi_rank": 6, "world_siz": [6, 29], "pmi_siz": [6, 29], "init_process_group": 6, "ccl": [6, 31, 34], "init_method": 6, "env": [6, 29], "dist_sampl": 6, "distributedsampl": 6, "sampler": 6, "distributeddataparallel": 6, "batch_id": 6, "destroy_process_group": 6, "nlp": [6, 7, 26, 30, 34], "resnet50_weight": [6, 12, 13], "rand": [6, 8, 12, 13, 20, 26, 34], "vocab_s": [6, 11, 32], "seq_length": [6, 11, 32], "randint": [6, 11, 32], "freez": [6, 8, 10, 13, 15, 16, 20, 23, 26, 32, 34], "check_trac": [6, 13, 32], "strict": [6, 32], "sinc": [6, 7, 18, 19, 20, 21, 26, 33, 34], "manual_se": [6, 11], "43": [6, 11, 31, 32], "12": [6, 10, 14, 17, 30, 31, 32], "instanti": 6, "qconfig_map": 6, "default_static_qconfig_map": 6, "own": [6, 15, 28], "qconfigmap": 6, "per_tensor_affin": [6, 15, 34], "quint8": [6, 15], "set_glob": 6, "traced_model": [6, 10, 13, 15, 16, 26, 34], "static_quantized_model": 6, "local": [6, 20, 28, 31, 32, 33], "default_dynamic_qconfig_map": 6, "placeholderobserv": [6, 15], "is_dynam": [6, 15], "dynamic_quantized_model": 6, "dedic": [6, 28, 34], "faster": [6, 7, 8, 30, 33], "variou": [6, 7, 14, 28, 33, 34], "38": [6, 11, 31, 32], "account": 6, "pretrain": [6, 32, 34], "login": 6, "argpars": [6, 23], "autoconfig": [6, 23], "automodelforcausallm": [6, 23, 29, 34], "autotoken": [6, 23], "parser": [6, 23], "argumentpars": [6, 23], "add_help": [6, 23], "add_argu": [6, 23], "choic": [6, 21, 23, 31], "choos": [6, 8, 20, 23, 31, 33, 34], "dinner": [6, 23], "greedi": [6, 23], "action": [6, 23], "store_tru": [6, 23], "parse_arg": [6, 23], "amp_en": [6, 23], "els": [6, 14, 17, 18, 23], "amp_dtyp": [6, 23], "getattr": [6, 23], "model_id": [6, 23], "125m": 6, "trust_remote_cod": [6, 23], "torch_dtyp": [6, 23], "low_cpu_mem_usag": [6, 23], "memory_format": [6, 7, 18, 23], "channels_last": [6, 7, 18, 23, 33, 34], "num_beam": [6, 23], "generate_kwarg": [6, 23], "do_sampl": [6, 23], "temperatur": [6, 23], "input_s": [6, 23], "return_tensor": [6, 23], "input_id": [6, 23], "inference_mod": [6, 23, 29], "gen_id": [6, 23], "max_new_token": [6, 23], "gen_text": [6, 23], "batch_decod": [6, 23], "skip_special_token": [6, 23], "input_tokens_length": [6, 23], "output_tokens_length": [6, 23], "total_new_token": [6, 23], "zip": [6, 23, 34], "flush": [6, 23], "typic": [6, 10, 28, 33, 34], "summari": [6, 34], "narg": 6, "neelnanda": 6, "pile": 6, "10k": 6, "meta": [6, 18, 28, 29], "7b": [6, 28, 30], "hf": [6, 28], "beam_idx_tmp": 6, "contigu": [6, 13, 18, 33, 34], "global_past_key_valu": 6, "num_attention_head": 6, "user_model": [6, 15], "num_hidden_lay": 6, "pad_val": 6, "pad_max": 6, "tokenize_funct": 6, "set_format": 6, "column": 6, "elif": 6, "collate_batch": 6, "position_ids_pad": 6, "input_ids_pad": 6, "last_ind": 6, "attention_mask_pad": 6, "append": [6, 7], "vstack": 6, "calib_dataset": [6, 29], "load_dataset": 6, "calib_evalu": 6, "shuffl": 6, "collate_fn": 6, "break": [6, 16, 34], "calibration_sampl": 6, "save_qconf_summari": [6, 15, 16, 29], "qconf_summari": [6, 15, 16, 29], "int8_qconfig": 6, "done": [6, 10, 16, 17, 26, 33, 34], "Will": [6, 18], "exit": [6, 31], "benchmark": [6, 26, 30, 31, 34], "lowp": 6, "fp16": [6, 17, 29], "unrel": 6, "lowp_mod": [6, 29], "fall": [6, 12], "back": [6, 12, 17, 18, 21, 26], "implicitli": 6, "determin": [6, 17, 21, 33], "woqweightdtyp": [6, 29], "weight_dtyp": [6, 29], "woqlowpmod": [6, 29], "get_weight_only_quant_qconfig_map": [6, 29], "known": [6, 10, 28], "practic": [6, 21, 24, 28, 33], "libtorch": [6, 34], "suppos": [6, 14, 33], "handl": [6, 18, 33], "servic": [6, 28, 30, 33], "regular": [6, 21], "unlik": 6, "app": [6, 34], "iostream": 6, "argc": 6, "const": [6, 17], "char": 6, "argv": 6, "catch": 6, "c10": [6, 17], "std": [6, 17, 19], "cerr": 6, "ivalu": 6, "push_back": 6, "cout": 6, "slice": [6, 18], "end": [6, 13, 20, 34], "endl": 6, "cmakelist": 6, "cmake_minimum_requir": 6, "version": [6, 7, 16, 17, 25, 26, 27, 32, 33, 34], "fatal_error": 6, "find_packag": 6, "add_execut": 6, "target_link_librari": 6, "torch_ipex_librari": 6, "set_properti": 6, "properti": [6, 32], "cxx_standard": 6, "17": [6, 30, 31, 32], "mkdir": 6, "build": [6, 28, 33, 34], "dcmake_prefix_path": 6, "libpytorch_path": 6, "had": [6, 33], "verifi": [6, 7], "ldd": 6, "workspac": 6, "identif": [6, 17], "gnu": [6, 17, 32], "xx": 6, "cxx": [6, 17], "abi": [6, 17, 34], "usr": [6, 17, 31, 32], "torchconfig": 6, "22": [6, 30, 31, 32], "kineto_librari": 6, "notfound": 6, "stack": [6, 8], "most": [6, 7, 13, 21, 28, 30, 32, 33, 34], "recent": [6, 7, 18], "append_torchlib_if_found": 6, "ipexconfig": 6, "84": [6, 30, 31, 33], "lib": [6, 31, 32], "libintel": [6, 34], "ext": [6, 34], "0x00007f3cf98e0000": 6, "libc10": 6, "0x00007f3cf985a000": 6, "0x00007f3cf70fc000": 6, "libtorch_cpu": 6, "0x00007f3ce16ac000": 6, "libdnnl_graph": 6, "0x00007f3cde954000": 6, "former": 6, "zoo": [6, 30], "simpli": [6, 7, 26, 31], "overview": [7, 25, 29, 34], "three": [7, 16, 17], "claus": [7, 10, 19], "guidanc": 7, "intel_pytorch_extens": [7, 25, 26, 34], "10": [7, 14, 16, 17, 18, 21, 25, 26, 31, 32, 33], "correct": [7, 18, 25, 34], "speed": [7, 11, 19, 28, 33, 34], "happen": 7, "inductor": [7, 34], "level": [7, 10, 13, 16, 18, 20, 21, 26, 33, 34], "migrat": 7, "pattern": [7, 11, 18, 28, 34], "highli": [7, 23, 28, 33, 34], "adapt": 7, "nchw": [7, 33], "nhwc": [7, 33, 34], "could": [7, 13, 16, 18, 26, 32, 33, 34], "anymor": [7, 34], "aka": [7, 18], "cooper": [7, 30, 34], "lake": [7, 30, 34], "avx512": [7, 17, 18, 32, 34], "partial": 7, "upstream": [7, 18, 34], "land": [7, 34], "pr": [7, 18, 34], "being": [7, 33], "review": [7, 34], "instead": [7, 8, 14, 19, 20, 29, 30, 31, 32, 33, 34], "device_nam": [7, 8], "conduct": 7, "frequent": 7, "websit": 7, "registr": 7, "topologi": [7, 18, 19, 26, 30, 31, 33, 34], "roialign": [7, 34], "nm": [7, 34], "cnn": [7, 18, 26, 30, 33, 34], "frozenbatchnorm2d": 7, "num_featur": 7, "batchnorm2d": [7, 10, 26, 34], "statist": 7, "affin": [7, 10, 15, 20, 31, 32, 33], "w": [7, 16, 18, 21, 30, 32], "interact": [7, 34], "beyond": 7, "kind": 7, "gender": 7, "hobbi": 7, "between": [7, 8, 17, 20, 33, 34], "man": [7, 33], "plai": [7, 33], "footbal": 7, "b": [7, 8, 16, 28], "mergedembeddingbag": 7, "embedding_spec": 7, "embeddingspec": 7, "merg": [7, 34], "embeddingbag": [7, 26, 34], "At": [7, 17], "stage": [7, 10, 19, 20, 29, 33, 34], "spars": [7, 18, 34], "dens": [7, 18], "gradient": 7, "mergedembeddingbagwithsgd": 7, "emblist": 7, "modulist": 7, "emb1": 7, "emb2": 7, "emb3": 7, "emb_m": 7, "in1": 7, "in2": 7, "in3": 7, "in_m": 7, "emb": 7, "in_i": 7, "merged_emb": 7, "from_embeddingbag_list": 7, "minim": [7, 14, 17, 33], "heavi": 7, "big": [7, 18], "read": [7, 19], "futur": [7, 28, 34], "visit": [7, 33], "mergedembeddingbagwith": 7, "weight_decai": [7, 19], "grad": [7, 19], "creat": [7, 16, 20, 33, 34], "decai": 7, "to_bfloat16_train": 7, "merged_input": 7, "linearize_indices_and_offset": 7, "need_linearize_indices_and_offset": 7, "booltensor": 7, "becom": [7, 28, 33], "balanc": [7, 16, 22, 33], "embedingbag": 7, "often": 7, "categor": 7, "power": [7, 33, 34], "law": 7, "ag": 7, "video": 7, "game": 7, "19": [7, 30, 31, 32, 34], "29": [7, 31, 32], "row": 7, "write": [7, 17], "address": [7, 18, 31, 32, 33, 34], "conflict": [7, 17], "solv": [7, 19, 33], "togeth": [7, 14, 20, 33, 34], "immedi": 7, "right": [7, 21, 23, 28], "friendli": [7, 33], "gemm": [7, 18, 26, 28, 34], "aim": [7, 10, 16, 33], "math": 7, "wa": [7, 31, 32, 33, 34], "test": [7, 16, 17, 30, 34], "broad": [7, 9, 34], "toggl": 7, "switch": [7, 17, 31, 33, 34], "concern": 7, "footprint": [7, 21, 28, 34], "stick": 7, "splitsgd": [7, 21], "spawn": [7, 20], "subject": [7, 17, 20, 27, 34], "built": [7, 17, 20, 34], "deliv": [7, 28, 34], "separ": [7, 19, 27, 33], "smooth": 7, "ptq": 7, "tackl": 7, "problem": [7, 19, 26, 32, 33], "systemat": 7, "outlier": [7, 16], "commonli": [7, 28, 33, 34], "hopefulli": 7, "eas": [7, 18, 34], "small": [7, 19, 33, 34], "turn": [7, 34], "boolean": [7, 34], "off": [7, 8, 21, 28, 30, 34], "area": [7, 14], "extrem": [7, 14, 33], "situat": [7, 14], "huge": [7, 14, 33], "impract": [7, 14], "consum": [7, 14], "launcher": [7, 13, 31, 33, 34], "integr": [7, 18, 28, 33, 34], "conveni": [8, 34], "lower": [8, 17, 21, 28, 34], "becaus": [8, 17, 18, 21, 28, 33, 34], "lighter": 8, "smaller": [8, 17], "sacrif": 8, "trade": [8, 28, 30, 34], "slower": [8, 33, 34], "accur": 8, "primarili": [8, 34], "show": [8, 17, 21, 28, 29, 30, 31, 32, 33, 34], "simplenet": [8, 34], "super": [8, 10, 16, 20, 26, 34], "stride": [8, 10, 20, 34], "pad": [8, 10, 20, 34], "y": [8, 15, 16, 20, 21, 34], "chosen": [8, 14, 17], "maintain": 8, "categori": [8, 34], "circumst": 8, "imag": [8, 13, 18, 33, 34], "label": 8, "float64": 8, "suppli": 8, "addmm": 8, "addmm_": 8, "cannot": [8, 19, 26, 34], "describ": [8, 13, 18, 21, 32, 33], "expos": 8, "namespac": [8, 17], "regardless": [8, 34], "unlist": 8, "downstream": 8, "believ": [8, 18], "unstabl": 8, "conv1d": [8, 13], "conv3d": [8, 13, 34], "conv_transpose1d": 8, "conv_transpose2d": 8, "conv_transpose3d": 8, "bmm": [8, 34], "mm": 8, "baddbmm": 8, "addbmm": 8, "conv_tbc": 8, "group_norm": 8, "_native_multi_head_attent": 8, "avg_pool3d": 8, "binary_cross_entropi": 8, "grid_sampl": 8, "polar": 8, "prod": 8, "quantil": 8, "nanquantil": 8, "stft": 8, "cdist": 8, "view_as_complex": 8, "choleski": 8, "cholesky_invers": 8, "cholesky_solv": 8, "invers": 8, "lu_solv": 8, "matrix_rank": 8, "orgqr": 8, "ormqr": 8, "pinvers": 8, "max_unpool2d": 8, "max_unpool3d": 8, "adaptive_avg_pool3d": 8, "reflection_pad1d": 8, "reflection_pad2d": 8, "replication_pad1d": 8, "replication_pad2d": 8, "replication_pad3d": 8, "mse_loss": 8, "cosine_embedding_loss": 8, "nll_loss": 8, "nll_loss2d": 8, "hinge_embedding_loss": 8, "poisson_nll_loss": 8, "smooth_l1_loss": 8, "cross_entropy_loss": 8, "l1_loss": 8, "huber_loss": 8, "margin_ranking_loss": 8, "soft_margin_loss": 8, "triplet_margin_loss": 8, "multi_margin_loss": 8, "ctc_loss": 8, "kl_div": 8, "multilabel_margin_loss": 8, "binary_cross_entropy_with_logit": 8, "fft_fft": 8, "fft_ifft": 8, "fft_fft2": 8, "fft_ifft2": 8, "fft_fftn": 8, "fft_ifftn": 8, "fft_rfft": 8, "fft_irfft": 8, "fft_rfft2": 8, "fft_irfft2": 8, "fft_rfftn": 8, "fft_irfftn": 8, "fft_hfft": 8, "fft_ihfft": 8, "linalg_cond": 8, "linalg_matrix_rank": 8, "linalg_solv": 8, "linalg_choleski": 8, "linalg_svdv": 8, "linalg_eigv": 8, "linalg_eigvalsh": 8, "linalg_inv": 8, "linalg_householder_product": 8, "linalg_tensorinv": 8, "linalg_tensorsolv": 8, "fake_quantize_per_tensor_affin": 8, "eig": 8, "geqrf": 8, "lstsq": 8, "_lu_with_info": 8, "qr": 8, "svd": 8, "symeig": 8, "triangular_solv": 8, "fractional_max_pool2d": 8, "fractional_max_pool3d": 8, "adaptive_max_pool3d": 8, "multilabel_margin_loss_forward": 8, "linalg_qr": 8, "linalg_cholesky_ex": 8, "linalg_svd": 8, "linalg_eig": 8, "linalg_eigh": 8, "linalg_lstsq": 8, "linalg_inv_ex": 8, "cat": [8, 31, 32, 34], "index_copi": 8, "intervent": 8, "mixtur": [8, 34], "enable_auto_channels_last": 9, "disable_auto_channels_last": 9, "regress": [9, 34], "rais": 10, "oob": [10, 34], "easili": [10, 15], "who": 10, "inevit": 10, "simplifi": [10, 34], "snippet": [10, 29], "optimum": 10, "monkei": 10, "patch": [10, 34], "embedding_bag": 10, "qa": [10, 34], "clear": 10, "ninstanc": [10, 14, 31, 34], "ncore": [10, 31], "28": [10, 14, 16, 30, 31, 32, 33, 34], "run_qa": [10, 34], "model_name_or_path": [10, 29, 34], "dataset_nam": [10, 34], "squad": [10, 30, 34], "do_ev": [10, 34], "per_device_train_batch_s": [10, 34], "learning_r": [10, 34], "3e": [10, 34], "num_train_epoch": [10, 34], "max_seq_length": [10, 34], "384": [10, 32, 34], "doc_strid": [10, 34], "output_dir": [10, 14, 34], "tmp": [10, 32, 34], "debug_squad": [10, 34], "dummymodul": 10, "input1": 10, "kernel_s": 10, "7": [10, 14, 17, 20, 21, 31, 32, 34], "track_running_stat": 10, "customized_forward": 10, "method1": 10, "success": [10, 24], "method2": 10, "fail": [10, 26, 34], "top": [10, 21, 34], "unabl": 10, "hook": [10, 16], "As": [10, 19, 20, 28, 31, 32, 33, 34], "behaviour": 10, "repeat": [10, 18, 21], "feasibl": 10, "idea": [11, 21, 33], "primit": [11, 20, 30, 34], "portabl": 11, "hpc": 11, "ensur": [11, 19, 20, 32], "perf": [11, 18], "tri": 12, "failur": [12, 34], "incorrect": [12, 26, 34], "trigger": 12, "meanwhil": [12, 33, 34], "resnet50": [12, 13, 14, 18, 30, 31, 33, 34], "dag": 13, "acycl": 13, "straight": [13, 33], "cover": [13, 18, 31], "constant": 13, "resourc": [13, 20, 28, 32, 33], "focus": [13, 34], "front": [13, 34], "batchnorm": [13, 17, 18, 26, 34], "propag": [13, 21, 33], "graph_for": 13, "regard": 13, "rn50": [13, 34], "sum": [13, 16, 18, 19, 34], "convrelu": 13, "convsumrelu": 13, "default_static_qconfig": [13, 15, 32, 34], "quantized_model": [13, 15, 34], "244": 13, "convtranspose3d": 13, "ab": [13, 32], "clamp": 13, "elu": 13, "exp": 13, "hardtanh": 13, "hardswish": [13, 34], "mish": 13, "sigmoid": [13, 34], "pow": 13, "round": [13, 21], "squar": [13, 28], "tanh": [13, 34], "leaki": 13, "_": [13, 15, 16, 17, 18, 20, 30, 31, 32, 33, 34], "div": 13, "view": [13, 18, 20, 21], "transpos": [13, 34], "dequant": [13, 16], "partit": [13, 33], "leaky_relu": 13, "___": 13, "divid": [13, 32, 33, 34], "maxpool2d": 13, "_____": 13, "stock": [13, 30, 34], "owner": 13, "otheriws": 13, "compuat": 13, "wikipedia": [13, 33], "There": [14, 16, 20, 33, 34], "thing": [14, 33], "yaml": 14, "strategi": [14, 33, 34], "grid": 14, "random": 14, "max_trial": 14, "trial": 14, "record": [14, 32], "csv": 14, "hyperparam": 14, "mandatori": 14, "hp": 14, "ncores_per_inst": 14, "all_physical_cor": 14, "ncore_per_inst": [14, 34], "all_logical_cor": 14, "use_all_nod": 14, "num_nod": 14, "use_logical_cor": [14, 32], "is_hyperthreading_en": 14, "disable_numactl": [14, 32], "disable_iomp": [14, 32], "malloc": [14, 31, 33], "tc": 14, "je": 14, "previou": [14, 16, 18, 33, 34], "hyperparamt": 14, "8": [14, 16, 30, 31, 32, 33], "respect": [14, 16, 30, 31, 34], "maxim": 14, "statement": [14, 17], "higher_is_bett": 14, "target_v": 14, "inf": 14, "minimum": [14, 16, 18], "platinum": [14, 30, 32, 33], "8180m": [14, 33], "socket": [14, 30, 32, 33, 34], "anoth": [14, 31, 33, 34], "conf_fil": [14, 34], "hypertune_directori": 14, "termin": 14, "15": [14, 17, 30, 31, 32], "339081764221191": 14, "gave": 14, "side": [15, 33], "compon": [15, 26, 27, 28], "much": [15, 18, 21, 28, 33], "abl": 15, "similar": [15, 17, 33], "satisfi": [15, 26], "tradeoff": 15, "reduce_rang": 15, "methond": 15, "obsev": 15, "symmetr": 15, "sete": 15, "skylak": 15, "quant_stat": 15, "calibration_data_set": [15, 34], "qparam": 15, "And": [15, 20, 32, 34], "achang": 15, "overrid": 15, "load_qconf_summari": 15, "dynamic_qconfig": 15, "default_dynamic_qconfig": [15, 32], "per_tensor_symmetr": 15, "gru": 15, "lstmcell": 15, "rnncell": 15, "grucel": 15, "bother": 16, "desir": [16, 31], "receip": [16, 20], "sq": 16, "difficulti": 16, "vari": 16, "across": [16, 31], "herebi": 16, "obtain": 16, "abil": 16, "optdecoderlay": 16, "blockwis": 16, "consist": [16, 28, 33, 34], "major": 16, "adjust": 16, "accordingli": 16, "predict": 16, "criteria": 16, "consider": 16, "numpi": 16, "np": [16, 31], "tolist": 16, "auto_alpha_arg": 16, "init_alpha": [16, 22], "baselin": [16, 22, 34], "alpha_min": [16, 22], "alpha_max": [16, 22], "99": [16, 30, 34], "alpha_step": [16, 22], "step_siz": [16, 22], "shared_criterion": [16, 22], "enable_blockwise_loss": [16, 22], "portion": 16, "beginn": 16, "quickstart_tutori": 16, "training_data": 16, "fashionmnist": 16, "test_data": 16, "loader": 16, "train_dataload": 16, "test_dataload": 16, "neuralnetwork": 16, "flatten": [16, 20], "linear_relu_stack": 16, "sequenti": 16, "logit": 16, "loss_fn": 16, "pred": 16, "backpropag": 16, "item": 16, "7f": 16, "5d": 16, "epoch": 16, "argmax": 16, "inc": [16, 17, 22, 28], "accu": 16, "tuned_conf": 16, "explain": [17, 18, 21], "fork": [17, 33], "avx512_vnni": 17, "avx512_bf16": 17, "avx2": [17, 26, 34], "avx2_vnni": 17, "avx512_fp16": 17, "11": [17, 31, 32], "gcc": 17, "findavx": 17, "bodi": 17, "anonym": 17, "virtual": 17, "polymorph": 17, "pertain": 17, "cpuid": 17, "statu": 17, "pointer": 17, "system": [17, 33], "specifii": 17, "complier": 17, "isacodegen": 17, "suffix": 17, "adaptiveaveragepoolingkrnl": 17, "isa_codegen": 17, "o3": 17, "d__avx__": 17, "dcpu_capability_avx2": 17, "mavx2": 17, "mfma": 17, "mno": 17, "avx256": 17, "unalign": [17, 34], "dcpu_cap": 17, "dcpu_capability_default": 17, "d__avx512f__": 17, "mavx512f": 17, "mavx512bw": 17, "mavx512vl": 17, "mavx512dq": 17, "dcpu_capability_avx512": 17, "mavx512vnni": 17, "dcpu_capability_avx512_vnni": 17, "mavx512bf16": 17, "dcpu_capability_avx512_bf16": 17, "mamx": 17, "tile": 17, "dcpu_capability_amx": 17, "mavx512fp16": 17, "dcpu_capability_avx512_fp16": 17, "align": [17, 18, 21, 34], "stead": 17, "sleef": 17, "width": [17, 18], "isa_nam": 17, "inlin": 17, "compat": [17, 21], "definit": [17, 21], "Such": 17, "But": [17, 18], "tip": 17, "newkernelkrnl": 17, "newkernel": 17, "header": 17, "special": [17, 18, 28], "fastest": 17, "cpuinfo": 17, "mykernel": 17, "fn_type": 17, "void": 17, "ipex_declare_dispatch": 17, "ipex_define_dispatch": 17, "ipex_register_dispatch": 17, "kcpu": 17, "declar": 17, "ideep": [17, 18], "common": [17, 21, 28, 31, 33], "intrins": 17, "cvtfp32tobf16": 17, "pragma": 17, "torch_ipex": [17, 34], "cvt_fp32_to_bf16": 17, "dst": 17, "cvt_fp32_to_bf16_kernel_impl": 17, "cvt_fp32_to_bf16_kernel_fn": 17, "cvt_fp32_to_bf16_kernel_stub": 17, "macro": 17, "cpu_capability_avx512": 17, "cpu_capability_avx512_bf16": 17, "hav": 17, "cvtfp32tobf16krnl": 17, "vec512": 17, "vec256": 17, "endif": 17, "immintrin": 17, "__m256i": 17, "_cvt_fp32_to_bf16": 17, "__m512": 17, "reinterpret_cast": 17, "_mm512_cvtneps_pbh": 17, "__m512i": 17, "_mm512_castps_si512": 17, "nan": [17, 34], "_mm512_set1_epi32": 17, "0xffff": 17, "mask_valu": 17, "_mm512_cmp_ps_mask": 17, "_cmp_ord_q": 17, "0x1": 17, "vec_bia": 17, "0x7fff": 17, "uint32_t": 17, "lsb": 17, "t_valu": 17, "_mm512_and_si512": 17, "_mm512_srli_epi32": 17, "rounding_bia": 17, "_mm512_add_epi32": 17, "_mm512_mask_blend_epi32": 17, "_mm512_cvtusepi32_epi16": 17, "f32": [17, 18], "_mm512_loadu_p": 17, "_mm256_storeu_si256": 17, "_mm512_maskz_loadu_p": 17, "_mm256_mask_storeu_epi16": 17, "getveclength": 17, "get_cpp_typesize_and_vecs": 17, "scalartyp": 17, "get_cpp_typesize_and_vecsize_kernel_impl": 17, "get_cpp_typesize_and_vecsize_kernel_fn": 17, "get_cpp_typesize_and_vecsize_kernel_stub": 17, "types": 17, "vectors": 17, "getveclengthkrnl": 17, "doubl": 17, "make_tupl": 17, "sizeof": 17, "complexdoubl": 17, "complex": 17, "complexfloat": 17, "decltyp": 17, "impl": 17, "scalartypetocpptyp": 17, "torch_check": 17, "09": [17, 31], "58": [17, 31], "anaconda": 17, "copyright": [17, 27], "credit": 17, "licens": 17, "_c": [17, 26], "_get_current_isa_level": 17, "_get_highest_cpu_support_isa_level": 17, "_get_highest_binary_support_isa_level": 17, "quit": [17, 34], "By": [17, 31, 33], "aten_cpu_cap": 17, "effect": [17, 21, 26, 32, 33], "intern": [17, 18, 20, 32], "purpos": [17, 31, 32, 33], "addtion": 17, "tool": [17, 33, 34], "subfold": 17, "rh": 17, "toolset": 17, "33": [17, 31, 32], "cmakefil": 17, "cpu_featur": 17, "dir": [17, 31], "66": [17, 31, 34], "cpu_feature_main": 17, "xcr0": 17, "00000000000602e7": 17, "mmx": 17, "sse": 17, "sse2": 17, "sse3": 17, "ssse3": 17, "sse4_1": 17, "sse4_2": 17, "aes_ni": 17, "sha": 17, "xsave": 17, "fma": 17, "f16c": 17, "avx_vnni": 17, "avx512_f": 17, "avx512_cd": 17, "avx512_pf": 17, "avx512_er": 17, "avx512_vl": 17, "avx512_bw": 17, "avx512_dq": 17, "avx512_ifma": 17, "avx512_vbmi": 17, "avx512_vpopcntdq": 17, "avx512_4fmap": 17, "avx512_4vnniw": 17, "avx512_vbmi2": 17, "avx512_vpclmul": 17, "avx512_bitalg": 17, "avx512_vp2intersect": 17, "amx_bf16": 17, "amx_til": 17, "amx_int8": 17, "prefetchw": 17, "prefetchwt1": 17, "represent": 18, "multidimension": 18, "arrai": 18, "nd": 18, "1d": 18, "semant": 18, "attribut": 18, "coo": 18, "canon": 18, "assign": [18, 32, 33], "2d": 18, "height": 18, "illustr": [18, 19, 21, 31, 33], "actual": [18, 21], "bmp": 18, "contiguous_format": [18, 33], "tensorflow": 18, "close": [18, 31, 33], "to_mkldnn": 18, "difficult": 18, "manipul": 18, "to_dens": 18, "natur": [18, 21, 28], "hold": [18, 33], "secret": 18, "ingredi": 18, "almost": 18, "foundat": [18, 33], "upper": [18, 33], "fact": [18, 33], "expens": 18, "benefici": 18, "nb": 18, "me": 18, "roughli": 18, "50": [18, 31, 32], "mkldnn": 18, "mkldnn_util": 18, "subsequ": [18, 33], "concept": [18, 33], "diagram": [18, 33], "hard": [18, 26], "conclus": 18, "necessari": 18, "neglig": 18, "move": [18, 33], "organ": 18, "question": [18, 30], "reinterpret": 18, "answer": [18, 30], "chw": 18, "hw": 18, "stride_n": 18, "stride_c": 18, "stride_h": 18, "stride_w": 18, "merit": 18, "express": [18, 34], "noncontigu": 18, "n1": 18, "n2": 18, "mind": [18, 32], "someth": 18, "reli": [18, 20], "rfc": 18, "hwc": 18, "wc": 18, "chwn": 18, "hwn": 18, "wn": 18, "empti": [18, 31], "outplac": [18, 34], "is_contigu": 18, "_appli": 18, "brief": [18, 28, 34], "imagenet": [18, 30], "spontan": 18, "tell": [18, 20, 33], "NOT": [18, 31], "compris": 18, "explicit": [18, 20, 33], "implicit": 18, "tensoriter": 18, "guidelin": 18, "awar": [18, 20, 31, 32], "my": 18, "upsampl": [18, 34], "cudnn": 18, "accommod": 18, "md": 18, "format_tag": 18, "src_md": 18, "desc": 18, "data_typ": 18, "src_mem": 18, "src_data_ptr": 18, "card": 18, "hwio": 18, "resnext101": [18, 34], "detectron2": 18, "8x": 18, "lamb": [19, 21], "adagrad": [19, 21], "clr": 19, "lr_decai": 19, "state_sum": 19, "addcmul_": 19, "add_": 19, "addcdiv_": 19, "whole": [19, 20, 33], "storag": 19, "onboard": [19, 33], "third": [19, 34], "high": [19, 21, 33], "bound": [19, 20, 28, 33], "bottl": 19, "neck": 19, "prevent": 19, "pseudo": [19, 21, 34], "adagrad_fused_step": 19, "group": [19, 20, 33], "grad0": 19, "grad1": 19, "grad_n": 19, "param_n": 19, "state_sum_n": 19, "adagrad_step": 19, "grad_i": 19, "param_i": 19, "state_sum_i": 19, "other_arg": 19, "coupl": [20, 33, 34], "omp": [20, 26, 31, 32, 33, 34], "ld_preload": [20, 31, 32, 33], "libiomp5": [20, 31, 32, 33], "model_script": 20, "examplenet": 20, "examplenet1": 20, "x1": 20, "start_dim": 20, "examplenet2": 20, "conv2": 20, "x2": 20, "y1": 20, "y2": 20, "model1": 20, "traced_model1": 20, "model2": 20, "traced_model2": 20, "multi_stream_model": [20, 34], "datatyp": [20, 34], "receipt": 20, "steam": [20, 34], "input_hint": 20, "output_hint": 20, "pthread": 20, "async": [20, 34], "wake": 20, "synchron": [20, 26, 34], "imper": [20, 34], "suffer": 20, "gil": 20, "hurt": 20, "mitig": [20, 30], "omp_num_thread": [20, 26, 31, 32, 34], "phase": 20, "s1": 20, "c1": 20, "numactl": [20, 31, 32], "outsid": 20, "superset": 20, "undefin": [20, 33], "gb": 20, "simultan": 20, "correspond": [20, 31, 34], "cpu_pool1": 20, "cpu_pool2": 20, "task1": 20, "task2": 20, "y1_futur": 20, "y2_futur": 20, "y_runtim": 20, "kmp_": 20, "fulfil": 20, "worker": [20, 31], "serv": [20, 34], "sub": [20, 28, 33], "wait": [20, 33], "futuretensor": 20, "didn": 20, "dlopen": 20, "symbol": 20, "bottom": 21, "bit": [21, 28], "sign": 21, "expon": 21, "mantissa": 21, "23": [21, 31, 32], "capac": [21, 30], "digit": 21, "shorter": [21, 28], "fewer": 21, "neg": 21, "disadvantag": 21, "shift": 21, "left": [21, 28, 32], "lose": 21, "decim": 21, "valid": [21, 34], "1234500000": 21, "0000012345": 21, "1234512345": 21, "sens": 21, "fraction": 21, "12345": 21, "00000": 21, "signific": 21, "bui": 21, "involv": 21, "ground": 21, "truth": 21, "chain": 21, "rule": [21, 34], "meet": [21, 33, 34], "wide": [21, 34], "understand": [21, 28, 33], "formula": 21, "\u03b1": 21, "gw": 21, "denot": 21, "receiv": 21, "rate": 21, "earlier": 21, "inaccur": 21, "exactli": 21, "kept": 21, "halv": 21, "recov": 21, "fp32_w": 21, "concat_fp32_from_bf16": 21, "bf16_w": 21, "fp32_gw": 21, "bf16_gw": 21, "weight_dacai": 21, "split_bf16_from_fp32": 21, "ratio": [22, 30, 34], "beta": [23, 26], "demostr": 23, "cheat": 23, "sheet": 23, "pypi": [26, 34], "occupi": 26, "remark": [26, 30, 33], "__name__": [26, 34], "__main__": [26, 31, 32, 34], "112": [26, 30, 33, 34], "nnc": 26, "poor": [26, 34], "xlm": 26, "roberta": [26, 34], "casual": 26, "gpt2": 26, "summar": 26, "classif": [26, 30], "allenai": 26, "longform": 26, "409": 26, "workaround": [26, 34], "_jit_set_texpr_fuser_en": 26, "csrc": 26, "tensorexpr_fus": 26, "settensorexprfuseren": 26, "longer": [26, 30], "complic": [26, 31, 33], "undergo": [26, 29], "runtimeerror": [26, 34], "overflow": [26, 34], "unpack": [26, 34], "exce": [26, 30, 33, 34], "quantize_per_tensor": 26, "pseudocod": [26, 34], "omp_num_threa": 26, "set_num_thread": [26, 34], "freezed_model": [26, 34], "run_benchmark": [26, 34], "flow": 26, "bag": [26, 34], "progress": [26, 28, 34], "abnorm": [26, 34], "tbd": 26, "transformerencoderlay": 26, "encount": [26, 34], "rnnt": [26, 34], "joint_net": [26, 34], "caller": [26, 34], "apach": [27, 32], "notic": [27, 31, 32], "term": 27, "condit": 27, "multiheadattent": 28, "feedforward": 28, "lot": [28, 34], "besid": [28, 33, 34], "adopt": [28, 34], "modelfamili": 28, "hub": 28, "staticquantizationint8": 28, "onlyquantizationint8": 28, "onlyquantizationint4": 28, "13b": [28, 30, 34], "70b": [28, 34], "8b": 28, "20b": 28, "dolli": [28, 34], "databrick": 28, "v2": [28, 30, 34], "12b": 28, "tiiuae": 28, "40b": 28, "30b": 28, "3b": 28, "bigscienc": 28, "1b7": 28, "salesforc": 28, "2b": 28, "baichuan2": [28, 34], "chat": 28, "thudm": 28, "chatglm3": [28, 34], "chatglm2": [28, 34], "bigcod": 28, "starcod": [28, 34], "flan": 28, "xl": 28, "mosaicml": 28, "mistralai": 28, "v0": 28, "8x7b": 28, "stabilityai": 28, "1_6b": 28, "liuhaotian": 28, "v1": [28, 34], "microsoft": 28, "ieityuan": 28, "yuan2": 28, "102b": 28, "signifi": 28, "perfect": 28, "codellama": 28, "rope": 28, "past": 28, "year": 28, "flourish": 28, "contribut": [28, 31, 34], "research": 28, "web": 28, "legend": 28, "autotp": 28, "obviou": 28, "hotspot": 28, "lead": 28, "significantli": [28, 34], "heavier": 28, "io": 28, "occurr": 28, "ship": 28, "2nd": 28, "4th": [28, 30], "except": [28, 31], "beeter": 28, "Its": 28, "seen": 28, "woq": 28, "integ": [28, 33], "bandwidth": 28, "reorder_cach": 28, "beam_width": 28, "secondli": 28, "elimin": 28, "shard": 28, "content": [29, 34], "your_calibration_dataset": 29, "calib_sampl": 29, "calibration_model": 29, "qconfig_summary_file_path": 29, "nf4": 29, "init_distribut": 29, "get_acceler": 29, "communication_backend_nam": 29, "var": 29, "ondevic": 29, "init_infer": 29, "mp_size": 29, "base_dir": 29, "repo_root": 29, "checkpoints_json": 29, "zone": [30, 34], "articl": [30, 33], "llama2": [30, 34], "1024": [30, 33], "were": [30, 31, 32, 33], "carri": 30, "m7i": 30, "m6i": [30, 32], "47x": 30, "62x": 30, "57x": 30, "58x": 30, "85x": 30, "27x": 30, "38x": 30, "29x": 30, "36x": 30, "conclud": [30, 34], "respons": 30, "session": 30, "exhibit": 30, "wherea": 30, "p90": 30, "26x": 30, "sec": 30, "39": [30, 31, 32, 34], "26": [30, 31, 32], "49": [30, 31, 32], "170": 30, "21": [30, 31, 32], "measur": [30, 34], "17th": 30, "16xlarg": 30, "u": [30, 32], "west": 30, "ubuntu": 30, "04": [30, 31], "1009": 30, "sw": 30, "workload1": 30, "inference2": 30, "realtim": 30, "inference3": 30, "tunabl": [30, 32], "8380": 30, "30ghz": 30, "83x": 30, "44x": 30, "ssd": [30, 34], "resnet34": [30, 34], "16x": 30, "coco": 30, "1200": 30, "resnext": 30, "32x16d": 30, "81x": 30, "21x": 30, "vgg": 30, "75x": 30, "19x": 30, "shufflenetv2_x1": 30, "07x": 30, "78x": 30, "04x": 30, "max_seq_len": 30, "384task": 30, "jemalloc": [30, 32, 34], "05x": 30, "96x": 30, "mrpc": 30, "128task": 30, "distilbert": 30, "12x": 30, "dnnl": 30, "base_text_classif": 30, "f1": 30, "81": [30, 31], "79": [30, 31], "93": 30, "02": [30, 32], "85": [30, 31], "86": [30, 31], "top1": 30, "76": [30, 31], "75": [30, 31], "98": 30, "78": [30, 31], "199": 30, "48": [30, 31, 32], "vgg11": 30, "69": [30, 31], "67": [30, 31, 34], "96": 30, "44": [30, 31, 32], "36": [30, 31, 32], "92": 30, "97": 30, "shufflenet": 30, "histogram": [30, 34], "40": [30, 31, 32, 34], "ucod": 30, "0xd0002a0": 30, "ON": 30, "turboboost": 30, "bio": 30, "ddr": 30, "16gb": 30, "3200": 30, "dcpmm": 30, "256gb": 30, "host": [30, 34], "cento": 30, "2105": 30, "18": [30, 31, 32], "305": 30, "el8_4": 30, "x86_64": 30, "docker": [30, 34], "spectr": 30, "meltdown": 30, "24x": 30, "31x": 30, "15x": 30, "30x": 30, "mobilenet": 30, "08x": 30, "03x": 30, "09x": 30, "39x": 30, "35x": 30, "160": 30, "55x": 30, "06x": 30, "fpn": 30, "71x": 30, "20x": 30, "13x": 30, "32x": 30, "48x": 30, "11x": 30, "terabyt": 30, "14x": 30, "02x": 30, "10x": 30, "33x": 30, "8380h": 30, "90ghz": 30, "56": [30, 31, 32, 33], "67x": 30, "45x": 30, "77x": 30, "18x": 30, "formerli": [30, 33, 34], "0x700001c": 30, "wlydcrb1": 30, "sy": 30, "0016": 30, "p29": 30, "2006080250": 30, "64gb": 30, "768gb": 30, "influenc": [31, 33], "properli": 31, "themselv": [31, 34], "free": [31, 34], "mainli": [31, 34], "around": 31, "interpret": 31, "prefix": 31, "cross": [31, 32, 33, 34], "taskset": 31, "malloc_conf": [31, 33], "crash": [31, 33, 34], "nnode": 31, "nproc": 31, "count": 31, "addr": 31, "ip": 31, "hostnam": 31, "proc": 31, "port": 31, "hostfil": 31, "mpi": 31, "mpiexec": 31, "hydra": 31, "ppn": 31, "genv": 31, "i_mpi_pin_domain": 31, "codeless": 31, "ut": 31, "exclus": 31, "mutual": 31, "ld": 31, "favorit": 31, "kmp": [31, 33], "granular": [31, 32, 33], "compact": [31, 32, 33], "stdout": 31, "afterward": [31, 33], "undesir": 31, "_timestamp_inst": 31, "_timestamp_instance_": 31, "_core": 31, "run_20210712212258_inst": 31, "run_20210712212258_instance_0_cores_0": 31, "gif": 31, "07": 31, "764": 31, "conda_prefix": [31, 32], "virtual_env": [31, 32], "lib64": [31, 32], "home": [31, 32], "drop": [31, 32], "kmp_affin": [31, 32, 33], "kmp_blocktim": [31, 32, 33], "14": [31, 32, 34], "24": [31, 32], "25": [31, 32], "27": [31, 32, 33], "30": [31, 32], "31": [31, 32], "34": [31, 32], "35": [31, 32], "37": [31, 32, 34], "41": [31, 32], "42": [31, 32], "tee": 31, "run_20210712223308_inst": 31, "run_20210712223308_instance_0_cores_0": 31, "87": 31, "08": 31, "117": 31, "88": 31, "118": 31, "45": [31, 32], "46": [31, 32], "47": [31, 32], "51": [31, 32], "52": [31, 32], "53": [31, 32], "54": [31, 32], "55": [31, 32, 33], "57": 31, "59": 31, "60": 31, "61": 31, "62": 31, "63": [31, 34], "65": 31, "68": [31, 34], "70": 31, "71": 31, "72": 31, "73": 31, "74": 31, "77": 31, "82": 31, "83": [31, 33], "run_20210712214504_inst": 31, "run_20210712214504_instance_0_cores_22": 31, "513": 31, "run_20210712220928_inst": 31, "run_20210712220928_instance_0_cores_0": 31, "355": 31, "356": 31, "deduct": 31, "run_20210712221615_inst": 31, "run_20210712221615_instance_0_cores_11": 31, "591": 31, "run_20210712221150_inst": 31, "run_20210712221150_instance_0_cores_0": 31, "run_20210712221150_instance_1_cores_22": 31, "233": 31, "236": 31, "run_20210712221415_inst": 31, "run_20210712221415_instance_0_cores_0": 31, "run_20210712221415_instance_1_cores_4": 31, "run_20210712221415_instance_2_cores_8": 31, "run_20210712221415_instance_3_cores_12": 31, "run_20210712221415_instance_4_cores_16": 31, "run_20210712221415_instance_5_cores_20": 31, "run_20210712221415_instance_6_cores_24": 31, "run_20210712221415_instance_7_cores_28": 31, "run_20210712221415_instance_8_cores_32": 31, "run_20210712221415_instance_9_cores_36": 31, "run_20210712221415_instance_10_cores_40": 31, "140": 31, "143": 31, "146": 31, "149": 31, "151": 31, "154": 31, "157": 31, "159": 31, "162": 31, "164": 31, "167": 31, "run_20210712221305_inst": 31, "run_20210712221305_instance_0_cores_0": 31, "run_20210712221305_instance_1_cores_11": 31, "run_20210712221305_instance_2_cores_22": 31, "run_20210712221305_instance_3_cores_33": 31, "470": 31, "471": 31, "473": 31, "476": 31, "479": 31, "instance_idx": 31, "independ": 31, "confirm": 31, "175": 31, "176": 31, "177": 31, "run_20220106130151_instance_0_cores_0": 31, "sometim": [31, 33], "235": 31, "jemallocl": 31, "oversize_threshold": [31, 33], "background_thread": [31, 33], "metadata_thp": [31, 33], "dirty_decay_m": [31, 33], "9000000000": [31, 33], "muzzy_decay_m": [31, 33], "libjemalloc": 31, "run_20210713153048_instance_0_cores_0": 31, "654": 31, "libtcmalloc": [31, 32], "655": 31, "run_20210713153333_instance_0_cores_0": 31, "784": 31, "run_20210713153659_instance_0_cores_0": 31, "blocktim": 31, "00": [31, 34], "760": [31, 32], "761": [31, 32], "omp_schedul": [31, 33], "omp_proc_bind": [31, 33], "run_20210713152500_instance_0_cores_0": 31, "give": [32, 34], "ipex_en": 32, "procedur": 32, "tunin": 32, "dramat": [32, 33], "cpu_launcher_en": 32, "cpu_launcher_arg": 32, "hyperthread": 32, "present": 32, "ital": 32, "ptmalloc": 32, "use_default_alloc": [32, 34], "tcmalloc": 32, "enable_tcmalloc": 32, "enable_jemalloc": 32, "nth": [32, 33], "uniform": 32, "overlap": 32, "signficantli": 32, "8180": 32, "affinit": 32, "addition": 32, "kill": 32, "unutil": 32, "restart": 32, "remain": 32, "aliv": 32, "taken": 32, "care": 32, "worri": 32, "continu": [32, 34], "Then": 32, "interrupt": 32, "dummi": 32, "dummy_tensor": 32, "scheme": 32, "bert_int8_jit": 32, "n_iter": 32, "rn50_int8_jit": 32, "usus": 32, "rn50_ipex_int8": 32, "handler": 32, "image_classifi": 32, "similarli": 32, "bert_ipex_int8": 32, "transformer_handler_gener": 32, "setup_config": 32, "seq_classification_artifact": 32, "index_to_nam": 32, "nc": 32, "model_stor": 32, "server": [32, 33], "rest": 32, "model_log": 32, "096": 32, "8375c": 32, "03": 32, "981": 32, "982": 32, "previous": 32, "cases": 32, "223": 32, "site": 32, "model_service_work": 32, "sock": 32, "unix": 32, "9000": 32, "762": 32, "763": 32, "9001": 32, "274": 32, "9002": 32, "975": 32, "9003": 32, "bench": 32, "amazon": 32, "ec2": 32, "24xlarg": 32, "reproduc": 32, "url": [32, 34], "modelurl": 32, "inputpath": 32, "concurr": [32, 33], "huggingface_transform": 32, "sample_text_captum_input": 32, "graphic": 33, "xe": 33, "briefli": 33, "background": 33, "knowledg": 33, "c620": 33, "seri": 33, "chipset": 33, "purlei": 33, "chip": 33, "inclus": 33, "1mb": 33, "l2": 33, "2666": 33, "mhz": 33, "ddr4": 33, "six": 33, "ultra": 33, "interconnect": 33, "upi": 33, "microarchitectur": 33, "connect": 33, "transfer": 33, "equip": 33, "motherboard": 33, "attach": 33, "remot": 33, "asu": 33, "z11pa": 33, "d8": 33, "competit": 33, "stall": 33, "busi": 33, "uma": 33, "lscpu": 33, "retriev": 33, "111": 33, "50ghz": 33, "node0": 33, "node1": 33, "sophist": 33, "brought": [33, 34], "polici": 33, "put": 33, "sysctl": 33, "great": 33, "placement": 33, "cpunodebind": 33, "membind": 33, "multithread": 33, "primari": 33, "consecut": 33, "join": 33, "libgomp": 33, "libiomp": 33, "hang": [33, 34], "gomp_cpu_affin": 33, "comma": 33, "invalid": 33, "thrash": 33, "did": [33, 34], "compet": 33, "unus": 33, "proclist": 33, "millisecond": 33, "sleep": 33, "200m": 33, "period": 33, "elaps": 33, "overal": 33, "appropri": 33, "reserv": 33, "sole": 33, "penal": 33, "role": 33, "unnecessari": 33, "destruct": 33, "emphas": 33, "fragment": 33, "mmuzzy_decay_m": 33, "forg": 33, "dealloc": 33, "costli": 33, "gpertool": 33, "plu": 33, "pretti": 33, "nifti": 33, "analysi": 33, "gperftool": 33, "set_flush_denorm": 33, "warm": 33, "therefor": 33, "threshold": 33, "usuali": 33, "come": 33, "maskrcnn": [33, 34], "wav2vec2": 33, "recognit": 33, "onednn_primitive_cache_capac": 33, "65536": 33, "voic": 33, "excit": 34, "announc": 34, "accompani": 34, "privat": 34, "broader": 34, "sincer": 34, "encourag": 34, "feedback": 34, "creator": 34, "reach": 34, "hf_beam_sampl": 34, "hf_beam_search": 34, "hf_greedy_search": 34, "hf_sampl": 34, "walk": 34, "2561": 34, "2584": 34, "2617": 34, "2663": 34, "2733": 34, "act": 34, "2550": 34, "2568": 34, "2641": 34, "2675": 34, "2613": 34, "upgrad": 34, "v3": 34, "2747": 34, "misc": 34, "2468": 34, "2627": 34, "2631": 34, "2704": 34, "changelog": 34, "optimize_transform": 34, "your_generation_param": 34, "newli": 34, "varianc": 34, "encod": 34, "2349": 34, "2412": 34, "2469": 34, "2476": 34, "flash": 34, "2317": 34, "2334": 34, "2392": 34, "2480": 34, "elser": 34, "2491": 34, "public": 34, "2473": 34, "2511": 34, "2433": 34, "2253": 34, "2251": 34, "2236": 34, "2278": 34, "2257": 34, "dockerfil": 34, "ux": 34, "2229": 34, "2195": 34, "2299": 34, "2315": 34, "2283": 34, "2280": 34, "2292": 34, "2275": 34, "2319": 34, "2198": 34, "2264": 34, "2290": 34, "experiment": 34, "workflow": 34, "1563": 34, "excess": 34, "1677": 34, "1688": 34, "1664": 34, "lar": 34, "1695": 34, "dictionari": 34, "1682": 34, "2137": 34, "1568": 34, "1585": 34, "1590": 34, "1587": 34, "1594": 34, "old": 34, "hypervisor": 34, "vm": 34, "1513": 34, "1593": 34, "padding_mod": 34, "1580": 34, "1566": 34, "transnetv2": 34, "1564": 34, "rnn": 34, "avx512_core_vnni": 34, "1592": 34, "1589": 34, "1517": 34, "hero": 34, "inspir": 34, "stanford": 34, "consumpt": 34, "ve": 34, "1341": 34, "instancenorm": 34, "1330": 34, "1414": 34, "1473": 34, "1419": 34, "1488": 34, "webpag": 34, "1318": 34, "1353": 34, "1328": 34, "1355": 34, "1367": 34, "1384": 34, "1295": 34, "1392": 34, "1376": 34, "1373": 34, "1338": 34, "1391": 34, "1322": 34, "usabl": 34, "effort": 34, "cv": 34, "refin": 34, "identifi": 34, "torchrun": 34, "shortcut": 34, "mkl": 34, "sgemm": 34, "geomean": 34, "auto_ipex": 34, "hood": 34, "calibrated_model": 34, "model_to_be_calibr": 34, "992": 34, "64byte": 34, "addlayernorm": 34, "retinanet": 34, "1032": 34, "1053": 34, "1074": 34, "tightli": 34, "matur": 34, "offlin": 34, "becam": 34, "bake": 34, "wave2vec": 34, "albert": 34, "facilit": 34, "minmax": 34, "movingaverageminmax": 34, "polish": 34, "flexibl": 34, "quantconf": 34, "multi_stream_input_hint": 34, "multi_stream_output_hint": 34, "adam": 34, "822": 34, "3d": 34, "642": 34, "deconv3d": 34, "692": 34, "787": 34, "swish": 34, "fsi": 34, "risk": 34, "551": 34, "leakyrelu": 34, "589": 34, "407": 34, "647": 34, "convolution1d": 34, "657": 34, "einsum": 34, "alphafold2": 34, "674": 34, "711": 34, "threa": 34, "slow": 34, "equival": 34, "joint": 34, "net": 34, "pend": 34, "648": 34, "684": 34, "685": 34, "dockerhub": 34, "wheel": 34, "sdk": 34, "2x": 34, "5x": 34, "reduct": 34, "center": 34, "deploi": 34, "u8": 34, "s8": 34, "satur": 34, "occur": 34, "u7": 34, "unsign": 34, "s7": 34, "worth": 34, "upload": 34, "pip3": 34, "whl": 34, "220mb": 34, "5mb": 34, "dep": 34, "220m": 34, "cxx11": 34, "224m": 34, "7m": 34, "5m": 34, "qkv": 34, "278": 34, "531": 34, "432": 34, "438": 34, "602": 34, "sliu": 34, "hardsigmoid": 34, "relu6": 34, "selu": 34, "524": 34, "452": 34, "425": 34, "100mb": 34, "40mb": 34, "meant": 34, "resolv": 34, "te": 34, "wrap": 34, "bactchnorm": 34, "205": 34, "straightforward": 34, "underhood": 34, "torchvison": 34, "hugginfac": 34, "legal": 34, "resnet18": 34, "resnet18_xpu": 34, "enable_auto_mixed_precis": 34, "mixed_dtyp": 34, "mymodel": 34, "xx_c": 34, "xx_v": 34, "clibrat": 34, "ampconf": 34, "automixprecis": 34, "running_mod": 34, "cali_dataset": 34, "trace_model": 34, "omp_set_num_thread": 34, "model_execut": 34, "same_model_execution_again": 34, "descriptor": 34, "rc3": 34, "parti": 34, "49786": 34, "rc": 34, "readm": 34, "stakehold": 34, "5rc3": 34, "dpcpp": 34, "heterogen": 34, "bfp16": 34, "proper": 34, "tacotron2": 34, "frozenbatchnorm": 34, "embeddingbad": 34, "daili": 34, "resnext3d": 34, "maskrnn": 34, "codenam": 34, "mlp": 34, "eltwis": 34, "7x": 34, "enable_auto_optim": 34, "streamlin": 34, "enable_auto_mix_precis": 34, "inject": 34, "resnet3d": 34, "fb": 34, "yolov3": 34, "maxpool": 34}, "objects": {"": [[2, 0, 0, "-", "intel_extension_for_pytorch"]], "intel_extension_for_pytorch.cpu": [[2, 0, 0, "-", "runtime"]], "intel_extension_for_pytorch.cpu.runtime": [[2, 1, 1, "", "CPUPool"], [2, 1, 1, "", "MultiStreamModule"], [2, 1, 1, "", "MultiStreamModuleHint"], [2, 1, 1, "", "Task"], [2, 2, 1, "", "get_core_list_of_node_id"], [2, 2, 1, "", "is_runtime_ext_enabled"], [2, 1, 1, "", "pin"]], "intel_extension_for_pytorch": [[2, 2, 1, "", "enable_onednn_fusion"], [2, 2, 1, "", "fast_bert"], [2, 0, 0, "-", "llm"], [2, 2, 1, "", "optimize"], [2, 0, 0, "-", "quantization"], [2, 1, 1, "", "verbose"]], "intel_extension_for_pytorch.llm": [[2, 0, 0, "-", "functional"], [2, 0, 0, "-", "modules"], [2, 2, 1, "", "optimize"]], "intel_extension_for_pytorch.llm.functional": [[2, 2, 1, "", "fast_layer_norm"], [2, 2, 1, "", "indirect_access_kv_cache_attention"], [2, 2, 1, "", "rms_norm"], [2, 2, 1, "", "rotary_embedding"], [2, 2, 1, "", "varlen_attention"]], "intel_extension_for_pytorch.llm.modules": [[2, 1, 1, "", "FastLayerNorm"], [2, 1, 1, "", "IndirectAccessKVCacheAttention"], [2, 1, 1, "", "Linear2SiluMul"], [2, 1, 1, "", "LinearAdd"], [2, 1, 1, "", "LinearAddAdd"], [2, 1, 1, "", "LinearGelu"], [2, 1, 1, "", "LinearMul"], [2, 1, 1, "", "LinearNewGelu"], [2, 1, 1, "", "LinearRelu"], [2, 1, 1, "", "LinearSilu"], [2, 1, 1, "", "LinearSiluMul"], [2, 1, 1, "", "PagedAttention"], [2, 1, 1, "", "RMSNorm"], [2, 1, 1, "", "RotaryEmbedding"], [2, 1, 1, "", "VarlenAttention"]], "intel_extension_for_pytorch.nn": [[7, 1, 1, "", "FrozenBatchNorm2d"]], "intel_extension_for_pytorch.nn.functional": [[7, 2, 1, "", "interaction"]], "intel_extension_for_pytorch.nn.modules": [[7, 1, 1, "", "MergedEmbeddingBag"], [7, 1, 1, "", "MergedEmbeddingBagWithSGD"]], "intel_extension_for_pytorch.quantization": [[2, 2, 1, "", "autotune"], [2, 2, 1, "", "convert"], [2, 2, 1, "", "get_smooth_quant_qconfig_mapping"], [2, 2, 1, "", "prepare"]]}, "objtypes": {"0": "py:module", "1": "py:class", "2": "py:function"}, "objnames": {"0": ["py", "module", "Python module"], "1": ["py", "class", "Python class"], "2": ["py", "function", "Python function"]}, "titleterms": {"intel": [0, 1, 5, 6, 15, 30, 31, 32, 33], "extens": [0, 1, 5, 7, 15, 20, 26, 32], "pytorch": [0, 1, 5, 15, 18, 32], "cpu": [0, 2, 17, 18, 33], "isa": [0, 7, 17], "dynam": [0, 6, 7, 15, 17, 26], "dispatch": [0, 7, 17], "design": [0, 17, 20, 31], "doc": 0, "architectur": 1, "support": [1, 8, 10], "api": [2, 7, 9, 13, 16, 17, 18, 22, 25, 28, 29], "document": [2, 5, 25, 32, 33], "gener": [2, 26], "llm": [2, 6, 7, 23, 28, 30], "modul": [2, 10, 20, 28], "level": [2, 17, 28], "optim": [2, 7, 10, 13, 15, 19, 28, 29], "prototyp": [2, 6, 7, 10, 11, 12, 14, 16, 22, 28], "fast": [2, 6, 7, 11], "bert": [2, 6, 7, 11, 32], "graph": [2, 7, 12, 13, 28], "quantiz": [2, 6, 7, 15, 16, 29], "runtim": [2, 7, 20, 26], "blog": 3, "public": 3, "cheat": 4, "sheet": 4, "contribut": 5, "develop": 5, "tip": 5, "debug": [5, 17], "unit": 5, "test": 5, "python": [5, 6, 7], "better": 5, "local": 5, "pytest": 5, "lint": 5, "c": [5, 6, 18], "write": [5, 18], "build": [5, 17], "exampl": [6, 10, 11, 12, 14, 16, 17, 20, 31], "train": [6, 8], "singl": [6, 28, 31], "instanc": [6, 28, 30, 31], "float32": [6, 8], "bfloat16": [6, 8, 21, 26, 30], "distribut": [6, 28, 29], "infer": [6, 8, 28, 29, 31, 32], "eager": [6, 8], "mode": [6, 28, 31], "resnet50": [6, 32], "torchscript": [6, 8], "torchdynamo": [6, 26], "beta": [6, 7], "new": [6, 7, 34], "featur": [6, 7, 11, 12, 17], "from": [6, 7], "2": [6, 7, 14, 32, 34], "0": [6, 7, 34], "int8": [6, 7, 13, 16, 26, 30, 32], "static": [6, 15], "calibr": [6, 15], "deploy": 6, "larg": [6, 7, 28], "languag": [6, 7, 28], "model": [6, 7, 13, 15, 18, 20, 28, 32], "fp32": [6, 10, 13, 29, 30], "bf16": [6, 10, 13, 29], "smooth": [6, 16, 22], "weight": [6, 29], "onli": [6, 29], "int4": 6, "ai": [6, 30], "refer": [6, 8], "easi": 7, "us": [7, 8, 9, 10, 13, 16, 20, 31], "1": [7, 14, 32, 34], "torch": 7, "compil": [7, 17], "auto": [7, 8, 9, 16, 20], "channel": [7, 9, 18, 33], "last": [7, 9, 18, 33], "mix": [7, 8], "precis": [7, 8, 28], "amp": [7, 8], "oper": [7, 18, 19, 28], "codeless": [7, 10], "13": [7, 34], "captur": [7, 12], "hypertun": [7, 14], "introduct": [8, 19, 25], "case": [8, 10, 20], "default": [8, 9, 14, 18, 31], "path": 8, "autocast": 8, "op": 8, "elig": 8, "specif": [8, 17], "behavior": 8, "can": 8, "promot": 8, "widest": 8, "input": [8, 20], "type": [8, 28], "eas": [9, 13], "enabl": 9, "disabl": 9, "known": [9, 20, 34], "issu": [9, 20, 34], "motiv": 10, "usag": [10, 11, 12, 14, 16, 20, 26, 29, 31], "huggingfac": 10, "The": 10, "origin": 10, "command": 10, "ipex": [10, 28], "launch": [10, 31], "appli": 10, "forward": 10, "method": 10, "explicitli": 10, "instead": 10, "__call__": 10, "attr": 10, "alreadi": 10, "jit": 10, "trace": 10, "descript": [11, 12], "prerequisit": 11, "methodologi": [13, 28], "fusion": [13, 19], "pattern": 13, "fold": 13, "your_conf_fil": 14, "hyperparamet": 14, "launcher": [14, 32], "defin": [14, 15], "search": 14, "space": 14, "tune": [14, 16, 22, 33], "user": 14, "your_python_script": 14, "qconfig": 15, "prepar": 15, "do": 15, "convert": 15, "deploi": [15, 32], "recip": [16, 20, 22], "autotun": 16, "algorithm": 16, "alpha": [16, 34], "fix": 16, "determin": 16, "through": 16, "overview": [17, 28, 30, 31, 33], "requir": [17, 20], "code": 17, "folder": 17, "struct": 17, "kernel": [17, 18], "implement": [17, 20], "csrc": 17, "aten": [17, 18], "xyzkrnl": 17, "cpp": 17, "stub": 17, "xyz": 17, "h": 17, "dyndisp": 17, "dispatchstub": 17, "codegen": 17, "process": 17, "add": 17, "custom": [17, 28], "intrin": 17, "vec": 17, "privat": 17, "select": 17, "manual": 17, "check": 17, "what": [18, 34], "i": [18, 20, 31], "memori": [18, 31, 33], "format": 18, "all": [18, 31], "That": 18, "matter": 18, "nchw": 18, "b": 18, "nhwc": 18, "wip": 18, "block": 18, "nchw16c": 18, "stride": 18, "layout": 18, "tensor": 18, "creation": 18, "convers": 18, "d": 18, "coverag": 18, "statu": 18, "regist": [18, 32], "nativ": 18, "manner": 18, "onednn": [18, 33], "creat": [18, 32], "convolut": 18, "primit": [18, 33], "target": 18, "multistream": 20, "examples1": 20, "basic": 20, "examples2": 20, "set": 20, "examples3": 20, "structur": [20, 33], "output": 20, "perform": [20, 26, 30, 32, 33, 34], "asynchron": 20, "task": 20, "configur": [20, 30, 33], "core": [20, 31, 32], "bind": 20, "detail": 20, "how": 20, "iomp": 20, "preload": 20, "load": 20, "dure": 20, "split": 21, "sgd": 21, "stochast": 21, "gradient": 21, "descent": 21, "quant": 22, "quick": 23, "start": [23, 25, 32], "instal": [24, 32], "get": 25, "troubleshoot": 26, "regress": 26, "shape": 26, "result": [26, 34], "correct": 26, "licens": 27, "list": 28, "verifi": 28, "via": 28, "deepspe": [28, 29], "demo": 28, "linear": 28, "low": 28, "data": [28, 30], "indirect": 28, "access": [28, 33], "kv": 28, "cach": [28, 33], "transform": 29, "frontend": 29, "pseudocod": 29, "common": 29, "scenario": 29, "smoothquant": 29, "woq": 29, "center": 30, "product": 30, "v1": 30, "11": [30, 34], "number": [30, 31, 33], "accuraci": 30, "softwar": [30, 33], "version": 30, "hardwar": [30, 33], "200": [30, 34], "an": 30, "aw": 30, "ec2": 30, "c6i": 30, "2xlarg": 30, "10": [30, 34], "script": 31, "guid": [31, 33], "physic": 31, "ii": 31, "includ": 31, "logic": 31, "iii": 31, "node": 31, "iv": 31, "your": 31, "multipl": 31, "v": 31, "throughput": 31, "vi": 31, "latenc": 31, "vii": 31, "viii": 31, "index": 31, "jemalloc": [31, 33], "tcmalloc": [31, 33], "alloc": [31, 33], "openmp": [31, 33], "librari": 31, "gnu": [31, 33], "torchserv": 32, "content": [32, 33], "thi": [32, 33], "serv": 32, "pin": 32, "boost": 32, "multi": 32, "worker": 32, "scale": 32, "export": 32, "serial": 32, "file": 32, "archiv": 32, "3": [32, 34], "4": 32, "benchmark": 32, "non": 33, "uniform": 33, "numa": 33, "numactl": 33, "omp_num_thread": 33, "omp_thread_limit": 33, "denorm": 33, "releas": 34, "highlight": 34, "100": 34, "12": 34, "300": 34, "": 34, "chang": 34, "9": 34, "8": 34, "improv": 34, "other": 34, "note": 34}, "envversion": {"sphinx.domains.c": 3, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 9, "sphinx.domains.index": 1, "sphinx.domains.javascript": 3, "sphinx.domains.math": 2, "sphinx.domains.python": 4, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx": 58}, "alltitles": {"Intel\u00ae Extension for PyTorch* CPU ISA Dynamic Dispatch Design Doc": [[0, "intel-extension-for-pytorch-cpu-isa-dynamic-dispatch-design-doc"]], "Intel\u00ae Extension for PyTorch*": [[1, "intel-extension-for-pytorch"]], "Architecture": [[1, "architecture"]], "Support": [[1, "support"]], "API Documentation": [[2, "api-documentation"], [25, "api-documentation"]], "General": [[2, "general"]], "LLM Module Level Optimizations (Prototype)": [[2, "llm-module-level-optimizations-prototype"]], "Fast Bert (Prototype)": [[2, "fast-bert-prototype"], [6, "fast-bert-prototype"]], "Graph Optimization": [[2, "graph-optimization"], [7, "graph-optimization"], [13, "graph-optimization"], [28, "graph-optimization"]], "Quantization": [[2, "module-intel_extension_for_pytorch.quantization"]], "CPU Runtime": [[2, "module-intel_extension_for_pytorch.cpu.runtime"]], "Blogs & Publications": [[3, "blogs-publications"]], "Cheat Sheet": [[4, "cheat-sheet"]], "Contribution": [[5, "contribution"]], "Contributing to Intel\u00ae Extension for PyTorch*": [[5, "contributing-to-intel-extension-for-pytorch"]], "Developing Intel\u00ae Extension for PyTorch*": [[5, "developing-intel-extension-for-pytorch"]], "Tips and Debugging": [[5, "tips-and-debugging"]], "Unit testing": [[5, "unit-testing"]], "Python Unit Testing": [[5, "python-unit-testing"]], "Better local unit tests with pytest": [[5, "better-local-unit-tests-with-pytest"]], "Local linting": [[5, "local-linting"]], "C++ Unit Testing": [[5, "c-unit-testing"]], "Writing documentation": [[5, "writing-documentation"]], "Building documentation": [[5, "building-documentation"]], "Tips": [[5, "tips"]], "Examples": [[6, "examples"]], "Python": [[6, "python"]], "Training": [[6, "training"]], "Single-instance Training": [[6, "single-instance-training"]], "Float32": [[6, "float32"], [6, "id1"]], "BFloat16": [[6, "bfloat16"], [6, "id6"], [21, "bfloat16"], [26, "bfloat16"]], "Distributed Training": [[6, "distributed-training"]], "Inference": [[6, "inference"]], "Eager Mode": [[6, "eager-mode"], [6, "id7"]], "Resnet50": [[6, "resnet50"], [6, "id2"], [6, "id4"], [6, "id8"], [6, "id11"], [6, "id14"]], "BERT": [[6, "bert"], [6, "id3"], [6, "id5"], [6, "id9"], [6, "id12"], [6, "id15"], [32, "bert"]], "TorchScript Mode": [[6, "torchscript-mode"], [6, "id10"]], "TorchDynamo Mode (Beta, NEW feature from 2.0.0)": [[6, "torchdynamo-mode-beta-new-feature-from-2-0-0"], [6, "id13"]], "INT8": [[6, "int8"], [26, "int8"]], "Static Quantization": [[6, "static-quantization"], [15, "static-quantization"]], "Calibration": [[6, "calibration"]], "Deployment": [[6, "deployment"]], "Dynamic Quantization": [[6, "dynamic-quantization"], [15, "dynamic-quantization"]], "Large Language Model (LLM)": [[6, "large-language-model-llm"]], "FP32/BF16": [[6, "fp32-bf16"], [29, "fp32-bf16"]], "Smooth Quantization INT8": [[6, "smooth-quantization-int8"]], "Weight Only Quantization INT8/INT4": [[6, "weight-only-quantization-int8-int4"]], "C++": [[6, "c"]], "Intel\u00ae AI Reference Models": [[6, "intel-ai-reference-models"]], "Features": [[7, "features"]], "Easy-to-use Python API": [[7, "easy-to-use-python-api"]], "Large Language Models (LLM, NEW feature from 2.1.0)": [[7, "large-language-models-llm-new-feature-from-2-1-0"]], "torch.compile (Beta, NEW feature from 2.0.0)": [[7, "torch-compile-beta-new-feature-from-2-0-0"]], "ISA Dynamic Dispatching": [[7, "isa-dynamic-dispatching"], [17, "isa-dynamic-dispatching"]], "Auto Channels Last": [[7, "auto-channels-last"], [9, "auto-channels-last"]], "Auto Mixed Precision (AMP)": [[7, "auto-mixed-precision-amp"], [8, "auto-mixed-precision-amp"]], "Operator Optimization": [[7, "operator-optimization"]], "Optimizer Optimization": [[7, "optimizer-optimization"]], "Runtime Extension": [[7, "runtime-extension"], [20, "runtime-extension"], [26, "runtime-extension"]], "INT8 Quantization": [[7, "int8-quantization"]], "Codeless Optimization (Prototype, NEW feature from 1.13.0)": [[7, "codeless-optimization-prototype-new-feature-from-1-13-0"]], "Graph Capture (Prototype, NEW feature from 1.13.0)": [[7, "graph-capture-prototype-new-feature-from-1-13-0"]], "HyperTune (Prototype, NEW feature from 1.13.0)": [[7, "hypertune-prototype-new-feature-from-1-13-0"]], "Fast BERT Optimization (Prototype, NEW feature from 2.0.0)": [[7, "fast-bert-optimization-prototype-new-feature-from-2-0-0"]], "Introduction": [[8, "introduction"], [19, "introduction"], [25, "introduction"]], "Use Case": [[8, "use-case"]], "Default Precision": [[8, "default-precision"]], "Inference with Eager Path": [[8, "inference-with-eager-path"]], "Inference with TorchScript Path": [[8, "inference-with-torchscript-path"]], "Training Support": [[8, "training-support"]], "Autocast Op Reference": [[8, "autocast-op-reference"]], "Op Eligibility": [[8, "op-eligibility"]], "Op-Specific Behavior": [[8, "op-specific-behavior"]], "Ops that can autocast to bfloat16": [[8, "ops-that-can-autocast-to-bfloat16"]], "Ops that can autocast to float32": [[8, "ops-that-can-autocast-to-float32"]], "Ops that promote to the widest input type": [[8, "ops-that-promote-to-the-widest-input-type"]], "Ease-of-use auto channels last API": [[9, "ease-of-use-auto-channels-last-api"]], "default": [[9, "default"]], "enable": [[9, "enable"]], "disable": [[9, "disable"]], "Known issue": [[9, "known-issue"], [34, "known-issue"], [34, "id43"]], "Codeless Optimization (Prototype)": [[10, "codeless-optimization-prototype"]], "Motivation": [[10, "motivation"]], "Example Usage with HuggingFace": [[10, "example-usage-with-huggingface"]], "The origin command with ipex launch": [[10, "the-origin-command-with-ipex-launch"]], "Command to apply ipex optimization for FP32": [[10, "command-to-apply-ipex-optimization-for-fp32"]], "Command to apply ipex optimization for BF16": [[10, "command-to-apply-ipex-optimization-for-bf16"]], "Use Case not supported": [[10, "use-case-not-supported"]], "Module uses forward method explicitly instead of the __call__ attr": [[10, "module-uses-forward-method-explicitly-instead-of-the-call-attr"]], "Already using ipex.optimize": [[10, "already-using-ipex-optimize"]], "Already using Jit Trace": [[10, "already-using-jit-trace"]], "Fast BERT (Prototype)": [[11, "fast-bert-prototype"]], "Feature Description": [[11, "feature-description"], [12, "feature-description"]], "Prerequisite": [[11, "prerequisite"]], "Usage Example": [[11, "usage-example"], [12, "usage-example"], [16, "usage-example"]], "Graph Capture (Prototype)": [[12, "graph-capture-prototype"]], "Ease-of-use graph optimization API": [[13, "ease-of-use-graph-optimization-api"]], "FP32 and BF16 models": [[13, "fp32-and-bf16-models"]], "INT8 models": [[13, "int8-models"]], "Methodology": [[13, "methodology"]], "Fusion": [[13, "fusion"]], "FP32 and BF16 fusion patterns": [[13, "fp32-and-bf16-fusion-patterns"]], "INT8 fusion patterns": [[13, "int8-fusion-patterns"]], "Folding": [[13, "folding"]], "HyperTune (Prototype)": [[14, "hypertune-prototype"]], "Usage of Hypertune": [[14, "usage-of-hypertune"]], "your_conf_file": [[14, "your-conf-file"]], "Hyperparameters": [[14, "hyperparameters"]], "Launcher Hyperparameters": [[14, "launcher-hyperparameters"]], "Defining hyperparameters and their search spaces": [[14, "defining-hyperparameters-and-their-search-spaces"]], "1. Defining hyperparameters to tune:": [[14, "defining-hyperparameters-to-tune"]], "2. Defining the search spaces of the hyperparameters:": [[14, "defining-the-search-spaces-of-the-hyperparameters"]], "Default search space": [[14, "default-search-space"]], "User defined search space": [[14, "user-defined-search-space"]], "": [[14, "your-python-script"]], "Usage Examples": [[14, "usage-examples"], [31, "usage-examples"]], "Intel\u00ae Extension for PyTorch* optimizations for quantization": [[15, "intel-extension-for-pytorch-optimizations-for-quantization"]], "Define qconfig": [[15, "define-qconfig"]], "Prepare Model and Do Calibration": [[15, "prepare-model-and-do-calibration"]], "Convert to Static Quantized Model and Deploy": [[15, "convert-to-static-quantized-model-and-deploy"]], "Define QConfig": [[15, "id1"]], "Prepare Model": [[15, "prepare-model"]], "Convert to Dynamic Quantized Model and Deploy": [[15, "convert-to-dynamic-quantized-model-and-deploy"]], "INT8 Recipe Tuning API (Prototype)": [[16, "int8-recipe-tuning-api-prototype"]], "Smooth Quantization Autotune": [[16, "smooth-quantization-autotune"]], "Algorithm: Auto-tuning of $\\alpha$.": [[16, "algorithm-auto-tuning-of-alpha"]], "$\\alpha$ Usage": [[16, "alpha-usage"]], "Using a fixed alpha": [[16, "using-a-fixed-alpha"]], "Determining the alpha through auto-tuning": [[16, "determining-the-alpha-through-auto-tuning"]], "Overview": [[17, "overview"], [30, "overview"], [31, "overview"], [33, "overview"]], "CPU ISA build compiler requirement": [[17, "cpu-isa-build-compiler-requirement"]], "Dynamic Dispatch Design": [[17, "dynamic-dispatch-design"]], "Code Folder Struct": [[17, "code-folder-struct"]], "Kernel implementation: csrc/cpu/aten/kernels/xyzKrnl.cpp": [[17, "kernel-implementation-csrc-cpu-aten-kernels-xyzkrnl-cpp"]], "Kernel Stub: csrc/cpu/aten/xyz.cpp and csrc/cpu/aten/xyz.h": [[17, "kernel-stub-csrc-cpu-aten-xyz-cpp-and-csrc-cpu-aten-xyz-h"]], "Dispatch Stub implementation: csrc/cpu/dyndisp/DispatchStub.cpp and csrc/cpu/dyndisp/DispatchStub.h": [[17, "dispatch-stub-implementation-csrc-cpu-dyndisp-dispatchstub-cpp-and-csrc-cpu-dyndisp-dispatchstub-h"]], "CodeGen Process": [[17, "codegen-process"]], "Add Custom Kernel": [[17, "add-custom-kernel"]], "ISA intrinics specific kernel example:": [[17, "isa-intrinics-specific-kernel-example"]], "Vec specific kernel example:": [[17, "vec-specific-kernel-example"]], "Private Debug APIs": [[17, "private-debug-apis"]], "Example:": [[17, "example"], [17, "id1"]], "Select ISA level manually.": [[17, "select-isa-level-manually"]], "CPU feature check": [[17, "cpu-feature-check"]], "Channels Last": [[18, "channels-last"], [33, "channels-last"]], "What is Channels Last": [[18, "what-is-channels-last"]], "Memory Format Is All That Matters": [[18, "memory-format-is-all-that-matters"]], "a. NCHW (default)": [[18, "a-nchw-default"]], "b. NHWC (WIP for CPU)": [[18, "b-nhwc-wip-for-cpu"]], "c. Blocked (nChw16c)": [[18, "c-blocked-nchw16c"]], "PyTorch Strided Layout": [[18, "pytorch-strided-layout"]], "PyTorch Channels Last Memory Format APIs": [[18, "pytorch-channels-last-memory-format-apis"]], "a. tensor creation": [[18, "a-tensor-creation"]], "b. tensor conversion": [[18, "b-tensor-conversion"]], "c. model conversion": [[18, "c-model-conversion"]], "d. operator coverage": [[18, "d-operator-coverage"]], "Writing Channels Last Kernels": [[18, "writing-channels-last-kernels"]], "a. Status on CPU": [[18, "a-status-on-cpu"]], "b. Register Channels Last Kernel in ATen Native Manner": [[18, "b-register-channels-last-kernel-in-aten-native-manner"]], "c. Register oneDNN Kernel on Channels Last": [[18, "c-register-onednn-kernel-on-channels-last"]], "oneDNN NHWC APIs": [[18, "onednn-nhwc-apis"]], "a. Create NHWC Memory": [[18, "a-create-nhwc-memory"]], "b. Create Convolution Primitive": [[18, "b-create-convolution-primitive"]], "CPU Channels Last Targets": [[18, "cpu-channels-last-targets"]], "Optimizer Fusion": [[19, "optimizer-fusion"]], "Operation Fusion": [[19, "operation-fusion"]], "Requirements": [[20, "requirements"]], "Use Cases": [[20, "use-cases"]], "Example of MultiStream Module": [[20, "example-of-multistream-module"]], "Examples1: Basic Usage": [[20, "examples1-basic-usage"]], "Examples2: Usage with \u201cAUTO\u201d setting": [[20, "examples2-usage-with-auto-setting"]], "Examples3: Usage for models with structure inputs/outputs": [[20, "examples3-usage-for-models-with-structure-inputs-outputs"]], "Performance recipes": [[20, "performance-recipes"]], "Known issues": [[20, "known-issues"], [34, "id37"]], "Example of asynchronous task": [[20, "example-of-asynchronous-task"]], "Example of configuring core binding": [[20, "example-of-configuring-core-binding"]], "Detail Design": [[20, "detail-design"]], "How the core binding is implemented": [[20, "how-the-core-binding-is-implemented"]], "Design of Task": [[20, "design-of-task"]], "IOMP preload or load during the runtime": [[20, "iomp-preload-or-load-during-the-runtime"]], "Split SGD": [[21, "split-sgd"], [21, "id2"]], "Stochastic Gradient Descent (SGD)": [[21, "stochastic-gradient-descent-sgd"]], "Smooth Quant Recipe Tuning API (Prototype)": [[22, "smooth-quant-recipe-tuning-api-prototype"]], "Quick Start": [[23, "quick-start"]], "LLM Quick Start": [[23, "llm-quick-start"]], "Installation": [[24, "installation"]], "Get Started": [[25, "get-started"]], "Troubleshooting": [[26, "troubleshooting"]], "General Usage": [[26, "general-usage"]], "Performance Regression": [[26, "performance-regression"]], "TorchDynamo": [[26, "torchdynamo"]], "Dynamic Shape": [[26, "dynamic-shape"]], "Result Correctness": [[26, "result-correctness"]], "License": [[27, "license"]], "Large Language Models (LLM) Optimization Overview": [[28, "large-language-models-llm-optimization-overview"]], "ipex.llm Optimized Model List": [[28, "ipex-llm-optimized-model-list"]], "Verified for single instance mode": [[28, "verified-for-single-instance-mode"]], "Verified for distributed inference mode via DeepSpeed": [[28, "verified-for-distributed-inference-mode-via-deepspeed"]], "Module Level Optimization API for customized LLM (Prototype)": [[28, "module-level-optimization-api-for-customized-llm-prototype"]], "Demos": [[28, "demos"]], "Optimization Methodologies": [[28, "optimization-methodologies"]], "Linear Operator Optimization": [[28, "linear-operator-optimization"]], "Low Precision Data Types": [[28, "low-precision-data-types"]], "Indirect Access KV Cache": [[28, "indirect-access-kv-cache"]], "Distributed Inference": [[28, "distributed-inference"]], "Transformers Optimization Frontend API": [[29, "transformers-optimization-frontend-api"]], "Pseudocode of Common Usage Scenarios": [[29, "pseudocode-of-common-usage-scenarios"]], "SmoothQuant": [[29, "smoothquant"]], "Weight Only Quantization (WOQ)": [[29, "weight-only-quantization-woq"]], "Distributed Inference with DeepSpeed": [[29, "distributed-inference-with-deepspeed"]], "Performance": [[30, "performance"], [34, "performance"]], "Performance Data for Intel\u00ae AI Data Center Products": [[30, "performance-data-for-intel-ai-data-center-products"]], "LLM Performance": [[30, "llm-performance"]], "INT8 with v1.11": [[30, "int8-with-v1-11"]], "Performance Numbers": [[30, "performance-numbers"], [30, "id1"], [30, "id4"]], "Accuracy": [[30, "accuracy"]], "Configuration": [[30, "configuration"], [30, "id2"], [30, "id5"]], "Software Version": [[30, "software-version"], [30, "id3"], [30, "id6"]], "Hardware Configuration": [[30, "hardware-configuration"], [30, "id7"], [33, "hardware-configuration"]], "FP32 with v1.11.200 on an AWS EC2 C6i.2xlarge instance": [[30, "fp32-with-v1-11-200-on-an-aws-ec2-c6i-2xlarge-instance"]], "FP32 and BFloat16 with v1.10": [[30, "fp32-and-bfloat16-with-v1-10"]], "Launch Script Usage Guide": [[31, "launch-script-usage-guide"]], "Usage of launch script": [[31, "usage-of-launch-script"]], "Single instance for inference": [[31, "single-instance-for-inference"]], "I. Use all physical cores": [[31, "i-use-all-physical-cores"]], "II. Use all cores including logical cores": [[31, "ii-use-all-cores-including-logical-cores"]], "III. Use physical cores on designated nodes": [[31, "iii-use-physical-cores-on-designated-nodes"]], "IV. Use your designated number of cores": [[31, "iv-use-your-designated-number-of-cores"]], "Multiple instances for inference": [[31, "multiple-instances-for-inference"]], "V. Throughput mode": [[31, "v-throughput-mode"]], "VI. Latency mode": [[31, "vi-latency-mode"]], "VII. Your designated number of instances": [[31, "vii-your-designated-number-of-instances"]], "VIII. Your designated number of instances and instance index": [[31, "viii-your-designated-number-of-instances-and-instance-index"]], "Usage of Jemalloc/TCMalloc/Default memory allocator": [[31, "usage-of-jemalloc-tcmalloc-default-memory-allocator"]], "Jemalloc": [[31, "jemalloc"], [33, "jemalloc"]], "TCMalloc": [[31, "tcmalloc"], [33, "tcmalloc"]], "Default memory allocator": [[31, "default-memory-allocator"]], "Usage of OpenMP library": [[31, "usage-of-openmp-library"]], "Intel OpenMP Library": [[31, "intel-openmp-library"]], "GNU OpenMP Library": [[31, "gnu-openmp-library"]], "TorchServe with Intel\u00ae Extension for PyTorch*": [[32, "torchserve-with-intel-extension-for-pytorch"]], "Contents of this Document": [[32, "contents-of-this-document"], [33, "contents-of-this-document"]], "Install Intel\u00ae Extension for PyTorch*": [[32, "install-intel-extension-for-pytorch"]], "Serving model with Intel\u00ae Extension for PyTorch*": [[32, "serving-model-with-intel-extension-for-pytorch"]], "TorchServe with Launcher": [[32, "torchserve-with-launcher"]], "Launcher Core Pinning to Boost Performance of TorchServe Multi Worker Inference": [[32, "launcher-core-pinning-to-boost-performance-of-torchserve-multi-worker-inference"]], "Scaling workers": [[32, "scaling-workers"]], "Creating and Exporting INT8 model for Intel\u00ae Extension for PyTorch*": [[32, "creating-and-exporting-int8-model-for-intel-extension-for-pytorch"]], "1. Creating a serialized file": [[32, "creating-a-serialized-file"]], "ResNet50": [[32, "resnet50"]], "2. Creating a Model Archive": [[32, "creating-a-model-archive"]], "3. Start TorchServe to serve the model": [[32, "start-torchserve-to-serve-the-model"]], "4. Registering and Deploying model": [[32, "registering-and-deploying-model"]], "Benchmarking with Launcher": [[32, "benchmarking-with-launcher"]], "Benchmarking with Launcher Core Pinning": [[32, "benchmarking-with-launcher-core-pinning"]], "Performance Boost with Intel\u00ae Extension for PyTorch* and Launcher": [[32, "performance-boost-with-intel-extension-for-pytorch-and-launcher"]], "Performance Tuning Guide": [[33, "performance-tuning-guide"]], "Intel CPU Structure": [[33, "intel-cpu-structure"]], "Non-Uniform Memory Access (NUMA)": [[33, "non-uniform-memory-access-numa"]], "Software Configuration": [[33, "software-configuration"]], "Numactl": [[33, "numactl"]], "OpenMP": [[33, "openmp"]], "OMP_NUM_THREADS": [[33, "omp-num-threads"]], "OMP_THREAD_LIMIT": [[33, "omp-thread-limit"]], "GNU OpenMP": [[33, "gnu-openmp"]], "Intel OpenMP": [[33, "intel-openmp"]], "Memory Allocator": [[33, "memory-allocator"]], "Denormal Number": [[33, "denormal-number"]], "OneDNN primitive cache": [[33, "onednn-primitive-cache"]], "Releases": [[34, "releases"]], "2.3.0": [[34, "id1"]], "Highlights": [[34, "highlights"], [34, "id3"], [34, "id5"], [34, "id7"], [34, "id9"], [34, "id11"], [34, "id13"], [34, "id15"], [34, "id18"], [34, "id21"], [34, "id24"], [34, "id26"], [34, "id29"]], "2.2.0": [[34, "id2"]], "2.1.100": [[34, "id4"]], "2.1.0": [[34, "id6"]], "2.0.100": [[34, "id8"]], "2.0.0": [[34, "id10"]], "Known Issues": [[34, "known-issues"], [34, "id16"], [34, "id22"], [34, "id30"]], "1.13.100": [[34, "id12"]], "1.13.0": [[34, "id14"]], "1.12.300": [[34, "id17"]], "1.12.100": [[34, "id19"]], "1.12.0": [[34, "id20"]], "1.11.200": [[34, "id23"]], "1.11.0": [[34, "id25"]], "What\u2019s Changed": [[34, "what-s-changed"], [34, "id31"]], "1.10.100": [[34, "id27"]], "1.10.0": [[34, "id28"]], "1.9.0": [[34, "id32"]], "What\u2019s New": [[34, "what-s-new"], [34, "id34"], [34, "id36"], [34, "id39"], [34, "id42"]], "1.8.0": [[34, "id33"]], "1.2.0": [[34, "id35"]], "Performance Improvement": [[34, "performance-improvement"]], "Others": [[34, "others"]], "1.1.0": [[34, "id38"]], "1.0.2": [[34, "id40"]], "1.0.1-Alpha": [[34, "alpha"]], "1.0.0-Alpha": [[34, "id41"]], "Performance Result": [[34, "performance-result"]], "NOTE": [[34, "note"]]}, "indexentries": {"cpupool (class in intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.CPUPool"]], "fastlayernorm (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.FastLayerNorm"]], "indirectaccesskvcacheattention (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.IndirectAccessKVCacheAttention"]], "linear2silumul (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.Linear2SiluMul"]], "linearadd (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearAdd"]], "linearaddadd (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearAddAdd"]], "lineargelu (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearGelu"]], "linearmul (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearMul"]], "linearnewgelu (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearNewGelu"]], "linearrelu (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearRelu"]], "linearsilu (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearSilu"]], "linearsilumul (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearSiluMul"]], "multistreammodule (class in intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.MultiStreamModule"]], "multistreammodulehint (class in intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.MultiStreamModuleHint"]], "pagedattention (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.PagedAttention"]], "rmsnorm (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.RMSNorm"]], "rotaryembedding (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.RotaryEmbedding"]], "task (class in intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.Task"]], "varlenattention (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.VarlenAttention"]], "autotune() (in module intel_extension_for_pytorch.quantization)": [[2, "intel_extension_for_pytorch.quantization.autotune"]], "convert() (in module intel_extension_for_pytorch.quantization)": [[2, "intel_extension_for_pytorch.quantization.convert"]], "enable_onednn_fusion() (in module intel_extension_for_pytorch)": [[2, "intel_extension_for_pytorch.enable_onednn_fusion"]], "fast_bert() (in module intel_extension_for_pytorch)": [[2, "intel_extension_for_pytorch.fast_bert"]], "fast_layer_norm() (in module intel_extension_for_pytorch.llm.functional)": [[2, "intel_extension_for_pytorch.llm.functional.fast_layer_norm"]], "get_core_list_of_node_id() (in module intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.get_core_list_of_node_id"]], "get_smooth_quant_qconfig_mapping() (in module intel_extension_for_pytorch.quantization)": [[2, "intel_extension_for_pytorch.quantization.get_smooth_quant_qconfig_mapping"]], "indirect_access_kv_cache_attention() (in module intel_extension_for_pytorch.llm.functional)": [[2, "intel_extension_for_pytorch.llm.functional.indirect_access_kv_cache_attention"]], "intel_extension_for_pytorch": [[2, "module-intel_extension_for_pytorch"]], "intel_extension_for_pytorch.cpu.runtime": [[2, "module-intel_extension_for_pytorch.cpu.runtime"]], "intel_extension_for_pytorch.llm": [[2, "module-intel_extension_for_pytorch.llm"]], "intel_extension_for_pytorch.llm.functional": [[2, "module-intel_extension_for_pytorch.llm.functional"]], "intel_extension_for_pytorch.llm.modules": [[2, "module-intel_extension_for_pytorch.llm.modules"]], "intel_extension_for_pytorch.quantization": [[2, "module-intel_extension_for_pytorch.quantization"]], "is_runtime_ext_enabled() (in module intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.is_runtime_ext_enabled"]], "module": [[2, "module-intel_extension_for_pytorch"], [2, "module-intel_extension_for_pytorch.cpu.runtime"], [2, "module-intel_extension_for_pytorch.llm"], [2, "module-intel_extension_for_pytorch.llm.functional"], [2, "module-intel_extension_for_pytorch.llm.modules"], [2, "module-intel_extension_for_pytorch.quantization"]], "optimize() (in module intel_extension_for_pytorch)": [[2, "intel_extension_for_pytorch.optimize"]], "optimize() (in module intel_extension_for_pytorch.llm)": [[2, "intel_extension_for_pytorch.llm.optimize"]], "pin (class in intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.pin"]], "prepare() (in module intel_extension_for_pytorch.quantization)": [[2, "intel_extension_for_pytorch.quantization.prepare"]], "rms_norm() (in module intel_extension_for_pytorch.llm.functional)": [[2, "intel_extension_for_pytorch.llm.functional.rms_norm"]], "rotary_embedding() (in module intel_extension_for_pytorch.llm.functional)": [[2, "intel_extension_for_pytorch.llm.functional.rotary_embedding"]], "varlen_attention() (in module intel_extension_for_pytorch.llm.functional)": [[2, "intel_extension_for_pytorch.llm.functional.varlen_attention"]], "verbose (class in intel_extension_for_pytorch)": [[2, "intel_extension_for_pytorch.verbose"]], "frozenbatchnorm2d (class in intel_extension_for_pytorch.nn)": [[7, "intel_extension_for_pytorch.nn.FrozenBatchNorm2d"]], "mergedembeddingbag (class in intel_extension_for_pytorch.nn.modules)": [[7, "intel_extension_for_pytorch.nn.modules.MergedEmbeddingBag"]], "mergedembeddingbagwithsgd (class in intel_extension_for_pytorch.nn.modules)": [[7, "intel_extension_for_pytorch.nn.modules.MergedEmbeddingBagWithSGD"]], "interaction() (in module intel_extension_for_pytorch.nn.functional)": [[7, "intel_extension_for_pytorch.nn.functional.interaction"]]}})
\ No newline at end of file
diff --git a/cpu/2.3.0+cpu/tutorials/api_doc.html b/cpu/2.3.0+cpu/tutorials/api_doc.html
index 2be3c8ff2..5424314e8 100644
--- a/cpu/2.3.0+cpu/tutorials/api_doc.html
+++ b/cpu/2.3.0+cpu/tutorials/api_doc.html
@@ -421,13 +421,15 @@
+result = torch.nn.functional.silu(linear(input))
- Parameters:
-linear (torch.nn.Linear module) – the original torch.nn.Linear module to be fused with silu.
+linear (torch.nn.Linear module) – the original torch.nn.Linear
+module to be fused with silu.
@@ -451,9 +453,9 @@ LLM Module Level Optimizations (Prototype))
on the result, and multiplies the result by other:
-result = torch.nn.functional.silu(linear(input)) * other
+result = torch.nn.functional.silu(linear(input)) * other
- Parameters:
linear (torch.nn.Linear module) – the original torch.nn.Linear module to
@@ -479,17 +481,20 @@
LLM Module Level Optimizations (Prototype)