Merge pull request #1036 from zachlasiuk/main

New KleidiAI basics Learning Path
ArmDeveloperEcosystem · Jul 11, 2024 · d641536 · d641536
2 parents fd77fcf + 42e88e2
commit d641536
Show file tree

Hide file tree

Showing 11 changed files with 531 additions and 0 deletions.
diff --git a/.../learning-paths/cross-platform/kleidiai-explainer/Arm_KleidiAI_square_color.png b/.../learning-paths/cross-platform/kleidiai-explainer/Arm_KleidiAI_square_color.png
diff --git a/content/learning-paths/cross-platform/kleidiai-explainer/KleidiAI-src-matmul.JPG b/content/learning-paths/cross-platform/kleidiai-explainer/KleidiAI-src-matmul.JPG
diff --git a/content/learning-paths/cross-platform/kleidiai-explainer/KleidiAI-src.JPG b/content/learning-paths/cross-platform/kleidiai-explainer/KleidiAI-src.JPG
diff --git a/content/learning-paths/cross-platform/kleidiai-explainer/_index.md b/content/learning-paths/cross-platform/kleidiai-explainer/_index.md
@@ -0,0 +1,46 @@
+---
+title: KleidiAI basics - Improving AI/ML workloads from servers to phones
+
+minutes_to_complete: 60
+
+who_is_this_for: This is an introductory topic for people wanting to learn how Generative AI workloads execute on hardware, and how KleidiAI accelerates it.
+
+learning_objectives: 
+    - Understand how basic math operations power Large Language Models.
+    - Learn how the KleidiAI micro-kernels speed up Generative AI inference performance.
+    - Run a basic C++ matrix multiplication example to showcase the speedup the KleidiAI micro-kernels deliver.
+
+prerequisites:
+    - An Arm Linux machine that implements the Int8 Matrix Multiplication (*i8mm*) architecture feature; this example uses an AWS Graviton 3 instance. Instructions on setting up an Arm-based server are [found here](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/aws/).
+    - A basic understanding of linear algebra terminology such as dot product and matrix multiplication.
+
+author_primary: Zach Lasiuk
+### Tags
+skilllevels: Introductory 
+subjects: ML
+armips:
+    - Cortex-X
+    - Cortex-A
+    - Neoverse
+tools_software_languages:
+    - C++
+    - GenAI
+    - Coding
+    - NEON
+operatingsystems:
+    - Linux
+
+### Cross-platform metadata only
+shared_path: true
+shared_between:
+    - servers-and-cloud-computing
+    - smartphones-and-mobile
+
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/content/learning-paths/cross-platform/kleidiai-explainer/_next-steps.md b/content/learning-paths/cross-platform/kleidiai-explainer/_next-steps.md
@@ -0,0 +1,24 @@
+---
+next_step_guidance: Check out the KleidiAI further reading to further understand what it is capable of. 
+
+recommended_path: /learning-paths/servers-and-cloud-computing/llama-cpu/
+
+further_reading:
+    - resource:
+        title: KleidiAI documentation
+        link: https://gitlab.arm.com/kleidi/kleidiai/-/blob/main/docs/matmul_qsi4cx/README.md?ref_type=heads
+        type: documentation
+    - resource:
+        title: KleidiAI visualized
+        link: https://community.arm.com/arm-community-blogs/b/ai-and-ml-blog/posts/kleidiai
+        type: blog
+
+
+
+# ================================================================================
+#       FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 21                  # set to always be larger than the content in this path, and one more than 'review'
+title: "Next Steps"         # Always the same
+layout: "learningpathall"   # All files under learning paths have this same wrapper
+---
diff --git a/content/learning-paths/cross-platform/kleidiai-explainer/_review.md b/content/learning-paths/cross-platform/kleidiai-explainer/_review.md
@@ -0,0 +1,34 @@
+---
+review:
+    - questions:
+        question: >
+            What devices does KleidiAI NOT work on?
+        answers:
+            - AWS Graviton 3 (C7g, M7g, R7g)
+            - NVIDIA Grace (GB200 NVL72)
+            - Google Pixel 8 Pro
+            - Vivo Y22
+            - It runs on all of these
+        correct_answer: 5                    
+        explanation: >
+            KleidiAI runs on all of the above devices lists and more.
+
+    - questions:
+        question: >
+            If your ML framework supports KleidiAI, you automatically benefit from its AI workload acceleration.
+        answers:
+            - True, I don't need to do anything else to enable it.
+            - False, I need to manually activate it.
+        correct_answer: 1                   
+        explanation: >
+            Once your ML framework adopts KleidiAI, you will automatically see AI workload acceleration on supported machines.
+
+
+
+# ================================================================================
+#       FIXED, DO NOT MODIFY
+# ================================================================================
+title: "Review"                 # Always the same title
+weight: 20                      # Set to always be larger than the content in this path
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+---
diff --git a/content/learning-paths/cross-platform/kleidiai-explainer/neural-node-pic.jpg b/content/learning-paths/cross-platform/kleidiai-explainer/neural-node-pic.jpg
diff --git a/content/learning-paths/cross-platform/kleidiai-explainer/page1.md b/content/learning-paths/cross-platform/kleidiai-explainer/page1.md
@@ -0,0 +1,75 @@
+---
+title: KleidiAI and matrix multiplication
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+## What is KleidiAI?
+
+KleidiAI is a set of micro-kernels that integrates into machine learning frameworks, accelerating your AI inference on Arm-based platforms. KleidiAI's micro-kernels are hand-optimized in Arm assembly code to leverage modern architecture instructions that greatly speed up AI inference on Arm CPUs. 
+
+You don't need to do anything to get the benefits of KleidiAI. It will automatically apply if two conditions are met:
+1. Your ML Framework integrates KleidiAI, and
+2. Your hardware platform supports the required Arm instructions for your inference (covered further down this page).
+
+![KleidiAI#center](Arm_KleidiAI_square_color.png "Optimized micro-kernels for AI workloads on Arm CPUs")
+
+
+## How does Generative AI mathematically execute in hardware?
+
+{{% notice Quote %}}
+“Any sufficiently advanced technology is indistinguishable from magic” - Arthur C. Clarke
+{{% /notice %}}
+
+In the case of Generative AI models today, the math behind the perceived magic is **matrix multiplication**. To understand this, and better understand KleidiAI itself, this section offers a high-level explanation of neural network architecture.
+
+Neural networks consist of layers of neurons. Each neuron in a layer is connected to all neurons in the previous layer. Each of these connections has a unique connection strength, learned through training. This is called a connection's *weight*. 
+
+During inference (such as trying to generate the next token/word with a given input), each neuron performs a weighted sum of inputs and then decides its value via an activation function. The weighted sum is the dot product of each connected neuron's input (*x*) and its connection weight (*w*). A layer of neuron's calculations can be efficiently calculated via matrix multiplication, where the input matrix is multiplied by the weight matrix. 
+
+For example, in the image below, *z1* is calculated as a dot product of connected *x*'s and *w*'s from the previous layer. All *z* values in Layer 0 can therefore be efficiently calculated with a matrix multiplication operation.
+
+![Neural Network example#center](neural-node-pic.jpg "Figure 1. Zoomed in on neural network node")
+
+
+Sidebar:  In addition to *weights*, each neuron in a neural network is assigned a *bias*. These weights and biases are learned during training and make up a model's parameters. For example, in the Llama 3 model with 8 billion parameters, the model has around 8 billion individual weights and biases that embody what it learned during training. Generally speaking, the more parameters a model has, the more information it can retain from its training, leading to more capable models. For more information about Llama 3 view its [Hugging Face model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B).
+
+### Why is speeding up matrix multiplication crucial for AI performance?
+What does this all mean? An 8 billion parameter model generating one token requires billions of dot product calculations, with at least hundreds of millions of matrix multiplication operations. Therefore speeding up matrix multiplication is a critical piece to both running massive Generative AI models on servers and smaller models constrained devices like smartphones.
+
+KleidiAI uses modern Arm CPU instructions to accelerating matrix multiplication and overall AI inference.
+
+## What Arm features does KleidiAI leverage?
+Each KleidiAI matrix multiplication micro-kernel uses a specific Arm architecture feature to enhance AI inference. Below is a description of each architecture feature KleidiAI uses to accelerate matrix multiplication:
+
+* **Dot Product**: KleidiAI uses the `vdotq_s32` intrinsic, which is a vector dot product, introduced as part of SIMD. It computes the dot product of two vector 8-bit integers and accumlates the result into a 32-bit integer. View the `vdot` documentation [here](https://developer.arm.com/documentation/ddi0597/2024-03/SIMD-FP-Instructions/VDOT--by-element---BFloat16-floating-point-indexed-dot-product--vector--by-element--).
+
+* **SMMLA**: KleidiAI also makes use of the Int8 Matrix Multiplication (i8mm) feature including the `SMMLA` instruction,  which stands for *Signed 8-bit integer matrix multiply-accumulate*. It multiplies a 2x8 matrix of 8-bit integers by a 8x2 matrix of 8-bit integers, which is accumulated into a 2x2 matrix of 32-bit integers. For more information view the *SMMLA* and *i8mm* documentation [here](https://developer.arm.com/documentation/ddi0602/latest/SIMD-FP-Instructions/SMMLA--vector---Signed-8-bit-integer-matrix-multiply-accumulate--vector--).
+
+* **FMLA**: This instruction, which stands for *Floating-point Multiply Accumulate*, for 16-bit operations. It was included as part of the Advanced SIMD extension, multiplying and accumulating two vectors together, each containing eight 16-bit numbers. View the `FMLA` documentation [here](https://developer.arm.com/documentation/ddi0602/2024-03/SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-Multiply-Add-to-accumulator--vector--).
+
+* **FMOPA**: This instruction stands for *Floating-point outer product and accumulate*. It is included in the Arm Scalable Vector Extention 2 (SVE2). The single precision `FMOPA` variant enables optimized matrix multiplication on 32-bit numbers. View the `FMOPA` documentation [here](https://developer.arm.com/documentation/ddi0602/2023-12/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-?lang=en).
+
+Today, Arm-powered hardware containing these instructions exist in cloud servers and smartphones. Below are some examples of the first products from popular vendors that support KleidiAI:
+
+| Area        | Example Product     | Arm-based SoC      | Arm Architecture  |
+| ---------   | -----------------   | ----------------   | ----------- |
+| Smartphone  | Google Pixel 6      | Google Tensor G1    | Armv8.2  |
+| Smartphone  | OPPO Reno6 Pro 5G   | MediaTek Dimensity 1200 | Armv8.2  |
+| Smartphone  | Vivo Y22            | MediaTek Helio G70  | Armv8.2  |
+| Smartphone  | Xiaomi Mi 11        | Qualcomm Snapdragon 888 | Armv8.2  |
+| Smartphone  | Samsung Galaxy S20 Ultra      | Samsung Exynos 990 | Armv8.2  |
+| Smartphone  | Google Pixel 8 Pro | Google Tensor G3   | Armv9.0  |
+| Smartphone  | Samsung Galaxy S22 | Snapdragon 8 Gen 1 | Armv9.0  |
+| Smartphone  | OPPO Find X5 Pro   | Snapdragon 8 Gen 1 | Armv9.0  |
+| Smartphone  | Xiaomi 12T         | Mediatek Dimensity 9000 | Armv9.0  |
+| Server      | c8y                | Alibaba Yitian 710 | Armv9.0  |
+| Server      | GB200 NVL72        | NVIDIA Grace       | Armv9.0  |
+| Server      | C7g, M7g, R7g      | AWS Graviton 3     | Armv8.4  |
+
+
+The remaineder of this Learning Path will answer the following questions while stepping through a C++ example:
+* How does KleidAI 'just work' with ML Frameworks?
+* What do the micro-kernels in KleidiAI functionally do?
+* How are the KleidiAI micro-kernels actually speeding up matrix multiplication?
diff --git a/content/learning-paths/cross-platform/kleidiai-explainer/page2.md b/content/learning-paths/cross-platform/kleidiai-explainer/page2.md
@@ -0,0 +1,96 @@
+---
+title: KleidiAI in a real software stack
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## High-level KleidiAI architecture
+This section aims to provide an abstracted overview of KleidiAI's components before diving into the specifics. The KleidiAI source files are publicly accessible in the [KleidiAI GitLab repository](https://gitlab.arm.com/kleidi/kleidiai). Navigate there in your web browser to follow along and understand KleidiAI's structure.
+
+KlediAI's micro-kernels are located in the `/kai/ukernels/matmul` directory; navigate there now. There are essentially two types of KleidiAI micro-kernels today:
+1. Quantizing/Packing routines    - under the `pack` directory.
+2. Matrix Multiplication routines - the three directories starting with `matmul_clamp`. Each directory contains routines specialized for a specific input data type.
+
+
+![KleidiAI stuff](KleidiAI-src.JPG "Figure 3. KleidiAI src directory")
+
+### What quantization levels does KleidiAI support?
+KleidiAI has multiple matrix multiplication micro-kernels, and dynamic quantization routines, to optimally support all model quantization levels. To learn more about model quantization and how selecting the right quantization level affects your AI-based application, refer to [this Learning Path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama-cpu/llama-chatbot#quantization-format).
+
+KleidiAI currently has three matrix multiplication directories that each handle differently input/output types, which will evolve to support more over time:
+
+| uKernel                           |  Output type     | Input types     |
+| ---------                         | -----------------   | --------------  | 
+| `matmul_clamp_f16_f16_f16`        | 16-bit floating-point | 16-bit floating-point |
+| `matmul_clamp_f32_f32_f32`        | 16-bit floating-point | 32-bit floating-point |
+| `matmul_clamp_f32_qa8dxP_qs4cxP`  | 32-bit floating-point | 8-bit integer and 4-bit integer |
+
+### How to select the right KleidiAI micro-kernel?
+
+Only one matrix multiply micro-kernel will be used for your given AI application. Each AI model and workload (for example using a [Gemma](https://huggingface.co/blog/gemma) model running text generation), has unique inference characteristics. KleidiAI has various matrix multiplication micro-kernel routines to optimize the inference speed of different workloads. The ML Framework provider selects the optimal micro-kernel on your behalf when implementing KleidiAI in their framework. No extra effort is required when using a framework with KleidiAI.
+
+## KleidiAI in a real-world example 
+Before deep-diving into KleidiAI's code, it is helpful to see how KleidiAI micro-kernels interact with an GenAI model and ML Framework at a high-level. The steps below describe in words how KleidiAI speeds up matrix multiplication. The example is of a user asking a chatbot app a question on their smartphone. This is the example application stack to analyze:
+
+![KleidiAI in Stack](sw-stack.png "KleidiAI in a real-world software stack.")
+
+### Simple inference walkthrough
+
+#### Stage 1: Input
+* **A user** inputs their question into the chatbot, such as "What is the capital of the United States?"
+* **The chatbot app** uses MediaPipe to convert that text into a series of tokens representing the question as a matrix of FP16 numbers (as the Gemma-2b model is in FP16 format). 
+* **MediaPipe** invokes the large language model (LLM) inference with the tokenized input, feeding it into the first neural network layer of the Gemma-2b model.
+* **XNNPack** starts executing the inference, managing the mathmatical operations inside the LLM's neural network layers. 
+* **KleidiAI** is called to accelerate the essential matrix multiplication operations propogating the input through the neural network.
+
+#### Stage 2: KleidiAI Quantizing & Packing micro-kernels
+* **KleidiAI** recieves two large matricies of FP16 numbers to perform matrix multiplication. The *i8mm* micro-kernels were selected for this workload, requiring quantization from FP16 to lower percision numbers to enhance computational efficiency.
+* **The matrix of inputs**, also known as the Left-Hand Side matrix (LHS), is quantized into INT8.
+* **The matrix of model weights**, also known the Right-Hand Side matrix (RHS), is quantized into INT4, with two numbers packed into an INT8 memory space. This packing is done to take advantage of the *i8mm* architecture feature, which operates on 8-bit memory chunks.
+
+#### Stage 3: KleidiAI Matrix Multiplication micro-kernels
+* **KleidiAI** takes the prepared input and model weight matrices (LHS and RHS respectively) performs matrix multiplication using optimized *SMMLA* instructions from the *i8mm* micro-kernels.
+* **KleidiAI** unpacks and de-quantizes the result back into the original FP16 number format sends the resulting matrix to XNNPack.
+
+#### Stage 4: Finish Inference and Output
+* **XNNPack** completes the inference by sending the output of each neural network layer into the next, continuing through all layers of the Gemma-2b, calling KleidiAI to execute dynamic quantization and matrix multiplication.
+* **XNNPack** sends the inference result, a final matrix, to MediaPipe.
+* **MediaPipe** decodes the numerical matrix into a series of tokens representing the answer to the original question.
+* **The chatbot app** receives these tokens from MediaPipe and displays the answer to the user as it streams in from multiple inferences. 
+* **The user** sees the answer on the screen: “The capital of the USA is Washington, D.C.”
+
+
+Note that this overview leaves out details for the sake of brevity, but is helpful for a conceptual understanding of how KleidiAI interacts with ML Frameworks and Generative AI models.
+
+There are several nuances on the above process that will help you understand KleidiAI better.
+
+
+### Why are model weights quantized to INT4, and inputs quantized to INT8?
+KleidiAI optimizes for a balance of size, accuracy, and execution speed.
+
+Model weights can be quantized down to INT4 without inducing large errors because after training, weights typically stay within a range suitable for INT4 quantization. Furthermore, model sizes are halved by selecting INT4 over INT8 or FP16 quantization, a significant benefit to both memory storage and throughput (especially critical when deploying to constrained devices like smartphones).
+
+In contrast, neural network activations (described as 'inputs' so far) are highly variable and not as evenly distributed across the full data range. For example, a common activation function - ReLU - outputs zero for any negative input and leaves any positive input unchanged. This results in a number distribution with many small values but also occasional large values. Having only 16 distinct values (from INT4 quantization) would result in large quantization errors. 
+
+As a result, KleidiAI strategically selected INT4 quantization for model parameters and INT8 for inputs propagating through the neural network. 
+
+{{% notice What AI model size should you select %}}
+Most models are trained in FP32 or FP16 formats and are by default available to download at that size. To realize the benefits of lower memory footprint, select a pre-quantized version of your desired AI model, ideally in INT4 format. You can quantize models yourself through various tools, but the quickest and easiest way is to locate a pre-quantized version of the same model.
+
+KleidiAI supports matrix multiplication of models across FP32, FP16, INT8, and INT4 formats - your ML framework provider will select the optimal KleidiAI micro-kernel to balance inference speed and accuracy for your use-case. 
+
+In short, it is highly recommended to select an INT4 pre-quantized model when inference speed is critical, and especially on smartphones / edge devices.
+{{% /notice %}}
+
+
+
+### Why does KleidiAI need to 'pack' before matrix multiplication?
+The goal of KleidiAI's packing micro-kernels is to prepare the incoming matrices for efficient matrix multiplication. For each matrix multiplication routine is a corresponding packing routine, which may or may not require dynamic quantization depending on the incoming number formats. 
+
+For example, the *SMMLA* instruction operates on 8-bit numbers. The role of packing after quantization is to organize two 4-bit integers into a single 8-bit memory space, and the *SMMLA* instructions are hand-written to efficiently operate on the two packed integers at once.
+
+The power of KleidiAI comes from its deep understanding of AI workload requirements, quantization techniques, data packing strategies, and advanced Arm instructions combining to squeeze the most performance out of AI models on Arm CPUs.
+
+The next section dives into the technical specifics of how KleidiAI delivers these performance uplifts by stepping through a C++ example.