chore: update gpt2 use case llm multi head attention run with rounding (

#719)
zama-ai · Jun 7, 2024 · 78bf464 · 78bf464
1 parent 8501ae0
commit 78bf464
Show file tree

Hide file tree

Showing 3 changed files with 83 additions and 65 deletions.
diff --git a/use_case_examples/llm/QGPT2Evaluate.ipynb b/use_case_examples/llm/QGPT2Evaluate.ipynb
@@ -8,16 +8,16 @@
     "# Evaluation of GPT-2 models with FHE-compliant operators \n",
     "\n",
     "This notebook presents a first approach of how to execute a GPT model in FHE, where some specific parts of the model are converted to FHE computations.\n",
-    "In the following, we consider the GPT-2 model with the language modeling head on top, with the following configuration: 12 layers, 12 attention heads, 768 embedding dimensions and a vocabulary size of 50257 words.\n",
-    "Additionally, our QGPT-2 models are built around Hugging Face's Transformer library, sharing the same API with only a few additional steps to acknowledge.\n",
+    "The following considers the GPT-2 model with the language modeling head on top, with the following configuration: 12 layers, 12 attention heads, 768 embedding dimensions and a vocabulary size of 50257 words.\n",
+    "Additionally, the QGPT-2 models are built around Hugging Face's Transformer library, sharing the same API with only a few additional steps to acknowledge.\n",
     "\n",
-    "We therefore evaluate the performance of two quantized versions of the GPT-2 model:\n",
+    "The performance of two quantized versions of the GPT-2 model:\n",
     "- a model that quantizes a single attention head found in the first layer: SingleHeadQGPT2Model\n",
     "- a model that quantizes a complete multi-head attention pass found in the first layer: MultiHeadsQGPT2Model\n",
     "\n",
     "The quantized operators from these models are FHE-compliant, meaning that these specific part can be executed in FHE, while the rest of the model are done in float in the clear.\n",
-    "We therefore explain how to load these models from the Hugging Face's associated pre-trained ones, calibrate them, compile their FHE circuit and then execute their inference with some FHE.\n",
-    "Finally, we compare different top-k accuracies on the next token predicted from a base text for both models with respect to the number of bits of quantization used. Using these figures, we show that inputs and weights can be quantized over less than 8 bits to make the inference reach near-floating point performances.\n"
+    "This notebook explains how to load these models from the Hugging Face's associated pre-trained ones, calibrate them, compile their FHE circuit and then execute their inference with some FHE.\n",
+    "Finally, different top-k accuracies on the next token predicted from a base text for both models with respect to the number of bits of quantization used are compared. Using these figures, the notebook shows that inputs and weights can be quantized over less than 8 bits to make the inference reach near-floating point performances.\n"
    ]
   },
   {
@@ -35,6 +35,7 @@
    "outputs": [],
    "source": [
     "import logging\n",
+    "import time\n",
     "\n",
     "import matplotlib.pyplot as plt\n",
     "import torch\n",
@@ -50,7 +51,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We load the GPT-2 model (GPT2LMHeadModel) and tokenizer."
+    "Load the GPT-2 model (GPT2LMHeadModel) and tokenizer."
    ]
   },
   {
@@ -70,7 +71,7 @@
    "source": [
     "### Example with Hugging Face\n",
     "\n",
-    "First, we show a simple example using a short sequence and generate a few tokens with Hugging Face's GPT-2 model. "
+    "First, a simple example using a short sequence is shown, generating a few tokens with Hugging Face's GPT-2 model."
    ]
   },
   {
@@ -87,7 +88,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can now encode the sequence using the tokenizer and retrieve the input token ids."
+    "Now encode the sequence using the tokenizer and retrieve the input token ids."
    ]
   },
   {
@@ -105,7 +106,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Finally, we generate 10 new tokens and decode the output sentence. "
+    "Finally, generate 10 new tokens and decode the output sentence. "
    ]
   },
   {
@@ -147,7 +148,7 @@
     "Here, the model's first attention head of its first layer, as well as its associated first projection, is done using quantized operators. \n",
     "The rest remains the same as in Hugging Face's implementation. Mode details can be found in SingleHeadQGPT2Model's documentation.\n",
     "\n",
-    "We first load the model using the pre-trained files. \n",
+    "First, load the model using the pre-trained files. \n",
     "Only 7 bits of quantization is needed in order to recover the same sentence as with the initial floating point model."
    ]
   },
@@ -167,7 +168,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We then generate the next 10 tokens in the clear with the quantized operators in order to retrieve the sentence. "
+    "Then generate the next 10 tokens in the clear with the quantized operators in order to retrieve the sentence. "
    ]
   },
   {
@@ -206,9 +207,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can see that the sentence exactly matches the one from Hugging Face original model.\n",
+    "The generated sentence exactly matches the one from Hugging Face original model.\n",
     "\n",
-    "We now generate the logits for the next predicted token in the clear."
+    "The logits for the next predicted token are now generated in the clear."
    ]
   },
   {
@@ -227,14 +228,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let us compile the model using the input token ids. \n",
-    "Here, the model executes the forward pass in the clear, which computes and stores the necessary quantization parameters. \n",
+    "The model is compiled using the input token ids.\n",
+    "Here, the model executes the forward pass in the clear, which computes and stores the necessary quantization parameters.\n",
     "Once this is done, a FHE circuit is built, which then can be used to execute the forward pass with its quantized parts done in FHE."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 12,
    "metadata": {},
    "outputs": [
     {
@@ -254,14 +255,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can see that the circuit's bit-width reaches 15 bits at most.\n",
+    "The circuit's bit-width reaches 15 bits at most.\n",
     "\n",
-    "Now let's set the model in simulation mode in order to retrieve the logits that are expected to be computed with FHE. "
+    "The model is set in simulation mode in order to retrieve the logits that are expected to be computed with FHE."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 13,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -270,24 +271,35 @@
     "output_logits_simulated = proj_single_head_qgpt2(input_ids).logits"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The single head attention is now ready for execution in FHE."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 14,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Simulated logits are equal to the quantized clear ones: False\n"
+      "Time taken to execute the single head in FHE: 166.38 seconds\n"
      ]
     }
    ],
    "source": [
-    "print(\n",
-    "    \"Simulated logits are equal to the quantized clear ones:\",\n",
-    "    torch.equal(output_logits_clear, output_logits_simulated),\n",
-    ")"
+    "proj_single_head_qgpt2.set_fhe_mode(fhe=\"execute\")\n",
+    "\n",
+    "start = time.time()\n",
+    "output_logits_fhe = proj_single_head_qgpt2(input_ids).logits\n",
+    "end = time.time()\n",
+    "execution_time = end - start\n",
+    "\n",
+    "print(f\"Time taken to execute the single head in FHE: {execution_time:.2f} seconds\")"
    ]
   },
   {
@@ -297,16 +309,16 @@
    "source": [
     "## Multi-Head Attention Model\n",
     "\n",
-    "Here, the model's multi-head attention found in its first layer is done using quantized operators. \n",
-    "The rest remains the same as in Hugging Face's implementation. Mode details can be found in MultiHeadsQGPT2Model's documentation.\n",
+    "Here, the model's multi-head attention found in its first layer is done using quantized operators.\n",
+    "The rest remains the same as in Hugging Face's implementation. More details can be found in MultiHeadsQGPT2Model's documentation.\n",
     "\n",
-    "We first load the model using the pre-trained files. \n",
-    "In this case, 10 bits of quantization is needed in order to recover almost the same sentence as with the initial floating point model.\n"
+    "The model is loaded using the pre-trained files.\n",
+    "In this case, 10 bits of quantization are needed in order to recover almost the same sentence as with the initial floating point model."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 16,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -318,12 +330,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Similarly, we generate new tokens in the clear and decode the output sentence."
+    "Similarly, new tokens are generated in the clear and the output sentence is decoded."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 17,
    "metadata": {},
    "outputs": [
     {
@@ -340,7 +352,7 @@
        "'Computations on encrypted data can help protect your privacy.'"
       ]
      },
-     "execution_count": 13,
+     "execution_count": 17,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -357,12 +369,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can see that the sentence matches the one from Hugging Face original model. However, in its current form, the MHA part need less precision to run on encrypted data. We can reduce the precision to 7 bits for this."
+    "The generated sentence matches the one from Hugging Face's original model. However, in its current form, the MHA part requires less precision to run on encrypted data. The precision can be reduced to 7 bits for this operation."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 18,
    "metadata": {},
    "outputs": [
     {
@@ -379,7 +391,7 @@
        "'Computations on encrypted data can help protect your data from'"
       ]
      },
-     "execution_count": 14,
+     "execution_count": 18,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -397,14 +409,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can see that the meaning is pretty close and now we can check the compilation to FHE.\n",
+    "The meaning of the generated sentence is close to the original, and the compilation to FHE can now be checked.\n",
     "\n",
-    "We now generate the logits for the next predicted token in the clear and then compile the model the same way."
+    "The logits for the next predicted token are generated in the clear, and then the model is compiled in the same way."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 19,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -415,7 +427,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 20,
    "metadata": {},
    "outputs": [
     {
@@ -437,12 +449,12 @@
    "source": [
     "In this case, the circuit reaches 15 bits of precision.\n",
     "\n",
-    "We then execute it with simulation and observe that we once again retrieve the same outputs as the ones computes in the clear with quantized operators."
+    "The model is then executed with simulation, and it is observed that the same outputs as those computed in the clear with quantized operators are retrieved."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 21,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -451,24 +463,35 @@
     "output_logits_simulated = proj_12_heads_qgpt2(input_ids).logits"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The multi-head attention is now ready for execution in FHE."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 22,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Simulated logits are equal to the quantized clear ones: False\n"
+      "Time taken to execute the multi-head in FHE: 862.97 seconds\n"
      ]
     }
    ],
    "source": [
-    "print(\n",
-    "    \"Simulated logits are equal to the quantized clear ones:\",\n",
-    "    torch.equal(output_logits_clear, output_logits_simulated),\n",
-    ")"
+    "proj_12_heads_qgpt2.set_fhe_mode(fhe=\"execute\")\n",
+    "\n",
+    "start = time.time()\n",
+    "output_logits_fhe = proj_12_heads_qgpt2(input_ids).logits\n",
+    "end = time.time()\n",
+    "\n",
+    "execution_time = end - start\n",
+    "print(f\"Time taken to execute the multi-head in FHE: {execution_time:.2f} seconds\")"
    ]
   },
   {
@@ -478,13 +501,13 @@
    "source": [
     "## Performance Evaluation\n",
     "\n",
-    "In the following, we evaluate the impact of the number of bits used for quantization on the models' performance.\n",
-    "Here, we check for the top-k accuracies, with a few different values of k, on the predicted next logits with respected to the one computed by the initial floating point model.  "
+    "In the following, the impact of the number of bits used for quantization on the models' performance is evaluated.\n",
+    "Top-k accuracies are checked, with a few different values of k, on the predicted next logits with respect to the one computed by the initial floating point model."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 24,
    "metadata": {},
    "outputs": [
     {
@@ -635,7 +658,7 @@
     "                # Generate the top-k tokens for the Hugging Face floating point model\n",
     "                hf_topk_tokens_list = generate_topk_tokens(gpt2_model, gpt2_tokenizer, text, 1)\n",
     "\n",
-    "                # Generate the top-k tokens for the clone model\n",
+    "                # Generate the top-k tokens for the quantized model\n",
     "                clone_topk_tokens_list = generate_topk_tokens(model, gpt2_tokenizer, text, top_k)\n",
     "\n",
     "                # Compute the top-k accuracy for each token in the text\n",
@@ -674,8 +697,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can see that for the first model, where only a single head if done with FHE-compliant operators, 4 bits are enough to recover 95\\% of Hugging Face's performance in terms of top1 accuracy. \n",
-    "For the second model, which implements a complete multi-head attention with quantized operators, 7 bits gives a 80\\% exact predictions while the top-5 and top-10 accuracies reach 98\\%."
+    "The results show that for the first model, where only a single head is done with FHE-compliant operators, 4 bits are enough to recover 95% of Hugging Face's performance in terms of top-1 accuracy.\n",
+    "For the second model, which implements a complete multi-head attention with quantized operators, 7 bits gives a 80% exact predictions while the top-5 and top-10 accuracies reach 98%."
    ]
   }
  ],

diff --git a/use_case_examples/llm/README.md b/use_case_examples/llm/README.md
@@ -25,7 +25,6 @@ pip install -r requirements.txt
 
 ## How to Use
 
-How to Use
 You can use this code to predict the next token based on an input sentence via various GPT-2 models. There are three distinct modes of inference:
 
 1. Clear Quantized: Inference on unencrypted data.
@@ -74,19 +73,15 @@ The evaluation details are also available in the [QGPT2Evaluate.ipynb](./QGPT2Ev
 
 ## FHE Execution
 
-The multi-head attention (MHA) and single-head variants, in their current implementation, have not been optimized sufficiently in terms of precision to provide relevant execution times.
+The multi-head attention (MHA) and single-head variants now use a rounding approach, significantly improving their execution times in Fully Homomorphic Encryption (FHE) mode.
 
-If you aim to operate in FHE, it would be necessary to reduce the overall precision. A potential way to effectively achieve this is by using [rounded table lookup](https://docs.zama.ai/concrete/v/main-1/tutorials/rounded_table_lookups) from the Concrete library.
+See [rounded table lookup](https://docs.zama.ai/concrete/v/main-1/tutorials/rounded_table_lookups) from the Concrete library for more details.
 
-Also note that, as with the standard transformer architecture, the execution time is heavily reliant on the number of input tokens. For example, a precision of 4 bits for the single head model takes 11 minutes to run on a single 128 cores CPU machine (an m6i.metal from aws). You can easily replicate this run by setting the FHE mode to "execute" as shown below:
+For the single-head model, the execution time is 166.38 seconds on a single 196 cores CPU machine (an hp7c from AWS). The multi-head attention model, which is a full attention block from GPT-2, now runs in about 862.97 seconds (~14 minutes) under the same conditions. All these timings are actual FHE execution on encrypted data.
 
-<!--pytest-codeblocks:skip-->
-
-```python
-proj_single_head_qgpt2.set_fhe_mode(fhe="execute")
-```
+Note that, computations were done using 8 input tokens.
 
-and then doing a standard inference on an input.
+You can replicate these results by running the [QGPT2Evaluate.ipynb](./QGPT2Evaluate.ipynb) notebook.
 
 ## Additional Classes and Functions