diff --git a/docs/deep-learning/fhe_assistant.md b/docs/deep-learning/fhe_assistant.md index d3b0b67e8..07932550b 100644 --- a/docs/deep-learning/fhe_assistant.md +++ b/docs/deep-learning/fhe_assistant.md @@ -59,21 +59,29 @@ The most common compilation errors stem from the following causes: #### 1. TLU input maximum bit-width is exceeded -This error can occur when `rounding_threshold_bits` is not used and accumulated intermediate values in the computation exceed 16-bits. The most common approaches to fix this issue are: +**Error message**: `this [N]-bit value is used as an input to a table lookup` + +**Cause**: This error can occur when `rounding_threshold_bits` is not used and accumulated intermediate values in the computation exceed 16-bits. The most common approaches to fix this issue are: + +**Possible solutions**: - Reduce quantization `n_bits`. However, this may reduce accuracy. When quantization `n_bits` must be below 6, it is best to use [Quantization Aware Training](../deep-learning/fhe_friendly_models.md). - Use `rounding_threshold_bits`. This feature is described [here](../explanations/advanced_features.md#rounded-activations-and-quantizers). It is recommended to use the [`fhe.Exactness.APPROXIMATE`](../references/api/concrete.ml.torch.compile.md#function-compile_torch_model) setting, and set the rounding bits to 1 or 2 bits higher than the quantization `n_bits` - Use [pruning](../explanations/pruning.md) -#### 2. No crypto-parameters can be found for the ML model: `RuntimeError: NoParametersFound` is raised by the compiler +#### 2. No crypto-parameters can be found + +**Error message**: `RuntimeError: NoParametersFound` is raised by the compiler -This error occurs when using `rounding_threshold_bits` in the `compile_torch_model` function. The solutions in this case are similar to the ones for the previous error. +**Cause**: This error occurs when using `rounding_threshold_bits` in the `compile_torch_model` function. -#### 3. Quantization import failed with +**Possible solutions**: The solutions in this case are similar to the ones for the previous error. -The error associated is `Error occurred during quantization aware training (QAT) import [...] Could not determine a unique scale for the quantization!`. +#### 3. Quantization import failed -This error is a due to missing quantization operators in the model that is imported as a quantized aware training model. See [this guide](../deep-learning/fhe_friendly_models.md) for a guide on how to use Brevitas layers. This error message is generated when not all layers take inputs that are quantized through `QuantIdentity` layers. +**Error message**: `Error occurred during quantization aware training (QAT) import [...] Could not determine a unique scale for the quantization!`. + +**Cause**: This error is a due to missing quantization operators in the model that is imported as a quantized aware training model. See [this guide](../deep-learning/fhe_friendly_models.md) for a guide on how to use Brevitas layers. This error message is generated when not all layers take inputs that are quantized through `QuantIdentity` layers. A common example is related to the concatenation operator. Suppose two tensors `x` and `y` are produced by two layers and need to be concatenated: @@ -85,10 +93,12 @@ y = self.dense2(y) z = torch.cat([x, y]) ``` -In the example above, the `x` and `y` layers need quantization before being concatenated. When using quantization aware training with Brevitas the following approach will fix this error: +In the example above, the `x` and `y` layers need quantization before being concatenated. + +**Possible solutions**: -1. Add a new `QuantIdentity` layer in your model. Suppose it is called `quant_concat`. -1. In the `forward` function, before concatenation of `x` and `y`, apply it to both tensors that are concatenated: +1. If the error occurs for the first layer of the model: Add a `QuantIdentity` layer in your model and apply it on the input of the `forward` function, before the first layer is computed. +1. If the error occurs for a concatenation or addition layer: Add a new `QuantIdentity` layer in your model. Suppose it is called `quant_concat`. In the `forward` function, before concatenation of `x` and `y`, apply it to both tensors that are concatenated. The usage of a common `Quantidentity` layer to quantize both tensors that are concatenated ensures that they have the same scale: @@ -96,8 +106,6 @@ In the example above, the `x` and `y` layers need quantization before being conc z = torch.cat([self.quant_concat(x), self.quant_concat(y)]) ``` -The usage of a common `Quantidentity` layer to quantize both tensors that are concatenated ensures that they have the same scale. - ## Debugging compilation errors Compilation errors due to FHE incompatible models, such as maximum bit-width exceeded or `NoParametersFound` can be debugged by examining the bit-widths associated with various intermediate values of the FHE computation. diff --git a/docs/deep-learning/torch_support.md b/docs/deep-learning/torch_support.md index c6a16730f..f6f3aabd9 100644 --- a/docs/deep-learning/torch_support.md +++ b/docs/deep-learning/torch_support.md @@ -5,7 +5,7 @@ In addition to the built-in models, Concrete ML supports generic machine learnin There are two approaches to build [FHE-compatible deep networks](../getting-started/concepts.md#model-accuracy-considerations-under-fhe-constraints): - [Quantization Aware Training (QAT)](../explanations/quantization.md) requires using custom layers, but can quantize weights and activations to low bit-widths. Concrete ML works with [Brevitas](../explanations/inner-workings/external_libraries.md#brevitas), a library providing QAT support for PyTorch. To use this mode, compile models using `compile_brevitas_qat_model` -- Post-training Quantization: in this mode a vanilla PyTorch model can be compiled. However, when quantizing weights & activations to fewer than 7 bits the accuracy can decrease strongly. To use this mode, compile models with `compile_torch_model`. +- Post-training Quantization: in this mode a vanilla PyTorch model can be compiled. However, when quantizing weights & activations to fewer than 7 bits the accuracy can decrease strongly. On the other hand, depending on the model size, quantizing with 6-8 bits can be incompatible with FHE constraints. To use this mode, compile models with `compile_torch_model`. Both approaches should be used with the `rounding_threshold_bits` parameter set accordingly. The best values for this parameter need to be determined through experimentation. A good initial value to try is `6`. See [here](../explanations/advanced_features.md#rounded-activations-and-quantizers) for more details. @@ -15,7 +15,7 @@ Both approaches should be used with the `rounding_threshold_bits` parameter set ## Quantization-aware training -The following example uses a simple QAT PyTorch model that implements a fully connected neural network with two hidden layers. Due to its small size, making this model respect FHE constraints is relatively easy. +The following example uses a simple QAT PyTorch model that implements a fully connected neural network with two hidden layers. Due to its small size, making this model respect FHE constraints is relatively easy. To use QAT, Brevitas `QuantIdentity` nodes must be inserted in the PyTorch model, including one that quantizes the input of the `forward` function. ```python import brevitas.nn as qnn @@ -63,6 +63,10 @@ quantized_module = compile_brevitas_qat_model( ``` +{% hint style="warning" %} +If `QuantIdentity` layers are missing for any input or intermediate value, the compile function will raise an error. See the [common compilation errors page](./fhe_assistant.md#common-compilation-errors) for an explanation. +{% endhint %} + ## Post-training quantization The following example uses a simple PyTorch model that implements a fully connected neural network with two hidden layers. The model is compiled to use FHE using `compile_torch_model`. @@ -103,7 +107,11 @@ quantized_module = compile_torch_model( ## Configuring quantization parameters -The PyTorch/Brevitas models, created following the example above, require the user to configure quantization parameters such as `bit_width` (activation bit-width) and `weight_bit_width`. The quantization parameters, along with the number of neurons on each layer, will determine the accumulator bit-width of the network. Larger accumulator bit-widths result in higher accuracy but slower FHE inference time. +With QAT, the PyTorch/Brevitas models, created following the example above, require the user to configure quantization parameters such as `bit_width` (activation bit-width) and `weight_bit_width`. When using this mode, set `n_bits=None` in the `compile_brevitas_qat_model`. + +With PTQ, the user needs to set the `n_bits` value in the `compile_torch_model` function. A trade-off between accuracy and FHE compatibility and latency must be determined manually. + +The quantization parameters, along with the number of neurons on each layer, will determine the accumulator bit-width of the network. Larger accumulator bit-widths result in higher accuracy but slower FHE inference time. ## Running encrypted inference