From 3b2c017b46e20f0ee81e030703d8b1450b4aa235 Mon Sep 17 00:00:00 2001 From: zhelal Date: Thu, 3 Oct 2024 10:36:32 +0200 Subject: [PATCH] updating docs to reflect new changes --- docs/source/developer_guides/quantization.md | 10 +++++++++- docs/source/package_reference/vera.md | 7 ------- 2 files changed, 9 insertions(+), 8 deletions(-) diff --git a/docs/source/developer_guides/quantization.md b/docs/source/developer_guides/quantization.md index 114021cafc..c0848c086f 100644 --- a/docs/source/developer_guides/quantization.md +++ b/docs/source/developer_guides/quantization.md @@ -187,9 +187,17 @@ peft_config = LoraConfig(...) quantized_model = get_peft_model(quantized_model, peft_config) ``` +## Other Supported PEFT Methods + +Besides LoRA, the following PEFT methods also support quantization: + +- **VeRA** (supports bitsandbytes quantization) +- **AdaLoRA** (supports both bitsandbytes and GPTQ quantization) +- **(IA)³** (supports bitsandbytes quantization) + ## Next steps If you're interested in learning more about quantization, the following may be helpful: -* Learn more about details about QLoRA and check out some benchmarks on its impact in the [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes) blog post. +* Learn more details about QLoRA and check out some benchmarks on its impact in the [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes) blog post. * Read more about different quantization schemes in the Transformers [Quantization](https://hf.co/docs/transformers/main/quantization) guide. diff --git a/docs/source/package_reference/vera.md b/docs/source/package_reference/vera.md index 9f7bb19a38..107998aa6f 100644 --- a/docs/source/package_reference/vera.md +++ b/docs/source/package_reference/vera.md @@ -22,13 +22,6 @@ When saving the adapter parameters, it's possible to eschew storing the low rank To handle different shapes of adapted layers, VeRA initializes shared A and B matrices with the largest required size for each dimension. During the forward pass, submatrices A and B for a given layer are sliced out from these shared matrices and used as described in the paper. For example, adapting two linear layers of shapes (100, 20) and (80, 50) will create A and B matrices of shapes (rank, 50) and (100, rank) respectively. Then, to adapt a layer of shape (100, 20), submatrices A and B of shapes (rank, 20) and (100, rank) will be extracted. -VeRA currently has the following constraints: - -- Only `nn.Linear` layers are supported. -- Quantized layers are not supported. - -If these constraints don't work for your use case, use LoRA instead. - The abstract from the paper is: > Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning large language models, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameters compared to LoRA, yet maintains the same performance. It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead. We demonstrate its effectiveness on the GLUE and E2E benchmarks, image classification tasks, and show its application in instruction-tuning of 7B and 13B language models.