Skip to content

Commit

Permalink
updating docs to reflect new changes
Browse files Browse the repository at this point in the history
  • Loading branch information
zhelal committed Oct 3, 2024
1 parent 0752ff1 commit 3b2c017
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 8 deletions.
10 changes: 9 additions & 1 deletion docs/source/developer_guides/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,9 +187,17 @@ peft_config = LoraConfig(...)
quantized_model = get_peft_model(quantized_model, peft_config)
```

## Other Supported PEFT Methods

Besides LoRA, the following PEFT methods also support quantization:

- **VeRA** (supports bitsandbytes quantization)
- **AdaLoRA** (supports both bitsandbytes and GPTQ quantization)
- **(IA)³** (supports bitsandbytes quantization)

## Next steps

If you're interested in learning more about quantization, the following may be helpful:

* Learn more about details about QLoRA and check out some benchmarks on its impact in the [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes) blog post.
* Learn more details about QLoRA and check out some benchmarks on its impact in the [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes) blog post.
* Read more about different quantization schemes in the Transformers [Quantization](https://hf.co/docs/transformers/main/quantization) guide.
7 changes: 0 additions & 7 deletions docs/source/package_reference/vera.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,6 @@ When saving the adapter parameters, it's possible to eschew storing the low rank

To handle different shapes of adapted layers, VeRA initializes shared A and B matrices with the largest required size for each dimension. During the forward pass, submatrices A and B for a given layer are sliced out from these shared matrices and used as described in the paper. For example, adapting two linear layers of shapes (100, 20) and (80, 50) will create A and B matrices of shapes (rank, 50) and (100, rank) respectively. Then, to adapt a layer of shape (100, 20), submatrices A and B of shapes (rank, 20) and (100, rank) will be extracted.

VeRA currently has the following constraints:

- Only `nn.Linear` layers are supported.
- Quantized layers are not supported.

If these constraints don't work for your use case, use LoRA instead.

The abstract from the paper is:

> Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning large language models, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameters compared to LoRA, yet maintains the same performance. It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead. We demonstrate its effectiveness on the GLUE and E2E benchmarks, image classification tasks, and show its application in instruction-tuning of 7B and 13B language models.
Expand Down

0 comments on commit 3b2c017

Please sign in to comment.