Algorithm explanation

Basic Idea

Linear Layer:

$Y = W \cdot X$

finetuning is to get the $W'$

$Y = W \cdot X + W' \cdot X$

and normally $shape(W') = shape(W)$

Conventional Methods

Ref

Linear:

$Y_{out \times batch} = W_{out \times in} \cdot X_{in \times batch}$

$\xrightarrow{} Y_{out \times batch} = W_{out \times in} \cdot X_{in \times batch} + Wa_{out \times dim} \cdot Wb_{dim \times in} \cdot X_{in \times batch}$

LoRA for Convolution: Consider im2col of matmul first:

$X:[channel, width, height]$

$\xrightarrow{reorder}[c \times kw \times kh, outw \times outh]$

$Kernels: [out, c, kw, kh] \xrightarrow{reshape} [out, c \times kw \times kh]$

$Conv(X, Kernels) = Kernels \times X \xrightarrow{reshape} [out, outw, outh]$

and then write down this conventional LoRA for conv layer $Conv(in, out, ksize, padding, stride)$

$\xrightarrow{}Conv(dim, out, 1)\circ Conv(in, dim, ksize, padding, stride)$

In this method, we can get that $W' = Wa \cdot Wb$ with $rank(W') \le dim$

Hadamard Product

Ref

consider $W' = Wa \odot Wb$, we can get $rank(W') \le rank(Wa) \times rank(Wb)$. And then we use conventional method on $Wa$ and $Wb$. Which means it can use 2x dim to get square rank.

Rank != Information capacity, but they are relative

based on the experiment result from the paper, it seems like although the rank(Wa) * rank(Wb) is just upper bound, but almost everytime it will produce W' with rank = rank(Wa)*rank(Wb).

Why custom backward

with $W' = (Wa_1 \cdot Wa_2) \odot (Wb_1 \cdot Wb_2)$, when you need to calc the backpropogation, you will need $\Delta{W'}$ and $Wa$ to calc $\Delta{Wb}$, also $Wb$ for $\Delta{Wa}$.

With pytorch's autograd, this kind of operation will cache the $Wa$ and $Wb$ for calc the backward, which means it will cache 2x size of weight for backward.

To avoid this terrible situation, I impl a custom backward which will reconstruct $Wa$ and $Wb$ when actually needed, this method saved tons of memory.

CP-Decomposition

Ref

As mentioned before, the weight shape for convolution layer is $[out, in, kw, kh]$. And we just unfold it to $[out, in \times kw \times kh]$ for decomposition.

But actually there is a method to decomposition any shape ot tensor called cp decomposition.

Using cp-decomposition in Covolution will be something like:

$\tau: [dim, dim, kw, kh]$
$x_1: [dim, out]$
$x_2: [dim, in]$
$W' = \tau \times_1 x_1 \times_2 x_2$
$W': [out, in, kw, kh]$

Or write this thing as multiple conv layer:

Conv(in, dim, (1, 1))
↓
Conv(dim, dim, (kw, kh), stride, padding)
↓
conv(dim, out, (1, 1))

For hadamard product implementation, just use 2 different $W'$ and multiply them together.

Sparse Bias

Todo...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Algo.md

Algo.md

Algorithm explanation

Basic Idea

Conventional Methods

Hadamard Product

Why custom backward

CP-Decomposition

Sparse Bias

Files

Algo.md

Latest commit

History

Algo.md

File metadata and controls

Algorithm explanation

Basic Idea

Conventional Methods

Hadamard Product

Why custom backward

CP-Decomposition

Sparse Bias