Linear Layer:
$Y = W \cdot X$
finetuning is to get the $W'$
$Y = W \cdot X + W' \cdot X$
and normally $shape(W') = shape(W)$
Ref
Linear:
$Y_{out \times batch} = W_{out \times in} \cdot X_{in \times batch}$
$\xrightarrow{} Y_{out \times batch} = W_{out \times in} \cdot X_{in \times batch} + Wa_{out \times dim} \cdot Wb_{dim \times in} \cdot X_{in \times batch}$
LoRA for Convolution:
Consider im2col of matmul first:
$X:[channel, width, height]$
$\xrightarrow{reorder}[c \times kw \times kh, outw \times outh]$
$Kernels: [out, c, kw, kh] \xrightarrow{reshape} [out, c \times kw \times kh]$
$Conv(X, Kernels) = Kernels \times X \xrightarrow{reshape} [out, outw, outh]$
and then write down this conventional LoRA for conv layer
$Conv(in, out, ksize, padding, stride)$
$\xrightarrow{}Conv(dim, out, 1)\circ Conv(in, dim, ksize, padding, stride)$
In this method, we can get that
$W' = Wa \cdot Wb$ with $rank(W') \le dim$
Ref
consider $W' = Wa \odot Wb$, we can get $rank(W') \le rank(Wa) \times rank(Wb)$.
And then we use conventional method on $Wa$ and $Wb$. Which means it can use 2x dim to get square rank.
Rank != Information capacity, but they are relative
based on the experiment result from the paper, it seems like although the rank(Wa) * rank(Wb) is just upper bound, but almost everytime it will produce W' with rank = rank(Wa)*rank(Wb).
with $W' = (Wa_1 \cdot Wa_2) \odot (Wb_1 \cdot Wb_2)$, when you need to calc the backpropogation, you will need $\Delta{W'}$ and $Wa$ to calc $\Delta{Wb}$, also $Wb$ for $\Delta{Wa}$.
With pytorch's autograd, this kind of operation will cache the $Wa$ and $Wb$ for calc the backward, which means it will cache 2x size of weight for backward.
To avoid this terrible situation, I impl a custom backward which will reconstruct $Wa$ and $Wb$ when actually needed, this method saved tons of memory.
Ref
As mentioned before, the weight shape for convolution layer is $[out, in, kw, kh]$. And we just unfold it to $[out, in \times kw \times kh]$ for decomposition.
But actually there is a method to decomposition any shape ot tensor called cp decomposition.
Using cp-decomposition in Covolution will be something like:
$\tau: [dim, dim, kw, kh]$
$x_1: [dim, out]$
$x_2: [dim, in]$
$W' = \tau \times_1 x_1 \times_2 x_2$
$W': [out, in, kw, kh]$
Or write this thing as multiple conv layer:
Conv(in, dim, (1, 1))
↓
Conv(dim, dim, (kw, kh), stride, padding)
↓
conv(dim, out, (1, 1))
For hadamard product implementation, just use 2 different $W'$ and multiply them together.
Todo...