-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize DoRA in eval
and no dropout
#2122
base: main
Are you sure you want to change the base?
Conversation
src/peft/tuners/lora/layer.py
Outdated
if isinstance(dropout, nn.Identity): | ||
print("no dropout, optimize here") | ||
else: | ||
print("dropout, same ops") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan did you envision something like this?
My intuition was:
- Figure out whether there is dropout or not
- Use a flag for dropout
- Pass the flag to the
forward
or DoRA layers -- where I would need to skip the alignment step and reusex
(the base model outputs)
Let me know if I am on the right track.
Note: I could not figure out a way to catch if the model was in eval
mode. How would you have done it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think the dropout check is valid as is. Regarding eval mode, I think that checking self.training
should work.
On how to proceed, my thinking was that if we find that we can make this optimization, we pass the base result as an additional argument to DoRA forward
(default for that argument being None
) and there, we use this base result if it's given and if not, we calculate it like we currently do. Could be that I'm missing something but that's my idea.
The good news is that since we have a working implementation, we can then compare the results using both approaches and it should be identical (of course not when there is dropout, but apart from that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat solution!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this task. I have a few comments:
I don't think we need the do_optimize
argument. When result
is passed, let's use it, else don't.
Also, let's refactor this a bit and avoid an early return. Instead, let's do something like:
if result is not None:
# let's also add a comment to explain
base_result = ...
else:
base _result = F.linear(x, transpose(weight, self.fan_in_fan_out))
Then in this line:
replace F.linear(x, transpose(weight, self.fan_in_fan_out))
by base_result
.
A small caveat I see with this approach is that we assume that we can just remove the bias to get the base result. This will work for normal nn.Linear
layers but may not work for other types. But let's maybe not worry about that for now.
We should also run an example with and without the optimization to ensure that we get the same results and also a speedup and memory improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. I left a comment where I think the calculation is not quite right.
Also, it would be great if you could check a DoRA example to see the changes caused by this PR. Probably one of the existing examples could be used. We should ensure that:
- The results are the same (assuming dropout being 0)
- Training should be faster
- Memory usage should be lower
bias = base_layer.bias | ||
if bias is not None: | ||
base_result = base_result - bias | ||
result_dora = mag_norm_scale * base_result + mag_norm_scale * lora_result * scaling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, wait, should this not be the exact same calculation as in line 103? I.e. we should leave the condition after calculating the base_result
and then do the same calculation of dora_result
for both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that I follow. With the base_result
in place:
- We first subtract the bias
- Compute the
dora_result
where the scale thebase_result
withmag_norm_scale
But without the base_result
:
- We compute the
base_result
with the linear forward - Compute the
dora_result
where we scale thebase_result
with(1 - mag_norm_scale)
Aren't they going to be different for each case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so I'm a bit confused, let's try to resolve this.
In the old code, we basically have:
dora_result = (mag_norm_scale - 1) * base_result + mag_norm_scale * lora_result * lora_scale
variable names slightly changed for clarity
My thinking is that the base_result
is either calculated right there (old code) or we use the base_result
that is being passed as an argument, but the basic equation stays the same.
Of course, as you correctly noted, the bias needs to be subtracted first and then added back in the latter case.
In the currently proposed code, in one case we calculate mag_norm_scale * base_result
and in the other (mag_norm_scale - 1) * base_result
. This looks inconsistent to me.
Fixes #2107