Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize DoRA in eval and no dropout #2122

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ariG23498
Copy link
Contributor

Fixes #2107

Comment on lines 588 to 592
if isinstance(dropout, nn.Identity):
print("no dropout, optimize here")
else:
print("dropout, same ops")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenjaminBossan did you envision something like this?

My intuition was:

  1. Figure out whether there is dropout or not
  2. Use a flag for dropout
  3. Pass the flag to the forward or DoRA layers -- where I would need to skip the alignment step and reuse x (the base model outputs)

Let me know if I am on the right track.

Note: I could not figure out a way to catch if the model was in eval mode. How would you have done it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the dropout check is valid as is. Regarding eval mode, I think that checking self.training should work.

On how to proceed, my thinking was that if we find that we can make this optimization, we pass the base result as an additional argument to DoRA forward (default for that argument being None) and there, we use this base result if it's given and if not, we calculate it like we currently do. Could be that I'm missing something but that's my idea.

The good news is that since we have a working implementation, we can then compare the results using both approaches and it should be identical (of course not when there is dropout, but apart from that).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat solution!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ariG23498 ariG23498 marked this pull request as ready for review October 2, 2024 10:23
Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this task. I have a few comments:

I don't think we need the do_optimize argument. When result is passed, let's use it, else don't.

Also, let's refactor this a bit and avoid an early return. Instead, let's do something like:

if result is not None:
    # let's also add a comment to explain
    base_result = ...
else:
    base _result = F.linear(x, transpose(weight, self.fan_in_fan_out))

Then in this line:

https://github.com/huggingface/peft/pull/2122/files#diff-bcb4d7d165949d2eb3eac3203b9067589261e52a1192b7a31f101a3ec98855acR99

replace F.linear(x, transpose(weight, self.fan_in_fan_out)) by base_result.

A small caveat I see with this approach is that we assume that we can just remove the bias to get the base result. This will work for normal nn.Linear layers but may not work for other types. But let's maybe not worry about that for now.

We should also run an example with and without the optimization to ensure that we get the same results and also a speedup and memory improvement.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I left a comment where I think the calculation is not quite right.

Also, it would be great if you could check a DoRA example to see the changes caused by this PR. Probably one of the existing examples could be used. We should ensure that:

  • The results are the same (assuming dropout being 0)
  • Training should be faster
  • Memory usage should be lower

bias = base_layer.bias
if bias is not None:
base_result = base_result - bias
result_dora = mag_norm_scale * base_result + mag_norm_scale * lora_result * scaling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, wait, should this not be the exact same calculation as in line 103? I.e. we should leave the condition after calculating the base_result and then do the same calculation of dora_result for both cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that I follow. With the base_result in place:

  1. We first subtract the bias
  2. Compute the dora_result where the scale the base_result with mag_norm_scale

But without the base_result:

  1. We compute the base_result with the linear forward
  2. Compute the dora_result where we scale the base_result with (1 - mag_norm_scale)

Aren't they going to be different for each case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I'm a bit confused, let's try to resolve this.

In the old code, we basically have:

dora_result = (mag_norm_scale - 1) * base_result + mag_norm_scale * lora_result * lora_scale

variable names slightly changed for clarity

My thinking is that the base_result is either calculated right there (old code) or we use the base_result that is being passed as an argument, but the basic equation stays the same.

Of course, as you correctly noted, the bias needs to be subtracted first and then added back in the latter case.

In the currently proposed code, in one case we calculate mag_norm_scale * base_result and in the other (mag_norm_scale - 1) * base_result. This looks inconsistent to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize DoRA computation when there is no dropout
4 participants