When using torch.no_grad()
, will deepspeed *not* perform memory optimization on some parameters of no_grad
?
#5805
Unanswered
ojipadeson
asked this question in
Q&A
Replies: 1 comment
-
@ojipadeson, DeepSpeed does optimize for parameters that have no gradients, i.e., frozen parameters. See examples below. DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py Lines 310 to 312 in ffe0af2 DeepSpeed/deepspeed/runtime/zero/stage3.py Line 446 in ffe0af2 Can you please provide repro steps to help investigate what you are observing? Thanks! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I froze part of the LLM model and introduced it into my custom model. I found that in this case the memory usage of the first GPU is always much higher than the others. Does it mean that deepspeed will not perform memory optimization for the non-gradient part.
Beta Was this translation helpful? Give feedback.
All reactions