When using `torch.no_grad()`, will deepspeed not perform memory optimization on some parameters of `no_grad`? #5805

ojipadeson · 2024-07-29T11:58:23Z

ojipadeson
Jul 29, 2024

I froze part of the LLM model and introduced it into my custom model. I found that in this case the memory usage of the first GPU is always much higher than the others. Does it mean that deepspeed will not perform memory optimization for the non-gradient part.

tjruwase · 2024-08-13T11:19:47Z

tjruwase
Aug 13, 2024
Maintainer

@ojipadeson, DeepSpeed does optimize for parameters that have no gradients, i.e., frozen parameters. See examples below.

DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py

Lines 310 to 312 in ffe0af2

    
           if param.requires_grad: 
        
               param.grad_accum = None 
        
               trainable_parameters.append(param)

DeepSpeed/deepspeed/runtime/zero/stage3.py

Line 446 in ffe0af2

trainable_params = [p for p in param_group[PARAMS_KEY] if p.requires_grad]

Can you please provide repro steps to help investigate what you are observing? Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using `torch.no_grad()`, will deepspeed not perform memory optimization on some parameters of `no_grad`? #5805

{{title}}

Replies: 1 comment

{{title}}

Select a reply

When using torch.no_grad(), will deepspeed *not* perform memory optimization on some parameters of no_grad? #5805

ojipadeson Jul 29, 2024

Replies: 1 comment

tjruwase Aug 13, 2024 Maintainer

When using `torch.no_grad()`, will deepspeed not perform memory optimization on some parameters of `no_grad`? #5805

ojipadeson
Jul 29, 2024

tjruwase
Aug 13, 2024
Maintainer