Replies: 2 comments 1 reply
-
In any case, gradient accumulation induces an increase of memory, it's rather well explained in this paper. For e.g. sgd without momentum, this means that MultiSteps should double the memory requirements. |
Beta Was this translation helpful? Give feedback.
-
Is there a flexiable way to accumulate the gradient rather than using Multisteps, just like gradient_batch += gradient without increasing too much memory? I tried several ways to do it, but everytime I got an OOM. |
Beta Was this translation helpful? Give feedback.
-
Hi, I'm using
MultiSteps
for gradient accumulation. When the batchsize is 8 and withoutMultiSteps
, the code cost ~13G GPU memory. When I try to useMultiSteps
withevery_k_schedule
set to8
, i.e. only one sample at a time, I got an OOM. My codes are something like following:Do I have to split an
upadte
function from step?Beta Was this translation helpful? Give feedback.
All reactions