Mismatch with epoch when using gradient_accumulation #31677

SangbumChoi · 2024-06-28T02:37:06Z

System Info

transformers version: 4.43.0.dev0
Platform: Linux-5.4.0-167-generic-x86_64-with-glibc2.35
Python version: 3.10.14
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.30.1
Accelerate config: not found
PyTorch version (GPU?): 2.2.2 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA TITAN RTX

Who can help?

@muellerzr @SunMarc

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

This is the issue of mismatch defined epoch and actual train epoch.
Even though I set 24 epoch in trainarguments and set gradient_accumulation_step as 2. There is a mismatch of calculating max_steps when it is not set.

transformers/src/transformers/trainer.py

Line 1983 in 1c68f2c

 # May be slightly incorrect if the last batch in the training dataloader has a smaller size but it's 

Expected behavior

transformers/src/transformers/trainer.py

Line 1975 in 1c68f2c

num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps

If we just use normal divider it solves the issue. Is there any specific reason that num_update_steps_per_epoch should be remained as an integer?
num_update_steps_per_epoch = len_dataloader / args.gradient_accumulation_steps

The text was updated successfully, but these errors were encountered:

SangbumChoi · 2024-06-28T02:41:08Z

This is the intermediate value that I logged
batch size = 4, num_gpu = 4, gradient_accumulation_step=2, num_train_images 140

Before change
_inner_training_loop

has_length(train_dataloader) True
args.max_steps > 0 False
max_steps 96
args.num_train_epochs 24

total_epoch -> 22 (which is not expected!)

After change
_inner_training_loop

has_length(train_dataloader) True
args.max_steps > 0 False
max_steps 108
args.num_train_epochs 24

total_epoch -> 24

SangbumChoi · 2024-07-04T02:38:55Z

@muellerzr Hi what do you think about the suggested modification? Is there anything to concern? If not I can open PR!

amyeroberts · 2024-08-05T08:30:11Z

Gentle ping @muellerzr

SangbumChoi · 2024-09-12T05:20:46Z

@muellerzr If you don't have the bandwidth I can try open PR and get some review from you!

SunMarc · 2024-09-12T14:16:15Z

Thanks for the report @SangbumChoi ! I see your point. max steps calculation is not super accurate and might affect some part of our code. However, i would like to reassure you that we are indeed running 24 epoch as you can see in the training loop:
for epoch in range(epochs_trained, num_train_epochs):
If you are down to investigate a bit, you have access to the number of global_step in the training loop (self.state.global_step) to check how many updates we perform

SangbumChoi · 2024-09-13T07:29:34Z

@SunMarc Thanks for the reply.
I see that we can access self.state.global_step for the original value and calculate the progress backward.
The problem for me was that the output file trainer_state.json ends with the epoch 21.94, does this also expected? It seems quite confusing...

    {
      "epoch": 21.942857142857143,
      "step": 192,
      "total_flos": 2.24235763302767e+19,
      "train_loss": 37.259278148412704,
      "train_runtime": 2309.467,
      "train_samples_per_second": 2.847,
      "train_steps_per_second": 0.083
    }

trainer_state.json

SunMarc · 2024-09-17T00:20:10Z

The problem for me was that the output file trainer_state.json ends with the epoch 21.94, does this also expected? It seems quite confusing..

This is indeed a bit confusing. Can you share a minimal reproducer @SangbumChoi, that would be very helpful in order to investigate this further !

muellerzr self-assigned this Jul 1, 2024

github-actions bot closed this as completed Aug 5, 2024

huggingface deleted a comment from github-actions bot Aug 5, 2024

amyeroberts reopened this Aug 5, 2024

tomtseng mentioned this issue Sep 13, 2024

Trainer may stop short of requested number of epochs when using gradient_accumulation_steps > 1 #33455

Open

4 tasks

SangbumChoi mentioned this issue Sep 17, 2024

Multi-GPU Training Object Detection #33525

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch with epoch when using gradient_accumulation #31677

Mismatch with epoch when using gradient_accumulation #31677

SangbumChoi commented Jun 28, 2024

SangbumChoi commented Jun 28, 2024 •

edited

Loading

SangbumChoi commented Jul 4, 2024

amyeroberts commented Aug 5, 2024

SangbumChoi commented Sep 12, 2024 •

edited

Loading

SunMarc commented Sep 12, 2024 •

edited

Loading

SangbumChoi commented Sep 13, 2024 •

edited

Loading

SunMarc commented Sep 17, 2024 •

edited

Loading

Mismatch with epoch when using gradient_accumulation #31677

Mismatch with epoch when using gradient_accumulation #31677

Comments

SangbumChoi commented Jun 28, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

SangbumChoi commented Jun 28, 2024 • edited Loading

SangbumChoi commented Jul 4, 2024

amyeroberts commented Aug 5, 2024

SangbumChoi commented Sep 12, 2024 • edited Loading

SunMarc commented Sep 12, 2024 • edited Loading

SangbumChoi commented Sep 13, 2024 • edited Loading

SunMarc commented Sep 17, 2024 • edited Loading

SangbumChoi commented Jun 28, 2024 •

edited

Loading

SangbumChoi commented Sep 12, 2024 •

edited

Loading

SunMarc commented Sep 12, 2024 •

edited

Loading

SangbumChoi commented Sep 13, 2024 •

edited

Loading

SunMarc commented Sep 17, 2024 •

edited

Loading