Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch with epoch when using gradient_accumulation #31677

Open
2 of 4 tasks
SangbumChoi opened this issue Jun 28, 2024 · 7 comments
Open
2 of 4 tasks

Mismatch with epoch when using gradient_accumulation #31677

SangbumChoi opened this issue Jun 28, 2024 · 7 comments
Assignees

Comments

@SangbumChoi
Copy link
Contributor

System Info

  • transformers version: 4.43.0.dev0
  • Platform: Linux-5.4.0-167-generic-x86_64-with-glibc2.35
  • Python version: 3.10.14
  • Huggingface_hub version: 0.23.4
  • Safetensors version: 0.4.3
  • Accelerate version: 0.30.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.2 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA TITAN RTX

Who can help?

@muellerzr @SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

This is the issue of mismatch defined epoch and actual train epoch.
Even though I set 24 epoch in trainarguments and set gradient_accumulation_step as 2. There is a mismatch of calculating max_steps when it is not set.

# May be slightly incorrect if the last batch in the training dataloader has a smaller size but it's

Expected behavior

num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps

If we just use normal divider it solves the issue. Is there any specific reason that num_update_steps_per_epoch should be remained as an integer?
num_update_steps_per_epoch = len_dataloader / args.gradient_accumulation_steps

@SangbumChoi
Copy link
Contributor Author

SangbumChoi commented Jun 28, 2024

This is the intermediate value that I logged
batch size = 4, num_gpu = 4, gradient_accumulation_step=2, num_train_images 140

Before change
_inner_training_loop

has_length(train_dataloader) True
args.max_steps > 0 False
max_steps 96
args.num_train_epochs 24

total_epoch -> 22 (which is not expected!)

After change
_inner_training_loop

has_length(train_dataloader) True
args.max_steps > 0 False
max_steps 108
args.num_train_epochs 24

total_epoch -> 24

@muellerzr muellerzr self-assigned this Jul 1, 2024
@SangbumChoi
Copy link
Contributor Author

@muellerzr Hi what do you think about the suggested modification? Is there anything to concern? If not I can open PR!

@github-actions github-actions bot closed this as completed Aug 5, 2024
@huggingface huggingface deleted a comment from github-actions bot Aug 5, 2024
@amyeroberts amyeroberts reopened this Aug 5, 2024
@amyeroberts
Copy link
Collaborator

Gentle ping @muellerzr

@SangbumChoi
Copy link
Contributor Author

SangbumChoi commented Sep 12, 2024

@muellerzr If you don't have the bandwidth I can try open PR and get some review from you!

@SunMarc
Copy link
Member

SunMarc commented Sep 12, 2024

Thanks for the report @SangbumChoi ! I see your point. max steps calculation is not super accurate and might affect some part of our code. However, i would like to reassure you that we are indeed running 24 epoch as you can see in the training loop:
for epoch in range(epochs_trained, num_train_epochs):
If you are down to investigate a bit, you have access to the number of global_step in the training loop (self.state.global_step) to check how many updates we perform

@SangbumChoi
Copy link
Contributor Author

SangbumChoi commented Sep 13, 2024

@SunMarc Thanks for the reply.
I see that we can access self.state.global_step for the original value and calculate the progress backward.
The problem for me was that the output file trainer_state.json ends with the epoch 21.94, does this also expected? It seems quite confusing...

    {
      "epoch": 21.942857142857143,
      "step": 192,
      "total_flos": 2.24235763302767e+19,
      "train_loss": 37.259278148412704,
      "train_runtime": 2309.467,
      "train_samples_per_second": 2.847,
      "train_steps_per_second": 0.083
    }

trainer_state.json

@SunMarc
Copy link
Member

SunMarc commented Sep 17, 2024

The problem for me was that the output file trainer_state.json ends with the epoch 21.94, does this also expected? It seems quite confusing..

This is indeed a bit confusing. Can you share a minimal reproducer @SangbumChoi, that would be very helpful in order to investigate this further !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants