Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU Training Object Detection #33525

Closed
2 of 4 tasks
SangbumChoi opened this issue Sep 17, 2024 · 4 comments · May be fixed by #33561
Closed
2 of 4 tasks

Multi-GPU Training Object Detection #33525

SangbumChoi opened this issue Sep 17, 2024 · 4 comments · May be fixed by #33561

Comments

@SangbumChoi
Copy link
Contributor

SangbumChoi commented Sep 17, 2024

System Info

  • transformers version: 4.45.0.dev0
  • Platform: Linux-5.4.0-167-generic-x86_64-with-glibc2.35
  • Python version: 3.10.14
  • Huggingface_hub version: 0.24.5
  • Safetensors version: 0.4.4
  • Accelerate version: 0.33.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA TITAN RTX

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

https://github.com/huggingface/transformers/blob/main/examples/pytorch/object-detection/run_object_detection.py

Simply running this script with two or more GPUs

Expected behavior

While investigating #31677 (cc. @SunMarc )
Former issue
#28740
#31461
#13197
needed to be resolved so I have digged into it.

Found out the issue was no longer related to @NielsRogge comments (which was related to normalize the num_boxes). Now it is related to the targets in accelerate and Trainer with concat to the multi-gpu circumstances so following error was occured.

    indices = [linear_sum_assignment(c[i]) for i, c in enumerate(cost_matrix.split(sizes, -1))]
IndexError: index 2 is out of bounds for dimension 0 with size 2

If we do not use Trainer class it resolves by follows (no bug)

cost_matrix torch.Size([2, 100, 34])
sizes [22, 12]

However, when we use Trainer class it shows -> which is the circumstances of 3 multi-GPUs concat the individual targets.

cost_matrix torch.Size([2, 100, 17])
sizes [3, 2, 3, 1, 3, 5]

I am more investigating how to fundamentally fix this problem (not modifying model files just simply add some argument such as do_train_concat) but issue this for also other people who might be interested in. (cc. @qubvel )

@qubvel
Copy link
Member

qubvel commented Sep 17, 2024

Hi @SangbumChoi, thanks for opening the issue! It will be great to have it fixed 🙂

@qubvel
Copy link
Member

qubvel commented Sep 17, 2024

@SangbumChoi
Copy link
Contributor Author

#32525 This error is related to torchmetrics of cpu/gpu problem Lightning-AI/torchmetrics#2477

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants