Multi-GPU Training Object Detection #33525

SangbumChoi · 2024-09-17T02:15:28Z

System Info

transformers version: 4.45.0.dev0
Platform: Linux-5.4.0-167-generic-x86_64-with-glibc2.35
Python version: 3.10.14
Huggingface_hub version: 0.24.5
Safetensors version: 0.4.4
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.4.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA TITAN RTX

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

https://github.com/huggingface/transformers/blob/main/examples/pytorch/object-detection/run_object_detection.py

Simply running this script with two or more GPUs

Expected behavior

While investigating #31677 (cc. @SunMarc )
Former issue
#28740
#31461
#13197
needed to be resolved so I have digged into it.

Found out the issue was no longer related to @NielsRogge comments (which was related to normalize the num_boxes). Now it is related to the targets in accelerate and Trainer with concat to the multi-gpu circumstances so following error was occured.

    indices = [linear_sum_assignment(c[i]) for i, c in enumerate(cost_matrix.split(sizes, -1))]
IndexError: index 2 is out of bounds for dimension 0 with size 2

If we do not use Trainer class it resolves by follows (no bug)

cost_matrix torch.Size([2, 100, 34])
sizes [22, 12]

However, when we use Trainer class it shows -> which is the circumstances of 3 multi-GPUs concat the individual targets.

cost_matrix torch.Size([2, 100, 17])
sizes [3, 2, 3, 1, 3, 5]

I am more investigating how to fundamentally fix this problem (not modifying model files just simply add some argument such as do_train_concat) but issue this for also other people who might be interested in. (cc. @qubvel )

The text was updated successfully, but these errors were encountered:

qubvel · 2024-09-17T11:38:09Z

Hi @SangbumChoi, thanks for opening the issue! It will be great to have it fixed 🙂

qubvel · 2024-09-17T14:06:31Z

Might be also relevant

SangbumChoi · 2024-09-18T04:39:00Z

#32525 This error is related to torchmetrics of cpu/gpu problem Lightning-AI/torchmetrics#2477

github-actions · 2024-10-17T08:03:50Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

SangbumChoi added the bug label Sep 17, 2024

qubvel mentioned this issue Sep 17, 2024

Tensor size mismatch when trying to run RT-DETR on multiple gpus #33165

Closed

4 tasks

qubvel added the Vision label Sep 17, 2024

SangbumChoi mentioned this issue Sep 18, 2024

Enable multi-GPU in object detection #33561

Draft

5 tasks

github-actions bot closed this as completed Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU Training Object Detection #33525

Multi-GPU Training Object Detection #33525

SangbumChoi commented Sep 17, 2024 •

edited

Loading

qubvel commented Sep 17, 2024

qubvel commented Sep 17, 2024 •

edited

Loading

SangbumChoi commented Sep 18, 2024

github-actions bot commented Oct 17, 2024

Multi-GPU Training Object Detection #33525

Multi-GPU Training Object Detection #33525

Comments

SangbumChoi commented Sep 17, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

qubvel commented Sep 17, 2024

qubvel commented Sep 17, 2024 • edited Loading

SangbumChoi commented Sep 18, 2024

github-actions bot commented Oct 17, 2024

SangbumChoi commented Sep 17, 2024 •

edited

Loading

qubvel commented Sep 17, 2024 •

edited

Loading