Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi gpus training cause out of memory error #48

Open
Dufresue opened this issue Nov 8, 2023 · 1 comment
Open

multi gpus training cause out of memory error #48

Dufresue opened this issue Nov 8, 2023 · 1 comment

Comments

@Dufresue
Copy link

Dufresue commented Nov 8, 2023

when i try to run the train_caption.py script like this:

export Data_ROOT=path/to/coco_dataset
python train_caption.py exp.name=caption_rds moel.detector.checkpoint=4ds_detector_path

i encountered some errors like this:
bug2

belows are changes i made in coco_config.yml:
ngpus_per_node:2
world_size:2
batch_size:4
num_workers:2

however,when i set the ngpus_per_node:1 world_size:1,it can run properly.
微信图片_20231108104933

anyone can help? thanks a lot!

@Dufresue
Copy link
Author

i believe that is caused by the out of memory. i open the system monitor and found that after running the traincaption.py script, the memory usage continues to increase until it is full, causing an oom error. my computer has 16g memory, is the problem caused by insufficient memory? Or is it because there is a problem in the code that resources are not released in time?

@Dufresue Dufresue changed the title multi gpus training error multi gpus training cause out of memory error Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant