-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi gpus training cause out of memory error #48
Comments
i believe that is caused by the out of memory. i open the system monitor and found that after running the traincaption.py script, the memory usage continues to increase until it is full, causing an oom error. my computer has 16g memory, is the problem caused by insufficient memory? Or is it because there is a problem in the code that resources are not released in time? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
when i try to run the train_caption.py script like this:
i encountered some errors like this:
![bug2](https://private-user-images.githubusercontent.com/47912661/281243510-bf1e6490-d4dd-45ba-a93b-eae21386b33e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAxMTQyNDEsIm5iZiI6MTcyMDExMzk0MSwicGF0aCI6Ii80NzkxMjY2MS8yODEyNDM1MTAtYmYxZTY0OTAtZDRkZC00NWJhLWE5M2ItZWFlMjEzODZiMzNlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzA0VDE3MjU0MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWY2YzIyYzY3OGUyYjIyNjFjNzBiZDIyZjg3OWY5ZGZmNGQ1MzM3MGVhYzIwNjdiYTk1ODc2MmY3ZjY4NWFiMDEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.K-Jjs_zZ2XrAmqoHxa7zalHidTHpefxqa5cdQ2Edyhw)
belows are changes i made in coco_config.yml:
ngpus_per_node:2
world_size:2
batch_size:4
num_workers:2
however,when i set the ngpus_per_node:1 world_size:1,it can run properly.
![微信图片_20231108104933](https://private-user-images.githubusercontent.com/47912661/281244524-12a2701c-91e9-440e-af52-b68cb2e99765.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAxMTQyNDEsIm5iZiI6MTcyMDExMzk0MSwicGF0aCI6Ii80NzkxMjY2MS8yODEyNDQ1MjQtMTJhMjcwMWMtOTFlOS00NDBlLWFmNTItYjY4Y2IyZTk5NzY1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzA0VDE3MjU0MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTc4YjA1NmZhMjI4NWNjNzA3ZTQxZjU3MzFhMzczYzQ3MGZmNmNhM2QwZDU5MWRlZGY4ODhhMGVmMzU5YTY5YmMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.bi2IlxfXXNV_U010viJjj3Ss0pvw6_BpjaaH0hby1-k)
anyone can help? thanks a lot!
The text was updated successfully, but these errors were encountered: