Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run on multiple machines ? #42

Open
AnnemSony opened this issue Jul 6, 2023 · 5 comments
Open

How to run on multiple machines ? #42

AnnemSony opened this issue Jul 6, 2023 · 5 comments

Comments

@AnnemSony
Copy link

No description provided.

@tianrun-chen
Copy link
Owner

Do you mean multiple GPUs?

@AnnemSony
Copy link
Author

I have GPU'S in multiple machine(means on node clusters), how can I run the command.

@chusheng0505
Copy link

chusheng0505 commented Jul 11, 2023

Hi , I have 4 gpus and trying to tune the SAM-Adapter model
I used the command provided in git
command used : CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch train.py --nnodes 1 --nproc_per_node 4 --config configs/demo.yaml

I have successed training but i found that there is only one gpu is used !!
how can i solve this problem ...?(I have checked the documents of torch but don't have any idea to debug it ...?
@tianrun-chen

@Bill-Ren
Copy link

I also encountered this problem, and only O cards were used during distributed training. At the same time, I did not find the input of these two parameters --nnodes 1 --nproc_per_node 4 in the input of train.py. Why?

@Bill-Ren
Copy link

Hi , I have 4 gpus and trying to tune the SAM-Adapter model I used the command provided in git command used : CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch train.py --nnodes 1 --nproc_per_node 4 --config configs/demo.yaml

I have successed training but i found that there is only one gpu is used !! how can i solve this problem ...?(I have checked the documents of torch but don't have any idea to debug it ...? @tianrun-chen

I found a solution to the problem. Finally, I should run the code like this: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nnodes 1 --nproc_per_node 4 train.py --config configs/demo.yaml --tag exp1 , you can check the usage of torch.distributed.launch for details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants