Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: world_size (1) is not divisible by 2 #11480

Open
chanansh opened this issue Dec 5, 2024 · 0 comments
Open

RuntimeError: world_size (1) is not divisible by 2 #11480

chanansh opened this issue Dec 5, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@chanansh
Copy link

chanansh commented Dec 5, 2024

Describe the bug

Unable to run the hello world script nemo llm pretrain --factory llama3_8b like in the documentation https://github.com/NVIDIA/NeMo/blob/main/examples/llm/pretrain/README.md

I get:

[rank0]:   File "/opt/megatron-lm/megatron/core/parallel_state.py", line 532, in initialize_model_parallel
[rank0]:     raise RuntimeError(f"world_size ({world_size}) is not divisible by {total_model_size}")
[rank0]: RuntimeError: world_size (1) is not divisible by 2

Steps/Code to reproduce bug

srun -p batch -N 1 --container-image nvcr.io/nvidia/nemo:24.09 --pty bash
The machine has 8 gpu according to nvidia-smi
run nemo llm pretrain --factory llama3_8b

Expected behavior

I would expect it to run seamlessly

Environment overview (please complete the following information)

srun -p batch -N 1 --container-image nvcr.io/nvidia/nemo:24.09 --pty bash
The machine has 8 gpu according to nvidia-smi

@chanansh chanansh added the bug Something isn't working label Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant