RuntimeError: world_size (1) is not divisible by 2 #11480

chanansh · 2024-12-05T11:34:45Z

Describe the bug

Unable to run the hello world script nemo llm pretrain --factory llama3_8b like in the documentation https://github.com/NVIDIA/NeMo/blob/main/examples/llm/pretrain/README.md

I get:

[rank0]:   File "/opt/megatron-lm/megatron/core/parallel_state.py", line 532, in initialize_model_parallel
[rank0]:     raise RuntimeError(f"world_size ({world_size}) is not divisible by {total_model_size}")
[rank0]: RuntimeError: world_size (1) is not divisible by 2

Steps/Code to reproduce bug

srun -p batch -N 1 --container-image nvcr.io/nvidia/nemo:24.09 --pty bash
The machine has 8 gpu according to nvidia-smi
run nemo llm pretrain --factory llama3_8b

Expected behavior

I would expect it to run seamlessly

Environment overview (please complete the following information)

srun -p batch -N 1 --container-image nvcr.io/nvidia/nemo:24.09 --pty bash
The machine has 8 gpu according to nvidia-smi

The text was updated successfully, but these errors were encountered:

chanansh added the bug Something isn't working label Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: world_size (1) is not divisible by 2 #11480

RuntimeError: world_size (1) is not divisible by 2 #11480

chanansh commented Dec 5, 2024

RuntimeError: world_size (1) is not divisible by 2 #11480

RuntimeError: world_size (1) is not divisible by 2 #11480

Comments

chanansh commented Dec 5, 2024