Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: collective nonSFG is not supported during hpu graph capturing #192

Closed
xinsu626 opened this issue Aug 16, 2024 · 3 comments
Closed
Labels
bug Something isn't working

Comments

@xinsu626
Copy link

Your current environment

I am using the following Docker image: vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest.

🐛 Describe the bug

On the main branch of the vllm-fork repository, I attempted to run the "meta-llama/Meta-Llama-3-70B" model using the following code:

from vllm import LLM, SamplingParams
import sys
import os
os.environ['PT_HPU_LAZY_MODE'] = '1'

prompts = [
    "The president of the United States is",
    "The capital of France is",
]

sampling_params = SamplingParams(n=1, temperature=0, max_tokens=30)
llm = LLM(model="meta-llama/Meta-Llama-3-70B", max_num_seqs=32, tensor_parallel_size=8)
outputs = llm.generate(prompts, sampling_params)

However, I encountered the following error:

(RayWorkerWrapper pid=26165) ERROR 08-16 05:39:57 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [repeated 6x across cluster]
(RayWorkerWrapper pid=26165) ERROR 08-16 05:39:57 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2220, in all_reduce [repeated 6x across cluster]
(RayWorkerWrapper pid=26165) ERROR 08-16 05:39:57 worker_base.py:382]     work = group.allreduce([tensor], opts) [repeated 6x across cluster]
(RayWorkerWrapper pid=26165) ERROR 08-16 05:39:57 worker_base.py:382] RuntimeError: collective nonSFG is not supported during hpu graph capturing [repeated 6x across cluster]
@xinsu626 xinsu626 added the bug Something isn't working label Aug 16, 2024
@kdamaszk
Copy link

Hi @xinsu626, please set this variable: PT_HPU_ENABLE_LAZY_COLLECTIVES=true. It is required to make HPU graphs working with tensor parallelism.
Please check: Environment variables

@xinsu626
Copy link
Author

xinsu626 commented Aug 26, 2024

Hi @xinsu626, please set this variable: PT_HPU_ENABLE_LAZY_COLLECTIVES=true. It is required to make HPU graphs working with tensor parallelism. Please check: Environment variables

@kdamaszk Got it. Thank you for your help!

@m9e
Copy link

m9e commented Sep 9, 2024

Is this functionally the same as PT_HPU_LAZY_MODE being set? (eg, per the readme worning, should only be set with eager mode?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants