[Bug]: collective nonSFG is not supported during hpu graph capturing #192

xinsu626 · 2024-08-16T05:49:25Z

Your current environment

I am using the following Docker image: vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest.

🐛 Describe the bug

On the main branch of the vllm-fork repository, I attempted to run the "meta-llama/Meta-Llama-3-70B" model using the following code:

from vllm import LLM, SamplingParams
import sys
import os
os.environ['PT_HPU_LAZY_MODE'] = '1'

prompts = [
    "The president of the United States is",
    "The capital of France is",
]

sampling_params = SamplingParams(n=1, temperature=0, max_tokens=30)
llm = LLM(model="meta-llama/Meta-Llama-3-70B", max_num_seqs=32, tensor_parallel_size=8)
outputs = llm.generate(prompts, sampling_params)

However, I encountered the following error:

(RayWorkerWrapper pid=26165) ERROR 08-16 05:39:57 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [repeated 6x across cluster]
(RayWorkerWrapper pid=26165) ERROR 08-16 05:39:57 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2220, in all_reduce [repeated 6x across cluster]
(RayWorkerWrapper pid=26165) ERROR 08-16 05:39:57 worker_base.py:382]     work = group.allreduce([tensor], opts) [repeated 6x across cluster]
(RayWorkerWrapper pid=26165) ERROR 08-16 05:39:57 worker_base.py:382] RuntimeError: collective nonSFG is not supported during hpu graph capturing [repeated 6x across cluster]

The text was updated successfully, but these errors were encountered:

kdamaszk · 2024-08-18T21:03:34Z

Hi @xinsu626, please set this variable: PT_HPU_ENABLE_LAZY_COLLECTIVES=true. It is required to make HPU graphs working with tensor parallelism.
Please check: Environment variables

xinsu626 · 2024-08-26T21:51:14Z

Hi @xinsu626, please set this variable: PT_HPU_ENABLE_LAZY_COLLECTIVES=true. It is required to make HPU graphs working with tensor parallelism. Please check: Environment variables

@kdamaszk Got it. Thank you for your help!

m9e · 2024-09-09T21:57:31Z

Is this functionally the same as PT_HPU_LAZY_MODE being set? (eg, per the readme worning, should only be set with eager mode?)

xinsu626 added the bug Something isn't working label Aug 16, 2024

kzawora-intel closed this as completed Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: collective nonSFG is not supported during hpu graph capturing #192

[Bug]: collective nonSFG is not supported during hpu graph capturing #192

xinsu626 commented Aug 16, 2024

kdamaszk commented Aug 18, 2024

xinsu626 commented Aug 26, 2024 •

edited

Loading

m9e commented Sep 9, 2024

[Bug]: collective nonSFG is not supported during hpu graph capturing #192

[Bug]: collective nonSFG is not supported during hpu graph capturing #192

Comments

xinsu626 commented Aug 16, 2024

Your current environment

🐛 Describe the bug

kdamaszk commented Aug 18, 2024

xinsu626 commented Aug 26, 2024 • edited Loading

m9e commented Sep 9, 2024

xinsu626 commented Aug 26, 2024 •

edited

Loading