Replies: 3 comments
-
Hello! I have the same problem with os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
os.environ["CUDA_VISIBLE_DEVICES"] = "1, 2, 3, 4"
os.environ["TOKENIZERS_PARALLELISM"] = "True"
os.environ["VLLM_USE_MODELSCOPE"] = "True"
def vllm_call(model_id, prompts, devices=1):
sampling_params = SamplingParams(temperature=0.1, top_p=0.95)
llm = LLM(model=model_id,
quantization='bitsandbytes',
load_format='bitsandbytes',
max_model_len=4000,
gpu_memory_utilization=0.95,
pipeline_parallel_size=devices,
# tensor_parallel_size=devices,
enforce_eager=None)
outputs = llm.generate(prompts, sampling_params)
return outputs
if __name__ == "__main__":
model_id = "./Mistral-Nemo-Instruct-2407_bab-4bit-double"
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
outputs = vllm_call(model_id, prompts, devices=4)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt}, Generated text: {generated_text}")
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Same issue :P |
Beta Was this translation helpful? Give feedback.
0 replies
-
I'm facing the same issue. Any updates? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I trained llama-3.1 by QLORA as below.
Run inference
I got
I use vllm==0.6.2
Any Suggestion or help is highly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions