How to unload and change models for local offline inferencing with Aphrodite? #510
-
I'm trying to compare a few different models by running the same prompts through them using local offline inferencing with Aphrodite, since the API doesn't support changing models. Here's the code I'm using: import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import json
from notify_run import Notify
from aphrodite import LLM, SamplingParams
import torch; import gc; from aphrodite.distributed.parallel_state import destroy_model_parallel
datasets = data
models = [
"/work/ml/text-generation-webui/models/MurtazaNasir_Llama-3-70B-Instruct-32k-v0.1-h6-exl2_4.25bpw",
"/work/ml/text-generation-webui/models/turboderp_Llama-3-8B-Instruct-exl2_5.0bpw",
]
max_tokens = 25
prompts = [
"What is a man? A miserable little",
"Once upon a time",
]
sampling_params = SamplingParams(temperature=1.1, min_p=0.05)
for model_path in models:
llm = LLM(model=model_path, tensor_parallel_size=2, kv_cache_dtype="fp8", quantization="exl2", disable_custom_all_reduce=True, max_model_len=3000, gpu_memory_utilization=0.95)
outputs = llm.generate(prompts, sampling_params)
print(f"Results for model {model_path}:")
print(outputs)
destroy_model_parallel()
del llm.llm_engine
del llm
gc.collect()
torch.cuda.empty_cache() I'm trying to unload the first model and load the second one, but I haven't been able to get it to work. I've tried a few different approaches like What is the correct way to unload a model and load a new one in Aphrodite for local inferencing? Is there a specific sequence of steps or additional cleanup required to fully release the GPU memory and resources used by the first model? I'd appreciate any guidance or code examples showing the proper way to handle switching between models. Thanks in advance for any help! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi, and sorry for the late response! I didn't see this discussion. You're generally right, those steps would kill the process and clean up any leftover objects in memory. Can you try running all of these? import torch; import gc; from aphrodite.distributed import destroy_model_parallel
destroy_model_parallel()
del llm.llm_engine.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group() |
Beta Was this translation helpful? Give feedback.
Thanks for the suggestion! I actually got it working. Here's the gist of how I'm starting and killing models now, including support for multiple endpoints to distribute the task across gpus: