How to unload and change models for local offline inferencing with Aphrodite? #510

murtaza-nasir · 2024-06-13T05:36:45Z

murtaza-nasir
Jun 13, 2024

I'm trying to compare a few different models by running the same prompts through them using local offline inferencing with Aphrodite, since the API doesn't support changing models.

Here's the code I'm using:

import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import json
from notify_run import Notify
from aphrodite import LLM, SamplingParams
import torch; import gc; from aphrodite.distributed.parallel_state import destroy_model_parallel

datasets = data
models = [
    "/work/ml/text-generation-webui/models/MurtazaNasir_Llama-3-70B-Instruct-32k-v0.1-h6-exl2_4.25bpw",
    "/work/ml/text-generation-webui/models/turboderp_Llama-3-8B-Instruct-exl2_5.0bpw",
]
max_tokens = 25

prompts = [
    "What is a man? A miserable little",
    "Once upon a time",
]

sampling_params = SamplingParams(temperature=1.1, min_p=0.05)

for model_path in models:
    llm = LLM(model=model_path, tensor_parallel_size=2, kv_cache_dtype="fp8", quantization="exl2", disable_custom_all_reduce=True, max_model_len=3000, gpu_memory_utilization=0.95)
    
    outputs = llm.generate(prompts, sampling_params)
    print(f"Results for model {model_path}:")
    print(outputs)

    destroy_model_parallel()
    del llm.llm_engine
    del llm
    gc.collect()
    torch.cuda.empty_cache()

I'm trying to unload the first model and load the second one, but I haven't been able to get it to work. I've tried a few different approaches like destroy_model_parallel(), deleting the llm and llm_engine objects, calling gc.collect() and torch.cuda.empty_cache(), but I can never get the second model to load successfully after running the first one.

What is the correct way to unload a model and load a new one in Aphrodite for local inferencing? Is there a specific sequence of steps or additional cleanup required to fully release the GPU memory and resources used by the first model?

I'd appreciate any guidance or code examples showing the proper way to handle switching between models. Thanks in advance for any help!

Answered by murtaza-nasir

Aug 12, 2024

Thanks for the suggestion! I actually got it working. Here's the gist of how I'm starting and killing models now, including support for multiple endpoints to distribute the task across gpus:

import subprocess
import os
import psutil
import time
import openai
import concurrent.futures
from tqdm import tqdm

port = 5000
cmd_path = os.path.expanduser("~/work/ml/aphrodite-engine/runtime.sh")
num_actual_gpus = 4

def start_model(model_path, model_dtype, num_gpus, gpu_offset=0, port=5000):
    if num_gpus == 4:
        cmd = f"{cmd_path} python -m aphrodite.endpoints.openai.api_server --model '{model_path}' --dtype 'half' -q {model_dtype} --tensor-parallel-size {num_gpus} --port {port} --host 0…

View full answer

AlpinDale · 2024-08-12T18:38:54Z

AlpinDale
Aug 12, 2024
Maintainer

Hi, and sorry for the late response! I didn't see this discussion.

You're generally right, those steps would kill the process and clean up any leftover objects in memory. Can you try running all of these?

import torch; import gc; from aphrodite.distributed import destroy_model_parallel

destroy_model_parallel()
del llm.llm_engine.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()

1 reply

murtaza-nasir Aug 12, 2024
Author

Thanks for the suggestion! I actually got it working. Here's the gist of how I'm starting and killing models now, including support for multiple endpoints to distribute the task across gpus:

import subprocess
import os
import psutil
import time
import openai
import concurrent.futures
from tqdm import tqdm

port = 5000
cmd_path = os.path.expanduser("~/work/ml/aphrodite-engine/runtime.sh")
num_actual_gpus = 4

def start_model(model_path, model_dtype, num_gpus, gpu_offset=0, port=5000):
    if num_gpus == 4:
        cmd = f"{cmd_path} python -m aphrodite.endpoints.openai.api_server --model '{model_path}' --dtype 'half' -q {model_dtype} --tensor-parallel-size {num_gpus} --port {port} --host 0.0.0.0"
    else:
        gpu_ids = ','.join(str(i) for i in range(gpu_offset, gpu_offset + num_gpus))
        cmd = f"CUDA_VISIBLE_DEVICES={gpu_ids} {cmd_path} python -m aphrodite.endpoints.openai.api_server --model '{model_path}' --dtype 'half' -q {model_dtype} --tensor-parallel-size {num_gpus} --port {port} --host 0.0.0.0"
    
    with open(f"{port}_output.log", 'w') as file:
        return subprocess.Popen(cmd, shell=True, stdout=file, stderr=file)

def kill_process_by_command(port):
    for proc in psutil.process_iter(['pid', 'cmdline']):
        if "aphrodite.endpoints.openai.api_server" in ' '.join(proc.info['cmdline']) and f"--port {port}" in ' '.join(proc.info['cmdline']):
            proc.terminate()
            print(f"Terminated process on port {port}")

# Example usage
models = [
    ("/path/to/model1", "exl2", "model1_name", 1),
    ("/path/to/model2", "exl2", "model2_name", 2),
]

for model_path, model_dtype, model_name, num_gpus in models:
    processes = []
    num_endpoints = num_actual_gpus // num_gpus
    for i in range(num_endpoints):
        port = 5000 + i
        processes.append(start_model(model_path, model_dtype, num_gpus, gpu_offset=i * num_gpus, port=port))
    time.sleep(100)  # Wait for models to load
    
    try:
        print(f"Running tasks for {model_name}")
        # Your processing code here
        # Example of using multiple endpoints:
        with concurrent.futures.ThreadPoolExecutor() as executor:
            futures = [executor.submit(process_task, task, openai.OpenAI(base_url=f'http://127.0.0.1:{5000 + i % num_endpoints}/v1/', api_key='dummy')) for i, task in enumerate(tasks)]
            results = list(tqdm(concurrent.futures.as_completed(futures), total=len(futures)))
        
        # Process results...
    
    finally:
        for i in range(num_endpoints):
            kill_process_by_command(5000 + i)
    
    print(f'Finished all tasks for {model_name}')

This setup kicks off separate processes for each model, uses them, then cleans up before moving to the next one. It handles models needing different numbers of GPUs and sets up multiple endpoints when possible to finish tasks up quicker since I'm on 3090s and running a 70B across 4 gpus doesn't give any speed advantage especially at small context length. This setup works well for my usecase.

Answer selected by AlpinDale

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to unload and change models for local offline inferencing with Aphrodite? #510

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to unload and change models for local offline inferencing with Aphrodite? #510

murtaza-nasir Jun 13, 2024

Replies: 1 comment · 1 reply

AlpinDale Aug 12, 2024 Maintainer

murtaza-nasir Aug 12, 2024 Author

murtaza-nasir
Jun 13, 2024

Replies: 1 comment 1 reply

AlpinDale
Aug 12, 2024
Maintainer

murtaza-nasir Aug 12, 2024
Author