vLLM-xft

vLLM-xFT is a fork of vLLM to integrate the xfastertransformer backend, maintaining compatibility with most of the official vLLM's features.

Install

pip install vllm-xft

Usage

Notice: Preload libiomp5.so is required!

Serving(OpenAI Compatible Server)

# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')

python -m vllm.entrypoints.openai.api_server \
        --model ${MODEL_PATH} \
        --tokenizer ${TOKEN_PATH} \
        --dtype bf16 \
        --kv-cache-dtype fp16 \
        --served-model-name xft \
        --port 8000 \
        --trust-remote-code

--max-num-batched-tokens: max batched token, default value is max(MAX_SEQ_LEN_OF_MODEL, 2048).
--max-num-seqs: max seqs batch, default is 256.

More Arguments please refer to vllm official docs

Query example

  curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
  "model": "xft",
  "prompt": "San Francisco is a",
  "max_tokens": 16,
  "temperature": 0
  }'

Distributed(Multi-rank)

Use oneCCL's mpirun to run the workload. The master (rank 0) is the same as the single-rank above, and the slaves (rank > 0) should use the following command:

python -m vllm.entrypoints.slave --dtype bf16 --model ${MODEL_PATH} --kv-cache-dtype fp16

Please keep params of slaves align with master.

Serving(OpenAI Compatible Server)

Here is a example on 2Socket platform, 48 cores pre socket.

# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')

OMP_NUM_THREADS=48 mpirun \
        -n 1 numactl --all -C 0-47 -m 0 \
          python -m vllm.entrypoints.openai.api_server \
            --model ${MODEL_PATH} \
            --tokenizer ${TOKEN_PATH} \
            --dtype bf16 \
            --kv-cache-dtype fp16 \
            --served-model-name xft \
            --port 8000 \
            --trust-remote-code \
        : -n 1 numactl --all -C 48-95 -m 1 \
          python -m vllm.entrypoints.slave \
            --dtype bf16 \
            --model ${MODEL_PATH} \
            --kv-cache-dtype fp16

Benchmarking vLLM-xFT

Downloading the vLLM

git clone https://github.com/Duyi-Wang/vllm.git && cd vllm/benchmarks

Downloading the ShareGPT dataset

You can download the dataset by running:

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Benchmark offline inference throughput.

This script is used to benchmark the offline inference throughput of a specified model. It sets up the environment, defines the paths for the tokenizer, model, and dataset, and uses numactl to bind the process to appropriate CPU resources for optimized performance.

#!/bin/bash

# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')

# Define the paths for the tokenizer and the model
TOKEN_PATH=/data/models/Qwen2-7B-Instruct
MODEL_PATH=/data/models/Qwen2-7B-Instruct-xft
DATASET_PATH=ShareGPT_V3_unfiltered_cleaned_split.json

# Use numactl to bind to appropriate CPU resources
numactl -C 0-47 -l python benchmark_throughput.py \
        --tokenizer ${TOKEN_PATH} \          # Path to the tokenizer
        --model ${MODEL_PATH} \              # Path to the model
        --dataset ${DATASET_PATH}            # Path to the dataset

Benchmark online serving throughput.

This guide explains how to benchmark the online serving throughput for a model. It includes instructions for setting up the server and running the client benchmark script.

On the server side, you can refer to the following code to start the test API server:

#!/bin/bash

# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')

# Define the paths for the tokenizer and the model
TOKEN_PATH=/data/models/Qwen2-7B-Instruct
MODEL_PATH=/data/models/Qwen2-7B-Instruct-xft

# Start the API server using numactl to bind to appropriate CPU resources
numactl -C 0-47 -l python -m vllm.entrypoints.openai.api_server \
        --model ${MODEL_PATH} \              # Path to the model
        --tokenizer ${TOKEN_PATH} \          # Path to the tokenizer
        --dtype bf16 \                       # Data type for the model (bfloat16)
        --kv-cache-dtype fp16 \              # Data type for the key-value cache (float16)
        --served-model-name xft \            # Name for the served model
        --port 8000 \                        # Port number for the API server
        --trust-remote-code                  # Trust remote code execution

On the client side, you can use python benchmark_serving.py --help to see the required configuration parameters. Here is a reference example:

$ python benchmark_serving.py --model xft --tokenizer /data/models/Qwen2-7B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json
============ Serving Benchmark Result ============
Successful requests:                     xxxx
Benchmark duration (s):                  xxxx
Total input tokens:                      xxxx
Total generated tokens:                  xxxx
Request throughput (req/s):              xxxx
Input token throughput (tok/s):          xxxx
Output token throughput (tok/s):         xxxx
---------------Time to First Token----------------
Mean TTFT (ms):                          xxxx
Median TTFT (ms):                        xxxx
P99 TTFT (ms):                           xxxx
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          xxx
Median TPOT (ms):                        xxx
P99 TPOT (ms):                           xxx
==================================================

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm-xft.md

vllm-xft.md

vLLM-xft

Install

Usage

Serving(OpenAI Compatible Server)

Query example

Distributed(Multi-rank)

Serving(OpenAI Compatible Server)

Benchmarking vLLM-xFT

Downloading the vLLM

Downloading the ShareGPT dataset

Benchmark offline inference throughput.

Benchmark online serving throughput.

Files

vllm-xft.md

Latest commit

History

vllm-xft.md

File metadata and controls

vLLM-xft

Install

Usage

Serving(OpenAI Compatible Server)

Query example

Distributed(Multi-rank)

Serving(OpenAI Compatible Server)

Benchmarking vLLM-xFT

Downloading the vLLM

Downloading the ShareGPT dataset

Benchmark offline inference throughput.

Benchmark online serving throughput.