Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDXL deployment example on inf2 #538

Merged
merged 2 commits into from
Feb 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 22 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
**NOS (`torch-nos`)** is a fast and flexible Pytorch inference server, specifically designed for optimizing and running inference of popular foundational AI models.
<br>

## **Why use NOS?**
## 🛠️ **Why use NOS?**

- 👩‍💻 **Easy-to-use**: Built for [PyTorch](https://pytorch.org/) and designed to optimize, serve and auto-scale Pytorch models in production without compromising on developer experience.
- 🥷 **Flexible**: Run and serve several foundational AI models ([Stable Diffusion](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), [CLIP](https://huggingface.co/openai/clip-vit-base-patch32), [Whisper](https://huggingface.co/openai/whisper-large-v2)) in a single place.
Expand All @@ -37,7 +37,27 @@
* **[Jan 2024]** ✍️ [blog] [Getting started with NOS tutorials](https://docs.nos.run/docs/blog/-getting-started-with-nos-tutorials.html) is available [here](./examples/tutorials/)!
* **[Dec 2023]** 🛝 [repo] We open-sourced the [NOS playground](https://github.com/autonomi-ai/nos-playground) to help you get started with more examples built on NOS!

## **What can NOS do?**
## 🚀 Quickstart

We highly recommend that you go to our [quickstart guide](https://docs.nos.run/docs/quickstart.html) to get started. To install the NOS client, you can run the following command:

```bash
conda create -n nos python=3.8
conda activate nos
pip install torch-nos
```

Once the client is installed, you can start the NOS server via the NOS `serve` CLI. This will automatically detect your local environment, download the docker runtime image and spin up the NOS server:

```bash
nos serve up --http
```

You are now ready to run your first inference request with NOS! You can run any of the following commands to try things out.

*Note:* For the above quickstart to work out of the box, we expect the user to have [Docker](https://docs.docker.com/get-docker/), [Nvidia Docker](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) and [Docker Compose](https://docs.docker.com/compose/install/) pre-installed on their machine. If you run into any issues, please visit our [quickstart](https://docs.nos.run/docs/quickstart.html) page or ping us on [Discord](https://discord.gg/QAGgvTuvgg).

## 👩‍💻 **What can NOS do?**

### 💬 Chat / LLM Agents (ChatGPT-as-a-Service)
---
Expand Down
16 changes: 8 additions & 8 deletions docs/concepts/runtime-environments.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@ The NOS inference server supports custom runtime environments through the use of

### ⚡️ NOS Inference Runtime

We use docker to configure different worker configurations to run workloads in different runtime environments. The configured runtime environments are specified in the [InferenceServiceRuntime](../api/server.md#inferenceserviceruntime) class, which wraps the generic [`DockerRuntime`] class. For convenience, we have pre-built some runtime environments that can be used out-of-the-box `cpu`, `gpu`, `trt-runtime` etc.
We use docker to configure different worker configurations to run workloads in different runtime environments. The configured runtime environments are specified in the [InferenceServiceRuntime](../api/server.md#inferenceserviceruntime) class, which wraps the generic [`DockerRuntime`] class. For convenience, we have pre-built some runtime environments that can be used out-of-the-box `cpu`, `gpu`, `inf2` etc.

This is the general flow of how the runtime environments are configured:
- Configure runtime environments including `cpu`, `gpu`, `trt-runtime` etc in the [`InferenceServiceRuntime`](../api/server.md#inferenceserviceruntime) `config` dictionary.
- Configure runtime environments including `cpu`, `gpu`, `inf2` etc in the [`InferenceServiceRuntime`](../api/server.md#inferenceserviceruntime) `config` dictionary.
- Start the server with the appropriate runtime environment via the `--runtime` flag.
- The ray cluster is now configured within the appropriate runtime environment and has access to the appropriate libraries and binaries.

Expand All @@ -15,12 +15,12 @@ For custom runtime support, we use [Ray](https://ray.io) to configure different

The following runtimes are supported by NOS:

| Status | Name | Pyorch | HW | Base | Description |
| - | --- | --- | --- | --- | --- |
| ✅ | [`autonomi/nos:latest-gpu`](https://hub.docker.com/r/autonomi/nos/tags) | [`2.0.1`](https://pypi.org/project/torch/2.0.1/) | CPU | `debian:buster-slim` | CPU-only runtime. |
| ✅ | [`autonomi/nos:latest-gpu`](https://hub.docker.com/r/autonomi/nos/tags) | [`2.0.1`](https://pypi.org/project/torch/2.0.1/) | NVIDIA GPU | `nvidia/cuda:11.8.0-base-ubuntu22.04` | GPU runtime. |
| **Coming Soon** | `trt` | [`2.0.1`](https://pypi.org/project/torch/2.0.1/) | NVIDIA GPU | `nvidia/cuda:11.7.0-base-ubuntu22.04` | GPU runtime with TensorRT (8.4.2.4). |
| **Coming Soon** | `inf2` | [`1.13.1`](https://pypi.org/project/torch/1.13.1/) | [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) | `debian:buster-slim` | Inf2 runtime with [torch-neuronx](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install.html). |
| Status | Name | Pyorch | HW | Base | Size | Description |
| - | --- | --- | --- | --- | --- | --- |
| ✅ | [`autonomi/nos:latest-cpu`](https://hub.docker.com/r/autonomi/nos/tags) | [`2.1.1`](https://pypi.org/project/torch/2.1.1/) | CPU | `debian:buster-slim` | 1.1 GB | CPU-only runtime. |
| ✅ | [`autonomi/nos:latest-gpu`](https://hub.docker.com/r/autonomi/nos/tags) | [`2.1.1`](https://pypi.org/project/torch/2.1.1/) | NVIDIA GPU | `nvidia/cuda:11.8.0-base-ubuntu22.04` | 3.9 GB | GPU runtime. |
| ✅ | [`autonomi/nos:latest-inf2`](https://hub.docker.com/r/autonomi/nos/tags) | [`1.13.1`](https://pypi.org/project/torch/1.13.1/) | [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) | `debian:buster-slim` | 1.7 GB | Inf2 runtime with [torch-neuronx](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install.html). |
| **Coming Soon** | `trt` | [`2.0.1`](https://pypi.org/project/torch/2.0.1/) | NVIDIA GPU | `nvidia/cuda:11.8.0-base-ubuntu22.04` | GPU runtime with TensorRT (8.4.2.4). |

### 🛠️ Adding a custom runtime

Expand Down
6 changes: 3 additions & 3 deletions docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,14 +57,14 @@ You can start the nos server programmatically via either the CLI or SDK:

=== "Via CLI"

You can start the nos server (in daemon mode) via the NOS `serve` CLI:
You can start the nos server via the NOS `serve` CLI:
```bash
nos serve up -d
nos serve up
```

Optionally, to use the REST API, you can start an HTTP gateway proxy alongside the gRPC server:
```bash
nos serve up -d --http
nos serve up --http
```

!!!note
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ setup: |
sudo apt-get install -y docker-compose-plugin

cd /app && python3 -m venv .venv && source .venv/bin/activate
pip install git+https://github.com/spillai/nos.git pytest
pip install git+https://github.com/autonomi-ai/nos.git pytest

run: |
source /app/.venv/bin/activate
Expand Down
2 changes: 1 addition & 1 deletion examples/inf2/embeddings/tests/test_embeddings_inf2.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import numpy as np


def test_embeddings():
def test_embeddings_inf2():
from models.embeddings_inf2 import EmbeddingServiceInf2

model = EmbeddingServiceInf2()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@


@pytest.mark.parametrize("model_id", ["BAAI/bge-small-en-v1.5"])
def test_embeddings_client(model_id):
def test_embeddings_inf2_client(model_id):
import numpy as np

from nos.client import Client
Expand Down
34 changes: 34 additions & 0 deletions examples/inf2/sdxl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
## Embeddings Service

Start the server via:
```bash
nos serve up -c serve.yaml --http
```

Optionally, you can provide the `inf2` runtime flag, but this is automatically inferred.

```bash
nos serve up -c serve.yaml --http --runtime inf2
```

### Run the tests

```bash
pytest -sv ./tests/test_embeddings_client.py
```

### Call the service

You can also call the service via the REST API directly:

```bash
curl \
-X POST http://<service-ip>:8000/v1/infer \
-H 'Content-Type: application/json' \
-d '{
"model_id": "BAAI/bge-small-en-v1.5",
"inputs": {
"texts": ["fox jumped over the moon"]
}
}'
```
26 changes: 26 additions & 0 deletions examples/inf2/sdxl/job-inf2-sdxl-deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Usage: sky launch -c <cluster-name> job-inf2.yaml
# image_id: ami-09c62125a680f0ead # us-east-2
# image_id: ami-0d4155c8606f16f5b # us-west-1
# image_id: ami-096319086cc3d5f23 # us-west-2

file_mounts:
/app: .

resources:
cloud: aws
region: us-west-2
instance_type: inf2.8xlarge
image_id: ami-096319086cc3d5f23 # us-west-2
disk_size: 256
ports:
- 8000

setup: |
sudo apt-get install -y docker-compose-plugin

cd /app && python3 -m venv .venv && source .venv/bin/activate
pip install git+https://github.com/autonomi-ai/nos.git pytest

run: |
source /app/.venv/bin/activate
cd /app && NOS_LOGGING_LEVEL=DEBUG nos serve up -c serve.yaml --http
113 changes: 113 additions & 0 deletions examples/inf2/sdxl/models/sdxl_inf2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
"""SDXL model accelerated with AWS Neuron (using optimum-neuron)."""
from dataclasses import dataclass, field, replace
from pathlib import Path
from typing import Any, Dict, List, Union

import torch
from PIL import Image

from nos.constants import NOS_CACHE_DIR
from nos.hub import HuggingFaceHubConfig
from nos.neuron.device import NeuronDevice


@dataclass(frozen=True)
class StableDiffusionInf2Config(HuggingFaceHubConfig):
"""SDXL model configuration for Inf2."""

batch_size: int = 1
"""Batch size for the model."""

image_height: int = 1024
"""Height of the image."""

image_width: int = 1024
"""Width of the image."""

compiler_args: Dict[str, Any] = field(
default_factory=lambda: {"auto_cast": "matmul", "auto_cast_type": "bf16"}, repr=False
)
"""Compiler arguments for the model."""

@property
def id(self) -> str:
"""Model ID."""
return f"{self.model_name}-bs-{self.batch_size}-{self.image_height}x{self.image_width}-{self.compiler_args.get('auto_cast_type', 'fp32')}"


class StableDiffusionXLInf2:
configs = {
"stabilityai/stable-diffusion-xl-base-1.0-inf2": StableDiffusionInf2Config(
model_name="stabilityai/stable-diffusion-xl-base-1.0",
),
}

def __init__(self, model_name: str = "stabilityai/stable-diffusion-xl-base-1.0-inf2"):
from nos.logging import logger

NeuronDevice.setup_environment()
try:
cfg = StableDiffusionXLInf2.configs[model_name]
except KeyError:
raise ValueError(f"Invalid model_name: {model_name}, available models: {self.configs.keys()}")
self.logger = logger
self.model = None
self.__load__(cfg)

def __load__(self, cfg: StableDiffusionInf2Config):
from optimum.neuron import NeuronStableDiffusionXLPipeline

if self.model is not None:
self.logger.debug(f"De-allocating existing model [cfg={self.cfg}, id={self.cfg.id}]")
del self.model
self.model = None
self.cfg = cfg

# Load model from cache if available, otherwise load from HF and compile
# (cache is specific to model_name, batch_size and sequence_length)
self.logger.debug(f"Loading model [cfg={self.cfg}, id={self.cfg.id}]")
cache_dir = NOS_CACHE_DIR / "neuron" / self.cfg.id
if Path(cache_dir).exists():
self.logger.debug(f"Loading model from {cache_dir}")
self.model = NeuronStableDiffusionXLPipeline.from_pretrained(str(cache_dir))
self.logger.debug(f"Loaded model from {cache_dir}")
else:
input_shapes = {
"batch_size": self.cfg.batch_size,
"height": self.cfg.image_height,
"width": self.cfg.image_width,
}
self.model = NeuronStableDiffusionXLPipeline.from_pretrained(
self.cfg.model_name, export=True, **self.cfg.compiler_args, **input_shapes
)
self.model.save_pretrained(str(cache_dir))
self.logger.debug(f"Saved model to {cache_dir}")
self.logger.debug(f"Loaded neuron model [id={self.cfg.id}]")

@torch.inference_mode()
def __call__(
self,
prompts: Union[str, List[str]],
num_images: int = 1,
num_inference_steps: int = 50,
guidance_scale: float = 7.5,
height: int = 512,
width: int = 512,
) -> List[Image.Image]:
"""Generate images from text prompt."""

if isinstance(prompts, str):
prompts = [prompts]
if isinstance(prompts, list) and len(prompts) != 1:
raise ValueError(f"Invalid number of prompts: {len(prompts)}, expected: 1")
if height != self.cfg.image_height or width != self.cfg.image_width:
cfg = replace(self.cfg, image_height=height, image_width=width)
self.logger.debug(f"Re-loading model [cfg={cfg}, id={cfg.id}, prev_id={self.cfg.id}]")
self.__load__(cfg)
assert self.model is not None
return self.model(
prompts,
num_images_per_prompt=num_images,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
).images
14 changes: 14 additions & 0 deletions examples/inf2/sdxl/serve.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
images:
custom-inf2:
base: autonomi/nos:latest-inf2
env:
NOS_LOGGING_LEVEL: DEBUG
NOS_NEURON_CORES: 2
NEURON_RT_VISIBLE_CORES: 2

models:
stabilityai/stable-diffusion-xl-base-1.0-inf2:
model_cls: StableDiffusionXLInf2
model_path: models/sdxl_inf2.py
default_method: __call__
runtime_env: custom-inf2
9 changes: 9 additions & 0 deletions examples/inf2/sdxl/tests/test_sdxl_inf2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
def test_sdxl_inf2():
from models.sdxl_inf2 import StableDiffusionXLInf2
from PIL import Image

model = StableDiffusionXLInf2()
prompts = "a photo of an astronaut riding a horse on mars"
response = model(prompts=prompts, height=1024, width=1024, num_inference_steps=50)
assert response is not None
assert isinstance(response[0], Image.Image)
21 changes: 21 additions & 0 deletions examples/inf2/sdxl/tests/test_sdxl_inf2_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import pytest


@pytest.mark.parametrize("model_id", ["stabilityai/stable-diffusion-xl-base-1.0-inf2"])
def test_sdxl_inf2_client(model_id):
from PIL import Image

from nos.client import Client

# Create a client
client = Client("[::]:50051")
assert client.WaitForServer()

# Load the embeddings model
model = client.Module(model_id)

# Run inference
prompts = "a photo of an astronaut riding a horse on mars"
response = model(prompts=prompts, height=1024, width=1024, num_inference_steps=50)
assert response is not None
assert isinstance(response[0], Image.Image)
4 changes: 2 additions & 2 deletions nos/neuron/device.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
import os
from dataclasses import dataclass

import torch_neuronx

from nos.constants import NOS_CACHE_DIR
from nos.logging import logger

Expand All @@ -21,6 +19,8 @@ def get(cls):

@staticmethod
def device_count() -> int:
import torch_neuronx

try:
return torch_neuronx.xla_impl.data_parallel.device_count()
except (RuntimeError, AssertionError):
Expand Down
Loading