Compatibility with llama-cpp-python and add documentation #1779

0x33taji · 2024-02-11T02:09:19Z

0x33taji
Feb 11, 2024

What happened?

llama-cpp-python has a OpenAI compatible server

I am serving a model as:

GGML_SYCL_DEVICE=0 python3 -m llama_cpp.server --model mistral-7b-instruct-v0.2.Q8_0.gguf --chat_format mistral-instruct --host 0.0.0.0 --port 7836 --n_ctx 16192 --n_gpu_layers 35

Output from llama-cpp-python:

INFO:     Started server process [12022]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:7836 (Press CTRL+C to quit)
INFO:     192.168.3.114:51186 - "GET /v1/models HTTP/1.1" 200 OK # Librechat is requesting
INFO:     192.168.3.114:51198 - "POST /v1 HTTP/1.1" 404 Not Found # Librechat is requesting

I have checked http://192.168.3.113:7836/docs works # the llama-cpp-python openapi spec endpoint with this ip on my system

This is my librechat output when I try to chat with my model

Something went wrong. Here's the specific error message we encountered: Failed to send message. HTTP 404 - {"detail":"Not Found"}

This is my librechat.yaml configuration

# Configuration version (required)
version: 1.0.1

# Cache settings: Set to true to enable caching
cache: true

# Definition of custom endpoints
endpoints:
  custom:

    - name: "llama"
      apiKey: "sk-1234"
      baseURL: "http://192.168.3.113:7836/v1" # I have tried with just "http://192.168.3.113:7836" also, librechat again hits invalid endpoint
      iconURL: "<some url>"
      models:
        default: ["mistral-7b-v0.1", "mistral-7b-instruct-v0.2"] # I was loading multiple models first, but I was trying to diagnose the error with 1 model only
        fetch: true
      titleConvo: true
      titleModel: "mistral-7b-v0.1"
      titleMethod: "completion"
      summarize: true
      summaryModel: "mistral-7b-v0.1"
      forcePrompt: true
      modelDisplayLabel : "llama"

Possibilities:

llama-cpp-python is not serving a OpenAI compatible server
I am missing some configuration in Librechat, since chat format is --chat_format mistral-instruct
I am missing some configuration for llama-cpp-python with chat format is --chat_format mistral-instruct

Steps to Reproduce

Stated above

What browsers are you seeing the problem on?

No response

Relevant log output

No response

Screenshots

No response

Code of Conduct

I agree to follow this project's Code of Conduct

Answered by vladiliescu

Feb 12, 2024

@0x33taji I've managed to get it working -- the key was using http://localhost:8000/v1 for baseUrl

Run command

python3 -m llama_cpp.server \
    --model "<path-to-models>/nous-hermes-2-mixtral-8x7Bb-dpo/Nous-Hermes-2-Mixtral-8x7B-DPO.Q5_K_M.gguf" \
    --chat_format chatml --n_gpu_layers -1 --n_ctx 8192

librechat.yaml

# Configuration version (required)
version: 1.0.2

cache: true

# Definition of custom endpoints
endpoints:
  custom:      
    - name: "Local LLAMA"   
      apiKey: "1234"
      baseURL: "http://localhost:8000/v1"
      models:
        default: ["Nous-Hermes-2-Mixtral-8×7B-DPO.Q5_K_M.gguf"]
        fetch: true
      titleConvo: true
      titleModel: "Nous-Hermes-2-Mixtra…

View full answer

0x33taji · 2024-02-11T02:11:30Z

0x33taji
Feb 11, 2024
Author

abetlen/llama-cpp-python#1174

0 replies

danny-avila · 2024-02-12T08:04:13Z

danny-avila
Feb 12, 2024
Maintainer

To be honest, I've never setup llama-cpp-python for use in LibreChat, which is why there are compatibility/documentation issues.

My next big priority will be open source LLMs, and I will look into this.

Also, maybe the compatibility is better addressed to openai-node, as the project relies on the official OpenAI node SDK, and llama-cpp-python compatibility with that library would mean compatibility with many other projects using openai-node.

https://github.com/openai/openai-node/issues

0 replies

vladiliescu · 2024-02-12T14:53:08Z

vladiliescu
Feb 12, 2024

@0x33taji I've managed to get it working -- the key was using http://localhost:8000/v1 for baseUrl

Run command

python3 -m llama_cpp.server \
    --model "<path-to-models>/nous-hermes-2-mixtral-8x7Bb-dpo/Nous-Hermes-2-Mixtral-8x7B-DPO.Q5_K_M.gguf" \
    --chat_format chatml --n_gpu_layers -1 --n_ctx 8192

librechat.yaml

# Configuration version (required)
version: 1.0.2

cache: true

# Definition of custom endpoints
endpoints:
  custom:      
    - name: "Local LLAMA"   
      apiKey: "1234"
      baseURL: "http://localhost:8000/v1"
      models:
        default: ["Nous-Hermes-2-Mixtral-8×7B-DPO.Q5_K_M.gguf"]
        fetch: true
      titleConvo: true
      titleModel: "Nous-Hermes-2-Mixtral-8×7B-DPO.Q5_K_M.gguf"
      summarize: false
      summaryModel: "Nous-Hermes-2-Mixtral-8×7B-DPO.Q5_K_M.gguf"
      forcePrompt: false
      modelDisplayLabel: "Mixtral"

2 replies

danny-avila Feb 12, 2024
Maintainer

thanks!

vladiliescu Feb 12, 2024

Welcome, thanks for the great UI!

Forgot to mention, I've upgraded node-openai to 4.27.0 at one point, but it would still error out with 404 messages, only thing that fixed it was the /v1 bit. I used the code below to test various combinations:

const OpenAI = require('openai').default;

const openai = new OpenAI({
  apiKey: 'X',
  baseURL: 'http://localhost:8000/v1'
});

async function main() {
  var models = await openai.models.list();
  console.log(models);

  const chatCompletion = await openai.chat.completions.create({
    messages: [{ role: 'user', content: 'Say this is a test' }],
    model: 'gpt-3.5-turbo',
  });

  console.log(chatCompletion);
}

main();

0x33taji · 2024-02-13T21:48:56Z

0x33taji
Feb 13, 2024
Author

Thanks vladiliescu for your promptness!!

But unfortunately it did not work for me. 🥲

As you can see in my post I did test the [IP] and [IP]/v1.
If I am not being mistaken, I served with --chat_format mistral-instruct which is how Mistral model is supposed to be served according to the llama.cpp documentation . But also did try --chat_format chatml with the /v1.

I tested again now, no combination worked.

However, if your test is holistic with the openai client sdk on the endpoints of llama-cpp-python and Librechat was not tested using that package, then I think the llama-cpp-python part of the issue puzzle is certain that the API implementation of the project is complete as of this time of writing.

I would ask @danny-avila to check on this when time permits. Take your time. No worries 😄 and thanks for the UI.

2 replies

danny-avila Feb 13, 2024
Maintainer

@0x33taji just to clarify, both of the specified models don't work for you? is there any swagger api docs for what your specific configuration is expecting in the API payload?

vladiliescu Feb 14, 2024

@0x33taji Yeah, I had noticed you were also sending /v1, not sure why it doesn't work in your case. I've also added Mistral 7B Instruct (a non-quantized version, just for tests), and can communicate with it (even though it gets a lower-than-expected tokens per second).

Here's my llamacpp.config (notice the different chat_formats).:

{
    "port": 8042,
    "models": [
        {
            "model": "/Users/.../models/nous-hermes-2-mixtral-8x7Bb-dpo/Nous-Hermes-2-Mixtral-8x7B-DPO.Q5_K_M.gguf",
            "model_alias": "Nous-Hermes-2-Mixtral-8x7B-DPO",
            "chat_format": "chatml",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 16384
        },
        {
            "model": "/Users/.../models/Mistral-7B-Instruct-v0.2/ggml-model-f16.gguf",
            "model_alias": "Mistral-7B-Instruct-v0.2",
            "chat_format": "mistral-instruct",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 8192
        },
        {
            "model": "/Users/.../models/BakLLaVA-1/ggml-model-q5_1.gguf",
            "model_alias": "BakLLaVA",
            "chat_format": "llava-1-5",
            "clip_model_path": "/Users/.../models/BakLLaVA-1/mmproj-model-f16.gguf",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 8192
        }        
    ]
}

I'm running it with python3 -m llama_cpp.server --config_file llamacpp.config.

Same librechat.yaml as above. One thing I recommend is setting titleConvo: false b/c otherwise whenever you chat with a model other than the title model, the server will try to unload the chat model and load the title model instead, with hilarity and warning: failed to munlock buffer: Cannot allocate memory ensuing.

❓Maybe this warrants a new option for the librechat.yaml file, something like titleModel: <WhicheverIsActive>? What do you think @danny-avila ?

Getting back to our debugging, here's what I suggest:

Create a test.js file in your LibreChat directory, adjust the ip it's connecting to (why aren't you using localhost, btw, are you exposing this to your entire network?), and run it with node test.js. Does it display the models you expect? Can it even connect?

const OpenAI = require('openai').default;

const openai = new OpenAI({
  apiKey: 'X',
  baseURL: 'http://localhost:8042/v1'
});

async function main() {
  var models = await openai.models.list();
  console.log(models);

  const chatCompletion = await openai.chat.completions.create({
    messages: [{ role: 'user', content: 'Say this is a test' }],
    model: 'Mistral-7B-Instruct-v0.2',
  });

  console.log(chatCompletion);
}

main();

If it (for whatever reason) doesn't work, update the node-openai package to the latest version with npm install --save openai and retry.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compatibility with llama-cpp-python and add documentation #1779

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Compatibility with llama-cpp-python and add documentation #1779

0x33taji Feb 11, 2024

What happened?

Steps to Reproduce

What browsers are you seeing the problem on?

Relevant log output

Screenshots

Code of Conduct

Run command

librechat.yaml

Replies: 4 comments · 4 replies

0x33taji Feb 11, 2024 Author

danny-avila Feb 12, 2024 Maintainer

vladiliescu Feb 12, 2024

Run command

librechat.yaml

danny-avila Feb 12, 2024 Maintainer

vladiliescu Feb 12, 2024

0x33taji Feb 13, 2024 Author

danny-avila Feb 13, 2024 Maintainer

vladiliescu Feb 14, 2024

0x33taji
Feb 11, 2024

Replies: 4 comments 4 replies

0x33taji
Feb 11, 2024
Author

danny-avila
Feb 12, 2024
Maintainer

vladiliescu
Feb 12, 2024

danny-avila Feb 12, 2024
Maintainer

0x33taji
Feb 13, 2024
Author

danny-avila Feb 13, 2024
Maintainer