Skip to content

Commit

Permalink
feat: add Voyage AI vectorizer integration (#256)
Browse files Browse the repository at this point in the history
To configure a vectorizer with Voyage AI:

```sql
SELECT ai.create_vectorizer(
    'my_table'::regclass,
    embedding => ai.embedding_voyageai(
      'voyage-3-lite',
      512,
    ),
    -- other parameters...
);
```

The vectorizer worker connects to the Voyage AI API with the API
specified in the `VOYAGE_API_KEY` environment variable.

To get a vector embedding from SQL, use the `ai.voyageai_embed`
function:

```sql
SELECT ai.voyageai_embed('voyage-3-lite', 'text to embed');
```

Co-authored-by: Sergio Moya <1083296+smoya@users.noreply.github.com>
  • Loading branch information
JamesGuthrie and smoya authored Dec 5, 2024
1 parent 55444b3 commit 1b56d62
Show file tree
Hide file tree
Showing 26 changed files with 1,456 additions and 8 deletions.
1 change: 1 addition & 0 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ To set up the tests:
ENABLE_OLLAMA_TESTS=1
ENABLE_ANTHROPIC_TESTS=1
ENABLE_COHERE_TESTS=1
ENABLE_VOYAGEAI_TESTS=1
ENABLE_VECTORIZER_TESTS=1
ENABLE_DUMP_RESTORE_TESTS=1
ENABLE_PRIVILEGES_TESTS=1
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ For other use cases, first [Install pgai](#installation) in Timescale Cloud, a p
* [OpenAI](./docs/openai.md) - configure pgai for OpenAI, then use the model to tokenize, embed, chat complete and moderate. This page also includes advanced examples.
* [Anthropic](./docs/anthropic.md) - configure pgai for Anthropic, then use the model to generate content.
* [Cohere](./docs/cohere.md) - configure pgai for Cohere, then use the model to tokenize, embed, chat complete, classify, and rerank.
* [Voyage AI](./docs/voyageai.md) - configure pgai for Voyage AI, then use the model to embed.
- Leverage LLMs for data processing tasks such as classification, summarization, and data enrichment ([see the OpenAI example](/docs/openai.md)).


Expand Down Expand Up @@ -175,6 +176,7 @@ You can use pgai to integrate AI from the following providers:
- [Anthropic](./docs/anthropic.md)
- [Cohere](./docs/cohere.md)
- [Llama 3 (via Ollama)](/docs/ollama.md)
- [Voyage AI](/docs/voyageai.md)
Learn how to [moderate](/docs/moderate.md) content directly in the database using triggers and background jobs.
Expand Down
44 changes: 43 additions & 1 deletion docs/vectorizer-api-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,7 @@ The embedding functions are:

- [ai.embedding_openai](#aiembedding_openai)
- [ai.embedding_ollama](#aiembedding_ollama)
- [ai.embedding_voyageai](#aiembedding_voyageai)

### ai.embedding_openai

Expand Down Expand Up @@ -318,7 +319,6 @@ SELECT ai.create_vectorizer(
'nomic-embed-text',
768,
base_url => "http://my.ollama.server:443"
truncate => false,
options => '{ "num_ctx": 1024 }',
keep_alive => "10m"
),
Expand All @@ -343,6 +343,48 @@ The function takes several parameters to customize the Ollama embedding configur

A JSON configuration object that you can use in [ai.create_vectorizer](#create-vectorizers).

### ai.embedding_voyageai

You use the `ai.embedding_voyageai` function to use a Voyage AI model to generate embeddings.

The purpose of `ai.embedding_voyageai` is to:
- Define which Voyage AI model to use.
- Specify the dimensionality of the embeddings.
- Configure the model's truncation behaviour, and api key name.
- Configure the input type.

#### Example usage

This function is used to create an embedding configuration object that is passed as an argument to [ai.create_vectorizer](#create-vectorizers):

```sql
SELECT ai.create_vectorizer(
'my_table'::regclass,
embedding => ai.embedding_voyageai(
'voyage-3-lite',
512,
api_key_name => "TEST_API_KEY"
),
-- other parameters...
);
```

#### Parameters

The function takes several parameters to customize the Ollama embedding configuration:

| Name | Type | Default | Required | Description |
|--------------|---------|------------------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| model | text | - || Specify the name of the [Voyage AI model](https://docs.voyageai.com/docs/embeddings#model-choices) to use. |
| dimensions | int | - || Define the number of dimensions for the embedding vectors. This should match the output dimensions of the chosen model. |
| truncate | boolean | true || Truncates the end of each input to fit within the chosen model's context length. Embedding fails (for a given chunk) if set to false and the context length is exceeded. |
| input_type | text | 'document' || Type of the input text, null, 'query', or 'document'. |
| api_key_name | text | `VOYAGE_API_KEY` || Set the name of the environment variable that contains the Voyage AI API key. This allows for flexible API key management without hardcoding keys in the database. On Timescale Cloud, you should set this to the name of the secret that contains the Voyage AI API key. |

#### Returns

A JSON configuration object that you can use in [ai.create_vectorizer](#create-vectorizers).

## Formatting configuration

You use the `ai.formatting_python_template` function in `pgai` to
Expand Down
182 changes: 182 additions & 0 deletions docs/voyageai.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# Use pgai with Voyage AI

This page shows you how to:

- [Configure pgai for Voyage AI](#configure-pgai-for-voyage-ai)
- [Add AI functionality to your database](#usage)
- [Follow advanced AI examples](#advanced-examples)

## Configure pgai for Voyage AI

Most pgai functions require a [Voyage AI API key](https://docs.voyageai.com/docs/api-key-and-installation#authentication-with-api-keys).

- [Handle API keys using pgai from psql](#handle-api-keys-using-pgai-from-psql)
- [Handle API keys using pgai from python](#handle-api-keys-using-pgai-from-python)

### Handle API keys using pgai from psql

The api key is an [optional parameter to pgai functions](https://www.postgresql.org/docs/current/sql-syntax-calling-funcs.html).
You can either:

* [Run AI queries by passing your API key implicitly as a session parameter](#run-ai-queries-by-passing-your-api-key-implicitly-as-a-session-parameter)
* [Run AI queries by passing your API key explicitly as a function argument](#run-ai-queries-by-passing-your-api-key-explicitly-as-a-function-argument)

#### Run AI queries by passing your API key implicitly as a session parameter

To use a [session level parameter when connecting to your database with psql](https://www.postgresql.org/docs/current/config-setting.html#CONFIG-SETTING-SHELL)
to run your AI queries:

1. Set your Voyage AI key as an environment variable in your shell:
```bash
export VOYAGE_API_KEY="this-is-my-super-secret-api-key-dont-tell"
```
1. Use the session level parameter when you connect to your database:

```bash
PGOPTIONS="-c ai.voyage_api_key=$VOYAGE_API_KEY" psql -d "postgres://<username>:<password>@<host>:<port>/<database-name>"
```

1. Run your AI query:

`ai.voyage_api_key` is set for the duration of your psql session, you do not need to specify it for pgai functions.

```sql
SELECT * FROM ai.voyageai_embed('voyage-3-lite', 'sample text to embed');
```

#### Run AI queries by passing your API key explicitly as a function argument

1. Set your Voyage AI key as an environment variable in your shell:
```bash
export VOYAGE_API_KEY="this-is-my-super-secret-api-key-dont-tell"
```

2. Connect to your database and set your api key as a [psql variable](https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-VARIABLES):

```bash
psql -d "postgres://<username>:<password>@<host>:<port>/<database-name>" -v voyage_api_key=$VOYAGE_API_KEY
```
Your API key is now available as a psql variable named `voyage_api_key` in your psql session.

You can also log into the database, then set `voyage_api_key` using the `\getenv` [metacommand](https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-META-COMMAND-GETENV):

```sql
\getenv voyage_api_key VOYAGE_API_KEY
```

3. Pass your API key to your parameterized query:
```sql
SELECT *
FROM ai.voyageai_embed('voyage-3-lite', 'sample text to embed', api_key=>$1)
ORDER BY created DESC
\bind :voyage_api_key
\g
```

Use [\bind](https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-META-COMMAND-BIND) to pass the value of `voyage_api_key` to the parameterized query.

The `\bind` metacommand is available in psql version 16+.

4. Once you have used `\getenv` to load the environment variable to a psql variable
you can optionally set it as a session-level parameter which can then be used explicitly.
```sql
SELECT set_config('ai.voyage_api_key', $1, false) IS NOT NULL
\bind :voyage_api_key
\g
```

```sql
SELECT * FROM ai.voyageai_embed('voyage-3-lite', 'sample text to embed');
```

### Handle API keys using pgai from python

1. In your Python environment, include the dotenv and postgres driver packages:

```bash
pip install python-dotenv
pip install psycopg2-binary
```

1. Set your Voyage AI key in a .env file or as an environment variable:
```bash
VOYAGE_API_KEY="this-is-my-super-secret-api-key-dont-tell"
DB_URL="your connection string"
```

1. Pass your API key as a parameter to your queries:

```python
import os
from dotenv import load_dotenv
load_dotenv()
VOYAGE_API_KEY = os.environ["VOYAGE_API_KEY"]
DB_URL = os.environ["DB_URL"]
import psycopg2
with psycopg2.connect(DB_URL) as conn:
with conn.cursor() as cur:
# pass the API key as a parameter to the query. don't use string manipulations
cur.execute("SELECT * FROM ai.voyageai_embed('voyage-3-lite', 'sample text to embed', api_key=>%s)", (VOYAGE_API_KEY,))
records = cur.fetchall()
```

Do not use string manipulation to embed the key as a literal in the SQL query.


## Usage

This section shows you how to use AI directly from your database using SQL.

- [Embed](#embed): generate [embeddings](https://docs.voyageai.com/docs/embeddings) using a
specified model.

### Embed

Generate [embeddings](https://docs.voyageai.com/docs/embeddings) using a specified model.

- Request an embedding using a specific model:

```sql
SELECT ai.voyageai_embed
( 'voyage-3-lite'
, 'the purple elephant sits on a red mushroom'
);
```

The data returned looks like:

```text
voyageai_embed
--------------------------------------------------------
[0.005978798,-0.020522336,...-0.0022857306,-0.023699166]
(1 row)
```

- Pass an array of text inputs:

```sql
SELECT ai.voyageai_embed
( 'voyage-3-lite'
, array['Timescale is Postgres made Powerful', 'the purple elephant sits on a red mushroom']
);
```

- Specify the input type

The Voyage AI API allows setting the `input_type` to `"document"`, or
`"query"`, (or unset). Correctly setting this value should enhance retrieval
quality:

```sql
SELECT ai.voyageai_embed
( 'voyage-3-lite'
, 'A query'
, input_type => 'query'
);
```


22 changes: 22 additions & 0 deletions projects/extension/ai/voyageai.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import voyageai
from typing import Optional, Generator, Union

DEFAULT_KEY_NAME = "VOYAGE_API_KEY"


def embed(
model: str,
input: Union[list[str]],
api_key: str,
input_type: Optional[str] = None,
truncation: Optional[bool] = None,
) -> Generator[tuple[int, list[float]], None, None]:
client = voyageai.Client(api_key=api_key)
args = {}
if truncation is not None:
args["truncation"] = truncation
response = client.embed(input, model=model, input_type=input_type, **args)
if not hasattr(response, "embeddings"):
return None
for idx, obj in enumerate(response.embeddings):
yield idx, obj
3 changes: 2 additions & 1 deletion projects/extension/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ tiktoken==0.7.0
ollama==0.2.1
anthropic==0.29.0
cohere==5.5.8
backoff==2.2.1
backoff==2.2.1
voyageai==0.3.1
1 change: 1 addition & 0 deletions projects/extension/setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ install_requires =
anthropic==0.29.0
cohere==5.5.8
backoff==2.2.1
voyageai==0.3.1
33 changes: 33 additions & 0 deletions projects/extension/sql/idempotent/008-embedding.sql
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,37 @@ $func$ language sql immutable security invoker
set search_path to pg_catalog, pg_temp
;

-------------------------------------------------------------------------------
-- embedding_voyageai
create or replace function ai.embedding_voyageai
( model text
, dimensions int
, truncate boolean default true
, input_type text default 'document'
, api_key_name text default 'VOYAGE_API_KEY'
) returns jsonb
as $func$
begin
if input_type is not null and input_type not in ('query', 'document') then
-- Note: purposefully not using an enum here because types make life complicated
raise exception 'invalid input_type for voyage ai "%"', input_type;
end if;

return json_object
( 'implementation': 'voyageai'
, 'config_type': 'embedding'
, 'model': model
, 'dimensions': dimensions
, 'truncate': truncate
, 'input_type': input_type
, 'api_key_name': api_key_name
absent on null
);
end
$func$ language plpgsql immutable security invoker
set search_path to pg_catalog, pg_temp
;

-------------------------------------------------------------------------------
-- _validate_embedding
create or replace function ai._validate_embedding(config jsonb) returns void
Expand All @@ -69,6 +100,8 @@ begin
-- ok
when 'ollama' then
-- ok
when 'voyageai' then
-- ok
else
if _implementation is null then
raise exception 'embedding implementation not specified';
Expand Down
Loading

0 comments on commit 1b56d62

Please sign in to comment.