Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: SQLAlchemy and alembic integration #208

Closed
wants to merge 23 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
ee55c8e
feat: add docs about potential python integration
Askir Nov 8, 2024
1ce31e1
feat: add vectorized annotation
Askir Nov 12, 2024
6f3d033
feat: add VectorizerField implementation
Askir Nov 12, 2024
c09bac0
docs: reset docs since inacurate
Askir Nov 12, 2024
528522c
feat: add basic alembic operations
Askir Nov 13, 2024
85ccc75
feat: make autogenerate work
Askir Nov 14, 2024
17e4b50
chore: ditch reversible autogenerate for now
Askir Nov 16, 2024
9e94399
chore: refactor tests
Askir Nov 18, 2024
fe43e3c
chore: test refactoring
Askir Nov 19, 2024
42034d4
chore: add all paramters to vectorizer field interface
Askir Nov 20, 2024
91de685
docs: add basic python integration docs
Askir Nov 20, 2024
91bd528
chore: move packages around, add sqlalchemy as optional extra
Askir Nov 21, 2024
e097851
chore: change package structure
Askir Nov 22, 2024
fb17a93
chore: make use of drop_all parameter
Askir Nov 22, 2024
d25d5b2
chore: align vectorizers on target_table name
Askir Nov 22, 2024
074f233
docs: update docs to reflect current interface
Askir Nov 22, 2024
37ffbdb
chore: move sql and python generation to dataclasses
Askir Nov 25, 2024
5f4abf3
chore: fix comparison logic
Askir Nov 27, 2024
63fd2ef
chore: fix tests
Askir Nov 27, 2024
128f836
chore: use pydantic parsing for loading from db
Askir Nov 27, 2024
51c4cb9
chore: add more extensive tests for vectorizer creation and add ollam…
Askir Nov 28, 2024
e64a08e
chore: add extensive tests for alembic migrations
Askir Nov 29, 2024
9249bc1
chore: make CI install all extras for tests
Askir Nov 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ jobs:

- name: Install dependencies
working-directory: ./projects/pgai
run: uv sync
run: uv sync --all-extras

- name: Lint
run: just pgai lint
Expand Down
299 changes: 299 additions & 0 deletions docs/python-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,299 @@
# SQLAlchemy Integration

pgai provides SQLAlchemy integration for managing and querying vector embeddings in your database through a simple declarative interface.

## VectorizerField

### Basic Setup

```python
from sqlalchemy import Column, Integer, Text
from sqlalchemy.orm import DeclarativeBase
from pgai.sqlalchemy import VectorizerField
from pgai.configuration import EmbeddingConfig

class BlogPost(DeclarativeBase):
__tablename__ = "blog_posts"

id = Column(Integer, primary_key=True)
title = Column(Text, nullable=False)
content = Column(Text, nullable=False)

content_embeddings = VectorizerField(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great interface. One nit VectorizerField or Vectorizer? In my mind this isn't really a field...

embedding=EmbeddingConfig(
model="text-embedding-3-small",
dimensions=768
),
chunking=ChunkingConfig(
chunk_column="content",
chunk_size=500
),
)
```

### Querying with VectorizerField

Once your model is set up, you can query the embeddings in several ways:

#### Basic ORM Queries

```python
# Get all embeddings
with Session(engine) as session:
# Get embedding entries
embeddings = session.query(BlogPost.content_embeddings).all()

# Access embedding vectors and chunks
for embedding in embeddings:
vector = embedding.embedding # numpy array of embeddings
chunk = embedding.chunk # text chunk that was embedded
chunk_seq = embedding.chunk_seq # sequence number of chunk
```

#### Semantic Search

Using pgvector's distance operators in queries:

```python
def search_similar_content(session: Session, query_text: str, limit: int = 5):
return (
session.query(
BlogPost.content_embeddings,
BlogPost.title,
# Calculate and include distance in results
BlogPost.content_embeddings.embedding.cosine_distance(
func.ai.openai_embed(
'text-embedding-3-small',
query_text,
text('dimensions => 768')
)
Comment on lines +65 to +69
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reference this from the Vectorizer config? That way we can remove the cognitive load on the user. Maybe that's what @cevian was referring when talking about improving the dev experience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is just plain sqlalchemy core right now, this works even without my extension code. We can add custom sql alchemy functions. e.g. so you don't have to wrap the dimension paramter in a text, I guess I could also run a subquery maybe in such a function. We can definitely provide a python helper that references the underlying vectorizer or loads it from the python config.

Mats idea was to do this on sql level though and then it's automatically also available in python so we safe some effort there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we can improve this UX. But I also think we can do that in a separate PR

).label('distance')
)
.order_by('distance') # Sort by similarity
.limit(limit)
.all()
)

# Usage:
results = search_similar_content(session, "machine learning concepts")
for embedding, title, distance in results:
print(f"Title: {title}")
print(f"Matching chunk: {embedding.chunk}")
print(f"Distance: {distance}\n")
```

#### Advanced Filtering

```python
# Find content within a certain distance threshold
threshold_query = (
session.query(BlogPost.content_embeddings)
.filter(
BlogPost.content_embeddings.embedding.cosine_distance(
func.ai.openai_embed(
'text-embedding-3-small',
'search query',
text('dimensions => 768')
)
) < 0.3
)
)

# Combine with regular SQL filters
combined_query = (
session.query(BlogPost, BlogPost.content_embeddings)
.join(
BlogPost.content_embeddings,
BlogPost.id == BlogPost.content_embeddings.id,
)
.filter(BlogPost.title.ilike("%Python%"))
.order_by(
BlogPost.content_embeddings.embedding.cosine_distance(
func.ai.openai_embed(
'text-embedding-3-small',
'search query',
text('dimensions => 768')
)
)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably ought to have a limit clause here

)
```

### Model Relationships

You can optionally create SQLAlchemy relationships between your model and its embeddings:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth it to mention that we expose the relationship via the view that's created. I don't know much about sqlalchemy, so maybe this relationship way is better than querying the view directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to look up exactly how relationships work, but generally speaking it's just an eager join that enables the related objects to be loaded as a list on the parent.
Mat mentioned that he's, understandably a bit skeptical about automatic join, because you can run into N+1 problems depending on how it is configured, which is why I made it optional.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can set up a relationship but not have it do the eager join by default?


```python
class BlogPost(DeclarativeBase):
# ... other columns as above ...

content_embeddings = VectorizerField(
embedding=EmbeddingConfig(
model="text-embedding-3-small",
dimensions=768
),
chunking=ChunkingConfig(
chunk_column="content",
chunk_size=500
),
add_relationship=True
)

# Type hint for the relationship
content_embeddings_relation: Mapped[list[EmbeddingModel["BlogPost"]]]
```

### Advanced Configuration

The VectorizerField supports all configuration from [sql interface](./vectorizer-api-reference.md):

```python
from pgai.configuration import (
EmbeddingConfig,
ChunkingConfig,
DiskANNIndexingConfig,
SchedulingConfig,
ProcessingConfig
)

class BlogPost(DeclarativeBase):
content_embeddings = VectorizerField(
embedding=EmbeddingConfig(
model="text-embedding-3-small",
dimensions=768,
chat_user="custom_user",
api_key_name="custom_key"
),
chunking=ChunkingConfig(
chunk_column="content",
chunk_size=500,
chunk_overlap=50,
separator=" ",
is_separator_regex=True
),
indexing=DiskANNIndexingConfig(
min_rows=10000,
storage_layout="memory_optimized"
),
formatting_template="Title: ${title}\nContent: ${chunk}",
scheduling=SchedulingConfig(
schedule_interval="1h",
timezone="UTC"
),
target_schema="public",
target_table="blog_embeddings",
view_schema="public",
view_name="blog_embeddings_view"
)
```

# Alembic Integration

To actually create the vectorizer, pgai provides two alembic helpers:

## Creating a Vectorizer

Basic creation:

```python
from alembic import op
from pgai.configuration import EmbeddingConfig, ChunkingConfig

def upgrade():
op.create_vectorizer(
'blog_posts',
embedding=EmbeddingConfig(
model='text-embedding-3-small',
dimensions=768
),
chunking=ChunkingConfig(
chunk_column='content',
chunk_size=700
),
formatting_template='Title: ${title}\nContent: ${chunk}'
)
```

## Dropping a Vectorizer

```python
def downgrade():
# Drop by ID
op.drop_vectorizer(1, drop_all=True)
```

The `drop_all=True` parameter will also clean up the associated embedding table and view.

## Autogeneration Support

If you don't want to write the configuration twice, you can make use of alembics autogenerate feature to automatically detect changes between your SQL models and the underlying database schema.

### Setup

To configure autogeneration, you need to import a custom comparison function as well as exclude pgai managed models from alembics usual comparators via the `include_object` parameter:

```python
from alembic import context
from pgai.alembic import compare_vectorizers
from pgai.alembic import CreateVectorizerOp, DropVectorizerOp


# Make sure your env.py includes:
def run_migrations_online():
with connectable.connect() as connection:
context.configure(
connection=connection,
target_metadata=target_metadata,
include_object=lambda obj, name, type_, reflected, compare_to:
not obj.info.get("pgai_managed", False)
)
```

### Features

The autogeneration system will:

1. Detect new vectorizers defined in models and generate creation operations
2. Detect removed vectorizers and generate drop operations
3. Detect changes in vectorizer configuration and generate update operations (as drop + create)

Note: All operations done are **not reversible** via `alembic downgrade`.

Example generated migration for a new vectorizer:

```python
def upgrade():
op.create_vectorizer(
'blog_posts',
embedding=EmbeddingConfig(
model='text-embedding-3-small',
dimensions=768
),
chunking=ChunkingConfig(
chunk_column='content',
chunk_size=500
)
)

def downgrade():
op.drop_vectorizer(1)
```

When changes are made to a model's vectorizer configuration, the autogeneration will create appropriate migration operations:

```python
def upgrade():
# Update vectorizer configuration
op.drop_vectorizer(1, drop_objects=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably make it more clear this is a dangerous and destructive operations which requires re-embedding everything again and paying $$. Perhaps just a code comment explaining this is sufficient.

op.create_vectorizer(
'blog_posts',
embedding=EmbeddingConfig(
model='text-embedding-3-large', # Changed model
dimensions=1536 # Changed dimensions
),
chunking=ChunkingConfig(
chunk_column='content',
chunk_size=500
)
)
```
8 changes: 7 additions & 1 deletion projects/extension/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ RUN set -e; \
dpkg -i pgvectorscale-postgresql-${PG_MAJOR}_${PGVECTORSCALE_VERSION}-Linux_"$TARGET_ARCH".deb; \
rm pgvectorscale-${PGVECTORSCALE_VERSION}-pg${PG_MAJOR}-"$TARGET_ARCH".zip pgvectorscale-postgresql-${PG_MAJOR}_${PGVECTORSCALE_VERSION}-Linux_"$TARGET_ARCH".deb


###############################################################################
# image for use in testing the pgai library
FROM base AS pgai-test-db
Expand All @@ -51,6 +50,13 @@ WORKDIR /pgai
COPY . .
RUN just build install

# Create a custom config file in docker-entrypoint-initdb.d
RUN mkdir -p /docker-entrypoint-initdb.d && \
echo "#!/bin/bash" > /docker-entrypoint-initdb.d/configure-timescaledb.sh && \
echo "echo \"shared_preload_libraries = 'timescaledb'\" >> \${PGDATA}/postgresql.conf" >> /docker-entrypoint-initdb.d/configure-timescaledb.sh && \
chmod +x /docker-entrypoint-initdb.d/configure-timescaledb.sh



###############################################################################
# image for use in extension development
Expand Down
4 changes: 4 additions & 0 deletions projects/pgai/pgai/alembic/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from pgai.alembic.autogenerate import compare_vectorizers
from pgai.alembic.operations import CreateVectorizerOp, DropVectorizerOp

__all__ = ["CreateVectorizerOp", "DropVectorizerOp", "compare_vectorizers"]
Loading
Loading