-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: SQLAlchemy and alembic integration #208
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a very brief pass over the files, left some comments. I'll try to do a more through review later.
projects/pgai/pyproject.toml
Outdated
"sqlalchemy>=2.0.36", | ||
"psycopg2>=2.9.10", | ||
"alembic>=1.14.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add these extensions as extras
, that way users will do pip install pgai[sqlalquemy]
. If we add Django support next, then every users will have sqlalchemy and Django as deps, even if they just want to use the Vectorizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes for sure.
Question is a bit if the vectorizer itself should also live in its own extra? I feel like you wont need that in your core python e.g. fastapi application.
Almost an argument for a separate package but I think we want to make use of the pgai brand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little on the fence about naming this directory extensions
, I don't want people to get confused with the db extension
. Maybe I'm overreacting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: what do you think about these class definitions to a non __init__.py
file. I think putting them inside the __init__.py
makes them less discoverable while browsing the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could also just have pgai.sqlalchemy
and pgai.alembic
? What do you think about that?
pgvector-python doesn't have an extension subpackage either.
func.ai.openai_embed( | ||
'text-embedding-3-small', | ||
query_text, | ||
text('dimensions => 768') | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reference this from the Vectorizer config? That way we can remove the cognitive load on the user. Maybe that's what @cevian was referring when talking about improving the dev experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is just plain sqlalchemy core right now, this works even without my extension code. We can add custom sql alchemy functions. e.g. so you don't have to wrap the dimension paramter in a text
, I guess I could also run a subquery maybe in such a function. We can definitely provide a python helper that references the underlying vectorizer or loads it from the python config.
Mats idea was to do this on sql level though and then it's automatically also available in python so we safe some effort there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we can improve this UX. But I also think we can do that in a separate PR
|
||
### Model Relationships | ||
|
||
You can optionally create SQLAlchemy relationships between your model and its embeddings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth it to mention that we expose the relationship via the view that's created. I don't know much about sqlalchemy, so maybe this relationship
way is better than querying the view directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to look up exactly how relationships work, but generally speaking it's just an eager join that enables the related objects to be loaded as a list on the parent.
Mat mentioned that he's, understandably a bit skeptical about automatic join, because you can run into N+1 problems depending on how it is configured, which is why I made it optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can set up a relationship but not have it do the eager join by default?
84a27f4
to
19ed565
Compare
def _build_hnsw_indexing_params(config: HNSWIndexingConfig) -> str: | ||
"""Build HNSW indexing configuration parameters.""" | ||
params = [] | ||
if config.min_rows is not None: | ||
params.append(f"min_rows=>{config.min_rows}") | ||
if config.opclass is not None: | ||
params.append(f"opclass=>'{config.opclass}'") | ||
if config.m is not None: | ||
params.append(f"m=>{config.m}") | ||
if config.ef_construction is not None: | ||
params.append(f"ef_construction=>{config.ef_construction}") | ||
if config.create_when_queue_empty is not None: | ||
params.append(f"create_when_queue_empty=>{str(config.create_when_queue_empty).lower()}") | ||
return f"ai.indexing_hnsw({', '.join(params)})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I am considering is to make the dataclass config objects smarter:
Currently there is multiple places where logic for each config is placed:
- rendering to SQL (here)
- rendering to python (in autogenerate.py)
- loading from the db state (in compare_vectorizers)
- Storing configuration in the metadata info dict on the vectorizer field
I think if I make these config objects "smarter" and add a few functions:
from_db_state
to_sql()
to_python
or just__repr__
?
For each one. It should make keeping this in sync with our actual underlying db code a lot easier. Since you'll only have to touch one spot in the code and not understand how everything works?
Honestly I am a bit disappointed by alembic, i would have thought that at least the to_python
part is covered by the library generically already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this actually kind of overlaps with the pydantic models we have in the vectorizer part of the library. Maybe there is also a way to re-use that config. But e.g. IndexingConfig
does not exist there and needs to be loaded separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly encourage us to have one set of dataclass objects across all of our libraries. That way we have code reuse and only one place to add things. (and I expect we'll constantly be adding things). My mental model would be one set of dataclass objects (probably pydantic based) that's used (imported) by both vectorizer and these kind of integrations. So that could include things like IndexingConfig
that's used by only some parts of the code and not others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might actually be nice to have access to them in the extension too. I could use the pydantic models to validate the jsonb representations. But I don't know how cumbersome it would be to do this.
else: | ||
# Check for configuration changes | ||
existing_config = existing_vectorizers[table_name] | ||
if _config_has_changed(model_config, existing_config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change detection part doesn't work yet, no need to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is great. Straightforward and helpful approach. Left a few comments. The most important one is about trying to get common, reusable data classes between all of our components
def _build_hnsw_indexing_params(config: HNSWIndexingConfig) -> str: | ||
"""Build HNSW indexing configuration parameters.""" | ||
params = [] | ||
if config.min_rows is not None: | ||
params.append(f"min_rows=>{config.min_rows}") | ||
if config.opclass is not None: | ||
params.append(f"opclass=>'{config.opclass}'") | ||
if config.m is not None: | ||
params.append(f"m=>{config.m}") | ||
if config.ef_construction is not None: | ||
params.append(f"ef_construction=>{config.ef_construction}") | ||
if config.create_when_queue_empty is not None: | ||
params.append(f"create_when_queue_empty=>{str(config.create_when_queue_empty).lower()}") | ||
return f"ai.indexing_hnsw({', '.join(params)})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly encourage us to have one set of dataclass objects across all of our libraries. That way we have code reuse and only one place to add things. (and I expect we'll constantly be adding things). My mental model would be one set of dataclass objects (probably pydantic based) that's used (imported) by both vectorizer and these kind of integrations. So that could include things like IndexingConfig
that's used by only some parts of the code and not others.
title = Column(Text, nullable=False) | ||
content = Column(Text, nullable=False) | ||
|
||
content_embeddings = VectorizerField( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great interface. One nit VectorizerField
or Vectorizer
? In my mind this isn't really a field...
func.ai.openai_embed( | ||
'text-embedding-3-small', | ||
query_text, | ||
text('dimensions => 768') | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we can improve this UX. But I also think we can do that in a separate PR
|
||
### Model Relationships | ||
|
||
You can optionally create SQLAlchemy relationships between your model and its embeddings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can set up a relationship but not have it do the eager join by default?
```python | ||
def upgrade(): | ||
# Update vectorizer configuration | ||
op.drop_vectorizer(1, drop_objects=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably make it more clear this is a dangerous and destructive operations which requires re-embedding everything again and paying $$. Perhaps just a code comment explaining this is sufficient.
) | ||
|
||
|
||
def _config_has_changed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this can be as easy as rendering the two configs as create_vectorizer calls and doing a string diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only got partway through. Will try to look at the rest after lunch.
text('dimensions => 768') | ||
) | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably ought to have a limit
clause here
projects/pgai/pgai/configuration.py
Outdated
|
||
|
||
@dataclass | ||
class EmbeddingConfig: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll need an EmbeddingConfigOpenAI
and EmbeddingConfigOllama
, right?
|
||
|
||
@dataclass | ||
class ChunkingConfig: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need a ChunkingConfigCharacterTextSplitter
and a ChunkingConfigRecursiveCharacterTextSplitter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only using one right now, because the recursive character text splitter is essentially the same as the character one if you only provide one splitting point right?
Maybe still good to support both for completeness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I'm also just wondering what will happen from a naming convention when we add a third chunking strategy
|
||
|
||
@dataclass | ||
class SchedulingConfig: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be SchedulingConfigTimescaledb
or something similar. There's a non-zero chance we'll add support for pgcron back as an alternative.
"""Base type for embedding models with required attributes""" | ||
|
||
embedding_uuid: Mapped[str] | ||
id: Mapped[int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't constrained to an integer id
foreign key. How should we handle that? Source tables may have compound primary keys of various types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're right. I'll try make this a generic, but I'm a bit scared of breaking strict typing. I don't want the embedding class to have a possibly infinite amount of Any fields.
274665d
to
51c4cb9
Compare
8acabab
to
e64a08e
Compare
Adds Python integration for pgai that allows defining and managing vectorizers through SQLAlchemy models and Alembic migrations. The integration provides a declarative interface for vector embeddings and automated migration support.
Features:
Known missing features:
ai.xxx
functionsChunking configuration is not separated for character and recursive splitter (will fix)Scheduling config only supports default and specific scheduling not no scheduling right now (will fix)How to review:
sqlalchemy/__init__.py
: Contains the VectorizerField that defines sqlalchemy embedding models and registers the configuration for alembic.alembic/autogenerate.py
: Contains the comparison functions as well as the python code generation.alemic/operations.py
: Contains the SQL to be executed (and some alembic boilerplate).configuration.py
: Contains dataclasses that map to all the sql parameters for create_vectorizer which act as the input for both the VectorizerField as well as the alembic create_vectorizer and drop_vectorizer operations.