feat: add alembic operations for vectorizer #266

Askir · 2024-12-02T10:03:17Z

This PR adds native python operations to alembic so you don't have to write SQL to create vectorizers.

cevian

I gotta say I'm not convinced about the arguments for using a separate model than the models already in pgai/vectorizer or at least having both sets of models extend a base model. I think having 2 sets of models with similar params is really hard to maintain and quite a bit of code duplication. I'd like some more eyes on this tho. Can
James and/or Alejandro chime in here. In particular I'd like us to consider three designs:

simply extending the pydantic model we already have with optional fields that are present in either the stored json OR needed for the alembic stuff + having some kind of wrappers to create the config objects in alembic.
Factoring common data fields into base classes and using those as mixins. (kinda like the ApiKeyMixin now).
Maybe I'm just being stubborn and we should have separate models, like Jascha has them now.
leaving a few comments in but I think this is the big issue we need to resolve

projects/pgai/pgai/configuration.py

projects/pgai/pgai/alembic/operations.py

projects/pgai/pgai/configuration.py

docs/python-integration.md

projects/pgai/pgai/alembic/operations.py

projects/pgai/tests/vectorizer/extensions/fixtures/migrations/002_create_vectorizer.py.template

Askir

I have added the base.py with shared pydantic classes and an optional @required decorator to not have to redefine classes just for optional params.
This should allow to mainly have to edit the base.py classes instead of having to look in two places when adding new config fields to create_vectorizer.

I'm still not convinced that this is the better approach. But I don't feel strongly about it.

Askir · 2025-01-07T14:05:05Z

projects/extension/Dockerfile

+RUN mkdir -p /docker-entrypoint-initdb.d && \
+    echo "#!/bin/bash" > /docker-entrypoint-initdb.d/configure-timescaledb.sh && \
+    echo "echo \"shared_preload_libraries = 'timescaledb'\" >> \${PGDATA}/postgresql.conf" >> /docker-entrypoint-initdb.d/configure-timescaledb.sh && \
+    chmod +x /docker-entrypoint-initdb.d/configure-timescaledb.sh


I had to add this to be able to run create extension if not exists timescaledb I'm not sure this is correct?

why did you need timescaledb for this pr? This is a dev image so this is fine I'm just curious

Askir · 2025-01-07T14:10:00Z

projects/pgai/pgai/vectorizer/chunking.py


    @cached_property
    def _chunker(self) -> CharacterTextSplitter:
        return CharacterTextSplitter(
-            separator=self.separator,
+            separator=self.separator,  # type: ignore


This type: ignore is now necessary as pyright does not know that the decorator makes the field required and complains about possibly passing None.

Askir · 2025-01-07T16:29:19Z

projects/pgai/pgai/vectorizer/base.py

+                new_fields[name] = new_field
+            else:
+                new_fields[name] = field
+        _cls.model_fields = new_fields


Oh... I'm trying to build a sample application right now. And in this repo we use pydantic 2.9 where this works but in pydantic 2.10 this already breaks...

File "/Users/jascha/repositories/pgai/examples/discord_bot/.venv/lib/python3.12/site-packages/pgai/vectorizer/chunking.py", line 41, in <module> @required ^^^^^^^^ File "/Users/jascha/repositories/pgai/examples/discord_bot/.venv/lib/python3.12/site-packages/pgai/vectorizer/base.py", line 97, in required return dec(cls) ^^^^^^^^ File "/Users/jascha/repositories/pgai/examples/discord_bot/.venv/lib/python3.12/site-packages/pgai/vectorizer/base.py", line 94, in dec _cls.model_fields = new_fields ^^^^^^^^^^^^^^^^^ AttributeError: property 'model_fields' of 'ModelMetaclass' object has no setter

I'll have to go back through this and find another way, but either way this seems brittle. Pydantic is not really designed to allow such overrides their idea is to use the typing system to infer the validation logic, overriding the types breaks with this declarative approach.

cevian

This is looking a lot better. Some questions remaining but I think this is the right track.

cevian · 2025-01-07T16:29:38Z

docs/python-integration.md

+
+
+def downgrade() -> None:
+    op.drop_vectorizer(vectorizer_id=1, drop_all=True)


i think it would be better for this to take the target_table name and not the vectorizer_id (which would probably not be known when writing the migration). The target table should be unique and so we should be able to look up the id from that

cevian · 2025-01-07T16:31:15Z

projects/extension/Dockerfile

+RUN mkdir -p /docker-entrypoint-initdb.d && \
+    echo "#!/bin/bash" > /docker-entrypoint-initdb.d/configure-timescaledb.sh && \
+    echo "echo \"shared_preload_libraries = 'timescaledb'\" >> \${PGDATA}/postgresql.conf" >> /docker-entrypoint-initdb.d/configure-timescaledb.sh && \
+    chmod +x /docker-entrypoint-initdb.d/configure-timescaledb.sh


why did you need timescaledb for this pr? This is a dev image so this is fine I'm just curious

cevian · 2025-01-07T16:35:24Z

projects/pgai/pgai/alembic/configuration.py

+        # Get all fields including from parent classes
+        params = {}
+        for field_name, _field in self.model_fields.items():  # type: ignore
+            if field_name != "arg_type":


is the function_name field included then, how does that work?

cevian · 2025-01-07T16:46:19Z

projects/pgai/pgai/alembic/configuration.py

+        return f", {self.arg_type} => ai.{fn_name}({format_sql_params(params)})"  # type: ignore
+
+
+class OpenAIConfig(BaseOpenAIConfig, SQLArgumentMixin):


Naming (and I know naming discussions are always annoying) but why not stick to the sql convention we established and name this EmbeddingOpenAIConfig or EmbeddingConfigOpenAI? (and similar for others). The pro is that the name translation from sql->python is super easy and I think would be easier to understand. The con is that it's long.

Otherwise the translation seems a bit ad-hoc. e.g. Indexing configs have "indexing" in the name but in a different spot than the sql. Let's think about this some more

cevian · 2025-01-07T16:47:10Z

projects/pgai/pgai/alembic/configuration.py

@@ -0,0 +1,227 @@
+import re


To my eye this is a huge improvement from before

cevian · 2025-01-07T16:50:47Z

projects/pgai/pgai/vectorizer/chunking.py

@@ -33,7 +38,8 @@ def into_chunks(self, item: dict[str, Any]) -> list[str]:
        """


-class LangChainCharacterTextSplitter(BaseModel, Chunker):
+@required


I only see these used in 2 places? don't we need it on more models?

cevian · 2025-01-07T16:54:43Z

docs/python-integration.md

@@ -164,7 +164,7 @@ for post, embedding in results:



I think we also need to add docs to adding-embedding-integration.md

Askir force-pushed the jascha/add-alembic-migration-ops branch from c899380 to fd9f1bc Compare December 2, 2024 10:08

Askir mentioned this pull request Dec 2, 2024

feat: SQLAlchemy and alembic integration #208

Closed

Askir force-pushed the jascha/add-alembic-migration-ops branch from fd9f1bc to 6f5ff59 Compare December 3, 2024 13:37

Askir marked this pull request as ready for review December 3, 2024 23:16

Askir requested a review from a team as a code owner December 3, 2024 23:16

Askir force-pushed the jascha/add-vectorizer-field branch from 8742af8 to 36cf4d9 Compare December 4, 2024 13:13

Askir force-pushed the jascha/add-alembic-migration-ops branch from 6f5ff59 to 525ab5b Compare December 4, 2024 13:20

cevian requested changes Dec 4, 2024

View reviewed changes

projects/pgai/pgai/configuration.py Outdated Show resolved Hide resolved

projects/pgai/pgai/alembic/operations.py Show resolved Hide resolved

projects/pgai/pgai/alembic/operations.py Outdated Show resolved Hide resolved

projects/pgai/pgai/configuration.py Outdated Show resolved Hide resolved

JamesGuthrie reviewed Dec 5, 2024

View reviewed changes

docs/python-integration.md Outdated Show resolved Hide resolved

Askir force-pushed the jascha/add-vectorizer-field branch 10 times, most recently from 3b47afc to 8fe145e Compare December 12, 2024 13:46

Askir force-pushed the jascha/add-alembic-migration-ops branch from 525ab5b to 5e76cf9 Compare December 12, 2024 16:44

Askir commented Dec 13, 2024

View reviewed changes

projects/pgai/pgai/alembic/operations.py Outdated Show resolved Hide resolved

projects/pgai/pgai/alembic/operations.py Show resolved Hide resolved

projects/pgai/tests/vectorizer/extensions/fixtures/migrations/002_create_vectorizer.py.template Outdated Show resolved Hide resolved

Askir force-pushed the jascha/add-vectorizer-field branch from 8fe145e to 882f91e Compare December 19, 2024 11:40

Base automatically changed from jascha/add-vectorizer-field to main December 19, 2024 12:32

Askir force-pushed the jascha/add-alembic-migration-ops branch 7 times, most recently from c90ae69 to 7b90575 Compare January 7, 2025 13:57

Askir force-pushed the jascha/add-alembic-migration-ops branch from 7b90575 to 447078f Compare January 7, 2025 14:07

Askir commented Jan 7, 2025

View reviewed changes

Askir force-pushed the jascha/add-alembic-migration-ops branch from 447078f to 828347f Compare January 7, 2025 14:14

Askir added 9 commits January 7, 2025 15:14

feat: add alembic operations for vectorizer

522850a

chore: cleanup set up of operations

0ea53e4

chore: add shared base class

5675809

docs: update docs

149cac8

chore: unify sql generation

f274f3d

chore: add more test cases

04bfae6

chore: simplify code and tests a bit

cd42edb

chore: use shared base classes, make use of more optional params

5dff77c

chore: revert dockerfile change

ff4c8dc

Askir force-pushed the jascha/add-alembic-migration-ops branch from 828347f to e5e4614 Compare January 7, 2025 14:16

chore: move configuration to alembic package

7bfedb3

Askir force-pushed the jascha/add-alembic-migration-ops branch from e5e4614 to 7bfedb3 Compare January 7, 2025 14:19

Askir requested a review from cevian January 7, 2025 14:22

Askir commented Jan 7, 2025

View reviewed changes

cevian requested changes Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add alembic operations for vectorizer #266

feat: add alembic operations for vectorizer #266

Askir commented Dec 2, 2024 •

edited

Loading

cevian left a comment

Askir left a comment

Askir Jan 7, 2025

cevian Jan 7, 2025

Askir Jan 7, 2025

Askir Jan 7, 2025 •

edited

Loading

cevian left a comment

cevian Jan 7, 2025

cevian Jan 7, 2025

cevian Jan 7, 2025

cevian Jan 7, 2025

cevian Jan 7, 2025

cevian Jan 7, 2025

cevian Jan 7, 2025



		def downgrade() -> None:
		op.drop_vectorizer(vectorizer_id=1, drop_all=True)

		return f", {self.arg_type} => ai.{fn_name}({format_sql_params(params)})" # type: ignore


		class OpenAIConfig(BaseOpenAIConfig, SQLArgumentMixin):

feat: add alembic operations for vectorizer #266

Are you sure you want to change the base?

feat: add alembic operations for vectorizer #266

Conversation

Askir commented Dec 2, 2024 • edited Loading

cevian left a comment

Choose a reason for hiding this comment

Askir left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Askir Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

cevian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Askir commented Dec 2, 2024 •

edited

Loading

Askir Jan 7, 2025 •

edited

Loading