ref(ai-autofix): Better design for document chunk models #276

jennmueng · 2024-03-03T01:33:15Z

Attempts to fix TIMESERIES-ANALYSIS-SERVICE-2S

Refactors the document chunk models to not need a repo_id in the base model (now renamed BaseDocumentChunk) while ones with embeddings are now EmbeddedDocumentChunk and the one that has been retrieved from db is StoredDocumentChunk.

This way it is clearer that you're working with StoredDocumentChunk when retrieving from the database and the former two models are transitional and only used when processing chunks.

trillville · 2024-03-03T01:44:34Z

src/seer/automation/codebase/codebase_index.py

-
-                repo_info = RepositoryInfo.from_db(db_repo_info)
-
-                db_chunks = [chunk.to_db_model() for chunk in embedded_chunks]
                session.add_all(db_chunks)


i'm wondering if this should be batched? not sure how large of an operation it has to be before we care about that, but i can imagine this potentially being a very large transaction in the case of a large repo

@corps today said it shouldn't be an issue at all; he worked with millions of inserts before. And it's actually good in the case that 1 fails the whole transaction fails rather than a single batch. We don't want repos with partially missing data

FWIW i'm pretty sure there is a way to batch the operation while still treating it as an atomic transaction (e.g. if it fails halfway through, everything is rolled back). At the very least it might be worth adding a TODO here to revisit this (also worth noting that each row here is very large relative to a typical pg operation, so it may take significantly fewer rows than usual before its problematic)

trillville

LGTM overlal, had one question

ref(ai-autofix): Better design for document chunk models

878bee9

jennmueng requested review from corps and trillville March 3, 2024 01:33

trillville reviewed Mar 3, 2024

View reviewed changes

trillville approved these changes Mar 3, 2024

View reviewed changes

jennmueng merged commit df843e2 into main Mar 3, 2024
3 checks passed

jennmueng deleted the jenn/autofix/doc-chunk-fix branch March 3, 2024 01:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref(ai-autofix): Better design for document chunk models #276

ref(ai-autofix): Better design for document chunk models #276

jennmueng commented Mar 3, 2024

trillville Mar 3, 2024

jennmueng Mar 3, 2024

trillville Mar 3, 2024

trillville left a comment

ref(ai-autofix): Better design for document chunk models #276

ref(ai-autofix): Better design for document chunk models #276

Conversation

jennmueng commented Mar 3, 2024

trillville Mar 3, 2024

Choose a reason for hiding this comment

jennmueng Mar 3, 2024

Choose a reason for hiding this comment

trillville Mar 3, 2024

Choose a reason for hiding this comment

trillville left a comment

Choose a reason for hiding this comment