Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(ai-autofix): Better design for document chunk models #276

Merged
merged 1 commit into from
Mar 3, 2024

Conversation

jennmueng
Copy link
Member

Attempts to fix TIMESERIES-ANALYSIS-SERVICE-2S

Refactors the document chunk models to not need a repo_id in the base model (now renamed BaseDocumentChunk) while ones with embeddings are now EmbeddedDocumentChunk and the one that has been retrieved from db is StoredDocumentChunk.

This way it is clearer that you're working with StoredDocumentChunk when retrieving from the database and the former two models are transitional and only used when processing chunks.


repo_info = RepositoryInfo.from_db(db_repo_info)

db_chunks = [chunk.to_db_model() for chunk in embedded_chunks]
session.add_all(db_chunks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm wondering if this should be batched? not sure how large of an operation it has to be before we care about that, but i can imagine this potentially being a very large transaction in the case of a large repo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corps today said it shouldn't be an issue at all; he worked with millions of inserts before. And it's actually good in the case that 1 fails the whole transaction fails rather than a single batch. We don't want repos with partially missing data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW i'm pretty sure there is a way to batch the operation while still treating it as an atomic transaction (e.g. if it fails halfway through, everything is rolled back). At the very least it might be worth adding a TODO here to revisit this (also worth noting that each row here is very large relative to a typical pg operation, so it may take significantly fewer rows than usual before its problematic)

Copy link
Contributor

@trillville trillville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overlal, had one question

@jennmueng jennmueng merged commit df843e2 into main Mar 3, 2024
3 checks passed
@jennmueng jennmueng deleted the jenn/autofix/doc-chunk-fix branch March 3, 2024 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants