Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Index markdown in pgvector #392
base: master
Are you sure you want to change the base?
feat: Index markdown in pgvector #392
Changes from 16 commits
bfa44b5
3243fd5
e914cba
f305ef8
b3b3175
fe25b17
b531bdd
2a43a75
c17e9a9
b6fa280
c08658f
5c3aff8
57014d2
f60d009
4266bbe
9a32436
95e925a
72ba300
6781ec3
5c3458c
40709be
7fba78d
ce3945b
ae7718d
3c74464
e32f721
30c2d01
fade2fe
acb34af
b7aa295
bd0b265
3eb579f
6ad22f1
15b2909
6817d4b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost exactly right! I prefer it if builders do not do IO if they can avoid it, for multiple reasons. In this case, that also has the benefit of being able to connect lazilly and hiding the details of the connection pool.
i.e. the builder api like:
And then in
PgVector::setup
(which is only called once):There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timonv, I'm looking for your input on a design choice here.
If we decide to handle the database connection pool setup within
fn setup(&self)
instead ofPgVectorBuilder
, we'll need to mutatePgVector
withinfn setup()
. This change would mean updating the function signature intrait Persist
to:For example:
This adjustment would introduce breaking changes across the stack, particularly impacting:
swiftide-indexing/src/persist/memory_storage.rs
swiftide-integrations/src/lancedb/persist.rs
swiftide-integrations/src/qdrant/persist.rs
swiftide-integrations/src/redis/persist.rs
Would you prefer moving the IO operations into
Persist::setup()
for these components? If so, we could handle this as a separate PR to streamline the updates.Looking forward to your thoughts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value isn't optional, is it? What happens if it is None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify: this parameter can’t be
None
because if it is, the user won’t be able to build thePgVector::fields
parameter, andPgVectorBuilder::with_vector()
would fail. We’re ensuring that users configure this parameter correctly before launching the indexing pipeline.I’d love to hear your thoughts and any suggestions for enhancing this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that this value is optional and can be None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the
batch_size
configuration, I decided to keep it optional, drawing inspiration from the reference Qdrant implementation. By default, the batch size for PGVector is set to50
—though we still plan to fine-tune this for optimal performance.To answer your question, "Is it intentional that this value is optional and can be None?"—yes, it’s designed that way. In the indexing pipeline,
Pipeline::then_store_with()
functions as an adapter, routingNode Streams
to backend storage depending on the batch setting.Here’s the approach:
Persist::batch_size()
returnsNone
, each node is processed individually, withPersist::store()
sending each chunk to the backend.Persist::batch_size()
returnsSome()
, batch processing is enabled. The stream of nodes is grouped into chunks based on the batch size, andPersist::batch_store()
sends these batches to storage.Does this implementation align with requirements? Let me know if there’s anything specific you’d like adjusted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this do a query? Debug is called extensively with
tracing
. Which would mean (I think?) a query on every log / trace statement.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment. Changing that and moving it to the main struct will also remove the need to clone, unwrap and option.