-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Index markdown in pgvector #392
base: master
Are you sure you want to change the base?
Conversation
- Prototype pipeline for loading, chunking, enhancing, embedding, and storing markdown content in pgvector. Signed-off-by: shamb0 <r.raajey@gmail.com>
ccb90af
to
bfa44b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job, this looks already very solid! I have a couple of comments and change requests. Notably the N+1 is important.
Additionally, I'm also missing tests. Could you add tests for Persist, batch and non batch? For lance and qdrant I used testcontainers, see theirs for an example, it works pretty well. They're in their retrieve implementation iirc.
For the example test dataset, could you remove that and use the root readme instead? For the docker compose yml, I'm not sure, either we should have one at the root, leave it out, or use testcontainers in examples as well.
Really nice job on this, it's looking good!
PgPoolOptions::new() | ||
.max_connections(connection_max.unwrap_or(PG_POOL_MAX_CONN)) | ||
.connect(url.as_ref()) | ||
.await?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having this here, you could create a default_connection_pool
method instead, and have that as the default build_fn for the builder (in the attr macro) instead. Then this method could build and connect. Added benefit is that PgVector would then does not need an Option anymore, just a PgPool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I’ve refactored the code accordingly, but please let me know if I’ve misunderstood anything.
I’ve wrapped PgPool
inside the PgDBConnectionPool
tuple struct and made connection_pool
non-optional. The default is now handled via the builder macro with PgDBConnectionPool::default()
, and the connection to the Postgres server is established through the try_connect_to_pool()
method.
#[builder(default = "PgDBConnectionPool::default()")]
connection_pool: PgDBConnectionPool,
Let me know if you think further improvements are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, almost! I've left some comments
… 1.16.2 (bosun-ai#399) Bumps [SethCohen/github-releases-to-discord](https://github.com/sethcohen/github-releases-to-discord) from 1.15.1 to 1.16.2. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/sethcohen/github-releases-to-discord/releases">SethCohen/github-releases-to-discord's releases</a>.</em></p> <blockquote> <h2>v1.16.2</h2> <h2><a href="https://github.com/SethCohen/github-releases-to-discord/compare/v1.16.1...v1.16.2">1.16.2</a> (2024-10-18)</h2> <h3>Documentation</h3> <ul> <li>update README output example (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/6aa0dd988c547f3b3a73463bc6e69d944621c613">6aa0dd9</a>)</li> </ul> <h2>v1.16.1</h2> <h2><a href="https://github.com/SethCohen/github-releases-to-discord/compare/v1.16.0...v1.16.1">1.16.1</a> (2024-10-18)</h2> <h3>Bug Fixes</h3> <ul> <li>correct conversion of standalone PR, issue, and changelog URLs to markdown format (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/47869497ed80cf0d6188692d82d71dff7a55dffe">4786949</a>), closes <a href="https://redirect.github.com/SethCohen/github-releases-to-discord/issues/38">#38</a></li> </ul> <h3>Documentation</h3> <ul> <li>update README with details on markdown link conversion and other features (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/9737dc900274be227db48f8e23c715aa00b4af59">9737dc9</a>)</li> </ul> <h2>v1.16.0</h2> <h2><a href="https://github.com/SethCohen/github-releases-to-discord/compare/v1.15.3...v1.16.0">1.16.0</a> (2024-10-18)</h2> <h3>Features</h3> <ul> <li>add function to convert PR, issue, and changelog links to markdown format (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/07c2e1c3e60591d601b5d4b5bd4fc90e599867f8">07c2e1c</a>), closes <a href="https://redirect.github.com/SethCohen/github-releases-to-discord/issues/32">#32</a></li> </ul> <h2>v1.15.3</h2> <h2><a href="https://github.com/SethCohen/github-releases-to-discord/compare/v1.15.2...v1.15.3">1.15.3</a> (2024-10-18)</h2> <h3>Documentation</h3> <ul> <li>add contribution guidelines to README.md (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/5fd64bf266cea87ab4952ef9a4c6aaf099f266bc">5fd64bf</a>)</li> <li>update version reference in README.md (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/93d02ce8714c5f3e201f5b379422e978b837774b">93d02ce</a>)</li> </ul> <h3>Miscellaneous</h3> <ul> <li>update .gitignore (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/3449e38629b0c40dde5af524e2fef220dab24ead">3449e38</a>)</li> <li>update package-lock.json (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/5dc41089e63d18b5b191533c34cdddeab34a07e8">5dc4108</a>)</li> </ul> <h2>v1.15.2</h2> <h2><a href="https://github.com/SethCohen/github-releases-to-discord/compare/v1.15.1...v1.15.2">1.15.2</a> (2024-10-18)</h2> <h3>Bug Fixes</h3> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/SethCohen/github-releases-to-discord/blob/master/CHANGELOG.md">SethCohen/github-releases-to-discord's changelog</a>.</em></p> <blockquote> <h2><a href="https://github.com/SethCohen/github-releases-to-discord/compare/v1.16.1...v1.16.2">1.16.2</a> (2024-10-18)</h2> <h3>Documentation</h3> <ul> <li>update README output example (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/6aa0dd988c547f3b3a73463bc6e69d944621c613">6aa0dd9</a>)</li> </ul> <h2><a href="https://github.com/SethCohen/github-releases-to-discord/compare/v1.16.0...v1.16.1">1.16.1</a> (2024-10-18)</h2> <h3>Bug Fixes</h3> <ul> <li>correct conversion of standalone PR, issue, and changelog URLs to markdown format (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/47869497ed80cf0d6188692d82d71dff7a55dffe">4786949</a>), closes <a href="https://redirect.github.com/SethCohen/github-releases-to-discord/issues/38">#38</a></li> </ul> <h3>Documentation</h3> <ul> <li>update README with details on markdown link conversion and other features (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/9737dc900274be227db48f8e23c715aa00b4af59">9737dc9</a>)</li> </ul> <h2><a href="https://github.com/SethCohen/github-releases-to-discord/compare/v1.15.3...v1.16.0">1.16.0</a> (2024-10-18)</h2> <h3>Features</h3> <ul> <li>add function to convert PR, issue, and changelog links to markdown format (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/07c2e1c3e60591d601b5d4b5bd4fc90e599867f8">07c2e1c</a>), closes <a href="https://redirect.github.com/SethCohen/github-releases-to-discord/issues/32">#32</a></li> </ul> <h2><a href="https://github.com/SethCohen/github-releases-to-discord/compare/v1.15.2...v1.15.3">1.15.3</a> (2024-10-18)</h2> <h3>Documentation</h3> <ul> <li>add contribution guidelines to README.md (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/5fd64bf266cea87ab4952ef9a4c6aaf099f266bc">5fd64bf</a>)</li> <li>update version reference in README.md (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/93d02ce8714c5f3e201f5b379422e978b837774b">93d02ce</a>)</li> </ul> <h3>Miscellaneous</h3> <ul> <li>update .gitignore (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/3449e38629b0c40dde5af524e2fef220dab24ead">3449e38</a>)</li> <li>update package-lock.json (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/5dc41089e63d18b5b191533c34cdddeab34a07e8">5dc4108</a>)</li> </ul> <h2><a href="https://github.com/SethCohen/github-releases-to-discord/compare/v1.15.1...v1.15.2">1.15.2</a> (2024-10-18)</h2> <h3>Bug Fixes</h3> <ul> <li>improve <a href="https://github.com/mention"><code>@mention</code></a> parsing for GitHub usernames (<a href="https://redirect.github.com/SethCohen/github-releases-to-discord/issues/33">#33</a>) (<a href="https://github.com/SethCohen/github-releases-to-discord/commit/925765f099dcdc3b12316eaa6dc3c17506734b51">925765f</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/SethCohen/github-releases-to-discord/commit/6ac5abea42b8cbac14316970819a8a535aab08ea"><code>6ac5abe</code></a> chore(master): release 1.16.2 (<a href="https://redirect.github.com/sethcohen/github-releases-to-discord/issues/40">#40</a>)</li> <li><a href="https://github.com/SethCohen/github-releases-to-discord/commit/6aa0dd988c547f3b3a73463bc6e69d944621c613"><code>6aa0dd9</code></a> docs: update README output example</li> <li><a href="https://github.com/SethCohen/github-releases-to-discord/commit/1e7bdfd17373d338e99a46dbbc14fa90b29a2fe8"><code>1e7bdfd</code></a> chore(master): release 1.16.1 (<a href="https://redirect.github.com/sethcohen/github-releases-to-discord/issues/39">#39</a>)</li> <li><a href="https://github.com/SethCohen/github-releases-to-discord/commit/47869497ed80cf0d6188692d82d71dff7a55dffe"><code>4786949</code></a> fix: correct conversion of standalone PR, issue, and changelog URLs to markdo...</li> <li><a href="https://github.com/SethCohen/github-releases-to-discord/commit/9737dc900274be227db48f8e23c715aa00b4af59"><code>9737dc9</code></a> docs: update README with details on markdown link conversion and other features</li> <li><a href="https://github.com/SethCohen/github-releases-to-discord/commit/26399c645d5aed4951be2d569b493f11ed440a65"><code>26399c6</code></a> chore(master): release 1.16.0 (<a href="https://redirect.github.com/sethcohen/github-releases-to-discord/issues/37">#37</a>)</li> <li><a href="https://github.com/SethCohen/github-releases-to-discord/commit/07c2e1c3e60591d601b5d4b5bd4fc90e599867f8"><code>07c2e1c</code></a> feat: add function to convert PR, issue, and changelog links to markdown format</li> <li><a href="https://github.com/SethCohen/github-releases-to-discord/commit/f184bc59c7a047bf04277729953c79637d0c4cc4"><code>f184bc5</code></a> chore(master): release 1.15.3 (<a href="https://redirect.github.com/sethcohen/github-releases-to-discord/issues/36">#36</a>)</li> <li><a href="https://github.com/SethCohen/github-releases-to-discord/commit/3449e38629b0c40dde5af524e2fef220dab24ead"><code>3449e38</code></a> chore: update .gitignore</li> <li><a href="https://github.com/SethCohen/github-releases-to-discord/commit/93d02ce8714c5f3e201f5b379422e978b837774b"><code>93d02ce</code></a> docs: update version reference in README.md</li> <li>Additional commits viewable in <a href="https://github.com/sethcohen/github-releases-to-discord/compare/v1.15.1...v1.16.2">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=SethCohen/github-releases-to-discord&package-manager=github_actions&previous-version=1.15.1&new-version=1.16.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Hopefully fix cargo not updating
Ensure swiftide\* crates are always included. Additionally, these are the default members, excluding examples and benches.
## 🤖 New release * `swiftide`: 0.13.3 -> 0.13.4 (✓ API compatible changes) * `swiftide-core`: 0.13.3 -> 0.13.4 * `swiftide-indexing`: 0.13.3 -> 0.13.4 (✓ API compatible changes) * `swiftide-macros`: 0.13.3 -> 0.13.4 * `swiftide-integrations`: 0.13.3 -> 0.13.4 * `swiftide-query`: 0.13.3 -> 0.13.4 <details><summary><i><b>Changelog</b></i></summary><p> ## `swiftide` <blockquote> ## [0.13.4](bosun-ai/swiftide@v0.13.3...v0.13.4) - 2024-10-21 ### Bug fixes - [47455fb](bosun-ai@47455fb) *(indexing)* Visibility of ChunkMarkdown builder should be public - [2b3b401](bosun-ai@2b3b401) *(indexing)* Improve splitters consistency and provide defaults ([bosun-ai#403](bosun-ai#403)) **Full Changelog**: bosun-ai/swiftide@0.13.3...0.13.4 </blockquote> </p></details> --- This PR was generated with [release-plz](https://github.com/MarcoIeni/release-plz/).
- Prototype pipeline for loading, chunking, enhancing, embedding, and storing markdown content in pgvector. Signed-off-by: shamb0 <r.raajey@gmail.com>
- added Postgres test_util, - completed unit tests for persist and retrieval Signed-off-by: shamb0 <r.raajey@gmail.com>
e307ba2
to
4266bbe
Compare
- added Postgres test_util, - completed unit tests for persist and retrieval Signed-off-by: shamb0 <r.raajey@gmail.com>
@timonv, Thanks a lot! I really appreciate your detailed feedback. I’ve addressed most of the points, but I’d like some clarification on your comment about the I’ve also added unit tests, following the primary use cases in LanceDB, and included tests for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @shamb0,
Getting there! I've provided some more feedback. I still need to run the example to see how it performs, will do that this week.
// Find a free port on the host for Postgres to use | ||
let host_port = portpicker::pick_unused_port().expect("No available free port on the host"); | ||
|
||
let postgres = testcontainers::GenericImage::new("ankane/pgvector", "v0.5.1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use pgvector/pgvector
, ankane's hasn't been updated for a year and is unofficial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good—I’ve migrated to the latest image, pgvector/pgvector:pg17
:
let postgres = testcontainers::GenericImage::new("pgvector/pgvector", "pg17")
Let me know if there’s anything else you’d suggest here!
.with_mount(Mount::bind_mount( | ||
temp_data_bind_path, | ||
"/var/lib/postgresql/data", | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might not need the bindmount, or do you want to persist data across tests? We're currently having issues with github runner limitations, disk space is one, so we need to be a bit careful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the resource constraints in the CI/CD build context, Mount::tmpfs_mount()
seems like a promising choice since the volume resource is released once the container is stopped.
let postgres = testcontainers::GenericImage::new("pgvector/pgvector", "pg17")
.with_wait_for(WaitFor::message_on_stdout(
"database system is ready to accept connections",
))
.with_mapped_port(host_port, 5432.tcp())
.with_env_var("POSTGRES_USER", "myuser")
.with_env_var("POSTGRES_PASSWORD", "mypassword")
.with_env_var("POSTGRES_DB", "mydatabase")
.with_mount(Mount::tmpfs_mount("/var/lib/postgresql/data"))
.start()
.await
.expect("Failed to start Postgres container");
Does this approach fit the context, or would you suggest any adjustments?
.with_metadata(METADATA_QA_TEXT_NAME) | ||
.table_name("swiftide_pgvector_test".to_string()) | ||
.build() | ||
.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost exactly right! I prefer it if builders do not do IO if they can avoid it, for multiple reasons. In this case, that also has the benefit of being able to connect lazilly and hiding the details of the connection pool.
i.e. the builder api like:
let pgv_storage = PgVector::builder()
.database_url(pgv_db_url)
.pool_size(10) // With a sane default if ommitted
.vector_size(384)
.with_vector(EmbeddedField::Combined)
.with_metadata(METADATA_QA_TEXT_NAME)
.table_name("swiftide_pgvector_test".to_string())
.build()
.unwrap();
And then in PgVector::setup
(which is only called once):
async fn setup(&self) -> Result<()> {
self.try_connect_to_pool(self.database_url, self.pool_size).await?;
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timonv, I'm looking for your input on a design choice here.
If we decide to handle the database connection pool setup within fn setup(&self)
instead of PgVectorBuilder
, we'll need to mutate PgVector
within fn setup()
. This change would mean updating the function signature in trait Persist
to:
async fn setup(&mut self) -> Result<()>
For example:
async fn setup(&mut self) -> Result<()> {
self.connection_pool = self.try_connect_to_pool(self.database_url, self.pool_size).await?;
...
}
This adjustment would introduce breaking changes across the stack, particularly impacting:
swiftide-indexing/src/persist/memory_storage.rs
swiftide-integrations/src/lancedb/persist.rs
swiftide-integrations/src/qdrant/persist.rs
swiftide-integrations/src/redis/persist.rs
Would you prefer moving the IO operations into Persist::setup()
for these components? If so, we could handle this as a separate PR to streamline the updates.
Looking forward to your thoughts!
|
||
/// Default sizes of vectors. Vectors can also be of different | ||
/// sizes by specifying the size in the vector configuration. | ||
vector_size: Option<i32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value isn't optional, is it? What happens if it is None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify: this parameter can’t be None
because if it is, the user won’t be able to build the PgVector::fields
parameter, and PgVectorBuilder::with_vector()
would fail. We’re ensuring that users configure this parameter correctly before launching the indexing pipeline.
I’d love to hear your thoughts and any suggestions for enhancing this approach.
|
||
/// Batch size for storing nodes. | ||
#[builder(default = "Some(DEFAULT_BATCH_SIZE)")] | ||
batch_size: Option<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that this value is optional and can be None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the batch_size
configuration, I decided to keep it optional, drawing inspiration from the reference Qdrant implementation. By default, the batch size for PGVector is set to 50
—though we still plan to fine-tune this for optimal performance.
To answer your question, "Is it intentional that this value is optional and can be None?"—yes, it’s designed that way. In the indexing pipeline, Pipeline::then_store_with()
functions as an adapter, routing Node Streams
to backend storage depending on the batch setting.
Here’s the approach:
- When
Persist::batch_size()
returnsNone
, each node is processed individually, withPersist::store()
sending each chunk to the backend. - If
Persist::batch_size()
returnsSome()
, batch processing is enabled. The stream of nodes is grouped into chunks based on the batch size, andPersist::batch_store()
sends these batches to storage.
Does this implementation align with requirements? Let me know if there’s anything specific you’d like adjusted.
impl fmt::Debug for PgVector { | ||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { | ||
// Access the connection pool synchronously and determine the status. | ||
let connection_status = self.connection_pool.connection_status(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this do a query? Debug is called extensively with tracing
. Which would mean (I think?) a query on every log / trace statement.
) -> Result<Self> { | ||
let pool = self.connection_pool.clone().unwrap_or_default(); | ||
|
||
self.connection_pool = Some(pool.try_connect_to_url(url, connection_max).await?); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment. Changing that and moving it to the main struct will also remove the need to clone, unwrap and option.
.setup() | ||
.await | ||
.expect("PgVector setup should not fail when the table already exists"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests for storing nodes and verifying with a query that it got persisted as expected? For all the different embedding modes and with/without metadata.
} | ||
|
||
query.execute(&mut *tx).await.map_err(|e| { | ||
tracing::error!("Failed to store nodes: {:?}", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good practice not to log errors in libraries if the error is bubbled up, as the caller is expected to handle it. Could you remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I've cleaned up the extra logs and traces.
/// SQL is intended to be efficient and safe for concurrent use. | ||
#[allow(clippy::redundant_closure_for_method_calls)] | ||
/// Generates a bulk insert SQL statement for inserting multiple nodes. | ||
fn generate_bulk_insert_sql(&self, node_count: usize) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a hunch this query can be made simpler, reducing the code and at the same time speeding things up. There's a lot of loops / allocations in the sql generation that might be avoided. I'm not not a 100% sure, haven't fully dived into it. Additionally, what happens if the node is already present? In swiftide we upsert on the id of the node.
Postgres supports bulk 'dynamic' bulk insertion with UNNEST
, see https://github.com/launchbadge/sqlx/blob/main/FAQ.md#how-can-i-bind-an-array-to-a-values-clause-how-can-i-do-bulk-inserts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Thanks @timnov for suggesting
UNNEST
; that sounds promising! I'll give it a try and keep you posted. -
Regarding the question, “What happens if the node is already present?”
- In the current setup, if a node already exists, a duplicate entry is created. I considered using a caching layer, like Redis, to manage duplicate source files after the
FileLoader
stage, rather than handling it at the storage level.
- In the current setup, if a node already exists, a duplicate entry is created. I considered using a caching layer, like Redis, to manage duplicate source files after the
-
On the suggestion to “upsert on the ID of the node”
-
Upserting on the node’s ID could lead to data loss. Here’s why:
Let’s take a look at the indexing pipeline structure:
indexing::Pipeline::from_loader(FileLoader::new(test_dataset_path).with_extensions(&["md"])) .then_chunk(ChunkMarkdown::from_chunk_range(10..2048)) .then(MetadataQAText::new(llm_client.clone())) .then_in_batch(Embed::new(fastembed.clone()).with_batch_size(100)) .then_store_with(pgv_storage.clone()) .run() .await?;
- Explanation: In this example, the pipeline starts with
FileLoader
, which processes a single file and creates a node, let’s call itN1
. - In the next stage,
ChunkMarkdown
,N1
is split into multiple nodes (one-to-X
relationship), each inheriting the sameID
asN1
. - These nodes, sharing the same ID, are then passed to the storage stage (
pgv_storage
).
- Explanation: In this example, the pipeline starts with
If we use upsert based on the ID, data from previously processed nodes may be overwritten, leading to possible data loss. I’d be glad to hear your thoughts on this or correct any misunderstandings if there’s a different intended approach.
-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be an issue, the id is lazilly created based on a hash of the chunk (usually at index time). Do you think there is a bug here? You might be on to something, I recently redid id generation such that it's lazy, but it indeed seems that it copies over the id when chunking, while it shouldn't. Perhaps the id shouldn't be lazy at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shamb0 I think you identified a pretty hefty potential bug. Nodes should never share the same ID unless their chunk and path is exactly the same. As long as that holds true it should be safe to upsert. What could happen (maybe with a node cache?), that the id gets set earlier and gets copied over during chunking, never updating it again. That's not intentional and is a bug. I've fixed it in #414, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, no scrum gods to please in open source. Thank you for identifying the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HaHa, Thanks! :-) Happy to help improve the project. Bugs are part of our journey—glad we caught it early!
As @shamb0 pointed out in #392, there is a potential issue where Node ids are get cached before chunking or other transformations, breaking upserts and potentially resulting in data loss. BREAKING CHANGE: This PR reworks Nodes with a builder API and a private id. Hence, manually creating nodes no longer works. In the future, all the fields are likely to follow the same pattern, so that we can decouple the inner fields from the Node's implementation.
## 🤖 New release * `swiftide`: 0.13.4 -> 0.14.0 (✓ API compatible changes) * `swiftide-core`: 0.13.4 -> 0.14.0 (⚠️ API breaking changes) * `swiftide-indexing`: 0.13.4 -> 0.14.0 (✓ API compatible changes) * `swiftide-macros`: 0.13.4 -> 0.14.0 * `swiftide-integrations`: 0.13.4 -> 0.14.0 (✓ API compatible changes) * `swiftide-query`: 0.13.4 -> 0.14.0 (✓ API compatible changes) ###⚠️ `swiftide-core` breaking changes ``` --- failure inherent_method_missing: pub method removed or renamed --- Description: A publicly-visible method or associated fn is no longer available under its prior name. It may have been renamed or removed entirely. ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.36.0/src/lints/inherent_method_missing.ron Failed in: Node::update_id, previously in file /tmp/.tmpp9ZuUf/swiftide-core/src/node.rs:204 --- failure struct_pub_field_missing: pub struct's pub field removed or renamed --- Description: A publicly-visible struct has at least one public field that is no longer available under its prior name. It may have been renamed or removed entirely. ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.36.0/src/lints/struct_pub_field_missing.ron Failed in: field id of struct Node, previously in file /tmp/.tmpp9ZuUf/swiftide-core/src/node.rs:41 ``` <details><summary><i><b>Changelog</b></i></summary><p> ## `swiftide` <blockquote> ## [0.14.0](v0.13.4...v0.14.0) - 2024-10-27 ### New features - [a866d38](a866d38) *(integrations)* Support in process hugging face models via mistralrs ([#386](#386)) ### Bug fixes - [551a9cb](551a9cb) *(indexing)* [**breaking**] Node ID no longer memoized ([#414](#414)) ````text As @shamb0 pointed out in [#392](#392), there is a potential issue where Node ids are get cached before chunking or other transformations, breaking upserts and potentially resulting in data loss. ```` **BREAKING CHANGE**: This PR reworks Nodes with a builder API and a private id. Hence, manually creating nodes no longer works. In the future, all the fields are likely to follow the same pattern, so that we can decouple the inner fields from the Node's implementation. - [c091ffa](c091ffa) *(indexing)* Use atomics for key generation in memory storage ([#415](#415)) ### Miscellaneous - [0000000](0000000) Update Cargo.toml dependencies **Full Changelog**: 0.13.4...0.14.0 </blockquote> </p></details> --- This PR was generated with [release-plz](https://github.com/MarcoIeni/release-plz/).
Co-authored-by: Timon Vonk <mail@timonv.nl>
…osun-ai#410) Bumps the minor group with 9 updates in the / directory: | Package | From | To | | --- | --- | --- | | [anyhow](https://github.com/dtolnay/anyhow) | `1.0.90` | `1.0.91` | | [tokio](https://github.com/tokio-rs/tokio) | `1.40.0` | `1.41.0` | | [serde](https://github.com/serde-rs/serde) | `1.0.210` | `1.0.213` | | [spider](https://github.com/spider-rs/spider) | `2.9.15` | `2.10.7` | | [auto_encoder](https://github.com/spider-rs/auto-encoder) | `0.1.4` | `0.1.5` | | [aws-sdk-bedrockruntime](https://github.com/awslabs/aws-sdk-rust) | `1.55.0` | `1.56.0` | | [bytes](https://github.com/tokio-rs/bytes) | `1.7.2` | `1.8.0` | | [proc-macro2](https://github.com/dtolnay/proc-macro2) | `1.0.88` | `1.0.89` | | [thiserror](https://github.com/dtolnay/thiserror) | `1.0.64` | `1.0.65` | Updates `anyhow` from 1.0.90 to 1.0.91 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/dtolnay/anyhow/releases">anyhow's releases</a>.</em></p> <blockquote> <h2>1.0.91</h2> <ul> <li>Ensure OUT_DIR is left with deterministic contents after build script execution (<a href="https://redirect.github.com/dtolnay/anyhow/issues/388">#388</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/dtolnay/anyhow/commit/6c52daaa79eb22279248d80bebedc6c14bbb84ec"><code>6c52daa</code></a> Release 1.0.91</li> <li><a href="https://github.com/dtolnay/anyhow/commit/4986853bea70e653e68e6e94f6ac1475bbb5a180"><code>4986853</code></a> Merge pull request <a href="https://redirect.github.com/dtolnay/anyhow/issues/388">#388</a> from dtolnay/outdir</li> <li><a href="https://github.com/dtolnay/anyhow/commit/f130b76204037c99e8d883d3854039c8d1993a81"><code>f130b76</code></a> Clean up dep-info files from OUT_DIR</li> <li>See full diff in <a href="https://github.com/dtolnay/anyhow/compare/1.0.90...1.0.91">compare view</a></li> </ul> </details> <br /> Updates `tokio` from 1.40.0 to 1.41.0 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/tokio-rs/tokio/releases">tokio's releases</a>.</em></p> <blockquote> <h2>Tokio v1.41.0</h2> <h1>1.41.0 (Oct 22th, 2024)</h1> <h3>Added</h3> <ul> <li>metrics: stabilize <code>global_queue_depth</code> (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6854">#6854</a>, <a href="https://redirect.github.com/tokio-rs/tokio/issues/6918">#6918</a>)</li> <li>net: add conversions for unix <code>SocketAddr</code> (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6868">#6868</a>)</li> <li>sync: add <code>watch::Sender::sender_count</code> (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6836">#6836</a>)</li> <li>sync: add <code>mpsc::Receiver::blocking_recv_many</code> (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6867">#6867</a>)</li> <li>task: stabilize <code>Id</code> apis (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6793">#6793</a>, <a href="https://redirect.github.com/tokio-rs/tokio/issues/6891">#6891</a>)</li> </ul> <h3>Added (unstable)</h3> <ul> <li>metrics: add H2 Histogram option to improve histogram granularity (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6897">#6897</a>)</li> <li>metrics: rename some histogram apis (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6924">#6924</a>)</li> <li>runtime: add <code>LocalRuntime</code> (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6808">#6808</a>)</li> </ul> <h3>Changed</h3> <ul> <li>runtime: box futures larger than 16k on release mode (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6826">#6826</a>)</li> <li>sync: add <code>#[must_use]</code> to <code>Notified</code> (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6828">#6828</a>)</li> <li>sync: make <code>watch</code> cooperative (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6846">#6846</a>)</li> <li>sync: make <code>broadcast::Receiver</code> cooperative (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6870">#6870</a>)</li> <li>task: add task size to tracing instrumentation (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6881">#6881</a>)</li> <li>wasm: enable <code>cfg_fs</code> for <code>wasi</code> target (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6822">#6822</a>)</li> </ul> <h3>Fixed</h3> <ul> <li>net: fix regression of abstract socket path in unix socket (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6838">#6838</a>)</li> </ul> <h3>Documented</h3> <ul> <li>io: recommend <code>OwnedFd</code> with <code>AsyncFd</code> (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6821">#6821</a>)</li> <li>io: document cancel safety of <code>AsyncFd</code> methods (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6890">#6890</a>)</li> <li>macros: render more comprehensible documentation for <code>join</code> and <code>try_join</code> (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6814">#6814</a>, <a href="https://redirect.github.com/tokio-rs/tokio/issues/6841">#6841</a>)</li> <li>net: fix swapped examples for <code>TcpSocket::set_nodelay</code> and <code>TcpSocket::nodelay</code> (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6840">#6840</a>)</li> <li>sync: document runtime compatibility (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6833">#6833</a>)</li> </ul> <p><a href="https://redirect.github.com/tokio-rs/tokio/issues/6793">#6793</a>: <a href="https://redirect.github.com/tokio-rs/tokio/pull/6793">tokio-rs/tokio#6793</a> <a href="https://redirect.github.com/tokio-rs/tokio/issues/6808">#6808</a>: <a href="https://redirect.github.com/tokio-rs/tokio/pull/6808">tokio-rs/tokio#6808</a> <a href="https://redirect.github.com/tokio-rs/tokio/issues/6810">#6810</a>: <a href="https://redirect.github.com/tokio-rs/tokio/pull/6810">tokio-rs/tokio#6810</a> <a href="https://redirect.github.com/tokio-rs/tokio/issues/6814">#6814</a>: <a href="https://redirect.github.com/tokio-rs/tokio/pull/6814">tokio-rs/tokio#6814</a> <a href="https://redirect.github.com/tokio-rs/tokio/issues/6821">#6821</a>: <a href="https://redirect.github.com/tokio-rs/tokio/pull/6821">tokio-rs/tokio#6821</a> <a href="https://redirect.github.com/tokio-rs/tokio/issues/6822">#6822</a>: <a href="https://redirect.github.com/tokio-rs/tokio/pull/6822">tokio-rs/tokio#6822</a> <a href="https://redirect.github.com/tokio-rs/tokio/issues/6826">#6826</a>: <a href="https://redirect.github.com/tokio-rs/tokio/pull/6826">tokio-rs/tokio#6826</a> <a href="https://redirect.github.com/tokio-rs/tokio/issues/6828">#6828</a>: <a href="https://redirect.github.com/tokio-rs/tokio/pull/6828">tokio-rs/tokio#6828</a> <a href="https://redirect.github.com/tokio-rs/tokio/issues/6833">#6833</a>: <a href="https://redirect.github.com/tokio-rs/tokio/pull/6833">tokio-rs/tokio#6833</a> <a href="https://redirect.github.com/tokio-rs/tokio/issues/6836">#6836</a>: <a href="https://redirect.github.com/tokio-rs/tokio/pull/6836">tokio-rs/tokio#6836</a> <a href="https://redirect.github.com/tokio-rs/tokio/issues/6838">#6838</a>: <a href="https://redirect.github.com/tokio-rs/tokio/pull/6838">tokio-rs/tokio#6838</a> <a href="https://redirect.github.com/tokio-rs/tokio/issues/6840">#6840</a>: <a href="https://redirect.github.com/tokio-rs/tokio/pull/6840">tokio-rs/tokio#6840</a></p> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/tokio-rs/tokio/commit/01e04daaa162ce6122bb894fdda0b6803dd32093"><code>01e04da</code></a> chore: prepare Tokio v1.41.0 (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6917">#6917</a>)</li> <li><a href="https://github.com/tokio-rs/tokio/commit/92ccadeb3c7058cdc8799b998f6a19a1171691df"><code>92ccade</code></a> runtime: fix stability feature flags for docs (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6909">#6909</a>)</li> <li><a href="https://github.com/tokio-rs/tokio/commit/fbfeb9a68a22556935b64dc426601b799ec369ac"><code>fbfeb9a</code></a> metrics: rename <code>*_poll_count_*</code> to <code>*_poll_time_*</code> (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6924">#6924</a>)</li> <li><a href="https://github.com/tokio-rs/tokio/commit/da745ff335dea94378f5ba2b79e2c9f97cb217aa"><code>da745ff</code></a> metrics: add H2 Histogram option to improve histogram granularity (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6897">#6897</a>)</li> <li><a href="https://github.com/tokio-rs/tokio/commit/ce1c74f1cc1e31083744b6eb24b0e60ceaca4b4e"><code>ce1c74f</code></a> metrics: fix deadlock in injection_queue_depth_multi_thread test (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6916">#6916</a>)</li> <li><a href="https://github.com/tokio-rs/tokio/commit/28c9a14a2e4da99842d41240b301022017b2a04a"><code>28c9a14</code></a> metrics: rename <code>injection_queue_depth</code> to <code>global_queue_depth</code> (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6918">#6918</a>)</li> <li><a href="https://github.com/tokio-rs/tokio/commit/32e0b4325f877d53ddc76818198becad9312159a"><code>32e0b43</code></a> ci: freeze FreeBSD and wasm-unknown-unknown on rustc 1.81 (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6911">#6911</a>)</li> <li><a href="https://github.com/tokio-rs/tokio/commit/1656d8e231903a7b84b9e2d5e3db7aeed13a2966"><code>1656d8e</code></a> sync: add <code>mpsc::Receiver::blocking_recv_many</code> (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6867">#6867</a>)</li> <li><a href="https://github.com/tokio-rs/tokio/commit/c9e998e4b3cbd3326b4ef52e38ae22561d6e1f29"><code>c9e998e</code></a> ci: print the correct sort order of the dictionary on failure (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6905">#6905</a>)</li> <li><a href="https://github.com/tokio-rs/tokio/commit/512e9decfb683d22f4a145459142542caa0894c9"><code>512e9de</code></a> rt: add LocalRuntime (<a href="https://redirect.github.com/tokio-rs/tokio/issues/6808">#6808</a>)</li> <li>Additional commits viewable in <a href="https://github.com/tokio-rs/tokio/compare/tokio-1.40.0...tokio-1.41.0">compare view</a></li> </ul> </details> <br /> Updates `serde` from 1.0.210 to 1.0.213 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/serde-rs/serde/releases">serde's releases</a>.</em></p> <blockquote> <h2>v1.0.213</h2> <ul> <li>Fix support for macro-generated <code>with</code> attributes inside a newtype struct (<a href="https://redirect.github.com/serde-rs/serde/issues/2847">#2847</a>)</li> </ul> <h2>v1.0.212</h2> <ul> <li>Fix hygiene of macro-generated local variable accesses in serde(with) wrappers (<a href="https://redirect.github.com/serde-rs/serde/issues/2845">#2845</a>)</li> </ul> <h2>v1.0.211</h2> <ul> <li>Improve error reporting about mismatched signature in <code>with</code> and <code>default</code> attributes (<a href="https://redirect.github.com/serde-rs/serde/issues/2558">#2558</a>, thanks <a href="https://github.com/Mingun"><code>@Mingun</code></a>)</li> <li>Show variant aliases in error message when variant deserialization fails (<a href="https://redirect.github.com/serde-rs/serde/issues/2566">#2566</a>, thanks <a href="https://github.com/Mingun"><code>@Mingun</code></a>)</li> <li>Improve binary size of untagged enum and internally tagged enum deserialization by about 12% (<a href="https://redirect.github.com/serde-rs/serde/issues/2821">#2821</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/serde-rs/serde/commit/58a8d229315553c4ae0a8d7eee8e382fbae4b4bf"><code>58a8d22</code></a> Release 1.0.213</li> <li><a href="https://github.com/serde-rs/serde/commit/ef0ed22593a17a5af5ebe48d3b6ef7c3de1b116a"><code>ef0ed22</code></a> Merge pull request <a href="https://redirect.github.com/serde-rs/serde/issues/2847">#2847</a> from dtolnay/newtypewith</li> <li><a href="https://github.com/serde-rs/serde/commit/79925ac3947483013ba8136e43bc0449b99bd10c"><code>79925ac</code></a> Ignore dead_code warning in regression test</li> <li><a href="https://github.com/serde-rs/serde/commit/b60e4092ec83c70e8c7d39574778349b2c5d9f05"><code>b60e409</code></a> Hygiene for macro-generated newtype struct deserialization with 'with' attr</li> <li><a href="https://github.com/serde-rs/serde/commit/fdc36e5c06def28b33d3154f0517969d90b744d8"><code>fdc36e5</code></a> Add regression test for issue 2846</li> <li><a href="https://github.com/serde-rs/serde/commit/49e11ce1bae9fbb9128c9144c4e1051daf7a29ed"><code>49e11ce</code></a> Ignore trivially_copy_pass_by_ref pedantic clippy lint in test</li> <li><a href="https://github.com/serde-rs/serde/commit/7ae1b5f8f39d7a80daaddcc04565f995427bfc41"><code>7ae1b5f</code></a> Release 1.0.212</li> <li><a href="https://github.com/serde-rs/serde/commit/1ac054b34a3139652d20bf9b0a6d206d3837ac3a"><code>1ac054b</code></a> Merge pull request <a href="https://redirect.github.com/serde-rs/serde/issues/2845">#2845</a> from dtolnay/withlocal</li> <li><a href="https://github.com/serde-rs/serde/commit/1e36ef551dae96d1f93e9f23de78dbfc514aed07"><code>1e36ef5</code></a> Fix hygiene of macro-generated local variable accesses in serde(with) wrappers</li> <li><a href="https://github.com/serde-rs/serde/commit/0058c7226e72e653d9e22c0879403ff6df195ec6"><code>0058c72</code></a> Add regression test for issue 2844</li> <li>Additional commits viewable in <a href="https://github.com/serde-rs/serde/compare/v1.0.210...v1.0.213">compare view</a></li> </ul> </details> <br /> Updates `spider` from 2.9.15 to 2.10.7 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/spider-rs/spider/releases">spider's releases</a>.</em></p> <blockquote> <h2>v2.10.6</h2> <h1>Whats Changed</h1> <ol> <li>add html lang auto encoding handling to improve detection</li> <li>add <code>exclude_selector</code> and <code>root_selector</code> transformations output formats</li> <li>add bin file handling to prevent SOF transformations</li> <li>chore(chrome): fix window navigator stealth handling</li> <li>chore: fix subdomains and tld handling</li> <li>chore(chrome): add automation all routes handling</li> </ol> <p><strong>Full Changelog</strong>: <a href="https://github.com/spider-rs/spider/compare/v2.9.15...v2.10.6">https://github.com/spider-rs/spider/compare/v2.9.15...v2.10.6</a></p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li>See full diff in <a href="https://github.com/spider-rs/spider/commits">compare view</a></li> </ul> </details> <br /> Updates `auto_encoder` from 0.1.4 to 0.1.5 <details> <summary>Commits</summary> <ul> <li>See full diff in <a href="https://github.com/spider-rs/auto-encoder/commits">compare view</a></li> </ul> </details> <br /> Updates `aws-sdk-bedrockruntime` from 1.55.0 to 1.56.0 <details> <summary>Commits</summary> <ul> <li>See full diff in <a href="https://github.com/awslabs/aws-sdk-rust/commits">compare view</a></li> </ul> </details> <br /> Updates `bytes` from 1.7.2 to 1.8.0 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/tokio-rs/bytes/releases">bytes's releases</a>.</em></p> <blockquote> <h2>Bytes 1.8.0</h2> <h1>1.8.0 (October 21, 2024)</h1> <ul> <li>Guarantee address in <code>split_off</code>/<code>split_to</code> for empty slices (<a href="https://redirect.github.com/tokio-rs/bytes/issues/740">#740</a>)</li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/tokio-rs/bytes/blob/master/CHANGELOG.md">bytes's changelog</a>.</em></p> <blockquote> <h1>1.8.0 (October 21, 2024)</h1> <ul> <li>Guarantee address in <code>split_off</code>/<code>split_to</code> for empty slices (<a href="https://redirect.github.com/tokio-rs/bytes/issues/740">#740</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/tokio-rs/bytes/commit/c45697ce4230777aa8467db7ef91e89f282a539f"><code>c45697c</code></a> chore: prepare bytes v1.8.0 (<a href="https://redirect.github.com/tokio-rs/bytes/issues/741">#741</a>)</li> <li><a href="https://github.com/tokio-rs/bytes/commit/0ac54ca706dfc039cc738962581bba4793860605"><code>0ac54ca</code></a> Guarantee address in split_off/split_to for empty slices (<a href="https://redirect.github.com/tokio-rs/bytes/issues/740">#740</a>)</li> <li>See full diff in <a href="https://github.com/tokio-rs/bytes/compare/v1.7.2...v1.8.0">compare view</a></li> </ul> </details> <br /> Updates `phf_macros` from 0.11.2 to 0.10.0 <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/rust-phf/rust-phf/blob/v0.10.0/CHANGELOG.md">phf_macros's changelog</a>.</em></p> <blockquote> <h2>0.10.0</h2> <ul> <li>Constify <code>len</code> and <code>is_empty</code> (<a href="https://redirect.github.com/rust-phf/rust-phf/issues/224">#224</a>)</li> <li>Implement <code>Clone</code>, <code>Debug</code>, and <code>FusedIterator</code> (<a href="https://redirect.github.com/rust-phf/rust-phf/issues/226">#226</a>)</li> </ul> <p><a href="https://redirect.github.com/rust-phf/rust-phf/issues/224">#224</a>: <a href="https://redirect.github.com/rust-phf/rust-phf/pull/224">rust-phf/rust-phf#224</a> <a href="https://redirect.github.com/rust-phf/rust-phf/issues/226">#226</a>: <a href="https://redirect.github.com/rust-phf/rust-phf/pull/226">rust-phf/rust-phf#226</a></p> <h2>0.9.1</h2> <p><strong>Yanked except for <code>phf-generator</code>, use 0.10.0 instead.</strong></p> <ul> <li>(phf-generator): Pin <code>criterion</code> version to keep MSRV</li> <li>Constify <code>len</code> and <code>is_empty</code> (<a href="https://redirect.github.com/rust-phf/rust-phf/issues/224">#224</a>) (<strong>yanked</strong>)</li> <li>Implement <code>Clone</code>, <code>Debug</code>, and <code>FusedIterator</code> (<a href="https://redirect.github.com/rust-phf/rust-phf/issues/226">#226</a>) (<strong>yanked</strong>)</li> </ul> <h2>0.9.0</h2> <ul> <li>Our MSRV is now 1.41 or 1.46 (because of dependencies)</li> <li><code>rand</code> dependency has been upgraded to 0.8</li> <li>Fix some crates' build on <code>no_std</code></li> <li>Restore the <code>unicase</code> feature for <code>phf_macros</code></li> <li>Allow using the owned <code>String</code> type for <code>phf</code> dynamic code generation</li> <li>Add back <code>OrderedMap</code> and <code>OrderedSet</code></li> <li>(<strong>breaking change</strong>) Use <code>PhfBorrow</code> trait instead of <code>std::borrow::Borrow</code></li> </ul> <h2>0.8.0</h2> <ul> <li><code>phf_macros</code> now works on stable.</li> <li>:tada: Fixed asymptotic slowdowns when constructing maps over very large datasets (+1M keys)</li> <li>(<strong>breaking change</strong>) The <code>core</code> features of <code>phf</code> and <code>phf_shared</code> have been changed to <code>std</code> default-features.</li> <li>(<strong>breaking change</strong>) The types in <code>phf_codegen</code> can be used with formatting macros via their <code>Display</code> impls and the <code>build()</code> methods no longer take <code>&mut Write</code>.</li> <li>Support has been added for using 128-bit integers as keys.</li> <li>(<strong>breaking change</strong>) The <code>OrderedMap</code> and <code>OrderedSet</code> types and the <code>phf_builder</code> crate have been removed due to lack of use.</li> <li>Byte strings now work correctly as keys.</li> <li><code>unicase</code> dependency has been upgraded to 2.4.0</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/rust-phf/rust-phf/commit/3ea14b2166553ad6e7b9afe7244144f5d661b6c6"><code>3ea14b2</code></a> Merge pull request <a href="https://redirect.github.com/rust-phf/rust-phf/issues/230">#230</a> from JohnTitor/release-0.10</li> <li><a href="https://github.com/rust-phf/rust-phf/commit/588ac25dd5c0afccea084e6f94867328a6a30454"><code>588ac25</code></a> Prepare for release 0.10.0</li> <li><a href="https://github.com/rust-phf/rust-phf/commit/fbb18f925018fa621ce8a8d334f6746ae0f1d072"><code>fbb18f9</code></a> Fix publish failure</li> <li><a href="https://github.com/rust-phf/rust-phf/commit/d527f9d016adafe7d2930e37710291030b432838"><code>d527f9d</code></a> Merge pull request <a href="https://redirect.github.com/rust-phf/rust-phf/issues/228">#228</a> from JohnTitor/release-0.9.1</li> <li><a href="https://github.com/rust-phf/rust-phf/commit/9b719789149ef195ef5eba093b7e73255fbef8dc"><code>9b71978</code></a> Prepare for v0.9.1</li> <li><a href="https://github.com/rust-phf/rust-phf/commit/012be08aa1bc23092539bf617317243e672c75b1"><code>012be08</code></a> Merge pull request <a href="https://redirect.github.com/rust-phf/rust-phf/issues/226">#226</a> from bhgomes/iterator-traits</li> <li><a href="https://github.com/rust-phf/rust-phf/commit/e47e4dce434fd8d0ee80a3c57880f6b2465eed90"><code>e47e4dc</code></a> add trait implementations to iterators mirroring std::collections</li> <li><a href="https://github.com/rust-phf/rust-phf/commit/d71851ef62092143914cc5a2bbbb780029a55ceb"><code>d71851e</code></a> Merge pull request <a href="https://redirect.github.com/rust-phf/rust-phf/issues/227">#227</a> from JohnTitor/pin-criterion</li> <li><a href="https://github.com/rust-phf/rust-phf/commit/b19afb6544c4c04fb7893661455191942d14e4af"><code>b19afb6</code></a> Pin <code>criterion</code> version</li> <li><a href="https://github.com/rust-phf/rust-phf/commit/65deaf745b5175b6b8e645b6c66e53fc55bb3a85"><code>65deaf7</code></a> Merge pull request <a href="https://redirect.github.com/rust-phf/rust-phf/issues/224">#224</a> from bhgomes/const-fns</li> <li>Additional commits viewable in <a href="https://github.com/rust-phf/rust-phf/compare/phf_macros-v0.11.2...v0.10.0">compare view</a></li> </ul> </details> <br /> Updates `proc-macro2` from 1.0.88 to 1.0.89 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/dtolnay/proc-macro2/releases">proc-macro2's releases</a>.</em></p> <blockquote> <h2>1.0.89</h2> <ul> <li>Ensure OUT_DIR is left with deterministic contents after build script execution (<a href="https://redirect.github.com/dtolnay/proc-macro2/issues/474">#474</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/dtolnay/proc-macro2/commit/671d87da32779e3a5c4cecacd8a9234e0e27630c"><code>671d87d</code></a> Release 1.0.89</li> <li><a href="https://github.com/dtolnay/proc-macro2/commit/9574d983aea667b9c4d7a744212df5fc00daef1f"><code>9574d98</code></a> Merge pull request <a href="https://redirect.github.com/dtolnay/proc-macro2/issues/474">#474</a> from dtolnay/outdir</li> <li><a href="https://github.com/dtolnay/proc-macro2/commit/3e8962cc14b0af47f0b07beeab2a560447f969f4"><code>3e8962c</code></a> Clean up dep-info files from OUT_DIR</li> <li>See full diff in <a href="https://github.com/dtolnay/proc-macro2/compare/1.0.88...1.0.89">compare view</a></li> </ul> </details> <br /> Updates `serde_derive` from 1.0.210 to 1.0.213 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/serde-rs/serde/releases">serde_derive's releases</a>.</em></p> <blockquote> <h2>v1.0.213</h2> <ul> <li>Fix support for macro-generated <code>with</code> attributes inside a newtype struct (<a href="https://redirect.github.com/serde-rs/serde/issues/2847">#2847</a>)</li> </ul> <h2>v1.0.212</h2> <ul> <li>Fix hygiene of macro-generated local variable accesses in serde(with) wrappers (<a href="https://redirect.github.com/serde-rs/serde/issues/2845">#2845</a>)</li> </ul> <h2>v1.0.211</h2> <ul> <li>Improve error reporting about mismatched signature in <code>with</code> and <code>default</code> attributes (<a href="https://redirect.github.com/serde-rs/serde/issues/2558">#2558</a>, thanks <a href="https://github.com/Mingun"><code>@Mingun</code></a>)</li> <li>Show variant aliases in error message when variant deserialization fails (<a href="https://redirect.github.com/serde-rs/serde/issues/2566">#2566</a>, thanks <a href="https://github.com/Mingun"><code>@Mingun</code></a>)</li> <li>Improve binary size of untagged enum and internally tagged enum deserialization by about 12% (<a href="https://redirect.github.com/serde-rs/serde/issues/2821">#2821</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/serde-rs/serde/commit/58a8d229315553c4ae0a8d7eee8e382fbae4b4bf"><code>58a8d22</code></a> Release 1.0.213</li> <li><a href="https://github.com/serde-rs/serde/commit/ef0ed22593a17a5af5ebe48d3b6ef7c3de1b116a"><code>ef0ed22</code></a> Merge pull request <a href="https://redirect.github.com/serde-rs/serde/issues/2847">#2847</a> from dtolnay/newtypewith</li> <li><a href="https://github.com/serde-rs/serde/commit/79925ac3947483013ba8136e43bc0449b99bd10c"><code>79925ac</code></a> Ignore dead_code warning in regression test</li> <li><a href="https://github.com/serde-rs/serde/commit/b60e4092ec83c70e8c7d39574778349b2c5d9f05"><code>b60e409</code></a> Hygiene for macro-generated newtype struct deserialization with 'with' attr</li> <li><a href="https://github.com/serde-rs/serde/commit/fdc36e5c06def28b33d3154f0517969d90b744d8"><code>fdc36e5</code></a> Add regression test for issue 2846</li> <li><a href="https://github.com/serde-rs/serde/commit/49e11ce1bae9fbb9128c9144c4e1051daf7a29ed"><code>49e11ce</code></a> Ignore trivially_copy_pass_by_ref pedantic clippy lint in test</li> <li><a href="https://github.com/serde-rs/serde/commit/7ae1b5f8f39d7a80daaddcc04565f995427bfc41"><code>7ae1b5f</code></a> Release 1.0.212</li> <li><a href="https://github.com/serde-rs/serde/commit/1ac054b34a3139652d20bf9b0a6d206d3837ac3a"><code>1ac054b</code></a> Merge pull request <a href="https://redirect.github.com/serde-rs/serde/issues/2845">#2845</a> from dtolnay/withlocal</li> <li><a href="https://github.com/serde-rs/serde/commit/1e36ef551dae96d1f93e9f23de78dbfc514aed07"><code>1e36ef5</code></a> Fix hygiene of macro-generated local variable accesses in serde(with) wrappers</li> <li><a href="https://github.com/serde-rs/serde/commit/0058c7226e72e653d9e22c0879403ff6df195ec6"><code>0058c72</code></a> Add regression test for issue 2844</li> <li>Additional commits viewable in <a href="https://github.com/serde-rs/serde/compare/v1.0.210...v1.0.213">compare view</a></li> </ul> </details> <br /> Updates `thiserror` from 1.0.64 to 1.0.65 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/dtolnay/thiserror/releases">thiserror's releases</a>.</em></p> <blockquote> <h2>1.0.65</h2> <ul> <li>Ensure OUT_DIR is left with deterministic contents after build script execution (<a href="https://redirect.github.com/dtolnay/thiserror/issues/325">#325</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/dtolnay/thiserror/commit/5088592a4efb6a5c40b4d869eb1a0e2eacf622cb"><code>5088592</code></a> Release 1.0.65</li> <li><a href="https://github.com/dtolnay/thiserror/commit/3309b3772afc1ddc248f52ce52cba995fd0bf1da"><code>3309b37</code></a> Merge pull request <a href="https://redirect.github.com/dtolnay/thiserror/issues/325">#325</a> from dtolnay/outdir</li> <li><a href="https://github.com/dtolnay/thiserror/commit/f563b1dc7620304c797cb794ba6e45fcba2b7586"><code>f563b1d</code></a> Clean up dep-info files from OUT_DIR</li> <li><a href="https://github.com/dtolnay/thiserror/commit/a72ea77c457bd4e150e8de93b33d8258f1908feb"><code>a72ea77</code></a> Resolve extra_unused_lifetimes clippy lint</li> <li><a href="https://github.com/dtolnay/thiserror/commit/1b15d6e6a44cd32d3622864ee6a77097a51df185"><code>1b15d6e</code></a> Ignore needless_lifetimes clippy lint</li> <li>See full diff in <a href="https://github.com/dtolnay/thiserror/compare/1.0.64...1.0.65">compare view</a></li> </ul> </details> <br /> Updates `thiserror-impl` from 1.0.64 to 1.0.65 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/dtolnay/thiserror/releases">thiserror-impl's releases</a>.</em></p> <blockquote> <h2>1.0.65</h2> <ul> <li>Ensure OUT_DIR is left with deterministic contents after build script execution (<a href="https://redirect.github.com/dtolnay/thiserror/issues/325">#325</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/dtolnay/thiserror/commit/5088592a4efb6a5c40b4d869eb1a0e2eacf622cb"><code>5088592</code></a> Release 1.0.65</li> <li><a href="https://github.com/dtolnay/thiserror/commit/3309b3772afc1ddc248f52ce52cba995fd0bf1da"><code>3309b37</code></a> Merge pull request <a href="https://redirect.github.com/dtolnay/thiserror/issues/325">#325</a> from dtolnay/outdir</li> <li><a href="https://github.com/dtolnay/thiserror/commit/f563b1dc7620304c797cb794ba6e45fcba2b7586"><code>f563b1d</code></a> Clean up dep-info files from OUT_DIR</li> <li><a href="https://github.com/dtolnay/thiserror/commit/a72ea77c457bd4e150e8de93b33d8258f1908feb"><code>a72ea77</code></a> Resolve extra_unused_lifetimes clippy lint</li> <li><a href="https://github.com/dtolnay/thiserror/commit/1b15d6e6a44cd32d3622864ee6a77097a51df185"><code>1b15d6e</code></a> Ignore needless_lifetimes clippy lint</li> <li>See full diff in <a href="https://github.com/dtolnay/thiserror/compare/1.0.64...1.0.65">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore <dependency name> major version` will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself) - `@dependabot ignore <dependency name> minor version` will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself) - `@dependabot ignore <dependency name>` will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself) - `@dependabot unignore <dependency name>` will remove all of the ignore conditions of the specified dependency - `@dependabot unignore <dependency name> <ignore condition>` will remove the ignore condition of the specified dependency and ignore conditions </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
As @shamb0 pointed out in bosun-ai#392, there is a potential issue where Node ids are get cached before chunking or other transformations, breaking upserts and potentially resulting in data loss. BREAKING CHANGE: This PR reworks Nodes with a builder API and a private id. Hence, manually creating nodes no longer works. In the future, all the fields are likely to follow the same pattern, so that we can decouple the inner fields from the Node's implementation.
## 🤖 New release * `swiftide`: 0.13.4 -> 0.14.0 (✓ API compatible changes) * `swiftide-core`: 0.13.4 -> 0.14.0 (⚠️ API breaking changes) * `swiftide-indexing`: 0.13.4 -> 0.14.0 (✓ API compatible changes) * `swiftide-macros`: 0.13.4 -> 0.14.0 * `swiftide-integrations`: 0.13.4 -> 0.14.0 (✓ API compatible changes) * `swiftide-query`: 0.13.4 -> 0.14.0 (✓ API compatible changes) ###⚠️ `swiftide-core` breaking changes ``` --- failure inherent_method_missing: pub method removed or renamed --- Description: A publicly-visible method or associated fn is no longer available under its prior name. It may have been renamed or removed entirely. ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.36.0/src/lints/inherent_method_missing.ron Failed in: Node::update_id, previously in file /tmp/.tmpp9ZuUf/swiftide-core/src/node.rs:204 --- failure struct_pub_field_missing: pub struct's pub field removed or renamed --- Description: A publicly-visible struct has at least one public field that is no longer available under its prior name. It may have been renamed or removed entirely. ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.36.0/src/lints/struct_pub_field_missing.ron Failed in: field id of struct Node, previously in file /tmp/.tmpp9ZuUf/swiftide-core/src/node.rs:41 ``` <details><summary><i><b>Changelog</b></i></summary><p> ## `swiftide` <blockquote> ## [0.14.0](bosun-ai/swiftide@v0.13.4...v0.14.0) - 2024-10-27 ### New features - [a866d38](bosun-ai@a866d38) *(integrations)* Support in process hugging face models via mistralrs ([bosun-ai#386](bosun-ai#386)) ### Bug fixes - [551a9cb](bosun-ai@551a9cb) *(indexing)* [**breaking**] Node ID no longer memoized ([bosun-ai#414](bosun-ai#414)) ````text As @shamb0 pointed out in [bosun-ai#392](bosun-ai#392), there is a potential issue where Node ids are get cached before chunking or other transformations, breaking upserts and potentially resulting in data loss. ```` **BREAKING CHANGE**: This PR reworks Nodes with a builder API and a private id. Hence, manually creating nodes no longer works. In the future, all the fields are likely to follow the same pattern, so that we can decouple the inner fields from the Node's implementation. - [c091ffa](bosun-ai@c091ffa) *(indexing)* Use atomics for key generation in memory storage ([bosun-ai#415](bosun-ai#415)) ### Miscellaneous - [0000000](bosun-ai@0000000) Update Cargo.toml dependencies **Full Changelog**: bosun-ai/swiftide@0.13.4...0.14.0 </blockquote> </p></details> --- This PR was generated with [release-plz](https://github.com/MarcoIeni/release-plz/).
Revert the 0.14 release as `mistralrs` is unpublished and unfortunately cannot be released.
Unfortunately we cannot publish unpublished crates. When mistral-rs is published, we are happy to add support again. This reverts commit a866d38.
- **Revert "fix: Revert 0.14 release as mistralrs is unpublished (bosun-ai#417)"** - **Fix changelog**
## 🤖 New release * `swiftide`: 0.14.0 -> 0.14.1 (✓ API compatible changes) * `swiftide-core`: 0.14.0 -> 0.14.1 * `swiftide-indexing`: 0.14.0 -> 0.14.1 (✓ API compatible changes) * `swiftide-macros`: 0.14.0 -> 0.14.1 * `swiftide-integrations`: 0.14.0 -> 0.14.1 * `swiftide-query`: 0.14.0 -> 0.14.1 <details><summary><i><b>Changelog</b></i></summary><p> ## `swiftide` <blockquote> ## [0.14.1](bosun-ai/swiftide@v0.14.0...v0.14.1) - 2024-10-27 ### Bug fixes - [5bbcd55](bosun-ai@5bbcd55) Revert 0.14 release as mistralrs is unpublished ([bosun-ai#417](bosun-ai#417)) ````text Revert the 0.14 release as `mistralrs` is unpublished and unfortunately cannot be released. ```` ### Miscellaneous - [07c2661](bosun-ai@07c2661) Re-release 0.14 without mistralrs ([bosun-ai#419](bosun-ai#419)) ````text - **Revert "fix: Revert 0.14 release as mistralrs is unpublished ([bosun-ai#417](https://github.com/bosun-ai/swiftide/pull/417))"** - **Fix changelog** ```` **Full Changelog**: bosun-ai/swiftide@0.14.0...0.14.1 </blockquote> </p></details> --- This PR was generated with [release-plz](https://github.com/MarcoIeni/release-plz/).
- Prototype pipeline for loading, chunking, enhancing, embedding, and storing markdown content in pgvector. Signed-off-by: shamb0 <r.raajey@gmail.com>
* `swiftide`: 0.13.3 -> 0.13.4 (✓ API compatible changes) * `swiftide-core`: 0.13.3 -> 0.13.4 * `swiftide-indexing`: 0.13.3 -> 0.13.4 (✓ API compatible changes) * `swiftide-macros`: 0.13.3 -> 0.13.4 * `swiftide-integrations`: 0.13.3 -> 0.13.4 * `swiftide-query`: 0.13.3 -> 0.13.4 <details><summary><i><b>Changelog</b></i></summary><p> <blockquote> [0.13.4](bosun-ai/swiftide@v0.13.3...v0.13.4) - 2024-10-21 - [47455fb](bosun-ai@47455fb) *(indexing)* Visibility of ChunkMarkdown builder should be public - [2b3b401](bosun-ai@2b3b401) *(indexing)* Improve splitters consistency and provide defaults ([bosun-ai#403](bosun-ai#403)) **Full Changelog**: bosun-ai/swiftide@0.13.3...0.13.4 </blockquote> </p></details> --- This PR was generated with [release-plz](https://github.com/MarcoIeni/release-plz/).
Addressed review feedback: - Updated PostgreSQL insertion to use upsert with unnest for bulk indexing of vector rows - Modified 'start_postgres()' to use the 'pgvector/pgvector:pg17' Docker image and 'Mount::tmpfs_mount()' for in-memory volume - Cleaned up extra logging and tracing for streamlined output Signed-off-by: shamb0 <r.raajey@gmail.com>
1d68d1c
to
bd0b265
Compare
Signed-off-by: shamb0 <r.raajey@gmail.com>
- Integrate retrieval functionality. - Extend unit test coverage by 30%. Signed-off-by: shamb0 <r.raajey@gmail.com>
|
Perfect, let me know if I can help / continue the review |
Yes @timonv, I need your help regarding the discussion thread on (#392 (comment)). Could you review my recent comments and share any updated thoughts when convenient? Once I have your input, I can proceed with the implementation, and then we can move forward to the intake review process. |
Related to: #158
PR Overview (Part-01)
This PR introduces the first part of the indexing pipeline, which handles the process of loading, chunking, enhancing, embedding, and storing markdown content into the
pgvector
store.How to Run the Example
The provided example,
swiftide/examples/index_md_into_pgvector.rs
, uses the Ollama LLM provider with thellama3.2:latest
model for generating QA metadata.Steps to Run:
Start the PostgreSQL services:
Run the example with the default configuration: