-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Index markdown in pgvector #392
Open
shamb0
wants to merge
35
commits into
bosun-ai:master
Choose a base branch
from
shamb0:feat/indexing-into-pgvector
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
bfa44b5
feat: Index markdown in pgvector
shamb0 3243fd5
chore(ci): Switch to dependabot for better grouping (#398)
timonv e914cba
chore(deps): bump SethCohen/github-releases-to-discord from 1.15.1 to…
dependabot[bot] f305ef8
chore(ci): Explicit allow all for dependabot
timonv b3b3175
fix(ci): Update dependabot.yml via ui (#402)
timonv fe25b17
fix(indexing): Improve splitters consistency and provide defaults (#403)
timonv b531bdd
fix(indexing): Visibility of ChunkMarkdown builder should be public
timonv 2a43a75
chore: Improve workspace configuration (#404)
timonv c17e9a9
chore: release v0.13.4 (#400)
SwabbieBosun b6fa280
fix(ci): Remove explicit 'all' from dependabot config
timonv c08658f
chore: Soft update deps
timonv 5c3aff8
fix(ci): Add zlib to allowed licenses
timonv 57014d2
fix(ci): Add back allow all in dependabot and fix aws pattern
timonv f60d009
feat: Index markdown in pgvector
shamb0 4266bbe
Addressed review comments:
shamb0 9a32436
Addressed review comments:
shamb0 95e925a
Update examples/index_md_into_pgvector.rs
shamb0 72ba300
fix(ci): Remove cache fixing ci disk limits (#408)
timonv 6781ec3
chore(deps): bump the minor group across 1 directory with 12 updates …
dependabot[bot] 5c3458c
fix(indexing)!: Node ID no longer memoized (#414)
timonv 40709be
fix(indexing): Use atomics for key generation in memory storage (#415)
timonv 7fba78d
feat(integrations): Support in process hugging face models via mistra…
timonv ce3945b
chore(deps): bump the minor group across 1 directory with 16 updates …
dependabot[bot] ae7718d
chore: release v0.14.0 (#416)
SwabbieBosun 3c74464
fix: Revert 0.14 release as mistralrs is unpublished (#417)
timonv e32f721
fix(integrations): Revert mistralrs support (#418)
timonv 30c2d01
chore: Re-release 0.14 without mistralrs (#419)
timonv fade2fe
chore: release v0.14.1 (#420)
SwabbieBosun acb34af
feat: Index markdown in pgvector
shamb0 b7aa295
chore: release v0.13.4 (#400)
SwabbieBosun bd0b265
Completed release v0.14.1 intake
shamb0 3eb579f
Merge branch 'master' into feat/indexing-into-pgvector
shamb0 6ad22f1
merge to upstream master
shamb0 15b2909
Address review feedback:
shamb0 6817d4b
Merge remote-tracking branch 'upstream/master' into feat/indexing-int…
shamb0 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
/** | ||
* This example demonstrates how to use the Pgvector integration with Swiftide | ||
*/ | ||
use std::path::PathBuf; | ||
use swiftide::{ | ||
indexing::{ | ||
self, | ||
loaders::FileLoader, | ||
transformers::{ | ||
metadata_qa_text::NAME as METADATA_QA_TEXT_NAME, ChunkMarkdown, Embed, MetadataQAText, | ||
}, | ||
EmbeddedField, | ||
}, | ||
integrations::{self, pgvector::PgVector}, | ||
}; | ||
|
||
#[tokio::main] | ||
async fn main() -> Result<(), Box<dyn std::error::Error>> { | ||
tracing_subscriber::fmt::init(); | ||
tracing::info!("Starting PgVector indexing test"); | ||
|
||
// Get the manifest directory path | ||
let manifest_dir = std::env::var("CARGO_MANIFEST_DIR").expect("CARGO_MANIFEST_DIR not set"); | ||
|
||
// Create a PathBuf to test dataset from the manifest directory | ||
let test_dataset_path = PathBuf::from(manifest_dir).join("test_dataset"); | ||
tracing::info!("Test Dataset path: {:?}", test_dataset_path); | ||
|
||
let pgv_db_url = std::env::var("DATABASE_URL") | ||
.as_deref() | ||
.unwrap_or("postgresql://myuser:mypassword@localhost:5432/mydatabase") | ||
.to_owned(); | ||
|
||
let ollama_client = integrations::ollama::Ollama::default() | ||
.with_default_prompt_model("llama3.2:latest") | ||
.to_owned(); | ||
|
||
let fastembed = | ||
integrations::fastembed::FastEmbed::try_default().expect("Could not create FastEmbed"); | ||
|
||
// Configure Pgvector with a default vector size, a single embedding | ||
// and in addition to embedding the text metadata, also store it in a field | ||
let pgv_storage = PgVector::builder() | ||
.try_from_url(pgv_db_url, Some(10)) | ||
.await | ||
.expect("Failed to connect to postgres server") | ||
.vector_size(384) | ||
.with_vector(EmbeddedField::Combined) | ||
.with_metadata(METADATA_QA_TEXT_NAME) | ||
.table_name("swiftide_pgvector_test".to_string()) | ||
.build() | ||
.unwrap(); | ||
|
||
// Drop the existing test table before running the test | ||
tracing::info!("Dropping existing test table if it exists"); | ||
let drop_table_sql = "DROP TABLE IF EXISTS swiftide_pgvector_test"; | ||
|
||
if let Some(pool) = pgv_storage.get_pool() { | ||
sqlx::query(drop_table_sql).execute(pool).await?; | ||
} else { | ||
return Err("Failed to get database connection pool".into()); | ||
} | ||
|
||
tracing::info!("Starting indexing pipeline"); | ||
indexing::Pipeline::from_loader(FileLoader::new(test_dataset_path).with_extensions(&["md"])) | ||
.then_chunk(ChunkMarkdown::from_chunk_range(10..2048)) | ||
.then(MetadataQAText::new(ollama_client.clone())) | ||
.then_in_batch(Embed::new(fastembed).with_batch_size(100)) | ||
.then_store_with(pgv_storage.clone()) | ||
.run() | ||
.await?; | ||
|
||
tracing::info!("Indexing test completed successfully"); | ||
Ok(()) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# **Swiftide: A Fast, Streaming Indexing and Query Library for AI Applications** | ||
|
||
Swiftide is a Rust-native library designed to simplify the development of Large Language Model (LLM) applications. It addresses the challenge of providing context to LLMs for solving real-world problems by enabling efficient ingestion, transformation, indexing, and querying of extensive data. This process, known as Retrieval Augmented Generation (RAG), enhances the capabilities of LLMs. | ||
|
||
## **Key Features:** | ||
|
||
* **Fast and Modular Indexing:** Swiftide offers a high-performance, streaming indexing pipeline with asynchronous, parallel processing capabilities. | ||
* **Query Pipeline:** An experimental query pipeline facilitates efficient retrieval and processing of information. | ||
* **Versatility:** The library includes various loaders, transformers, semantic chunkers, embedders, and other components, providing flexibility for different use cases. | ||
* **Extensibility:** Developers can bring their own transformers by extending straightforward traits or using closures. | ||
* **Pipeline Management:** Swiftide supports splitting and merging pipelines for complex workflows. | ||
* **Prompt Templating:** Jinja-like templating simplifies the creation of prompts. | ||
* **Storage Options:** Integration with multiple storage backends, including Qdrant, Redis, and LanceDB. | ||
* **Integrations:** Seamless integration with popular tools and platforms like OpenAI, Groq, Redis, Qdrant, Ollama, FastEmbed-rs, Fluvio, LanceDB, and Treesitter. | ||
* **Evaluation:** Pipeline evaluation using RAGAS for performance assessment. | ||
* **Sparse Vector Support:** Enables hybrid search with sparse vector support. | ||
* **Tracing:** Built-in tracing support for logging and debugging. | ||
|
||
## **Technical Insights:** | ||
|
||
* **Rust-Native:** Developed in Rust for performance, safety, and concurrency. | ||
* **Streaming Architecture:** Employs a streaming architecture for efficient processing of large datasets. | ||
* **Modularity:** Highly modular design allows for customization and extensibility. | ||
* **Asynchronous and Parallel Processing:** Leverages asynchronous and parallel processing for optimal performance. | ||
* **Strong Typing:** The query pipeline is fully and strongly typed, ensuring type safety and developer productivity. | ||
* **OpenAI Integration:** Provides seamless integration with OpenAI for powerful LLM capabilities. | ||
|
||
## **Getting Started:** | ||
|
||
To get started with Swiftide, developers need to set up a Rust project, add the Swiftide library as a dependency, enable the required integration features, and write a pipeline. Comprehensive examples and documentation are available to guide developers through the process. | ||
|
||
## **Current Status and Future Roadmap:** | ||
|
||
Swiftide is under active development and may introduce breaking changes as it progresses towards version 1.0. The documentation may not cover all features and could be slightly outdated. Despite these considerations, Swiftide offers a promising solution for building efficient and scalable LLM applications. The project's roadmap includes addressing open issues and incorporating proposed features to enhance its functionality and usability. | ||
|
||
## **Community and Contributions:** | ||
|
||
The Swiftide community welcomes feedback, questions, and contributions. Developers can connect with the community on Discord and contribute to the project by forking the repository, creating pull requests, or opening issues with enhancement tags. | ||
|
||
**Overall, Swiftide presents a powerful and flexible framework for building Retrieval Augmented Generation (RAG) pipelines in Rust. Its focus on performance, modularity, and extensibility makes it a valuable tool for developers working with LLMs and AI applications.** | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
services: | ||
test_env_pgvector: | ||
image: ankane/pgvector:v0.5.1 | ||
container_name: test_env_pgvector | ||
environment: | ||
POSTGRES_USER: ${POSTGRES_USER:-myuser} | ||
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-mypassword} | ||
POSTGRES_DB: ${POSTGRES_DB:-mydatabase} | ||
ports: | ||
- "5432:5432" | ||
volumes: | ||
- test_env_pgvector_data:/var/lib/postgresql/data | ||
networks: | ||
- pg-network | ||
healthcheck: | ||
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-myuser} -d ${POSTGRES_DB:-mydatabase}"] | ||
interval: 10s | ||
timeout: 5s | ||
retries: 5 | ||
restart: unless-stopped | ||
|
||
pgadmin: | ||
image: dpage/pgadmin4 | ||
container_name: test_env_pgadmin | ||
environment: | ||
PGADMIN_DEFAULT_EMAIL: ${PGADMIN_DEFAULT_EMAIL:-admin@admin.com} | ||
PGADMIN_DEFAULT_PASSWORD: ${PGADMIN_DEFAULT_PASSWORD:-root} | ||
ports: | ||
- "8080:80" | ||
volumes: | ||
- test_env_pgadmin_data:/var/lib/pgadmin | ||
depends_on: | ||
- test_env_pgvector | ||
networks: | ||
- pg-network | ||
restart: unless-stopped | ||
|
||
networks: | ||
pg-network: | ||
name: pg-network | ||
|
||
volumes: | ||
test_env_pgvector_data: | ||
name: test_env_pgvector_data | ||
test_env_pgadmin_data: | ||
name: test_env_pgadmin_data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost exactly right! I prefer it if builders do not do IO if they can avoid it, for multiple reasons. In this case, that also has the benefit of being able to connect lazilly and hiding the details of the connection pool.
i.e. the builder api like:
And then in
PgVector::setup
(which is only called once):There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timonv, I'm looking for your input on a design choice here.
If we decide to handle the database connection pool setup within
fn setup(&self)
instead ofPgVectorBuilder
, we'll need to mutatePgVector
withinfn setup()
. This change would mean updating the function signature intrait Persist
to:For example:
This adjustment would introduce breaking changes across the stack, particularly impacting:
swiftide-indexing/src/persist/memory_storage.rs
swiftide-integrations/src/lancedb/persist.rs
swiftide-integrations/src/qdrant/persist.rs
swiftide-integrations/src/redis/persist.rs
Would you prefer moving the IO operations into
Persist::setup()
for these components? If so, we could handle this as a separate PR to streamline the updates.Looking forward to your thoughts!