Apify Vector Database Integrations

The Apify Vector Database Integrations facilitate the transfer of data from Apify Actors to a vector database. This process includes data processing, optional splitting into chunks, embedding computation, and data storage

These integrations support incremental updates, ensuring that only changed data is updated. This reduces unnecessary embedding computation and storage operations, making it ideal for search and retrieval augmented generation (RAG) use cases.

This repository contains Actors for different vector databases.

How does it work?

Retrieve a dataset as output from an Actor.
[Optional] Split text data into chunks using langchain.
[Optional] Update only changed data.
Compute embeddings, e.g. using OpenAI or Cohere.
Save data into the database.

Vector database integrations (Actors)

Supported Vector Embeddings

How to add a new integration (an example for PG-Vector)?

Add database to docker-compose.yml for local testing (if the database is available in docker).

version: '3.8'

services:
  pgvector-container:
    image: pgvector/pgvector:pg16
    environment:
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=apify
    ports:
      - "5432:5432"

Add postgres dependency to pyproject.toml:
```
poetry add --group=pgvector "langchain_postgres"
```
and mark the group pgvector as optional (in pyproject.toml):
```
[tool.poetry.group.postgres]
optional = true
```
Create a new actor in the actors directory, e.g. actors/pgvector and add the following files:
- README.md - the actor documentation
- .actor/actor.json - the actor definition
- .actor/input_schema.json - the actor input schema

Create a pydantic model for the actor input schema. Edit Makefile to generate the input schema from the model:

 datamodel-codegen --input $(DIRS_WITH_ACTORS)/pgvector/.actor/input_schema.json --output $(DIRS_WITH_CODE)/src/models/pgvector_input_model.py  --input-file-type jsonschema  --field-constraints

and then run

make pydantic-model

Import the created model in src/models/__init__.py:

from .pgvector_input_model import PgvectorIntegration
``

Create a new module (pgvector.py) in the vector_stores directory, e.g. vector_stores/pgvector and implement all class PGVectorDatabase and all required methods.

Add PGVector into SupportedVectorStores in the constants.py

   class SupportedVectorStores(str, enum.Enum):
       pgvector = "pgvector"

Add PGVectorDatabase into entrypoint.py

   if actor_type == SupportedVectorStores.pgvector.value:
       await run_actor(PgvectorIntegration(**actor_input), actor_input)

Add PGVectorDatabase and PgvectorIntegration into _types.py

    ActorInputsDb: TypeAlias = ChromaIntegration | PgvectorIntegration | PineconeIntegration | QdrantIntegration
    VectorDb: TypeAlias = ChromaDatabase | PGVectorDatabase | PineconeDatabase | QdrantDatabase

Add PGVectorDatabase into vector_stores/vcs.py

    if isinstance(actor_input, PgvectorIntegration):
        from .vector_stores.pgvector import PGVectorDatabase

        return PGVectorDatabase(actor_input, embeddings)

Add PGVectorDatabase fixture into tests/conftets.py

   @pytest.fixture()
   def db_pgvector(crawl_1: list[Document]) -> PGVectorDatabase:
       db = PGVectorDatabase(
           actor_input=PgvectorIntegration(
               postgresSqlConnectionStr=os.getenv("POSTGRESQL_CONNECTION_STR"),
               postgresCollectionName=INDEX_NAME,
               embeddingsProvider=EmbeddingsProvider.OpenAI.value,
               embeddingsApiKey=os.getenv("OPENAI_API_KEY"),
               datasetFields=["text"],
           ),
           embeddings=embeddings,
       )

       db.unit_test_wait_for_index = 0

       db.delete_all()
       # Insert initially crawled objects
       db.add_documents(documents=crawl_1, ids=[d.metadata["id"] for d in crawl_1])

       yield db

       db.delete_all()

Add the db_pgvector fixture into tests/test_vector_stores.py

   DATABASE_FIXTURES = ["db_pinecone", "db_chroma", "db_qdrant", "db_pgvector"]

Update README.md in the actors/pgvector directory
Add the pgvector to the README.md in the root directory
Run tests
```
make test
```

Run the actor locally

export ACTOR_PATH_IN_DOCKER_CONTEXT=actors/pgvector
apify run -p

Setup Actor on Apify platform at https://console.apify.com

Build configuration

Git URL: https://github.com/apify/store-vector-db
Branch: master
Folder: actors/pgvector

Test the actor on the Apify platform

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
actors		actors
code		code
docs		docs
shared		shared
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
docker_build.sh		docker_build.sh
docker_run.sh		docker_run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apify Vector Database Integrations

How does it work?

Vector database integrations (Actors)

Supported Vector Embeddings

How to add a new integration (an example for PG-Vector)?

About

Releases

Packages

Contributors 3

Languages

License

apify/actor-vector-database-integrations

Folders and files

Latest commit

History

Repository files navigation

Apify Vector Database Integrations

How does it work?

Vector database integrations (Actors)

Supported Vector Embeddings

How to add a new integration (an example for PG-Vector)?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages