Skip to content

Commit

Permalink
feat: Set Perform chunking to True by default (#41)
Browse files Browse the repository at this point in the history
* Perform chunking should be set to True by default
* Correct input_schema link in the README.md
* #33
  • Loading branch information
jirispilka authored Sep 11, 2024
1 parent d4d8fda commit 9af3a1a
Show file tree
Hide file tree
Showing 16 changed files with 338 additions and 319 deletions.
4 changes: 2 additions & 2 deletions actors/chroma/.actor/input_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -140,15 +140,15 @@
"performChunking": {
"title": "Enable text chunking",
"description": "When set to true, the text will be divided into smaller chunks based on the settings provided below. Proper chunking helps optimize retrieval and ensures accurate and efficient responses.",
"default": false,
"default": true,
"type": "boolean",
"sectionCaption": "Text chunking settings"
},
"chunkSize": {
"title": "Maximum chunk size",
"type": "integer",
"description": "Defines the maximum number of characters in each text chunk. Choosing the right size balances between detailed context and system performance. Optimal sizes ensure high relevancy and minimal response time.",
"default": 1000,
"default": 2000,
"minimum": 1
},
"chunkOverlap": {
Expand Down
14 changes: 7 additions & 7 deletions actors/chroma/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ For instance, if you are using the [Website Content Crawler](https://apify.com/a
[Chroma](https://www.trychroma.com/) is an open-source, AI-native vector database designed for simplicity and developer productivity.
It provides SDKs for Python and JavaScript/TypeScript and includes an option for self-hosted servers.

## How does the Apify-Chroma work?
## 📋 How does the Apify-Chroma work?

Apify Chroma integration computes text embeddings and store them in Chroma.
It uses [LangChain](https://www.langchain.com/) to compute embeddings and interact with [Chroma](https://www.trychroma.com/).
Expand All @@ -25,7 +25,7 @@ It uses [LangChain](https://www.langchain.com/) to compute embeddings and intera
4. Compute embeddings, e.g. using `OpenAI` or `Cohere` (specify `embeddings` and `embeddingsConfig`)
5. Save data into the database

## Before you start
## Before you start

To utilize this integration, ensure you have:

Expand Down Expand Up @@ -76,16 +76,16 @@ The URL (`https://fdfe-82-208-25-82.ngrok-free.app`) can be used in the as an in
Note that your specific URL will vary.


## Examples

For detailed input information refer to [input schema](.actor/input_schema.json).
## 👉 Examples

The configuration consists of three parts: Chroma, embeddings provider, and data.

Ensure that the vector size of your embeddings aligns with the configuration of your Chroma database.
For instance, if you're using the `text-embedding-3-small` model from `OpenAI`, it generates vectors of size `1536`.
This means your Chroma index should also be configured to accommodate vectors of the same size, `1536` in this case.

For detailed input information refer to the [Input page](https://apify.com/apify/chroma-integration/input-schema).

#### Database: Chroma
```json
{
Expand Down Expand Up @@ -197,11 +197,11 @@ To disable this feature, set `deleteExpiredObjects` to `false`.
Otherwise, data crawled by one Actor might expire due to inconsistent crawling schedules.


## Outputs
## 💾 Outputs

This integration will save the selected fields from your Actor to Chroma.

## Example configuration
## 🔢 Example configuration

#### Full Input Example for Website Content Crawler Actor with Chroma integration

Expand Down
4 changes: 2 additions & 2 deletions actors/milvus/.actor/input_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -132,15 +132,15 @@
"performChunking": {
"title": "Enable text chunking",
"description": "When set to true, the text will be divided into smaller chunks based on the settings provided below. Proper chunking helps optimize retrieval and ensures accurate and efficient responses.",
"default": false,
"default": true,
"type": "boolean",
"sectionCaption": "Text chunking settings"
},
"chunkSize": {
"title": "Maximum chunk size",
"type": "integer",
"description": "Defines the maximum number of characters in each text chunk. Choosing the right size balances between detailed context and system performance. Optimal sizes ensure high relevancy and minimal response time.",
"default": 1000,
"default": 2000,
"minimum": 1
},
"chunkOverlap": {
Expand Down
6 changes: 4 additions & 2 deletions actors/milvus/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ It uses [LangChain](https://www.langchain.com/) to compute embeddings and intera
4. Compute embeddings, e.g. using `OpenAI` or `Cohere` (specify `embeddings` and `embeddingsConfig`)
5. Save data into the database

![Apify-pinecone-integration](https://raw.githubusercontent.com/apify/actor-vector-database-integrations/master/docs/Apify-milvus-integration-readme.png)

## ✅ Before you start

To utilize this integration, ensure you have:
Expand All @@ -40,14 +42,14 @@ For more details, please refer to the [Milvus documentation](https://milvus.io/d

## 👉 Examples

For detailed input information refer to [input schema](.actor/input_schema.json).

The configuration consists of three parts: Milvus, embeddings provider, and data.

Ensure that the vector size of your embeddings aligns with the configuration of your Milvus index.
For instance, if you're using the `text-embedding-3-small` model from `OpenAI`, it generates vectors of size `1536`.
This means your Milvus index should also be configured to accommodate vectors of the same size, `1536` in this case.

For detailed input information refer to the [Input page](https://apify.com/apify/milvus-integration/input-schema).

#### Database: Milvus
```json
{
Expand Down
4 changes: 2 additions & 2 deletions actors/pgvector/.actor/input_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -113,15 +113,15 @@
"performChunking": {
"title": "Enable text chunking",
"description": "When set to true, the text will be divided into smaller chunks based on the settings provided below. Proper chunking helps optimize retrieval and ensures accurate and efficient responses.",
"default": false,
"default": true,
"type": "boolean",
"sectionCaption": "Text chunking settings"
},
"chunkSize": {
"title": "Maximum chunk size",
"type": "integer",
"description": "Defines the maximum number of characters in each text chunk. Choosing the right size balances between detailed context and system performance. Optimal sizes ensure high relevancy and minimal response time.",
"default": 1000,
"default": 2000,
"minimum": 1
},
"chunkOverlap": {
Expand Down
14 changes: 7 additions & 7 deletions actors/pgvector/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ This approach reduces unnecessary embedding computation and storage operations,
💡 **Note**: This Actor is meant to be used together with other Actors' integration sections.
For instance, if you are using the [Website Content Crawler](https://apify.com/apify/website-content-crawler), you can activate PGVector integration to save web data as vectors to PostgreSQL.

## How does it work?
## 📋 How does Apify-PGVector integration work?

Apify PGVector integration computes text embeddings and store them in PostgreSQL.
It uses [LangChain](https://www.langchain.com/) to compute embeddings and interact with [PGVector](https://github.com/pgvector/pgvector).
Expand All @@ -21,23 +21,23 @@ It uses [LangChain](https://www.langchain.com/) to compute embeddings and intera
4. Compute embeddings, e.g. using `OpenAI` or `Cohere` (specify `embeddings` and `embeddingsConfig`)
5. Save data into the database

## Before you start
## Before you start

To utilize this integration, ensure you have:

- Created or existing `PostgreSQL` database with PGVector extension. You need to know `postgresSqlConnectionStr` and `postgresCollectionName`.
- An account to compute embeddings using one of the providers, e.g., [OpenAI](https://platform.openai.com/docs/guides/embeddings) or [Cohere](https://docs.cohere.com/docs/cohere-embed).

## Examples

For detailed input information refer to [input schema](.actor/input_schema.json).
## 👉 Examples

The configuration consists of three parts: PGVector, embeddings provider, and data.

Ensure that the vector size of your embeddings aligns with the configuration of your PostgreSQL.
For instance, if you're using the `text-embedding-3-small` model from `OpenAI`, it generates vectors of size `1536`.
This means your PostgreSQL vector should also be configured to accommodate vectors of the same size, `1536` in this case.

For detailed input information refer to the [Input page](https://apify.com/apify/pgvector-integration/input-schema).

#### Database: PostgreSQL with PGVector
```json
{
Expand Down Expand Up @@ -148,11 +148,11 @@ To disable this feature, set `deleteExpiredObjects` to `false`.
Otherwise, data crawled by one Actor might expire due to inconsistent crawling schedules.


## Outputs
## 💾 Outputs

This integration will save the selected fields from your Actor to PostgreSQL.

## Example configuration
## 🔢 Example configuration

#### Full Input Example for Website Content Crawler Actor with PostgreSQL integration

Expand Down
4 changes: 2 additions & 2 deletions actors/pinecone/.actor/input_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -119,15 +119,15 @@
"performChunking": {
"title": "Enable text chunking",
"description": "When set to true, the text will be divided into smaller chunks based on the settings provided below. Proper chunking helps optimize retrieval and ensures accurate and efficient responses.",
"default": false,
"default": true,
"type": "boolean",
"sectionCaption": "Text chunking settings"
},
"chunkSize": {
"title": "Maximum chunk size",
"type": "integer",
"description": "Defines the maximum number of characters in each text chunk. Choosing the right size balances between detailed context and system performance. Optimal sizes ensure high relevancy and minimal response time.",
"default": 1000,
"default": 2000,
"minimum": 1
},
"chunkOverlap": {
Expand Down
6 changes: 5 additions & 1 deletion actors/pinecone/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ It uses [LangChain](https://www.langchain.com/) to compute embeddings and intera
4. Compute embeddings, e.g. using `OpenAI` or `Cohere` (specify `embeddings` and `embeddingsConfig`)
5. Save data into the database

![Apify-pinecone-integration](https://raw.githubusercontent.com/apify/actor-vector-database-integrations/master/docs/Apify-pinecone-integration-readme.png)

## ✅ Before you start

To utilize this integration, ensure you have:
Expand All @@ -43,9 +45,11 @@ To utilize this integration, ensure you have:
The configuration consists of three parts: Pinecone, embeddings provider, and data.

Ensure that the vector size of your embeddings aligns with the configuration of your Pinecone index.
For instance, if you're using the `text-embedding-3-small` model from `OpenAI`, it generates vectors of size `1536`.
For instance, if you're using the `text-embedding-3-small` model from `OpenAI`, it generates vectors of size `1536`.
This means your Pinecone index should also be configured to accommodate vectors of the same size, `1536` in this case.

For detailed input information refer to the [Input page](https://apify.com/apify/pinecone-integration/input-schema).

#### Database: Pinecone
```json
{
Expand Down
4 changes: 2 additions & 2 deletions actors/qdrant/.actor/input_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -125,15 +125,15 @@
"performChunking": {
"title": "Enable text chunking",
"description": "When set to true, the text will be divided into smaller chunks based on the settings provided below. Proper chunking helps optimize retrieval and ensures accurate and efficient responses.",
"default": false,
"default": true,
"type": "boolean",
"sectionCaption": "Text chunking settings"
},
"chunkSize": {
"title": "Maximum chunk size",
"type": "integer",
"description": "Defines the maximum number of characters in each text chunk. Choosing the right size balances between detailed context and system performance. Optimal sizes ensure high relevancy and minimal response time.",
"default": 1000,
"default": 2000,
"minimum": 1
},
"chunkOverlap": {
Expand Down
12 changes: 6 additions & 6 deletions actors/qdrant/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Qdrant is mature, large-scale, high-performance, low-latency vector database, op
Built in Rust, Qdrant offers SDKs for a wide range of programming languages, including Rust, JavaScript/TypeScript, Python, Golang, and Java.
It also demonstrates strong performance in [benchmarks](https://qdrant.tech/benchmarks/).

## How does the Apify-Qdrant integration work?
## 📋 How does the Apify-Qdrant integration work?

It uses [LangChain](https://www.langchain.com/) to compute vector embeddings and interact with [Qdrant](https://www.qdrant.tech/).

Expand All @@ -34,16 +34,16 @@ To utilize this integration, ensure you have:
- A Qdrant instance to connect to. You can use Qdrant using Docker or you can quickly setup a free cloud instance at [cloud.qdrant.io](https://cloud.qdrant.io/).
- An account to compute embeddings using one of the providers, e.g., [OpenAI](https://platform.openai.com/docs/guides/embeddings) or [Cohere](https://docs.cohere.com/docs/cohere-embed).

## Examples

For detailed input information refer to [input schema](.actor/input_schema.json).
## 👉 Examples

The configuration consists of three parts: Qdrant, embeddings provider, and data.

Ensure that the vector size of your embeddings aligns with the configuration of your Qdrant settings.
For instance, if you're using the `text-embedding-3-small` model from `OpenAI`, it generates vectors of size `1536`.
This means your Qdrant collection should also be configured to accommodate vectors of the same size, `1536` in this case.

For detailed input information refer to the [Input page](https://apify.com/apify/qdrant-integration/input-schema).

#### Database: Qdrant
```json
{
Expand Down Expand Up @@ -156,11 +156,11 @@ To disable this feature, set `deleteExpiredObjects` to `false`.
Otherwise, data crawled by one Actor might expire due to inconsistent crawling schedules.


## Outputs
## 💾 Outputs

This integration will save the selected fields from your Actor to Qdrant.

## Example configuration
## 🔢 Example configuration

#### Full Input Example for Website Content Crawler Actor with Qdrant integration

Expand Down
4 changes: 2 additions & 2 deletions actors/weaviate/.actor/input_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -119,15 +119,15 @@
"performChunking": {
"title": "Enable text chunking",
"description": "When set to true, the text will be divided into smaller chunks based on the settings provided below. Proper chunking helps optimize retrieval and ensures accurate and efficient responses.",
"default": false,
"default": true,
"type": "boolean",
"sectionCaption": "Text chunking settings"
},
"chunkSize": {
"title": "Maximum chunk size",
"type": "integer",
"description": "Defines the maximum number of characters in each text chunk. Choosing the right size balances between detailed context and system performance. Optimal sizes ensure high relevancy and minimal response time.",
"default": 1000,
"default": 2000,
"minimum": 1
},
"chunkOverlap": {
Expand Down
14 changes: 7 additions & 7 deletions actors/weaviate/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ It is useful for similarity searches, making it useful for AI applications such
Weaviate supports both raw vectors and structured data, allowing for the combination of vector search with traditional filtering methods.
Clients are available for Python, Java, JavaScript/TypeScript, and Golang.

## How does the Apify-Weaviate integration work?
## 📋 How does the Apify-Weaviate integration work?

Apify Weaviate integration computes text embeddings and store them in Weaviate.
It uses [LangChain](https://www.langchain.com/) to compute embeddings and interact with [Weaviate](https://weaviate.io/).
Expand All @@ -28,7 +28,7 @@ It uses [LangChain](https://www.langchain.com/) to compute embeddings and intera
4. Compute embeddings, e.g. using `OpenAI` or `Cohere` (specify `embeddings` and `embeddingsConfig`)
5. Save data into the database

## Before you start
## Before you start

To utilize this integration, ensure you have:

Expand All @@ -37,9 +37,7 @@ To utilize this integration, ensure you have:

You can run Weaviate using docker or you can try managed [Weaviate](https://weaviate.io).

## Examples

For detailed input information refer to [input schema](.actor/input_schema.json).
## 👉 Examples

The configuration consists of three parts: Weaviate, embeddings provider, and data.

Expand All @@ -51,6 +49,8 @@ This means your Weaviate index should also be configured to accommodate vectors
If the embedding model is not set up correctly, the only indication might be in the logs.
Therefore, it's crucial to double-check your configuration to avoid any potential issues.

For detailed input information refer to the [Input page](https://apify.com/apify/weaviate-integration/input-schema).

#### Database: Weaviate
```json
{
Expand Down Expand Up @@ -162,11 +162,11 @@ To disable this feature, set `deleteExpiredObjects` to `false`.
Otherwise, data crawled by one Actor might expire due to inconsistent crawling schedules.


## Outputs
## 💾 Outputs

This integration will save the selected fields from your Actor to Weaviate.

## Example configuration
## 🔢 Example configuration

#### Full Input Example for Website Content Crawler Actor with Weaviate integration

Expand Down
Loading

0 comments on commit 9af3a1a

Please sign in to comment.