feat: Set Perform chunking to True by default (#41)

* Perform chunking should be set to True by default * Correct input_schema link in the README.md * #33
apify · Sep 11, 2024 · 9af3a1a · 9af3a1a
1 parent d4d8fda
commit 9af3a1a
Show file tree

Hide file tree

Showing 16 changed files with 338 additions and 319 deletions.
diff --git a/actors/chroma/.actor/input_schema.json b/actors/chroma/.actor/input_schema.json
@@ -140,15 +140,15 @@
     "performChunking": {
       "title": "Enable text chunking",
       "description": "When set to true, the text will be divided into smaller chunks based on the settings provided below. Proper chunking helps optimize retrieval and ensures accurate and efficient responses.",
-      "default": false,
+      "default": true,
       "type": "boolean",
       "sectionCaption": "Text chunking settings"
     },
     "chunkSize": {
       "title": "Maximum chunk size",
       "type": "integer",
       "description": "Defines the maximum number of characters in each text chunk. Choosing the right size balances between detailed context and system performance. Optimal sizes ensure high relevancy and minimal response time.",
-      "default": 1000,
+      "default": 2000,
       "minimum": 1
     },
     "chunkOverlap": {

diff --git a/actors/chroma/README.md b/actors/chroma/README.md
@@ -13,7 +13,7 @@ For instance, if you are using the [Website Content Crawler](https://apify.com/a
 [Chroma](https://www.trychroma.com/) is an open-source, AI-native vector database designed for simplicity and developer productivity.
 It provides SDKs for Python and JavaScript/TypeScript and includes an option for self-hosted servers.
 
-## How does the Apify-Chroma work?
+## 📋 How does the Apify-Chroma work?
 
 Apify Chroma integration computes text embeddings and store them in Chroma. 
 It uses [LangChain](https://www.langchain.com/) to compute embeddings and interact with [Chroma](https://www.trychroma.com/).
@@ -25,7 +25,7 @@ It uses [LangChain](https://www.langchain.com/) to compute embeddings and intera
 4. Compute embeddings, e.g. using `OpenAI` or `Cohere` (specify `embeddings` and `embeddingsConfig`)
 5. Save data into the database
 
-## Before you start
+## ✅ Before you start
 
 To utilize this integration, ensure you have:
 
@@ -76,16 +76,16 @@ The URL (`https://fdfe-82-208-25-82.ngrok-free.app`) can be used in the as an in
 Note that your specific URL will vary.
 
 
-## Examples
-
-For detailed input information refer to [input schema](.actor/input_schema.json).
+## 👉 Examples
 
 The configuration consists of three parts: Chroma, embeddings provider, and data.
 
 Ensure that the vector size of your embeddings aligns with the configuration of your Chroma database.
 For instance, if you're using the `text-embedding-3-small` model from `OpenAI`, it generates vectors of size `1536`.
 This means your Chroma index should also be configured to accommodate vectors of the same size, `1536` in this case.
 
+For detailed input information refer to the [Input page](https://apify.com/apify/chroma-integration/input-schema).
+
 #### Database: Chroma
 ```json
 {
@@ -197,11 +197,11 @@ To disable this feature, set `deleteExpiredObjects` to `false`.
 Otherwise, data crawled by one Actor might expire due to inconsistent crawling schedules.
 
 
-## Outputs
+## 💾 Outputs
 
 This integration will save the selected fields from your Actor to Chroma.
 
-## Example configuration
+## 🔢 Example configuration
 
 #### Full Input Example for Website Content Crawler Actor with Chroma integration
 

diff --git a/actors/milvus/.actor/input_schema.json b/actors/milvus/.actor/input_schema.json
@@ -132,15 +132,15 @@
     "performChunking": {
       "title": "Enable text chunking",
       "description": "When set to true, the text will be divided into smaller chunks based on the settings provided below. Proper chunking helps optimize retrieval and ensures accurate and efficient responses.",
-      "default": false,
+      "default": true,
       "type": "boolean",
       "sectionCaption": "Text chunking settings"
     },
     "chunkSize": {
       "title": "Maximum chunk size",
       "type": "integer",
       "description": "Defines the maximum number of characters in each text chunk. Choosing the right size balances between detailed context and system performance. Optimal sizes ensure high relevancy and minimal response time.",
-      "default": 1000,
+      "default": 2000,
       "minimum": 1
     },
     "chunkOverlap": {

diff --git a/actors/milvus/README.md b/actors/milvus/README.md
@@ -27,6 +27,8 @@ It uses [LangChain](https://www.langchain.com/) to compute embeddings and intera
 4. Compute embeddings, e.g. using `OpenAI` or `Cohere` (specify `embeddings` and `embeddingsConfig`)
 5. Save data into the database
 
+![Apify-pinecone-integration](https://raw.githubusercontent.com/apify/actor-vector-database-integrations/master/docs/Apify-milvus-integration-readme.png)
+
 ## ✅ Before you start
 
 To utilize this integration, ensure you have:
@@ -40,14 +42,14 @@ For more details, please refer to the [Milvus documentation](https://milvus.io/d
 
 ## 👉 Examples
 
-For detailed input information refer to [input schema](.actor/input_schema.json).
-
 The configuration consists of three parts: Milvus, embeddings provider, and data.
 
 Ensure that the vector size of your embeddings aligns with the configuration of your Milvus index. 
 For instance, if you're using the `text-embedding-3-small` model from `OpenAI`, it generates vectors of size `1536`. 
 This means your Milvus index should also be configured to accommodate vectors of the same size, `1536` in this case.
 
+For detailed input information refer to the [Input page](https://apify.com/apify/milvus-integration/input-schema).
+
 #### Database: Milvus
 ```json
 {

diff --git a/actors/pgvector/.actor/input_schema.json b/actors/pgvector/.actor/input_schema.json
@@ -113,15 +113,15 @@
     "performChunking": {
       "title": "Enable text chunking",
       "description": "When set to true, the text will be divided into smaller chunks based on the settings provided below. Proper chunking helps optimize retrieval and ensures accurate and efficient responses.",
-      "default": false,
+      "default": true,
       "type": "boolean",
       "sectionCaption": "Text chunking settings"
     },
     "chunkSize": {
       "title": "Maximum chunk size",
       "type": "integer",
       "description": "Defines the maximum number of characters in each text chunk. Choosing the right size balances between detailed context and system performance. Optimal sizes ensure high relevancy and minimal response time.",
-      "default": 1000,
+      "default": 2000,
       "minimum": 1
     },
     "chunkOverlap": {

diff --git a/actors/pgvector/README.md b/actors/pgvector/README.md
@@ -9,7 +9,7 @@ This approach reduces unnecessary embedding computation and storage operations,
 💡 **Note**: This Actor is meant to be used together with other Actors' integration sections.
 For instance, if you are using the [Website Content Crawler](https://apify.com/apify/website-content-crawler), you can activate PGVector integration to save web data as vectors to PostgreSQL.
 
-## How does it work?
+## 📋 How does Apify-PGVector integration work?
 
 Apify PGVector integration computes text embeddings and store them in PostgreSQL. 
 It uses [LangChain](https://www.langchain.com/) to compute embeddings and interact with [PGVector](https://github.com/pgvector/pgvector).
@@ -21,23 +21,23 @@ It uses [LangChain](https://www.langchain.com/) to compute embeddings and intera
 4. Compute embeddings, e.g. using `OpenAI` or `Cohere` (specify `embeddings` and `embeddingsConfig`)
 5. Save data into the database
 
-## Before you start
+## ✅ Before you start
 
 To utilize this integration, ensure you have:
 
 - Created or existing `PostgreSQL` database with PGVector extension. You need to know `postgresSqlConnectionStr` and `postgresCollectionName`.
 - An account to compute embeddings using one of the providers, e.g., [OpenAI](https://platform.openai.com/docs/guides/embeddings) or [Cohere](https://docs.cohere.com/docs/cohere-embed).
 
-## Examples
-
-For detailed input information refer to [input schema](.actor/input_schema.json).
+## 👉 Examples
 
 The configuration consists of three parts: PGVector, embeddings provider, and data.
 
 Ensure that the vector size of your embeddings aligns with the configuration of your PostgreSQL. 
 For instance, if you're using the `text-embedding-3-small` model from `OpenAI`, it generates vectors of size `1536`. 
 This means your PostgreSQL vector should also be configured to accommodate vectors of the same size, `1536` in this case.
 
+For detailed input information refer to the [Input page](https://apify.com/apify/pgvector-integration/input-schema).
+
 #### Database: PostgreSQL with PGVector
 ```json
 {
@@ -148,11 +148,11 @@ To disable this feature, set `deleteExpiredObjects` to `false`.
 Otherwise, data crawled by one Actor might expire due to inconsistent crawling schedules.
 
 
-## Outputs
+## 💾 Outputs
 
 This integration will save the selected fields from your Actor to PostgreSQL.
 
-## Example configuration
+## 🔢 Example configuration
 
 #### Full Input Example for Website Content Crawler Actor with PostgreSQL integration
 

diff --git a/actors/pinecone/.actor/input_schema.json b/actors/pinecone/.actor/input_schema.json
@@ -119,15 +119,15 @@
     "performChunking": {
       "title": "Enable text chunking",
       "description": "When set to true, the text will be divided into smaller chunks based on the settings provided below. Proper chunking helps optimize retrieval and ensures accurate and efficient responses.",
-      "default": false,
+      "default": true,
       "type": "boolean",
       "sectionCaption": "Text chunking settings"
     },
     "chunkSize": {
       "title": "Maximum chunk size",
       "type": "integer",
       "description": "Defines the maximum number of characters in each text chunk. Choosing the right size balances between detailed context and system performance. Optimal sizes ensure high relevancy and minimal response time.",
-      "default": 1000,
+      "default": 2000,
       "minimum": 1
     },
     "chunkOverlap": {

diff --git a/actors/pinecone/README.md b/actors/pinecone/README.md
@@ -31,6 +31,8 @@ It uses [LangChain](https://www.langchain.com/) to compute embeddings and intera
 4. Compute embeddings, e.g. using `OpenAI` or `Cohere` (specify `embeddings` and `embeddingsConfig`)
 5. Save data into the database
 
+![Apify-pinecone-integration](https://raw.githubusercontent.com/apify/actor-vector-database-integrations/master/docs/Apify-pinecone-integration-readme.png)
+
 ## ✅ Before you start
 
 To utilize this integration, ensure you have:
@@ -43,9 +45,11 @@ To utilize this integration, ensure you have:
 The configuration consists of three parts: Pinecone, embeddings provider, and data.
 
 Ensure that the vector size of your embeddings aligns with the configuration of your Pinecone index. 
-For instance, if you're using the `text-embedding-3-small` model from `OpenAI`, it generates vectors of size `1536`. 
+For instance, if you're using the `text-embedding-3-small` model from `OpenAI`, it generates vectors of size `1536`.
 This means your Pinecone index should also be configured to accommodate vectors of the same size, `1536` in this case.
 
+For detailed input information refer to the [Input page](https://apify.com/apify/pinecone-integration/input-schema).
+
 #### Database: Pinecone
 ```json
 {

diff --git a/actors/qdrant/.actor/input_schema.json b/actors/qdrant/.actor/input_schema.json
@@ -125,15 +125,15 @@
     "performChunking": {
       "title": "Enable text chunking",
       "description": "When set to true, the text will be divided into smaller chunks based on the settings provided below. Proper chunking helps optimize retrieval and ensures accurate and efficient responses.",
-      "default": false,
+      "default": true,
       "type": "boolean",
       "sectionCaption": "Text chunking settings"
     },
     "chunkSize": {
       "title": "Maximum chunk size",
       "type": "integer",
       "description": "Defines the maximum number of characters in each text chunk. Choosing the right size balances between detailed context and system performance. Optimal sizes ensure high relevancy and minimal response time.",
-      "default": 1000,
+      "default": 2000,
       "minimum": 1
     },
     "chunkOverlap": {

diff --git a/actors/qdrant/README.md b/actors/qdrant/README.md
@@ -16,7 +16,7 @@ Qdrant is mature, large-scale, high-performance, low-latency vector database, op
 Built in Rust, Qdrant offers SDKs for a wide range of programming languages, including Rust, JavaScript/TypeScript, Python, Golang, and Java. 
 It also demonstrates strong performance in [benchmarks](https://qdrant.tech/benchmarks/).
 
-## How does the Apify-Qdrant integration work?
+## 📋 How does the Apify-Qdrant integration work?
 
 It uses [LangChain](https://www.langchain.com/) to compute vector embeddings and interact with [Qdrant](https://www.qdrant.tech/).
 
@@ -34,16 +34,16 @@ To utilize this integration, ensure you have:
 - A Qdrant instance to connect to. You can use Qdrant using Docker or you can quickly setup a free cloud instance at [cloud.qdrant.io](https://cloud.qdrant.io/).
 - An account to compute embeddings using one of the providers, e.g., [OpenAI](https://platform.openai.com/docs/guides/embeddings) or [Cohere](https://docs.cohere.com/docs/cohere-embed).
 
-## Examples
-
-For detailed input information refer to [input schema](.actor/input_schema.json).
+## 👉 Examples
 
 The configuration consists of three parts: Qdrant, embeddings provider, and data.
 
 Ensure that the vector size of your embeddings aligns with the configuration of your Qdrant settings. 
 For instance, if you're using the `text-embedding-3-small` model from `OpenAI`, it generates vectors of size `1536`. 
 This means your Qdrant collection should also be configured to accommodate vectors of the same size, `1536` in this case.
 
+For detailed input information refer to the [Input page](https://apify.com/apify/qdrant-integration/input-schema).
+
 #### Database: Qdrant
 ```json
 {
@@ -156,11 +156,11 @@ To disable this feature, set `deleteExpiredObjects` to `false`.
 Otherwise, data crawled by one Actor might expire due to inconsistent crawling schedules.
 
 
-## Outputs
+## 💾 Outputs
 
 This integration will save the selected fields from your Actor to Qdrant.
 
-## Example configuration
+## 🔢 Example configuration
 
 #### Full Input Example for Website Content Crawler Actor with Qdrant integration
 

diff --git a/actors/weaviate/.actor/input_schema.json b/actors/weaviate/.actor/input_schema.json
@@ -119,15 +119,15 @@
     "performChunking": {
       "title": "Enable text chunking",
       "description": "When set to true, the text will be divided into smaller chunks based on the settings provided below. Proper chunking helps optimize retrieval and ensures accurate and efficient responses.",
-      "default": false,
+      "default": true,
       "type": "boolean",
       "sectionCaption": "Text chunking settings"
     },
     "chunkSize": {
       "title": "Maximum chunk size",
       "type": "integer",
       "description": "Defines the maximum number of characters in each text chunk. Choosing the right size balances between detailed context and system performance. Optimal sizes ensure high relevancy and minimal response time.",
-      "default": 1000,
+      "default": 2000,
       "minimum": 1
     },
     "chunkOverlap": {

diff --git a/actors/weaviate/README.md b/actors/weaviate/README.md
@@ -16,7 +16,7 @@ It is useful for similarity searches, making it useful for AI applications such
 Weaviate supports both raw vectors and structured data, allowing for the combination of vector search with traditional filtering methods. 
 Clients are available for Python, Java, JavaScript/TypeScript, and Golang.
 
-## How does the Apify-Weaviate integration work?
+## 📋 How does the Apify-Weaviate integration work?
 
 Apify Weaviate integration computes text embeddings and store them in Weaviate. 
 It uses [LangChain](https://www.langchain.com/) to compute embeddings and interact with [Weaviate](https://weaviate.io/).
@@ -28,7 +28,7 @@ It uses [LangChain](https://www.langchain.com/) to compute embeddings and intera
 4. Compute embeddings, e.g. using `OpenAI` or `Cohere` (specify `embeddings` and `embeddingsConfig`)
 5. Save data into the database
 
-## Before you start
+## ✅ Before you start
 
 To utilize this integration, ensure you have:
 
@@ -37,9 +37,7 @@ To utilize this integration, ensure you have:
 
 You can run Weaviate using docker or you can try managed [Weaviate](https://weaviate.io).
 
-## Examples
-
-For detailed input information refer to [input schema](.actor/input_schema.json).
+## 👉 Examples
 
 The configuration consists of three parts: Weaviate, embeddings provider, and data.
 
@@ -51,6 +49,8 @@ This means your Weaviate index should also be configured to accommodate vectors
 If the embedding model is not set up correctly, the only indication might be in the logs.
 Therefore, it's crucial to double-check your configuration to avoid any potential issues.
 
+For detailed input information refer to the [Input page](https://apify.com/apify/weaviate-integration/input-schema).
+
 #### Database: Weaviate
 ```json
 {
@@ -162,11 +162,11 @@ To disable this feature, set `deleteExpiredObjects` to `false`.
 Otherwise, data crawled by one Actor might expire due to inconsistent crawling schedules.
 
 
-## Outputs
+## 💾 Outputs
 
 This integration will save the selected fields from your Actor to Weaviate.
 
-## Example configuration
+## 🔢 Example configuration
 
 #### Full Input Example for Website Content Crawler Actor with Weaviate integration