split process and ingest sub-commands and other comments from latest …

…review meeting Signed-off-by: Daniele Martinoli <dmartino@redhat.com>
instructlab · Dec 11, 2024 · 0ed5ad9 · 0ed5ad9
1 parent c1e5104
commit 0ed5ad9
Showing 1 changed file with 119 additions and 86 deletions.
diff --git a/docs/cli/ilab-rag-retrieval.md b/docs/cli/ilab-rag-retrieval.md
@@ -1,4 +1,9 @@
 # Design Proposal - Embedding Ingestion Pipeline And RAG-Based Chat
+**TODOs**: 
+* Vector store authentication options.
+* Document versioning and data update policies.
+* Unify prompt management in InstructLab. See (`chat_template` [configuration][chat_template] and
+`augment_chat_template` [function][augment_chat_template])
 
 **Author**: Daniele Martinoli
 
@@ -8,45 +13,59 @@
 This document proposes enhancements to the `ilab` CLI to support workflows utilizing Retrieval-Augmented Generation
 (RAG) artifacts within `InstructLab`. The proposed changes introduce new commands and options for the embedding ingestion
 and RAG-based chat pipelines:
-* A new embedding ingestion command to process customer documentation, generate embeddings, and ingest them into a configured vector store.
+* A new `ilab data` sub-command to process customer documentation.
+* A new `ilab data` sub-command to generate and ingest embeddings from pre-processed documents into a configured vector store.
 * An option to enhance the chat pipeline by using the stored embeddings to augment the context of conversations, improving relevance and accuracy.
 
-## 2. Proposed Commands
+## 2. Proposed Pipelines
 ### 2.1 Working Assumption
 This proposal aims to serve as a reference design to develop a Proof of Concept for RAG workflows, while
 also laying the foundation for future implementations of state-of-the-art RAG artifacts tailored to specific use
 cases.
 
-To minimize impact on current and future users, the following boundaries are defined:
-* Existing commands will only be updated to add flags or configurations that are maintainable in future versions.
-  * Configuration parameters will be marked as `Optional` to not affect existing runtimes.
-* New commands will be added using flags (e.g., `--abc`) or environment variables (e.g., `ILAB_ABC`) to avoid any configuration changes. 
-  This approach ensures configuration compatibility even if the commands are deprecated later.
+#### Command Options
+To maintain compatibility and simplicity, no new configurations will be introduced for new commands. Instead,
+the settings will be defined using the following hierarchy (options higher in the list overriding those below):
+* CLI flags (e.g., `--FLAG`).
+* Environment variables following a consistent naming convention, such as `ILAB_<UPPERCASE_ARGUMENT_NAME>`.
+* Default values, for all the applicable use cases.
 
-### 2.2 Embedding Ingestion Pipeline Command
-The proposal is to add a `process`  command to the `data` command group, with an explicit `--rag` flag to trigger
-the executions:
+For example, the `vectordb-uri` argument can be implemented using the `click` module like this:
+```py
+@click.option(
+    "--vectordb-uri",
+    default='rag-output.db',
+    envvar="ILAB_VECTORDB_URI",
+)
 ```
-ilab data process --rag /path/to/docs/folder
+
+#### Local embedding models
+The embedding model used to generate text embeddings must be downloaded locally before executing the pipeline.
+
+For example, this command can be used to download the `sentence-transformers/all-minilm-l6-v2` model to the local models cache:
+```bash
+ilab model download -rp sentence-transformers/all-minilm-l6-v2
 ```
 
-The rationale behind this choice is that the `data process` command can support future workflows, making its
-introduction an investment to anticipate other needs.
+If the configured embedding model has not been cached, the command execution will terminate with an error. This requirement applies
+consistently to all new and updated commands.
 
-Since the RAG behavior is the only functionality of this new command, executions without the `--rag` option will result
-in no output for now.
+### 2.2 Document Processing Pipeline
+The proposal is to add a `process`  sub-command to the `data` command group:
+```
+ilab data process --input /path/to/docs/folder --output /path/to/processed/folder
+```
 
 #### Command Purpose
-Generate the embeddings from the documents at */path/to/docs/folder* folder and store them in the
-configured vector database. These embeddings are intended to be used as augmented context in a RAG-based chat pipeline.
+Applies the transformation for the customer documents in `/path/to/docs/folder`. Processed artifacts are stored under `/path/to/processed/folder`.
 
-#### Assumptions
-The provided documents must be in JSON format according to the InstructLab schema: this is the schema generated
-when transforming knowledge documents with the `ilab data generate` command (see 
-[Getting Started with Knowledge Contributions][1]).
+***Notes**: 
+* In alignment with the current SDG implementation, the folder will not be navigated recursively. Only files located at the root level of the specified
+folder will be considered. The same principle applies to all other options outlined below.
+* To ensure consistency and avoid issues with document versioning or outdated artifacts, the destination folder will be cleared before execution. 
+  This ensures it contains only the artifacts generated from the most recent run.
 
-To simplify the execution of the transformation step, we introduce a `--transform` option which also includes the documents
-transformation, leveraging on the `instructlab-sdg` modules. 
+The trasformation is based on the `instructlab-sdg` modules (the initial step of the `ilab data generate` pipeline)
 
 ### Why We Need It
 This command streamlines the `ilab data generate` pipeline and eliminates the requirement to define a `qna` document,
@@ -57,58 +76,57 @@ which typically includes:
 The goal is not to generate training data for InstructLab-trained models but to utilize the documents for RAG 
 workflows with pre-tuned models.
 
-#### Supported Databases
-The command may support various vector database types. A default configuration will align with the selected 
-InstructLab technology stack.
+#### Usage
+The generated artifacts can later be used to generete and ingest the embeddings into a vector database.
 
-### Local embedding models
-The embedding model used to generate the text embeddings must be downloaded locally before executing the pipeline.
+### 2.3 Document Processing Pipeline Options
 
-For example, this can be used to download the `sentence-transformers/all-minilm-l6-v2` model to the local models cache:
-```bash
-ilab model download -rp sentence-transformers/all-minilm-l6-v2
+
+| Option Description | Default Value | CLI Flag | Environment Variable |
+|--------------------|---------------|----------|----------------------|
+| Base directories where models are stored. | `$HOME/.cache/instructlab/models`  | `--model-dir` | `ILAB_MODEL_DIR` |
+| Name of the embedding model. | **TBD** | `--embedding-model` | `ILAB_EMBEDDING_MODEL_NAME` |
+
+### 2.4 Embedding Ingestion Pipeline
+The proposal is to add an `ingest` sub-command to the `data` command group:
 ```
+ilab data ingest /path/to/docs/folder
+```
+
+#### Working Assumption
+The documents at the specified path have already been processed using the `data process` command or an equivalent method
+(see [Getting Started with Knowledge Contributions][ilab-knowledge]).
 
-If the configured embedding model has not been cached, the execution will terminate with an error.
+#### Command Purpose
+Generate the embeddings from the pre-processed documents at */path/to/docs/folder* folder and store them in the
+configured vector database.
+
+**Notes**:
+* To ensure consistency and avoid issues with document versioning or outdated embeddings, the ingested collection will be cleared before execution. 
+  This ensures it contains only the embeddings generated from the most recent run.
+
+### Why We Need It
+To populate embedding vector stores with pre-processed information that can be used at chat inference time.
+
+#### Supported Databases
+The command may support various vector database types. A default configuration will align with the selected 
+InstructLab technology stack.
 
 #### Usage
 The generated embeddings can later be retrieved from a vector database and converted to text, enriching the 
 context for RAG-based chat pipelines.
 
-#### Defining Command Options
-To maintain compatibility and simplicity, no new configurations will be introduced for this command. Instead,
-the settings will be defined using the following hierarchy (options higher in the list overriding those below):
-* CLI flags (e.g., `--rag`).
-* Environment variables following a consistent naming convention, such as `ILAB_<UPPERCASE_ARGUMENT_NAME>`.
-* Default values, for all the applicable use cases.
-
-For example, the `vectordb-uri` argument can be implemented using the `click` module like this:
-```py
-@click.option(
-    "--vectordb-uri",
-    default='rag-output.db',
-    envvar="ILAB_VECTORDB_URI",
-)
-```
-
-### 2.3 Embedding Ingestion Pipeline Options
+### 2.5 Embedding Ingestion Pipeline Options
 
 | Option Description | Default Value | CLI Flag | Environment Variable |
 |--------------------|---------------|----------|----------------------|
-| Whether to include a transformation step. | `False` | `--transform` (boolean) | `ILAB_TRANSFORM` |
-| The output path of transformed documents (serve as input for the embedding ingestion pipeline). Mandatory when `--transform` is used. |  | `--transform-output` | `ILAB_TRANSFORM_OUTPUT` |
 | Vector DB implementation, one of: `milvuslite`, **TBD** | `milvuslite` | `--vectordb-type` | `ILAB_VECTORDB_TYPE` |
 | Vector DB service URI. | `./rag-output.db` | `--vectordb-uri` | `ILAB_VECTORDB_URI` |
 | Vector DB collection name. | `IlabEmbeddings` | `--vectordb-collection-name` | `ILAB_VECTORDB_COLLECTION_NAME` |
-| Vector DB connection username. | | `--vectordb-username` | `ILAB_VECTORDB_USERNAME` |
-| Vector DB connection password. | | `--vectordb-password` | `ILAB_VECTORDB_PASSWORD` |
 | Base directories where models are stored. | `$HOME/.cache/instructlab/models`  | `--model-dir` | `ILAB_MODEL_DIR` |
 | Name of the embedding model. | **TBD** | `--embedding-model` | `ILAB_EMBEDDING_MODEL_NAME` |
-| Token to download private models. |  | `--embedding-model-token` | `ILAB_EMBEDDING_MODEL_TOKEN` |
-
-**TODO**: vector store authentication options.
 
-### 2.4 RAG Chat Pipeline Command
+### 2.6 RAG Chat Pipeline Command
 The proposal is to add a `--rag` flag to the `model chat` command, like:
 ```
 ilab model chat --rag
@@ -124,11 +142,20 @@ enriching the conversational experience with relevant insights.
 * Append the retrieved context to the original LLM request.
 * Send the context augmented request to the LLM and return the response to the user.
 
-### Local embedding models
-Similar to the embedding ingestion pipeline, the embedding model required for generating text embeddings must be downloaded locally 
-before running the pipeline.
+#### Prompt Template
+A default non-configurable template is used with parameters to specify the user query and the context, like:
+```text
+Given the following information, answer the question.
+Context:
+{context}
+Question: 
+{user_query}
+Answer:
+```
+
+Future extensions should align prompt management with the existing InstructLab design.
 
-### 2.5 RAG Chat Commands
+### 2.7 RAG Chat Commands
 The `/r` command may be added to the `ilab model chat` command to dynamically toggle the execution of the RAG pipeline.
 
 The current status could be displayed with an additional marker on the chat status bar, as in (top right corner):
@@ -158,25 +185,19 @@ The current status could be displayed with an additional marker on the chat stat
 ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
 ```
 
-### 2.6 RAG Chat Options
-As we stated in [3.1 Working Assumptions](#31-working-assumption), we will introduce new configuration options for the spceific `chat` command,
+### 2.8 RAG Chat Options
+As we stated in [2.1 Working Assumptions](#21-working-assumption), we will introduce new configuration options for the spceific `chat` command,
 but we'll use flags and environment variables for the options that come from the embedding ingestion pipeline command.
 
 | Configuration FQN | Description | Default Value | CLI Flag | Environment Variable |
 |-------------------|-------------|---------------|----------|----------------------|
 | chat.rag.enabled | Enable or disable the RAG pipeline. | `false` | `--rag` (boolean)| `ILAB_CHAT_RAG_ENABLED` |
 | chat.rag.retriever.top_k | The maximum number of documents to retrieve. | `10` | `--retriever-top-k` | `ILAB_CHAT_RAG_RETRIEVER_TOP_K` |
-| chat.rag.prompt | Prompt template for RAG-based queries. | Examples below | `--rag-prompt` | `ILAB_CHAT_RAG_PROMPT` |
 | | Vector DB implementation, one of: `milvuslite`, **TBD** | `milvuslite` | `--vectordb-type` | `ILAB_VECTORDB_TYPE` |
 | | Vector DB service URI. | `./rag-output.db` | `--vectordb-uri` | `ILAB_VECTORDB_URI` |
 | | Vector DB collection name. | `IlabEmbeddings` | `--vectordb-collection-name` | `ILAB_VECTORDB_COLLECTION_NAME` |
-| | Vector DB connection username. | | `--vectordb-username` | `ILAB_VECTORDB_USERNAME` |
-| | Vector DB connection password. | | `--vectordb-password` | `ILAB_VECTORDB_PASSWORD` |
 | | Base directories where models are stored. | `$HOME/.cache/instructlab/models`  | `--model-dir` | `ILAB_MODEL_DIR` |
 | | Name of the embedding model. | **TBD** | `--model` | `ILAB_EMBEDDING_MODEL_NAME` |
-| | Token to download private models. |  | `--model-token` | `ILAB_EMBEDDING_MODEL_TOKEN` |
-
-**TODO**: vector store authentication options.
 
 Equivalent YAML document for the newly proposed options:
 ```yaml
@@ -185,29 +206,22 @@ chat:
     enabled: false
     retriever:
       top_k: 10
-    prompt: |
-      Given the following information, answer the question.
-      Context:
-      {{context}}
-      Question: {{question}}
-      Answer:
 ```
-**TODO**: unify prompt management in InstructLab
 
-### 2.7 References
+### 2.9 References
 * [Haystack-DocumentSplitter](https://github.com/deepset-ai/haystack/blob/f0c3692cf2a86c69de8738d53af925500e8a5126/haystack/components/preprocessors/document_splitter.py#L55) is temporarily adopted with default settings  until a splitter based on the [docling chunkers][chunkers] is integrated
  in Haystack.
 * [MilvusEmbeddingRetriever](https://github.com/milvus-io/milvus-haystack/blob/77b27de00c2f0278e28b434f4883853a959f5466/src/milvus_haystack/milvus_embedding_retriever.py#L18)
 
 
-### 2.8 Workflow Visualization
+### 2.10 Workflow Visualization
 <!-- https://excalidraw.com/#json=PN2h_LM-Wd2WZYBJfZMDs,WQCq5NDbRXUH2qr8maFFNg -->
 Embedding ingestion pipeline:
 ![ingestion-mvp](../images/ingestion-mvp.png)
 RAG-based Chat pipeline:
 ![rag-chat](../images/rag-chat.png)
 
-### 2.9 Proposed Implementation Stack
+### 2.11 Proposed Implementation Stack
 > **ℹ️ Note:** This stack is still under review. The proposed list represents potential candidates based on the current state of discussions.
 
 The following technologies form the foundation of the proposed solution:
@@ -216,11 +230,18 @@ The following technologies form the foundation of the proposed solution:
 * [MilvusLite](https://milvus.io/docs/milvus_lite.md): The default vector database for efficient storage and retrieval of embeddings.
 * [Docling](https://github.com/DS4SD/docling): Document processing tool. For more details, refer to William’s blog, [Docling: The missing document processing companion for generative AI](https://www.redhat.com/en/blog/docling-missing-document-processing-companion-generative-ai).
 
-## 3. Future Enhancements
-### 3.1 Model Evaluation 
+## 3. Design Considerations
+This solution must adopt a pluggable design to facilitate the easy integration of additional components:
+* Vector stores: Support all selected implementations.
+* Embedding models: Handle embedding models using the appropriate embedder implementation for the chosen framework.
+* Indexing services: Future extensions should allow retrieval of embeddings from configurable APIs, thereby extending the concept of a vector store
+ to include these retrieval services.
+
+## 4. Future Enhancements
+### 4.1 Model Evaluation 
 **TODO** A separate ADR will be defined.
 
-### 3.2 Additional RAG chat steps
+### 4.2 Advanced RAG retrieval steps
 - [Ranking retriever's result][ranking]: 
 ```bash
 ilab model chat --rag --ranking --ranking-top-k=5 --ranking-model=cross-encoder/ms-marco-MiniLM-L-12-v2
@@ -229,15 +250,27 @@ ilab model chat --rag --ranking --ranking-top-k=5 --ranking-model=cross-encoder/
 ```bash
 ilab model chat --rag --query-expansion --query-expansion-prompt="$QUERY_EXPANSION_PROMPT" --query-expansion-num-of-queries=5
 ```
+- Using retrieval strategy:
+```bash
+ilab model chat --rag --retrieval-strategy query-expansion --retrieval-strategy-options="prompt=$QUERY_EXPANSION_PROMPT;num_of_queries=5"
+```
 - ...
 
-### 3.3 Containerized Index
-Generate a containerized RAG artifact to expose a `/query` endpoint:
+### 4.3 Containerized Indexing Service
+Generate a containerized RAG artifact to expose a `/query` endpoint that can serve as an alternative source :
+```bash
+ilab data ingest --build-image --image-name=docker.io/user/my_rag_artifacts:1.0
+```
+
+Then serve it and use it in a chat session:
 ```bash
-ilab data process --rag --build-image --image-name=docker.io/user/my_rag_artifacts:1.0
+ilab serve --rag-embeddings --image-name=docker.io/user/my_rag_artifacts:1.0 --port 8123
+ilab model chat --rag --retriever-type api --retriever-uri http://localhost:8123
 ```
 
-[1]: https://github.com/instructlab/taxonomy?tab=readme-ov-file#getting-started-with-knowledge-contributions
+[ilab-knowledge]: https://github.com/instructlab/taxonomy?tab=readme-ov-file#getting-started-with-knowledge-contributions
+[chat_template]: https://github.com/instructlab/instructlab/blob/0a773f05f8f57285930df101575241c649f591ce/src/instructlab/configuration.py#L244
+[augment_chat_template]: https://github.com/instructlab/instructlab/blob/48e3f7f1574ae50036d6e342b8d78d8eb9546bd5/src/instructlab/model/backends/llama_cpp.py#L281
 [ranking]: https://docs.haystack.deepset.ai/v1.21/reference/ranker-api
 [expansion]: https://haystack.deepset.ai/blog/query-expansion
 [chunkers]: https://github.com/DS4SD/docling/blob/main/docs/concepts/chunking.md