Enhanced Metadata Handling for Documents and Data Readers #273
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This Pull Request introduces extensible metadata handling for
Document
objects and updates theDataReader
interface and its implementation (FileDataReader
) to support metadata extraction and management. The changes improve interoperability with Retrieval-Augmented Generation (RAG) workflows and provide a modular approach to embedding metadata into documents.Key Changes
Document Class Enhancements:
metadata
property to store key-value pairs of extensible metadata.addMetadata
andtoArray
methods to manage and serialize metadata.DataReader Interface:
extractMetadata
method to standardize metadata extraction from document content.FileDataReader:
extractMetadata
method to parse and populate metadata fields from the content.Document
objects.DocumentUtils Enhancements:
Tests:
Benefits of This Contribution
Enhanced Metadata Support:
Document
objects, providing rich context for document retrieval and organization.Improved RAG Workflows:
Extensibility:
extractMetadata
method allows custom data readers to parse metadata from diverse file formats.Backward Compatibility:
Ease of Integration:
How to Use the Enhanced Features
1. Add Metadata in a Custom Data Reader
Implement the
extractMetadata
method in a custom data reader to define how metadata is parsed:2. Retrieve Metadata in a RAG Workflow
Use the
toArray
method to serialize documents with metadata:3. Store Metadata in Vector Stores
Combine content and metadata for embedding in vector stores:
Use Cases with RAG Workflows
1. Metadata-Driven Retrieval
Scenario: Search documents by
category
ortags
before applying semantic search on embeddings.Query Example:
2. Context-Aware Augmented Responses
Scenario: Include metadata (e.g.,
sourceType
,title
) in AI responses to provide additional context.Example:
3. Chunk-Based Metadata
chunkNumber
to metadata for better traceability and reconstruction.Potential Enhancements
User-Defined Metadata Parsers:
Utility Methods for Metadata Queries:
findDocumentsByMetadata
to simplify metadata-based retrieval.Examples for Specific Vector Stores:
Checklist
Document
.DataReader
andFileDataReader
for metadata handling.DocumentUtils
for metadata compatibility.Let me know if you need any further adjustments or additional information! 🚀