Enhanced Metadata Handling for Documents and Data Readers #273

raihan-js · 2024-11-25T11:56:57Z

This Pull Request introduces extensible metadata handling for Document objects and updates the DataReader interface and its implementation (FileDataReader) to support metadata extraction and management. The changes improve interoperability with Retrieval-Augmented Generation (RAG) workflows and provide a modular approach to embedding metadata into documents.

Key Changes

Document Class Enhancements:
- Added a metadata property to store key-value pairs of extensible metadata.
- Introduced addMetadata and toArray methods to manage and serialize metadata.
DataReader Interface:
- Added an extractMetadata method to standardize metadata extraction from document content.
FileDataReader:
- Implemented the extractMetadata method to parse and populate metadata fields from the content.
- Automatically populates metadata during the creation of Document objects.
DocumentUtils Enhancements:
- Updated utility functions to support metadata when creating documents from arrays.
- Ensured compatibility with new and existing functionality.
Tests:
- Added tests to validate metadata extraction, assignment, and serialization.
- Ensured all existing tests remain functional to maintain backward compatibility.

Benefits of This Contribution

Enhanced Metadata Support:
- Metadata (e.g., titles, categories, tags) can now be embedded into Document objects, providing rich context for document retrieval and organization.
Improved RAG Workflows:
- Metadata is critical for Retrieval-Augmented Generation (RAG) workflows, enabling:
  - Better search and filtering: Use metadata to refine searches in vector stores.
  - Increased accuracy: Improve relevance by aligning metadata with document embeddings.
  - Custom queries: Leverage metadata fields for fine-grained information retrieval.
Extensibility:
- Developers can easily add new metadata fields without modifying the core library.
- The extractMetadata method allows custom data readers to parse metadata from diverse file formats.
Backward Compatibility:
- All existing functionality remains unchanged, ensuring no breaking changes for current users.
Ease of Integration:
- Standardized metadata handling ensures smooth integration with vector stores, such as Qdrant or Pinecone.

How to Use the Enhanced Features

1. Add Metadata in a Custom Data Reader

Implement the extractMetadata method in a custom data reader to define how metadata is parsed:

use LLPhant\Embeddings\DataReader\DataReader;
use LLPhant\Embeddings\Document;

class MyCustomDataReader implements DataReader
{
    public function getDocuments(): array
    {
        $content = "Sample document content";
        $document = new Document();
        $document->content = $content;

        // Extract and add metadata
        $metadata = $this->extractMetadata($content);
        foreach ($metadata as $key => $value) {
            $document->addMetadata($key, $value);
        }

        return [$document];
    }

    public function extractMetadata(string $content): array
    {
        // Custom metadata extraction logic
        return [
            'title' => 'Extracted Title',
            'category' => 'Extracted Category',
            // Add more metadata fields as needed
        ];
    }
}

2. Retrieve Metadata in a RAG Workflow

Use the toArray method to serialize documents with metadata:

$documents = $dataReader->getDocuments();
foreach ($documents as $document) {
    $metadata = $document->metadata;
    print_r($metadata); // Outputs: ['title' => 'Extracted Title', 'category' => 'Extracted Category']
}

3. Store Metadata in Vector Stores

Combine content and metadata for embedding in vector stores:

foreach ($documents as $document) {
    $vectorStore->upsert([
        'id' => DocumentUtils::getUniqueId($document),
        'embedding' => $document->embedding,
        'metadata' => $document->metadata,
    ]);
}

Use Cases with RAG Workflows

1. Metadata-Driven Retrieval

Scenario: Search documents by category or tags before applying semantic search on embeddings.

Query Example:

{
  "filter": {
    "category": "User Guide"
  },
  "vector": [0.12, 0.34, 0.56],
  "top_k": 5
}

2. Context-Aware Augmented Responses

Scenario: Include metadata (e.g., sourceType, title) in AI responses to provide additional context.
Example:

"The information comes from the document titled 'User Guide for Product X'."

3. Chunk-Based Metadata

Scenario: Manage individual chunks of large documents using metadata.
Implementation:
- Add a chunkNumber to metadata for better traceability and reconstruction.

Potential Enhancements

User-Defined Metadata Parsers:
- Allow users to plug in custom metadata parsing logic for different file types.
Utility Methods for Metadata Queries:
- Provide methods like findDocumentsByMetadata to simplify metadata-based retrieval.
Examples for Specific Vector Stores:
- Add documentation or examples showing integration with vector stores like Qdrant and Pinecone.

Checklist

Added metadata support to Document.
Updated DataReader and FileDataReader for metadata handling.
Enhanced DocumentUtils for metadata compatibility.
Included comprehensive tests.
Validated backward compatibility.

Let me know if you need any further adjustments or additional information! 🚀

…e, and added tests

…sibility

raihan-js added 6 commits November 25, 2024 17:53

Enhanced metadata handling for documents, updated DataReader interfac…

aebefa2

…e, and added tests

Fixed coding style issues flagged by Pint

caba93f

Fix tests and metadata extraction logic in FileDataReader

e5857f2

Fixed coding style issues and test failures

d021b86

Fixed Rector and unit test errors by updating docblocks and method vi…

409454d

…sibility

Fixed Rector and unit test errors by updating docblocks and method vi…

984619e

…sibility

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced Metadata Handling for Documents and Data Readers #273

Enhanced Metadata Handling for Documents and Data Readers #273

raihan-js commented Nov 25, 2024

Enhanced Metadata Handling for Documents and Data Readers #273

Are you sure you want to change the base?

Enhanced Metadata Handling for Documents and Data Readers #273

Conversation

raihan-js commented Nov 25, 2024

Key Changes

Benefits of This Contribution

How to Use the Enhanced Features

1. Add Metadata in a Custom Data Reader

2. Retrieve Metadata in a RAG Workflow

3. Store Metadata in Vector Stores

Use Cases with RAG Workflows

1. Metadata-Driven Retrieval

2. Context-Aware Augmented Responses

3. Chunk-Based Metadata

Potential Enhancements

Checklist