Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced Metadata Handling for Documents and Data Readers #273

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

raihan-js
Copy link
Contributor

This Pull Request introduces extensible metadata handling for Document objects and updates the DataReader interface and its implementation (FileDataReader) to support metadata extraction and management. The changes improve interoperability with Retrieval-Augmented Generation (RAG) workflows and provide a modular approach to embedding metadata into documents.


Key Changes

  1. Document Class Enhancements:

    • Added a metadata property to store key-value pairs of extensible metadata.
    • Introduced addMetadata and toArray methods to manage and serialize metadata.
  2. DataReader Interface:

    • Added an extractMetadata method to standardize metadata extraction from document content.
  3. FileDataReader:

    • Implemented the extractMetadata method to parse and populate metadata fields from the content.
    • Automatically populates metadata during the creation of Document objects.
  4. DocumentUtils Enhancements:

    • Updated utility functions to support metadata when creating documents from arrays.
    • Ensured compatibility with new and existing functionality.
  5. Tests:

    • Added tests to validate metadata extraction, assignment, and serialization.
    • Ensured all existing tests remain functional to maintain backward compatibility.

Benefits of This Contribution

  1. Enhanced Metadata Support:

    • Metadata (e.g., titles, categories, tags) can now be embedded into Document objects, providing rich context for document retrieval and organization.
  2. Improved RAG Workflows:

    • Metadata is critical for Retrieval-Augmented Generation (RAG) workflows, enabling:
      • Better search and filtering: Use metadata to refine searches in vector stores.
      • Increased accuracy: Improve relevance by aligning metadata with document embeddings.
      • Custom queries: Leverage metadata fields for fine-grained information retrieval.
  3. Extensibility:

    • Developers can easily add new metadata fields without modifying the core library.
    • The extractMetadata method allows custom data readers to parse metadata from diverse file formats.
  4. Backward Compatibility:

    • All existing functionality remains unchanged, ensuring no breaking changes for current users.
  5. Ease of Integration:

    • Standardized metadata handling ensures smooth integration with vector stores, such as Qdrant or Pinecone.

How to Use the Enhanced Features

1. Add Metadata in a Custom Data Reader

Implement the extractMetadata method in a custom data reader to define how metadata is parsed:

use LLPhant\Embeddings\DataReader\DataReader;
use LLPhant\Embeddings\Document;

class MyCustomDataReader implements DataReader
{
    public function getDocuments(): array
    {
        $content = "Sample document content";
        $document = new Document();
        $document->content = $content;

        // Extract and add metadata
        $metadata = $this->extractMetadata($content);
        foreach ($metadata as $key => $value) {
            $document->addMetadata($key, $value);
        }

        return [$document];
    }

    public function extractMetadata(string $content): array
    {
        // Custom metadata extraction logic
        return [
            'title' => 'Extracted Title',
            'category' => 'Extracted Category',
            // Add more metadata fields as needed
        ];
    }
}

2. Retrieve Metadata in a RAG Workflow

Use the toArray method to serialize documents with metadata:

$documents = $dataReader->getDocuments();
foreach ($documents as $document) {
    $metadata = $document->metadata;
    print_r($metadata); // Outputs: ['title' => 'Extracted Title', 'category' => 'Extracted Category']
}

3. Store Metadata in Vector Stores

Combine content and metadata for embedding in vector stores:

foreach ($documents as $document) {
    $vectorStore->upsert([
        'id' => DocumentUtils::getUniqueId($document),
        'embedding' => $document->embedding,
        'metadata' => $document->metadata,
    ]);
}

Use Cases with RAG Workflows

1. Metadata-Driven Retrieval

  • Scenario: Search documents by category or tags before applying semantic search on embeddings.

  • Query Example:

    {
      "filter": {
        "category": "User Guide"
      },
      "vector": [0.12, 0.34, 0.56],
      "top_k": 5
    }

2. Context-Aware Augmented Responses

  • Scenario: Include metadata (e.g., sourceType, title) in AI responses to provide additional context.

  • Example:

    "The information comes from the document titled 'User Guide for Product X'."

3. Chunk-Based Metadata

  • Scenario: Manage individual chunks of large documents using metadata.
  • Implementation:
    • Add a chunkNumber to metadata for better traceability and reconstruction.

Potential Enhancements

  1. User-Defined Metadata Parsers:

    • Allow users to plug in custom metadata parsing logic for different file types.
  2. Utility Methods for Metadata Queries:

    • Provide methods like findDocumentsByMetadata to simplify metadata-based retrieval.
  3. Examples for Specific Vector Stores:

    • Add documentation or examples showing integration with vector stores like Qdrant and Pinecone.

Checklist

  • Added metadata support to Document.
  • Updated DataReader and FileDataReader for metadata handling.
  • Enhanced DocumentUtils for metadata compatibility.
  • Included comprehensive tests.
  • Validated backward compatibility.

Let me know if you need any further adjustments or additional information! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant