Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial support for Late Chunking #97

Merged
merged 9 commits into from
Dec 21, 2024
Merged

Conversation

bhavnicksm
Copy link
Collaborator

This pull request introduces several enhancements and new features to the Chonkie library, particularly focusing on the addition of the experimental LateChunker and related functionalities. The most important changes include updates to the README.md, the sentence_transformer.py and types.py files, and the addition of new test cases for the LateChunker.

Enhancements and New Features:

  • Documentation Update:

    • README.md: Added a description for the new LateChunker (experimental) which embeds text and splits it to improve chunk embeddings.
  • Embedding Enhancements:

  • New Data Classes:

    • src/chonkie/types.py: Added LateSentence and LateChunk dataclasses to represent sentences and chunks with embeddings.
  • Testing:

    • tests/chunker/test_late_chunker.py: Added comprehensive test cases for the LateChunker, including initialization, mode validation, chunking functionality, handling of empty text, single sentence text, sentence boundaries, and embedding dimensions.

…hods

- Updated class docstring for clarity.
- Added new parameters: `min_characters_per_sentence` and `chunk_size` to the constructor.
- Implemented `_create_token_chunks`, `_token_chunk`, and `_split_sentences` methods for better chunking functionality.
- Refactored chunking logic to support both token and sentence modes.
- Introduced embedding handling for chunk embeddings.

These changes improve the flexibility and performance of the LateChunker for various text processing tasks.
- Added `embed_as_tokens` method to obtain token embeddings for individual texts, accommodating longer texts than the maximum sequence length.
- Introduced `embed_as_tokens_batch` method for batch processing of token embeddings.
- Updated the class to import numpy and added a property for `max_seq_length` to improve usability and flexibility in embedding operations.

These changes enhance the functionality of the SentenceTransformerEmbeddings class for token-level embedding tasks.
- Introduced an `approximate` parameter to control token count estimation.
- Added `_prepare_sentences` method for improved sentence handling and token counting.
- Refactored `_sentence_chunk` method to utilize new sentence preparation logic.
- Implemented `_create_sentence_chunk` for better chunk creation from sentences.
- Updated error handling for embedding model compatibility.

These changes improve the accuracy and efficiency of the LateChunker for processing text into meaningful chunks.
@bhavnicksm bhavnicksm changed the base branch from main to development December 21, 2024 20:33
@bhavnicksm bhavnicksm merged commit 5761899 into development Dec 21, 2024
1 check passed
@bhavnicksm bhavnicksm deleted the late-chunker branch December 24, 2024 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant