Add initial support for Late Chunking #97

bhavnicksm · 2024-12-21T20:33:12Z

This pull request introduces several enhancements and new features to the Chonkie library, particularly focusing on the addition of the experimental LateChunker and related functionalities. The most important changes include updates to the README.md, the sentence_transformer.py and types.py files, and the addition of new test cases for the LateChunker.

Enhancements and New Features:

Documentation Update:
- README.md: Added a description for the new LateChunker (experimental) which embeds text and splits it to improve chunk embeddings.
Embedding Enhancements:
- src/chonkie/embeddings/sentence_transformer.py: Added embed_as_tokens and embed_as_tokens_batch methods to embed text as tokens, and introduced a max_seq_length property. Also, imported numpy as np to handle embeddings. [1] [2] [3]
New Data Classes:
- src/chonkie/types.py: Added LateSentence and LateChunk dataclasses to represent sentences and chunks with embeddings.
Testing:
- tests/chunker/test_late_chunker.py: Added comprehensive test cases for the LateChunker, including initialization, mode validation, chunking functionality, handling of empty text, single sentence text, sentence boundaries, and embedding dimensions.

…hods - Updated class docstring for clarity. - Added new parameters: `min_characters_per_sentence` and `chunk_size` to the constructor. - Implemented `_create_token_chunks`, `_token_chunk`, and `_split_sentences` methods for better chunking functionality. - Refactored chunking logic to support both token and sentence modes. - Introduced embedding handling for chunk embeddings. These changes improve the flexibility and performance of the LateChunker for various text processing tasks.

- Added `embed_as_tokens` method to obtain token embeddings for individual texts, accommodating longer texts than the maximum sequence length. - Introduced `embed_as_tokens_batch` method for batch processing of token embeddings. - Updated the class to import numpy and added a property for `max_seq_length` to improve usability and flexibility in embedding operations. These changes enhance the functionality of the SentenceTransformerEmbeddings class for token-level embedding tasks.

- Introduced an `approximate` parameter to control token count estimation. - Added `_prepare_sentences` method for improved sentence handling and token counting. - Refactored `_sentence_chunk` method to utilize new sentence preparation logic. - Implemented `_create_sentence_chunk` for better chunk creation from sentences. - Updated error handling for embedding model compatibility. These changes improve the accuracy and efficiency of the LateChunker for processing text into meaningful chunks.

bhavnicksm added 9 commits December 7, 2024 16:52

Add LateChunking support in late.py

52be3e9

Add LateChunk and LateSentence for LateChunking support

91f6690

Merge branch 'development' into late-chunker

d14be46

[chore] let ruff fix stylistic choices

d3bd58d

Add experimental LateChunker to README with description

76bb237

Add and pass all the tests for LateChunking

5b16567

bhavnicksm changed the base branch from main to development December 21, 2024 20:33

bhavnicksm merged commit 5761899 into development Dec 21, 2024
1 check passed

bhavnicksm deleted the late-chunker branch December 24, 2024 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial support for Late Chunking #97

Add initial support for Late Chunking #97

bhavnicksm commented Dec 21, 2024

Add initial support for Late Chunking #97

Add initial support for Late Chunking #97

Conversation

bhavnicksm commented Dec 21, 2024

Enhancements and New Features: