Skip to content

Commit

Permalink
Merge pull request #3 from bhavnicksm/development
Browse files Browse the repository at this point in the history
v0.0.1a8
  • Loading branch information
bhavnicksm authored Nov 3, 2024
2 parents 56e33be + d59eaf4 commit 8a8f2a5
Show file tree
Hide file tree
Showing 10 changed files with 83 additions and 9 deletions.
45 changes: 45 additions & 0 deletions DOCS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,56 @@

- [πŸ¦› Chonkie Docs](#-chonkie-docs)
- [Table Of Contents](#table-of-contents)
- [Design CHONKosophy](#design-chonkosophy)
- [Dependency Table](#dependency-table)
- [Why Chunking is needed? (And may always be needed!)](#why-chunking-is-needed-and-may-always-be-needed)
- [Chunkers](#chunkers)
- [TokenChunker](#tokenchunker)
- [Initialization](#initialization)
- [Methods](#methods)
- [Example](#example)


# Design CHONKosophy

> Did you know that Hippos are surprisingly smart?
A lot of thought went into this repository and I want to take some space here to fully go over some of the decisions made: the what, why and how?

1. **Chonkie is very smart**
2. **Chonkie is surprisingly lightweight**
3. **Chonkie is superrrrr fast**



## Dependency Table

As per the details mentioned in the [Design](#design-chonkosophy) section, Chonkie is lightweight because it keeps most of the dependencies for each chunker seperate, making it more of an aggregate of multiple repositories and python packages. The optional dependencies feature in Python really helps with this.

| Chunker | Default | 'sentence' | 'semantic' | 'all' |
|----------|----------|----------|----------|----------|
| TokenChunker |βœ…|βœ…|βœ…|βœ…|
| WordChunker |βœ…|βœ…|βœ…|βœ…|
| SentenceChunker |⚠️|βœ…|⚠️|βœ…|
| SemanticChunker |❌|❌|⚠️/βœ…|βœ…|
| SPDMChunker |❌|❌|⚠️/βœ…|βœ…|

Note: In the above table `⚠️/βœ…` meant that some features would be disabled but the Chunker would work nonetheless.

What you could infer from the table is that, while it might be of inconvinience in the short-run to have it split like that, you can do surprisingly a lot with just the Defualt dependencies (which btw are super light). Furthermore, even our max dependencies option `all` is lightweight in comparison to some of the other libraries that one might use for such tasks.


## Why Chunking is needed? (And may always be needed!)

While you might be aware of models having longer and longer contexts in recent times (as of 2024), models have yet to reach the stage where adding additional context to them comes for free. Additional context, even with the greatest of model architectures comes at a O(n) penalty in speed, to say nothing of the additional memory requirements. And as long as we belive in that attention is all we need, it doesn't seem likely we would be free from this penalty.

That means, to make models run efficiently (lower speed, memory) it is absoulutely vital that we provide the most accurate information it needs during the retrieval phase.

Accuracy is one part during retrieval and the other is granularity. You might be able to extract the relevant Article out for model to work with, but if only 1 line is relevant from that passage, you are in effect adding a lot of noise that would hamper and confuse the model in practice. You want and hope to give the model only what it should require ideally (of course, the ideal scenario is rarely ever possible). This finally brings us to granularity and retrieval accuracy.

Representation models (or embedding models as you may call them) are great at representing large amount of information (sometimes pages of text) in a limited space of just 700-1000 floats, but that doesn't mean it does not suffer from any loss. Most representation is lossy, and if we have many concepts being covered in the same space, it is often that much of it would be lost. However, singluar concepts and explainations breed stronger representation vectors. It then becomes vital again to make sure we don't dilute the representation with noise.

All this brings me back to chunking. Chunking, done well, can make sure your representation vector (or embedding) is of high-quality to be able to retrieve the best context for your model to generate with. And that in turn, leads to better quality RAG generations. Therefore, I believe chunking is here to stay as long as RAG is here. And hence, it becomes important that we give it little more than a after-thought.

# Chunkers
## TokenChunker
Expand Down
5 changes: 3 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "chonkie"
version = "0.0.1a6"
version = "0.0.1a8"
description = "A simple, efficient text chunking library for RAG applications"
readme = "README.md"
requires-python = ">=3.8"
Expand Down Expand Up @@ -36,4 +36,5 @@ all = ["spacy>=3.0.0", "sentence-transformers>=2.0.0", "numpy>=1.23.0"]
dev = ["pytest>=6.2.0"]

[tool.setuptools]
packages = ["chonkie"]
package-dir = {"" = "src"}
packages = ["chonkie", "chonkie.chunker"]
21 changes: 14 additions & 7 deletions chonkie/__init__.py β†’ src/chonkie/__init__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,18 @@
from .chunker.base import Chunk, BaseChunker
from .chunker.token import TokenChunker
from .chunker.word import WordChunker
from .chunker.sentence import Sentence, SentenceChunk, SentenceChunker
from .chunker.semantic import SemanticSentence, SemanticChunk, SemanticChunker
from .chunker.spdm import SPDMChunker
from .chunker import (
BaseChunker,
TokenChunker,
WordChunker,
SentenceChunker,
SemanticChunker,
SPDMChunker,
Chunk,
SentenceChunk,
SemanticChunk,
Sentence,
SemanticSentence
)

__version__ = "0.0.1a6"
__version__ = "0.0.1a8"
__name__ = "chonkie"
__author__ = "Bhavnick Minhas"

Expand Down
21 changes: 21 additions & 0 deletions src/chonkie/chunker/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
from .base import Chunk, BaseChunker
from .token import TokenChunker
from .word import WordChunker
from .sentence import Sentence, SentenceChunk, SentenceChunker
from .semantic import SemanticSentence, SemanticChunk, SemanticChunker
from .spdm import SPDMChunker


__all__ = [
"Chunk",
"BaseChunker",
"TokenChunker",
"WordChunker",
"Sentence",
"SentenceChunk",
"SentenceChunker",
"SemanticSentence",
"SemanticChunk",
"SemanticChunker",
"SPDMChunker"
]
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 comments on commit 8a8f2a5

Please sign in to comment.