Merge pull request #3 from bhavnicksm/development

v0.0.1a8
chonkie-ai · Nov 3, 2024 · 8a8f2a5 · 8a8f2a5
2 parents 56e33be + d59eaf4
commit 8a8f2a5
Show file tree

Hide file tree

Showing 10 changed files with 83 additions and 9 deletions.
diff --git a/DOCS.md b/DOCS.md
@@ -6,11 +6,56 @@
 
 - [🦛 Chonkie Docs](#-chonkie-docs)
 - [Table Of Contents](#table-of-contents)
+- [Design CHONKosophy](#design-chonkosophy)
+  - [Dependency Table](#dependency-table)
+  - [Why Chunking is needed? (And may always be needed!)](#why-chunking-is-needed-and-may-always-be-needed)
 - [Chunkers](#chunkers)
   - [TokenChunker](#tokenchunker)
     - [Initialization](#initialization)
     - [Methods](#methods)
     - [Example](#example)
+
+
+# Design CHONKosophy
+
+> Did you know that Hippos are surprisingly smart? 
+
+A lot of thought went into this repository and I want to take some space here to fully go over some of the decisions made: the what, why and how? 
+
+1. **Chonkie is very smart**
+2. **Chonkie is surprisingly lightweight**
+3. **Chonkie is superrrrr fast**
+
+
+
+## Dependency Table
+
+As per the details mentioned in the [Design](#design-chonkosophy) section, Chonkie is lightweight because it keeps most of the dependencies for each chunker seperate, making it more of an aggregate of multiple repositories and python packages. The optional dependencies feature in Python really helps with this. 
+
+| Chunker  | Default | 'sentence' | 'semantic' | 'all' |
+|----------|----------|----------|----------|----------|
+| TokenChunker        |✅|✅|✅|✅|
+| WordChunker         |✅|✅|✅|✅|
+| SentenceChunker     |⚠️|✅|⚠️|✅|
+| SemanticChunker     |❌|❌|⚠️/✅|✅|
+| SPDMChunker         |❌|❌|⚠️/✅|✅|
+
+Note: In the above table `⚠️/✅` meant that some features would be disabled but the Chunker would work nonetheless. 
+
+What you could infer from the table is that, while it might be of inconvinience in the short-run to have it split like that, you can do surprisingly a lot with just the Defualt dependencies (which btw are super light). Furthermore, even our max dependencies option `all` is lightweight in comparison to some of the other libraries that one might use for such tasks. 
+
+
+## Why Chunking is needed? (And may always be needed!)
+
+While you might be aware of models having longer and longer contexts in recent times (as of 2024), models have yet to reach the stage where adding additional context to them comes for free. Additional context, even with the greatest of model architectures comes at a O(n) penalty in speed, to say nothing of the additional memory requirements. And as long as we belive in that attention is all we need, it doesn't seem likely we would be free from this penalty. 
+
+That means, to make models run efficiently (lower speed, memory) it is absoulutely vital that we provide the most accurate information it needs during the retrieval phase. 
+
+Accuracy is one part during retrieval and the other is granularity. You might be able to extract the relevant Article out for model to work with, but if only 1 line is relevant from that passage, you are in effect adding a lot of noise that would hamper and confuse the model in practice. You want and hope to give the model only what it should require ideally (of course, the ideal scenario is rarely ever possible). This finally brings us to granularity and retrieval accuracy. 
+
+Representation models (or embedding models as you may call them) are great at representing large amount of information (sometimes pages of text) in a limited space of just 700-1000 floats, but that doesn't mean it does not suffer from any loss. Most representation is lossy, and if we have many concepts being covered in the same space, it is often that much of it would be lost. However, singluar concepts and explainations breed stronger representation vectors. It then becomes vital again to make sure we don't dilute the representation with noise. 
+
+All this brings me back to chunking. Chunking, done well, can make sure your representation vector (or embedding) is of high-quality to be able to retrieve the best context for your model to generate with. And that in turn, leads to better quality RAG generations. Therefore, I believe chunking is here to stay as long as RAG is here. And hence, it becomes important that we give it little more than a after-thought. 
 
 # Chunkers
 ## TokenChunker

diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "chonkie"
-version = "0.0.1a6"
+version = "0.0.1a8"
 description = "A simple, efficient text chunking library for RAG applications"
 readme = "README.md"
 requires-python = ">=3.8"
@@ -36,4 +36,5 @@ all = ["spacy>=3.0.0", "sentence-transformers>=2.0.0", "numpy>=1.23.0"]
 dev = ["pytest>=6.2.0"]
 
 [tool.setuptools]
-packages = ["chonkie"]
+package-dir = {"" = "src"}
+packages = ["chonkie", "chonkie.chunker"]
diff --git a/chonkie/__init__.py → src/chonkie/__init__.py b/chonkie/__init__.py → src/chonkie/__init__.py
@@ -1,11 +1,18 @@
-from .chunker.base import Chunk, BaseChunker
-from .chunker.token import TokenChunker
-from .chunker.word import WordChunker
-from .chunker.sentence import Sentence, SentenceChunk, SentenceChunker
-from .chunker.semantic import SemanticSentence, SemanticChunk, SemanticChunker
-from .chunker.spdm import SPDMChunker
+from .chunker import (
+    BaseChunker,
+    TokenChunker,
+    WordChunker,
+    SentenceChunker,
+    SemanticChunker,
+    SPDMChunker,
+    Chunk, 
+    SentenceChunk,
+    SemanticChunk,
+    Sentence,
+    SemanticSentence
+)
 
-__version__ = "0.0.1a6"
+__version__ = "0.0.1a8"
 __name__ = "chonkie"
 __author__ = "Bhavnick Minhas"
 

diff --git a/src/chonkie/chunker/__init__.py b/src/chonkie/chunker/__init__.py
@@ -0,0 +1,21 @@
+from .base import Chunk, BaseChunker
+from .token import TokenChunker
+from .word import WordChunker
+from .sentence import Sentence, SentenceChunk, SentenceChunker
+from .semantic import SemanticSentence, SemanticChunk, SemanticChunker
+from .spdm import SPDMChunker
+
+
+__all__ = [
+    "Chunk", 
+    "BaseChunker",
+    "TokenChunker",
+    "WordChunker",
+    "Sentence",
+    "SentenceChunk",
+    "SentenceChunker",
+    "SemanticSentence",
+    "SemanticChunk",
+    "SemanticChunker",
+    "SPDMChunker"
+]
diff --git a/chonkie/chunker/base.py → src/chonkie/chunker/base.py b/chonkie/chunker/base.py → src/chonkie/chunker/base.py
diff --git a/chonkie/chunker/semantic.py → src/chonkie/chunker/semantic.py b/chonkie/chunker/semantic.py → src/chonkie/chunker/semantic.py
diff --git a/chonkie/chunker/sentence.py → src/chonkie/chunker/sentence.py b/chonkie/chunker/sentence.py → src/chonkie/chunker/sentence.py
diff --git a/chonkie/chunker/spdm.py → src/chonkie/chunker/spdm.py b/chonkie/chunker/spdm.py → src/chonkie/chunker/spdm.py
diff --git a/chonkie/chunker/token.py → src/chonkie/chunker/token.py b/chonkie/chunker/token.py → src/chonkie/chunker/token.py
diff --git a/chonkie/chunker/word.py → src/chonkie/chunker/word.py b/chonkie/chunker/word.py → src/chonkie/chunker/word.py