Added LICENSE and README

goldpulpy · Oct 9, 2024 · acd3db8 · acd3db8
1 parent 0c4962b
commit acd3db8
Show file tree

Hide file tree

Showing 2 changed files with 219 additions and 0 deletions.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 .pulpy
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,198 @@
+<h1 align="center">PySentence-Similarity 😊</h1>
+<p align="center">
+    <a href="https://github.com/goldpulpy/pysentence-similarity/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/goldpulpy/pysentence-similarity.svg?color=blue"></a>
+    <img alt="GitHub Actions Workflow Status" src="https://img.shields.io/github/actions/workflow/status/goldpulpy/pysentence-similarity/package.yml">
+</p>
+
+## Information
+
+**pysentence-similarity** is a tool designed to identify and find similarities between sentences and a base sentence, expressed as a percentage 📊. It compares the semantic value of each input sentence to the base sentence, providing a score that reflects how related or similar they are. This tool is useful for various natural language processing tasks such as clustering similar texts 📚, paraphrase detection 🔍 and textual consequence measurement 📈.
+
+The models were converted to ONNX format to optimize and speed up inference. Converting models to ONNX enables cross-platform compatibility and optimized hardware acceleration, making it more efficient for large-scale or real-world applications 🚀.
+
+- **High accuracy:** Utilizes a robust Transformer-based architecture, providing high accuracy in semantic similarity calculations 🔬.
+- **Cross-platform support:** The ONNX format provides seamless integration across platforms, making it easy to deploy across environments 🌐.
+- **Scalability:** Efficient processing can handle large datasets, making it suitable for enterprise-level applications 📈.
+- **Real-time processing:** Optimized for fast output, it can be used in real-world applications without significant latency ⏱️.
+- **Flexible:** Easily adaptable to specific use cases through customization or integration with additional models or features 🛠️.
+- **Low resource consumption:** The model is designed to operate efficiently, reducing memory and CPU/GPU requirements, making it ideal for resource-constrained environments ⚡.
+- **Fast and user-friendly:** The library offers high performance and an intuitive interface, allowing users to quickly and easily integrate it into their projects 🚀.
+
+## Installation 📦
+
+- **Requirements:** Python 3.8 or higher.
+
+```bash
+# install from PyPI
+pip install pysentence-similarity
+
+# install from GitHub
+pip install git+https://github.com/goldpulpy/pysentence-similarity.git
+```
+
+## Support models 🤝
+
+You don't need to download anything; the package itself will download the model and its tokenizer from a special HF [repository](https://huggingface.co/goldpulpy/pysentence-similarity).
+
+Below are the models currently added to the special repository, including their file size and a link to the source.
+
+| Model                                 | Parameters | FP32   | FP16  | INT8  | Source link                                                                                 |
+| ------------------------------------- | ---------- | ------ | ----- | ----- | ------------------------------------------------------------------------------------------- |
+| all-MiniLM-L6-v2                      | 22.7M      | 90MB   | 45MB  | 23MB  | [HF](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) 🤗                      |
+| paraphrase-MiniLM-L6-v2               | 22.7M      | 90MB   | 45MB  | 23MB  | [HF](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) 🤗               |
+| all-MiniLM-L12-v2                     | 33.4M      | 127MB  | 65MB  | 32MB  | [HF](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) 🤗                     |
+| gte-small                             | 33.4M      | 127MB  | 65MB  | 32MB  | [HF](https://huggingface.co/thenlper/gte-small) 🤗                                          |
+| all-mpnet-base-v2                     | 109M       | 418MB  | 209MB | 105MB | [HF](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) 🤗                     |
+| paraphrase-multilingual-MiniLM-L12-v2 | 118M       | 449MB  | 225MB | 113MB | [HF](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) 🤗 |
+| text2vec-base-multilingual            | 118M       | 449MB  | 225MB | 113MB | [HF](https://huggingface.co/shibing624/text2vec-base-multilingual) 🤗                       |
+| gte-multilingual-base                 | 305M       | 1.17GB | 599MB | 324MB | [HF](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) 🤗                           |
+| gte-large                             | 335M       | 1.25GB | 640MB | 321MB | [HF](https://huggingface.co/thenlper/gte-large) 🤗                                          |
+| LaBSE                                 | 470M       | 1.75GB | 898MB | 450MB | [HF](https://huggingface.co/sentence-transformers/LaBSE) 🤗                                 |
+
+**pysentence-similarity** supports `FP32`, `FP16`, and `INT8` dtypes.
+
+- **FP32:** 32-bit floating-point format that provides high precision and a wide range of values.
+- **FP16:** 16-bit floating-point format, reducing memory consumption and computation time, with minimal loss of precision (typically less than 1%).
+- **INT8:** 8-bit integer quantized format that greatly reduces model size and speeds up output, ideal for resource-constrained environments, with little loss of precision.
+
+## Usage examples 📖
+
+### Sentence similarity score 📊
+
+Let's define the similarity score as the percentage of how similar the sentences are to the original sentence (0.75 = 75%), default compute function is `cosine`
+
+You can use CUDA 12.X by passing the `device='cuda'` parameter to the Model object; the default is `cpu`. If the device is not available, it will automatically be set to `cpu`.
+
+```python
+from pysentence_similarity import Model, compute_score
+
+# Create an instance of the model all-MiniLM-L6-v2; the default dtype is `fp32`
+model = Model("all-MiniLM-L6-v2", dtype="fp16")
+
+sentences = [
+    "This is another test.",
+    "This is yet another test.",
+    "We are testing sentence similarity."
+]
+
+# Convert sentences to embeddings
+# The default is to use mean_pooling as a pooling function
+source_embedding = model.encode("This is a test.")
+embeddings = model.encode(sentences, progress_bar=True)
+
+# Compute similarity scores
+# The rounding parameter allows us to round our float values
+# with a default of 2, which means 2 decimal places.
+compute_score(source_embedding, embeddings)
+# Return: [0.86, 0.77, 0.48]
+```
+
+`compute_score` returns in the same index order in which the embedding was encoded.
+
+Let's see the sentence and its evaluation from a computational function
+
+```python
+# Compute similarity scores
+scores = compute_score(source_embedding, embeddings)
+
+for sentence, score in zip(sentences, scores):
+    print(f"{sentence} ({score})")
+
+# Output prints:
+# This is another test. (0.86)
+# This is yet another test. (0.77)
+# We are testing sentence similarity. (0.48)
+```
+
+You can use the computational functions: `cosine`, `euclidean`, `manhattan`, `jaccard`, `pearson`, `minkowski`, `hamming`, `kl_divergence`, `chebyshev`, `bregman` or your custom function
+
+```python
+from pysentence_similarity.compute import euclidean
+
+compute_score(source_embedding, embeddings, compute_function=euclidean)
+# Return: [2.52, 3.28, 5.62]
+```
+
+You can use `max_pooling`, `mean_pooling`, `min_pooling` or your custom function
+
+```python
+from pysentence_similarity.pooling import max_pooling
+
+source_embedding = model.encode("This is a test.", pooling_function=max_pooling)
+embeddings = model.encode(sentences, pooling_function=max_pooling)
+...
+```
+
+### Splitting ✂️
+
+```python
+from pysentence_similarity import Splitter
+
+# Default split markers: '\n'
+splitter = Splitter()
+
+# If you want to separate by specific characters.
+splitter = Splitter(markers_to_split=["!", "?", "."], preserve_markers=True)
+
+# Test text
+text = "Hello world! How are you? I'm fine."
+
+# Split from text
+splitter.split_from_text(text)
+# Return: ['Hello world!', 'How are you?', "I'm fine."]
+```
+
+At this point, sources for the splitting are available: text, file, URL, CSV, and JSON.
+
+### Storage 💾
+
+The storage allows you to save and link sentences and their embeddings for easy access, so you don't need to encode a large corpus of text every time. The storage also enables similarity searching.
+
+The storage must store the **sentences** themselves and their **embeddings**.
+
+```python
+from pysentence_similarity import Model, Storage
+
+# Create an instance of the model
+model = Model("all-MiniLM-L6-v2", dtype="fp16")
+
+# Create an instance of the storage
+storage = Storage()
+sentences = [
+    "This is another test.",
+    "This is yet another test.",
+    "We are testing sentence similarity."
+]
+
+# Convert sentences to embeddings
+embeddings = model.encode(sentences)
+
+# Add sentences and their embeddings
+storage.add(sentences, embeddings)
+
+# Save the storage
+storage.save("my_storage.h5")
+```
+
+Load from the storage
+
+```python
+from pysentence_similarity import Model, Storage, compute_score
+
+# Create an instance of the model and storage
+model = Model("all-MiniLM-L6-v2", dtype="fp16")
+storage = Storage.load("my_storage.h5")
+
+# Convert sentence to embedding
+source_embedding = model.encode("This is a test.")
+
+# Compute similarity scores with the storage
+compute_score(source_embedding, storage)
+# Return: [0.86, 0.77, 0.48]
+```
+
+## License 📜
+
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details
+
+<h6 align="center">Created by goldpulpy with ❤️</h6>