diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..892e939 --- /dev/null +++ b/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2024 .pulpy + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/README.md b/README.md new file mode 100644 index 0000000..7ed0a3c --- /dev/null +++ b/README.md @@ -0,0 +1,198 @@ +

PySentence-Similarity 😊

+

+ GitHub + GitHub Actions Workflow Status +

+ +## Information + +**pysentence-similarity** is a tool designed to identify and find similarities between sentences and a base sentence, expressed as a percentage 📊. It compares the semantic value of each input sentence to the base sentence, providing a score that reflects how related or similar they are. This tool is useful for various natural language processing tasks such as clustering similar texts 📚, paraphrase detection 🔍 and textual consequence measurement 📈. + +The models were converted to ONNX format to optimize and speed up inference. Converting models to ONNX enables cross-platform compatibility and optimized hardware acceleration, making it more efficient for large-scale or real-world applications 🚀. + +- **High accuracy:** Utilizes a robust Transformer-based architecture, providing high accuracy in semantic similarity calculations 🔬. +- **Cross-platform support:** The ONNX format provides seamless integration across platforms, making it easy to deploy across environments 🌐. +- **Scalability:** Efficient processing can handle large datasets, making it suitable for enterprise-level applications 📈. +- **Real-time processing:** Optimized for fast output, it can be used in real-world applications without significant latency ⏱️. +- **Flexible:** Easily adaptable to specific use cases through customization or integration with additional models or features 🛠️. +- **Low resource consumption:** The model is designed to operate efficiently, reducing memory and CPU/GPU requirements, making it ideal for resource-constrained environments ⚡. +- **Fast and user-friendly:** The library offers high performance and an intuitive interface, allowing users to quickly and easily integrate it into their projects 🚀. + +## Installation 📦 + +- **Requirements:** Python 3.8 or higher. + +```bash +# install from PyPI +pip install pysentence-similarity + +# install from GitHub +pip install git+https://github.com/goldpulpy/pysentence-similarity.git +``` + +## Support models 🤝 + +You don't need to download anything; the package itself will download the model and its tokenizer from a special HF [repository](https://huggingface.co/goldpulpy/pysentence-similarity). + +Below are the models currently added to the special repository, including their file size and a link to the source. + +| Model | Parameters | FP32 | FP16 | INT8 | Source link | +| ------------------------------------- | ---------- | ------ | ----- | ----- | ------------------------------------------------------------------------------------------- | +| all-MiniLM-L6-v2 | 22.7M | 90MB | 45MB | 23MB | [HF](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) 🤗 | +| paraphrase-MiniLM-L6-v2 | 22.7M | 90MB | 45MB | 23MB | [HF](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) 🤗 | +| all-MiniLM-L12-v2 | 33.4M | 127MB | 65MB | 32MB | [HF](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) 🤗 | +| gte-small | 33.4M | 127MB | 65MB | 32MB | [HF](https://huggingface.co/thenlper/gte-small) 🤗 | +| all-mpnet-base-v2 | 109M | 418MB | 209MB | 105MB | [HF](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) 🤗 | +| paraphrase-multilingual-MiniLM-L12-v2 | 118M | 449MB | 225MB | 113MB | [HF](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) 🤗 | +| text2vec-base-multilingual | 118M | 449MB | 225MB | 113MB | [HF](https://huggingface.co/shibing624/text2vec-base-multilingual) 🤗 | +| gte-multilingual-base | 305M | 1.17GB | 599MB | 324MB | [HF](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) 🤗 | +| gte-large | 335M | 1.25GB | 640MB | 321MB | [HF](https://huggingface.co/thenlper/gte-large) 🤗 | +| LaBSE | 470M | 1.75GB | 898MB | 450MB | [HF](https://huggingface.co/sentence-transformers/LaBSE) 🤗 | + +**pysentence-similarity** supports `FP32`, `FP16`, and `INT8` dtypes. + +- **FP32:** 32-bit floating-point format that provides high precision and a wide range of values. +- **FP16:** 16-bit floating-point format, reducing memory consumption and computation time, with minimal loss of precision (typically less than 1%). +- **INT8:** 8-bit integer quantized format that greatly reduces model size and speeds up output, ideal for resource-constrained environments, with little loss of precision. + +## Usage examples 📖 + +### Sentence similarity score 📊 + +Let's define the similarity score as the percentage of how similar the sentences are to the original sentence (0.75 = 75%), default compute function is `cosine` + +You can use CUDA 12.X by passing the `device='cuda'` parameter to the Model object; the default is `cpu`. If the device is not available, it will automatically be set to `cpu`. + +```python +from pysentence_similarity import Model, compute_score + +# Create an instance of the model all-MiniLM-L6-v2; the default dtype is `fp32` +model = Model("all-MiniLM-L6-v2", dtype="fp16") + +sentences = [ + "This is another test.", + "This is yet another test.", + "We are testing sentence similarity." +] + +# Convert sentences to embeddings +# The default is to use mean_pooling as a pooling function +source_embedding = model.encode("This is a test.") +embeddings = model.encode(sentences, progress_bar=True) + +# Compute similarity scores +# The rounding parameter allows us to round our float values +# with a default of 2, which means 2 decimal places. +compute_score(source_embedding, embeddings) +# Return: [0.86, 0.77, 0.48] +``` + +`compute_score` returns in the same index order in which the embedding was encoded. + +Let's see the sentence and its evaluation from a computational function + +```python +# Compute similarity scores +scores = compute_score(source_embedding, embeddings) + +for sentence, score in zip(sentences, scores): + print(f"{sentence} ({score})") + +# Output prints: +# This is another test. (0.86) +# This is yet another test. (0.77) +# We are testing sentence similarity. (0.48) +``` + +You can use the computational functions: `cosine`, `euclidean`, `manhattan`, `jaccard`, `pearson`, `minkowski`, `hamming`, `kl_divergence`, `chebyshev`, `bregman` or your custom function + +```python +from pysentence_similarity.compute import euclidean + +compute_score(source_embedding, embeddings, compute_function=euclidean) +# Return: [2.52, 3.28, 5.62] +``` + +You can use `max_pooling`, `mean_pooling`, `min_pooling` or your custom function + +```python +from pysentence_similarity.pooling import max_pooling + +source_embedding = model.encode("This is a test.", pooling_function=max_pooling) +embeddings = model.encode(sentences, pooling_function=max_pooling) +... +``` + +### Splitting ✂️ + +```python +from pysentence_similarity import Splitter + +# Default split markers: '\n' +splitter = Splitter() + +# If you want to separate by specific characters. +splitter = Splitter(markers_to_split=["!", "?", "."], preserve_markers=True) + +# Test text +text = "Hello world! How are you? I'm fine." + +# Split from text +splitter.split_from_text(text) +# Return: ['Hello world!', 'How are you?', "I'm fine."] +``` + +At this point, sources for the splitting are available: text, file, URL, CSV, and JSON. + +### Storage 💾 + +The storage allows you to save and link sentences and their embeddings for easy access, so you don't need to encode a large corpus of text every time. The storage also enables similarity searching. + +The storage must store the **sentences** themselves and their **embeddings**. + +```python +from pysentence_similarity import Model, Storage + +# Create an instance of the model +model = Model("all-MiniLM-L6-v2", dtype="fp16") + +# Create an instance of the storage +storage = Storage() +sentences = [ + "This is another test.", + "This is yet another test.", + "We are testing sentence similarity." +] + +# Convert sentences to embeddings +embeddings = model.encode(sentences) + +# Add sentences and their embeddings +storage.add(sentences, embeddings) + +# Save the storage +storage.save("my_storage.h5") +``` + +Load from the storage + +```python +from pysentence_similarity import Model, Storage, compute_score + +# Create an instance of the model and storage +model = Model("all-MiniLM-L6-v2", dtype="fp16") +storage = Storage.load("my_storage.h5") + +# Convert sentence to embedding +source_embedding = model.encode("This is a test.") + +# Compute similarity scores with the storage +compute_score(source_embedding, storage) +# Return: [0.86, 0.77, 0.48] +``` + +## License 📜 + +This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details + +
Created by goldpulpy with ❤️