PySentence-Similarity is a tool designed to identify and find similarities between sentences and a base sentence, expressed as a percentage π. It compares the semantic value of each input sentence to the base sentence, providing a score that reflects how related or similar they are. This tool is useful for various natural language processing tasks such as clustering similar texts π, paraphrase detection π and textual consequence measurement π.
The models were converted to ONNX format to optimize and speed up inference. Converting models to ONNX enables cross-platform compatibility and optimized hardware acceleration, making it more efficient for large-scale or real-world applications π.
- High accuracy: Utilizes a robust Transformer-based architecture, providing high accuracy in semantic similarity calculations π¬.
- Cross-platform support: The ONNX format provides seamless integration across platforms, making it easy to deploy across environments π.
- Scalability: Efficient processing can handle large datasets, making it suitable for enterprise-level applications π.
- Real-time processing: Optimized for fast output, it can be used in real-world applications without significant latency β±οΈ.
- Flexible: Easily adaptable to specific use cases through customization or integration with additional models or features π οΈ.
- Low resource consumption: The model is designed to operate efficiently, reducing memory and CPU/GPU requirements, making it ideal for resource-constrained environments β‘.
- Fast and user-friendly: The library offers high performance and an intuitive interface, allowing users to quickly and easily integrate it into their projects π.
- Requirements: Python 3.8 or higher.
# install from PyPI
pip install pysentence-similarity
# install from GitHub
pip install git+https://github.com/goldpulpy/pysentence-similarity.git
You don't need to download anything; the package itself will download the model and its tokenizer from a special HF repository.
Below are the models currently added to the special repository, including their file size and a link to the source.
Model | Parameters | FP32 | FP16 | INT8 | Source link |
---|---|---|---|---|---|
paraphrase-albert-small-v2 | 11.7M | 45MB | 22MB | 38MB | HF π€ |
all-MiniLM-L6-v2 | 22.7M | 90MB | 45MB | 23MB | HF π€ |
paraphrase-MiniLM-L6-v2 | 22.7M | 90MB | 45MB | 23MB | HF π€ |
multi-qa-MiniLM-L6-cos-v1 | 22.7M | 90MB | 45MB | 23MB | HF π€ |
msmarco-MiniLM-L-6-v3 | 22.7M | 90MB | 45MB | 23MB | HF π€ |
all-MiniLM-L12-v2 | 33.4M | 127MB | 65MB | 32MB | HF π€ |
gte-small | 33.4M | 127MB | 65MB | 32MB | HF π€ |
all-distilroberta-v1 | 82.1M | 313MB | 157MB | 79MB | HF π€ |
all-mpnet-base-v2 | 109M | 418MB | 209MB | 105MB | HF π€ |
multi-qa-mpnet-base-dot-v1 | 109M | 418MB | 209MB | 105MB | HF π€ |
paraphrase-multilingual-MiniLM-L12-v2 | 118M | 449MB | 225MB | 113MB | HF π€ |
text2vec-base-multilingual | 118M | 449MB | 225MB | 113MB | HF π€ |
distiluse-base-multilingual-cased-v1 | 135M | 514MB | 257MB | 129MB | HF π€ |
paraphrase-multilingual-mpnet-base-v2 | 278M | 1.04GB | 530MB | 266MB | HF π€ |
gte-multilingual-base | 305M | 1.17GB | 599MB | 324MB | HF π€ |
gte-large | 335M | 1.25GB | 640MB | 321MB | HF π€ |
all-roberta-large-v1 | 355M | 1.32GB | 678MB | 340MB | HF π€ |
LaBSE | 470M | 1.75GB | 898MB | 450MB | HF π€ |
PySentence-Similarity supports FP32
, FP16
, and INT8
dtypes.
- FP32: 32-bit floating-point format that provides high precision and a wide range of values.
- FP16: 16-bit floating-point format, reducing memory consumption and computation time, with minimal loss of precision (typically less than 1%).
- INT8: 8-bit integer quantized format that greatly reduces model size and speeds up output, ideal for resource-constrained environments, with little loss of precision.
Let's define the similarity score as the percentage of how similar the sentences are to the original sentence (0.75 = 75%), default compute function is cosine
You can use CUDA 12.X by passing the device='cuda'
parameter to the Model object; the default is cpu
. If the device is not available, it will automatically be set to cpu
.
from pysentence_similarity import Model
from pysentence_similarity.utils import compute_score
# Create an instance of the model all-MiniLM-L6-v2; the default dtype is `fp32`
model = Model("all-MiniLM-L6-v2", dtype="fp16")
sentences = [
"This is another test.",
"This is yet another test.",
"We are testing sentence similarity."
]
# Convert sentences to embeddings
# The default is to use mean_pooling as a pooling function
source_embedding = model.encode("This is a test.")
embeddings = model.encode(sentences, progress_bar=True)
# Compute similarity scores
# The rounding parameter allows us to round our float values
# with a default of 2, which means 2 decimal places.
compute_score(source_embedding, embeddings)
# Return: [0.86, 0.77, 0.48]
compute_score
returns in the same index order in which the embedding was encoded.
Let's see the sentence and its evaluation from a computational function
# Compute similarity scores
scores = compute_score(source_embedding, embeddings)
for sentence, score in zip(sentences, scores):
print(f"{sentence} ({score})")
# Output prints:
# This is another test. (0.86)
# This is yet another test. (0.77)
# We are testing sentence similarity. (0.48)
You can use the computational functions: cosine
, euclidean
, manhattan
, jaccard
, pearson
, minkowski
, hamming
, kl_divergence
, chebyshev
, bregman
or your custom function
from pysentence_similarity.compute import euclidean
compute_score(source_embedding, embeddings, compute_function=euclidean)
# Return: [2.52, 3.28, 5.62]
You can use max_pooling
, mean_pooling
, min_pooling
or your custom function
from pysentence_similarity.pooling import max_pooling
source_embedding = model.encode("This is a test.", pooling_function=max_pooling)
embeddings = model.encode(sentences, pooling_function=max_pooling)
...
from pysentence_similarity import Model
from pysentence_similarity.utils import search_similar
# Create an instance of the model
model = Model("all-MiniLM-L6-v2", dtype="fp16")
# Test text
sentences = [
"Hello my name is Bob.",
"I love to eat pizza.",
"We are testing sentence similarity."
"Today is a sunny day.",
"London is the capital of England.",
"I am a student at Stanford University."
]
# Convert query sentence to embedding
query_embedding = model.encode("What's the capital of England?")
# Convert sentences to embeddings
embeddings = model.encode(sentences)
# Search similar sentences
similar = search_similar(
query_embedding=query_embedding,
sentences=sentences,
embeddings=embeddings,
top_k=3 # number of similar sentences to return
)
# Print similar sentences
for idx, (sentence, score) in enumerate(similar, start=1):
print(f"{idx}: {sentence} ({score})")
# Output prints:
# 1: London is the capital of England. (0.81)
# 2: Hello my name is Bob. (0.06)
# 3: I love to eat pizza. (0.05)
With use storage
from pysentence_similarity import Model, Storage
from pysentence_similarity.utils import search_similar
model = Model("all-MiniLM-L6-v2", dtype="fp16")
query_embedding = model.encode("What's the capital of England?")
storage = Storage.load("my_storage.h5")
similar = search_similar(
query_embedding=query_embedding,
storage=storage,
top_k=3
)
...
from pysentence_similarity import Splitter
# Default split markers: '\n'
splitter = Splitter()
# If you want to separate by specific characters.
splitter = Splitter(markers_to_split=["!", "?", "."], preserve_markers=True)
# Test text
text = "Hello world! How are you? I'm fine."
# Split from text
splitter.split_from_text(text)
# Return: ['Hello world!', 'How are you?', "I'm fine."]
At this point, sources for the splitting are available: text, file, URL, CSV, and JSON.
The storage allows you to save and link sentences and their embeddings for easy access, so you don't need to encode a large corpus of text every time. The storage also enables similarity searching.
The storage must store the sentences themselves and their embeddings.
from pysentence_similarity import Model, Storage
# Create an instance of the model
model = Model("all-MiniLM-L6-v2", dtype="fp16")
# Create an instance of the storage
storage = Storage()
sentences = [
"This is another test.",
"This is yet another test.",
"We are testing sentence similarity."
]
# Convert sentences to embeddings
embeddings = model.encode(sentences)
# Add sentences and their embeddings
storage.add(sentences, embeddings)
# Save the storage
storage.save("my_storage.h5")
Load from the storage
from pysentence_similarity import Model, Storage
from pysentence_similarity.utils import compute_score
# Create an instance of the model and storage
model = Model("all-MiniLM-L6-v2", dtype="fp16")
storage = Storage.load("my_storage.h5")
# Convert sentence to embedding
source_embedding = model.encode("This is a test.")
# Compute similarity scores with the storage
compute_score(source_embedding, storage)
# Return: [0.86, 0.77, 0.48]
This project is licensed under the MIT License. See the LICENSE file for details