Common string similarity algorithms sim: (Σ* × Σ*) → [0, 1]
.
Open value range algorithms (like Hamming) are normalized.
Complexity: O(n)
def hamming_distance(str1: str, str2: str) -> int
def hamming(str1: str, str2: str) -> float
The shorter string is padded with blank symbols to apply the algorithm.
Complexity: O(n²)
def levenshtein_distance(str1: str, str2: str) -> int
def levenshtein(str1: str, str2: str) -> float
Complexity: O(n²)
def damerau_levenshtein_distance(str1: str, str2: str) -> int
def damerau_levenshtein(str1: str, str2: str) -> float
Complexity: O(n²)
def jaro(str1: str, str2: str) -> float
Complexity: O(n²)
def jaro_winkler(str1: str, str2: str, p: float = 0.1) -> float
Complexity: O(n)
def jaccard(str1: str, str2: str) -> float
The set based similarity algorithms use character and index combination to mimic set element identity (
{ (character, index) ∀ c ∈ S₁, S₂ }
).
Complexity: O(n)
def sorensen_dice(str1: str, str2: str) -> float
Complexity: O(n)
def szymkiewicz_simpson(str1: str, str2: str) -> float
Szymkiewicz-Simpson is also simply known as “overlap”.