Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

levenshtein bug #13823

Open
MyNameIsRoman opened this issue Mar 28, 2024 · 4 comments
Open

levenshtein bug #13823

MyNameIsRoman opened this issue Mar 28, 2024 · 4 comments

Comments

@MyNameIsRoman
Copy link

MyNameIsRoman commented Mar 28, 2024

Description

The following code:

<?php
echo levenshtein('aa', 'ab', 1, PHP_INT_MAX, 1);

Resulted in this output:

-9223372036854775807

But I expected this output instead:

2

Also levenshtein(str_repeat('a', 100000), str_repeat('b', 1000000), 1, PHP_INT_MAX - 500000, 1)

returning unexpected -9223372036854625811

PHP Version

any

Operating System

any 64-bit

@nielsdos
Copy link
Member

nielsdos commented Mar 28, 2024

This could be solved by changing the computation of optimal cost to use deltas. I'd have to check what the performance impact of that is, and if there is a better way though. I'll have a better look this evening.

@nielsdos
Copy link
Member

It's always going to be possible to choose costs that result in an overflow one way or another.
Using deltas helps in these example cases because these are fairly legit: you don't want to use a substitution, this is a legit concern.
However, using deltas has a noticeable performance penalty of about 15% in my tests for moderately long strings.
One pragmatic solution may be to adjust the costs behind the scenes, e.g. change the maximum to be lowest+second_lowest+1 (for positive costs) which would fix non-pathalogical cases.

@cmb69
Copy link
Member

cmb69 commented Jul 12, 2024

Hmm, I wonder if there are practical applications of high (absolute numbers of) cost values at all. I mean, usually(?) you'd pass very small numbers; so if we would restrict these values, that might not break any reasonable code using the levenshtein() function.

Another option might be to do the internal calculations with floats/doubles. That would be less precise (and probably slower), but at least you won't get completely wrong results.

If we want to avoid any breakage, we may instead choose to just document the issue.

Frankly, I see fee (if any) cases where levenshtein() can be used for proper texts since it is practically constrained to the 26 latin letters (even mixing lower and upper case letters wouldn't make much sense), and I don't know if there are any other applications. So it's probably not worth spending too much time on this issue (nonetheless I'd like to see it resolved in some way), since the reported bugs appear more like the results of failed attempts to find a security issue, than an issue stemming from a pratical application.

@nielsdos
Copy link
Member

nielsdos commented Jul 12, 2024

I made two possible suggestions on how to mitigate it here: #14809 (comment)

I also thought about using floats / doubles but didn't pursue that due to the precision loss.

If we want to avoid any breakage, we may instead choose to just document the issue.

That's a possibility too, not sure though.

One interesting use case which I have seen for levenshtein is sequence alignment cost for DNA or proteins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants