-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode: fix reading buffer overflow in valid_utf8() #5531
Unicode: fix reading buffer overflow in valid_utf8() #5531
Conversation
Minor note:
|
Wow, that's an extensive test script you wrote. Maybe it should be made part of our unit tests invoked on
I guess you mean with code from prior to this PR's changes, to reproduce the original bug? |
Yes, ASAN gets triggered prior to the patch. And it does not fire after the patch. |
The script is slow. It takes 70 seconds using PyPy3 and 57 seconds after compilation with cython3 (without changes to the script: |
I think try even sparser checks for under 1 second. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look at this and the few things I was worried about (such as 0xF4) was explained in the PR description, so I'm happy with it. Especially with that thorough test script! Also I can't see any reason it would end up slower (perhaps faster).
Nice, CI found problems with symbols referenced in Meanwhile a few words about the commit: I did not feel comfortable inserting a lot of code into In regular I think BTW I have idea that ASAN build:
In regular build, the test takes under 0.4 seconds. So I have 2 more changes in mind:
|
This part was wrong. Now I compile Guarding with |
I use Other way could be to add
|
And the last commit is to remove lookup from I don't insist on this change. But I have an idea for even more obscure modification:
As of Simpler bit trick is Some benchmark is needed to consider the tricks. Also I am not sure how it will play with future compiler optimizations/changes. So I am not going to commit them if they are not interesting/needed. Are they worth a benchmark? |
@AlekseyCherepanov Is this ready to merge? |
It is ready to be merged. I'd like to hear from @magnumripper about my changes to |
Ouch, I intended to remove the explanations from I guess |
I moved the comments to We could rewrite commits to squash initial misplacement of comments and incorrect changes to If there are no objections then this PR can be merged. |
Yes, if you like. You may also fix the typo in @magnumripper Please feel free to merge this when you feel it's ready, or let me know and I merge it. |
I'm fine with you merging it, squashed or not. |
cd3d694
to
63226fd
Compare
63226fd
to
588ab83
Compare
Oh, nice catch! Thanks! I fixed the typo. I merged fixes into respective commits. After pushing I noticed that I wrote too long message for one commit, so I rewrote it and pushed again. Then I rebased the first commit without changes, so all six commits would be listed in a row. (I am puzzled about the rebase: I finished my changes here. |
Each commit has two dates on it, see |
valid_utf8()
reads up to 2 bytes far behind the end of string because it checks bytes right-to-left. I switched order to check left-to-right. It means that innerswitch
should be split. Also I moved around a fewif
s:< 0xC2
is checked before mainswitch
, so theswitch
is for almost valid starting bytes,> 0xF4
was moved intocase 4
now because other cases did not need it (in both versions).diff looks terrible. I guess review of full code would be easier.
To validate my changes, I wrote a script in python to call the function. To call the function, I compiled
unicode.c
alone as a dynamic library and used python'sctypes
.fake.c
is needed to resolve unused symbols belonging to other files in john.The script: