Unicode: fix reading buffer overflow in valid_utf8() #5531

AlekseyCherepanov · 2024-09-08T16:15:58Z

valid_utf8() reads up to 2 bytes far behind the end of string because it checks bytes right-to-left. I switched order to check left-to-right. It means that inner switch should be split. Also I moved around a few ifs: < 0xC2 is checked before main switch, so the switch is for almost valid starting bytes, > 0xF4 was moved into case 4 now because other cases did not need it (in both versions).

diff looks terrible. I guess review of full code would be easier.

To validate my changes, I wrote a script in python to call the function. To call the function, I compiled unicode.c alone as a dynamic library and used python's ctypes. fake.c is needed to resolve unused symbols belonging to other files in john.

The script:

# code to structure allowed utf-8 chars, test john

from ctypes import *

'''
int options;
int real_error;
int john_MD4_Final;
int john_MD4_Update;
int john_MD4_Init;
''' # ^ contents of fake.c

# gcc -fPIC -shared -Wl,-no-undefined john/src/unicode.c fake.c -o u.so
john_valid_utf8 = CDLL("./u.so").valid_utf8
john_valid_utf8.restype = c_int
john_valid_utf8.argtypes = [ c_char_p ]

def check_john(r, b):
    if r == False:
        expected = 0
    elif len(b) == 1:
        expected = 1 # pure ascii
    else:
        expected = 2 # some non-ascii, we have only 1
    j = john_valid_utf8(b)
    assert j == expected, (j, expected, r, b)

# def check_john(r, b):
#     pass

def R(a, b):
    return list(range(a, b))

# Rough ranges
ascii    = R(0x0, 0x80)
trailing = R(0x80, 0xc0)
start2   = R(0xc0, 0xe0)
start3   = R(0xe0, 0xf0)
start4   = R(0xf0, 0x100)

not_trailing = ascii + start2 + start3 + start4
all_bytes = R(0x0, 0x100)

def is_valid(*args):
    b = bytes(args)
    r = None
    try:
        b.decode('utf-8')
        r = True
    except UnicodeDecodeError:
        r = False
    check_john(r, b)
    return r

def combine(args):
    r = [ 0, 0 ] # False and True are 0 and 1
    if len(args) == 1:
        for a in args[0]:
            r[is_valid(a)] += 1

    elif len(args) == 2:
        for a in args[0]:
            for b in args[1]:
                r[is_valid(a, b)] += 1

    elif len(args) == 3:
        for a in args[0]:
            for b in args[1]:
                for c in args[2]:
                    r[is_valid(a, b, c)] += 1

    elif len(args) == 4:
        for a in args[0]:
            for b in args[1]:
                for c in args[2]:
                    for d in args[3]:
                        r[is_valid(a, b, c, d)] += 1

    else:
        assert 0, 'wrong number of args'
    return r

def expect(count_false_and_true, *args):
    assert not (count_false_and_true[0] == -1 and count_false_and_true[1] == -1)
    for i in 0, 1:
        if count_false_and_true[i] == all:
            t = 1
            for a in args:
                t *= len(a)
            count_false_and_true[i] = t
    r = combine(args)
    assert r == count_false_and_true, (r, count_false_and_true)
    return r

# 1 byte
expect([ 0, len(ascii) ], ascii)
expect([ len(trailing), 0 ], trailing)
expect([ len(start2), 0 ], start2)
expect([ len(start3), 0 ], start3)
expect([ len(start4), 0 ], start4)

# 2 bytes starting with trailing
expect([ len(trailing) * len(all_bytes), 0 ], trailing, all_bytes)

# start2
# c0 and c1 are not in the range
expect([ 2 * len(trailing), 0 ], [ 0xc0, 0xc1 ], trailing)
expect([ 2 * len(trailing), (len(start2) - 2) * len(trailing) ], start2, trailing)

expect([ len(start2) * len(not_trailing), 0 ], start2, not_trailing)
expect([ len(start2) * len(trailing) ** 2, 0 ],
       start2, trailing, trailing)

# start3

# E0 80..9F X and ED A0..BF X are invalid
expect([ all, 0 ], [ 0xE0 ], R(0x80, 0xA0), all_bytes)
expect([ all, 0 ], [ 0xED ], R(0xA0, 0xC0), all_bytes)
r1 = expect([ all, 0 ], [ 0xE0 ], R(0x80, 0xA0), trailing)
r2 = expect([ all, 0 ], [ 0xED ], R(0xA0, 0xC0), trailing)
expect([ r1[0] + r2[0], len(start3) * len(trailing) ** 2 - r1[0] - r2[0] ],
       start3, trailing, trailing)

# start4

# F0 80..8F X Y and F4 90..BF X Y are invalid

expect([ 0, all ], [ 0xF0 ], R(0x90, 0xC0), trailing, trailing)
expect([ all, 0 ], [ 0xF0 ], R(0x80, 0x90), all_bytes, all_bytes)
r1 = expect([ all, 0 ], [ 0xF0 ], R(0x80, 0x90), trailing, trailing)

expect([ 0, all ], [ 0xF4 ], R(0x80, 0x90), trailing, trailing)
expect([ all, 0 ], [ 0xF4 ], R(0x90, 0xC0), all_bytes, all_bytes)
r2 = expect([ all, 0 ], [ 0xF4 ], R(0x90, 0xC0), trailing, trailing)

# F5..FF X Y Z are invalid
r3 = expect([ all, 0 ], R(0xF5, 0x100), trailing, trailing, trailing)

expect([ r1[0] + r2[0] + r3[0], len(start4) * len(trailing) ** 3 - r1[0] - r2[0] - r3[0] ],
       start4, trailing, trailing, trailing)

# expect([ len(start3) * len(all_bytes) * len(not_trailing), 0 ],
#        start3, all_bytes, not_trailing)
# expect([ len(start4) * len(all_bytes) ** 2 * len(not_trailing), 0 ],
#        start4, all_bytes, all_bytes, not_trailing)

AlekseyCherepanov · 2024-09-08T20:38:19Z

Minor note: gcc 12 expanded switch for E0,ED,F0,F4 into series of checks, so splitting it into 2 switch statements does not increase size. Actually size in disassembly became smaller because case 1 is not reachable anymore (previously invalid trailing bytes at the start could land into this branch, but I moved their check out from the switch; I count size in disassembly this way: objdump -d unicode.o | sed -ne '/<valid_utf8>:/,/^$/ p' | wc -l).

valid_utf8() is used in multiple places including loader.c. Let's overread line_buf:

$ perl -C0 -e 'print ":12345" x 170 . "a\xf1"' > t.pw && ../run/john t.pw
=================================================================
==1583178==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffc92e219f0 at pc 0x55c00d0dc209 bp 0x7ffc92e21480 sp 0x7ffc92e21478
READ of size 1 at 0x7ffc92e219f0 thread T0
    #0 0x55c00d0dc208 in valid_utf8 .../src/unicode.c:554
    #1 0x55c00d0647b2 in read_file .../src/loader.c:255
    #2 0x55c00d06c368 in ldr_load_pw_file .../src/loader.c:1189
    #3 0x55c00d05a667 in john_load .../src/john.c:1149
    #4 0x55c00d05a667 in john_init .../src/john.c:1598
    #5 0x55c00d05a667 in main .../src/john.c:2084
    #6 0x7fdf45646249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #7 0x7fdf45646304 in __libc_start_main_impl ../csu/libc-start.c:360
    #8 0x55c00cc0f1f0 in _start (.../run/john+0x1ff1f0)

Address 0x7ffc92e219f0 is located in stack of thread T0 at offset 1280 in frame
    #0 0x55c00d063f1b in read_file .../src/loader.c:209

  This frame has 2 object(s):
    [48, 192) 'file_stat' (line 210)
    [256, 1280) 'line_buf' (line 212) <== Memory access at offset 1280 overflows this variable
[...]

solardiz · 2024-09-12T21:54:51Z

Wow, that's an extensive test script you wrote. Maybe it should be made part of our unit tests invoked on make check, but only when Python is available?

valid_utf8() is used in multiple places including loader.c. Let's overread line_buf:
$ perl -C0 -e 'print ":12345" x 170 . "a\xf1"' > t.pw && ../run/john t.pw

I guess you mean with code from prior to this PR's changes, to reproduce the original bug?

AlekseyCherepanov · 2024-09-12T22:51:32Z

Yes, ASAN gets triggered prior to the patch. And it does not fire after the patch.

AlekseyCherepanov · 2024-09-13T02:08:28Z

The script is slow. It takes 70 seconds using PyPy3 and 57 seconds after compilation with cython3 (without changes to the script: cython3 --embed -3 test_uni2.py && gcc -I/usr/include/python3.11/ test_uni2.c -lpython3.11). I rewrote it in C to make it faster and more structured. It takes 3.7 seconds (I use sparse checks for some ranges; 13 seconds without them, but I still check more than the script). I'll add a commit tomorrow.

solardiz · 2024-09-13T10:01:02Z

It takes 3.7 seconds (I use sparse checks for some ranges; 13 seconds without them, but I still check more than the script). I'll add a commit tomorrow.

I think try even sparser checks for under 1 second. Thanks!

magnumripper

I had a look at this and the few things I was worried about (such as 0xF4) was explained in the PR description, so I'm happy with it. Especially with that thorough test script! Also I can't see any reason it would end up slower (perhaps faster).

AlekseyCherepanov · 2024-09-15T21:33:00Z

Nice, CI found problems with symbols referenced in unicode.o I did not have locally. I can reproduce them: --without-openssl is required for ./configure for that. I'll fix it.

Meanwhile a few words about the commit:

I did not feel comfortable inserting a lot of code into unit-tests.c. But I needed functions to integrate with the report. So I included my file using #include. (Afterthought: _test_valid_utf8 could use a local counter for tests and return it, so it would be ok to compile it separately. OTOH I also use hex() from unit-tests.c.)

In regular make and Makefile.legacy, I compile with actual unicode.o instead of sources. Makefile.legacy required -lcrypto for MD4_Init and friends. I picked it from command in the regular build.

I think memory.c and common.c do not need separate compilation. _JOHN_MISC_NO_LOG is internal to misc.c. So there should not be actual deviations from regular compilation for these files and .o could be reused.

BTW I have idea that switch could be replaced with explicit checks with the same "fall-through" style. Then the code would not need the lookup having the same number of checks in assembly.

ASAN build:

  test 38 test_valid_utf8         -  Performed 49022799 tests 0.76 seconds used

In regular build, the test takes under 0.4 seconds.

So I have 2 more changes in mind:

reuse .o for memory.c and common.c
replace lookup+switch with ifs on values

AlekseyCherepanov · 2024-09-15T23:45:42Z

In regular make and Makefile.legacy, I compile with actual unicode.o instead of sources. Makefile.legacy required -lcrypto for MD4_Init and friends. I picked it from command in the regular build.

This part was wrong. Now I compile tests/unicode.o separately defining UNICODE_NO_OPTIONS and NOT_JOHN, so it does not use options and MD4_* and I don't need -lcrypto and/or fake symbols.

Guarding with #ifndef UNICODE_NO_OPTIONS was slightly incomplete. So one commit is dedicated to the fix.

AlekseyCherepanov · 2024-09-15T23:54:34Z

I use unicode.o as a dependency for tests/unicode.o as a shortcut to reuse all dependencies including unused options.h. I considered it to be a better solution than copy-pasting dependencies with or without modifications. unicode.o should be built anyway in usual places where unit-tests gets build.

Other way could be to add tests/unicode.o as target near unicode.o. Like this:

unicode.o tests/unicode.o: [...]

AlekseyCherepanov · 2024-09-16T00:49:16Z

And the last commit is to remove lookup from valid_utf8() rewriting switch with explicit ifs. The code has similar structure with "fall-through" semantics. (And it looks strange if I try to forget that it is a switch by nature.) Now default case falls into 4+ bytes and then gets filtered out in > 0xF4 check.

I don't insist on this change.

But I have an idea for even more obscure modification: switch for E0/ED could be replaced with the following bit tricks.

if (*source == (0xE0 | (0xED & -(a > 0x9F)))) return 0;

As of E0 | ED gives ED, it is possible to OR values to get ED if the check gives true, and E0 otherwise. Fortunately the check for E0 complements the check for ED. So we don't need to check twice. (To be honest, I cannot formulate a proof easily despite that I inferred this trick myself and checked against the tests.) Just replacing constants, similar check can be written for F0/F4.

Simpler bit trick is ((a = *++srcptr) & 0xC0) == 0x80 to check for trailing bytes. & 0xC0 picks 2 highest bits, == 0x80 checks for 10 there. It might reduce number of branching instructions.

Some benchmark is needed to consider the tricks. Also I am not sure how it will play with future compiler optimizations/changes. So I am not going to commit them if they are not interesting/needed. Are they worth a benchmark?

solardiz · 2024-09-17T15:59:51Z

@AlekseyCherepanov Is this ready to merge?
@magnumripper Would you like to take another look before we merge?

AlekseyCherepanov · 2024-09-17T17:13:44Z

It is ready to be merged. I'd like to hear from @magnumripper about my changes to unicode.h and readability of the final code with ifs. I don't have any understanding if my change to unicode.h is valid in relation to other code backing encodings. The change is based on local context only.

src/unicode.c

AlekseyCherepanov · 2024-09-21T19:58:10Z

Ouch, I intended to remove the explanations from test_valid_utf8.c but I did not include it into the commit.

I guess unicode.h might be the best place to keep the info. It follows wider pattern that things are explained near interface for caller. Initially I did not think about it reading code directly in unicode.c. But I had to read the comments in unicode.h to understand return value for use in the test script.

AlekseyCherepanov · 2024-09-21T21:07:04Z

I moved the comments to unicode.h. Also I added a bit of elaboration for return value. And I added a link about noncharecters (because I spent unexpected amount of time reading Wikipedia semi-randomly (e.g. about PUA) and had to find missing pieces to match my incomplete understanding with Python's behaviour).

We could rewrite commits to squash initial misplacement of comments and incorrect changes to Makefile.in together with respective fixes into nicer commits. Should I do that?

If there are no objections then this PR can be merged.

solardiz · 2024-09-22T14:12:01Z

We could rewrite commits to squash initial misplacement of comments and incorrect changes to Makefile.in together with respective fixes into nicer commits. Should I do that?

Yes, if you like. You may also fix the typo in noncharecters. Thank you!

@magnumripper Please feel free to merge this when you feel it's ready, or let me know and I merge it.

magnumripper · 2024-09-24T20:35:46Z

I'm fine with you merging it, squashed or not.

AlekseyCherepanov · 2024-09-25T06:30:58Z

typo in noncharecters

Oh, nice catch! Thanks!

I fixed the typo. I merged fixes into respective commits.

After pushing I noticed that I wrote too long message for one commit, so I rewrote it and pushed again. Then I rebased the first commit without changes, so all six commits would be listed in a row. (I am puzzled about the rebase: git log shows old date, github shows that change is recent and I don't see the date. Local hash and github's hash are the same.) I guess the last step was not needed because list of commits in a pull request can be viewed in the "commits" tab. Rebasing I could use the latest john, but I did not.

I finished my changes here.

solardiz · 2024-09-25T14:55:12Z

I am puzzled about the rebase: git log shows old date, github shows that change is recent

Each commit has two dates on it, see git log --pretty=fuller. If you want to update both when you amend a commit, use --reset-author. You don't need to do that here - I merely include this explanation because you wondered.

solardiz requested a review from magnumripper September 12, 2024 21:52

magnumripper approved these changes Sep 13, 2024

View reviewed changes

AlekseyCherepanov mentioned this pull request Sep 15, 2024

make's target ../run/dynacomptest is broken; also add it to CI? #5536

Open

magnumripper reviewed Sep 21, 2024

View reviewed changes

src/unicode.c Show resolved Hide resolved

AlekseyCherepanov force-pushed the bleeding-jumbo branch 2 times, most recently from cd3d694 to 63226fd Compare September 25, 2024 06:06

AlekseyCherepanov added 6 commits September 25, 2024 09:09

Unicode: fix reading buffer overflow in valid_utf8()

8dd9398

Unicode: fix incomplete conditional use of options

ab859c1

Unicode: elaborate about return value in comments for valid_utf8()

868f00a

Unicode: add detailed explanation into comments for valid_utf8()

20d01bb

Unicode: add unit tests for valid_utf8()

5174046

Unicode: remove lookup from valid_utf8()

588ab83

AlekseyCherepanov force-pushed the bleeding-jumbo branch from 63226fd to 588ab83 Compare September 25, 2024 06:09

solardiz merged commit 3e2282a into openwall:bleeding-jumbo Sep 25, 2024
31 of 32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode: fix reading buffer overflow in valid_utf8() #5531

Unicode: fix reading buffer overflow in valid_utf8() #5531

AlekseyCherepanov commented Sep 8, 2024

AlekseyCherepanov commented Sep 8, 2024

solardiz commented Sep 12, 2024

AlekseyCherepanov commented Sep 12, 2024

AlekseyCherepanov commented Sep 13, 2024

solardiz commented Sep 13, 2024

magnumripper left a comment

AlekseyCherepanov commented Sep 15, 2024

AlekseyCherepanov commented Sep 15, 2024

AlekseyCherepanov commented Sep 15, 2024

AlekseyCherepanov commented Sep 16, 2024

solardiz commented Sep 17, 2024

AlekseyCherepanov commented Sep 17, 2024

AlekseyCherepanov commented Sep 21, 2024

AlekseyCherepanov commented Sep 21, 2024

solardiz commented Sep 22, 2024

magnumripper commented Sep 24, 2024

AlekseyCherepanov commented Sep 25, 2024

solardiz commented Sep 25, 2024

Unicode: fix reading buffer overflow in valid_utf8() #5531

Unicode: fix reading buffer overflow in valid_utf8() #5531

Conversation

AlekseyCherepanov commented Sep 8, 2024

AlekseyCherepanov commented Sep 8, 2024

solardiz commented Sep 12, 2024

AlekseyCherepanov commented Sep 12, 2024

AlekseyCherepanov commented Sep 13, 2024

solardiz commented Sep 13, 2024

magnumripper left a comment

Choose a reason for hiding this comment

AlekseyCherepanov commented Sep 15, 2024

AlekseyCherepanov commented Sep 15, 2024

AlekseyCherepanov commented Sep 15, 2024

AlekseyCherepanov commented Sep 16, 2024

solardiz commented Sep 17, 2024

AlekseyCherepanov commented Sep 17, 2024

AlekseyCherepanov commented Sep 21, 2024

AlekseyCherepanov commented Sep 21, 2024

solardiz commented Sep 22, 2024

magnumripper commented Sep 24, 2024

AlekseyCherepanov commented Sep 25, 2024

solardiz commented Sep 25, 2024