Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode decoding and encoding bugs for codepoints greater than 0xFFFF #12

Open
DelleVelleD opened this issue Apr 16, 2022 · 0 comments
Open

Comments

@DelleVelleD
Copy link

DelleVelleD commented Apr 16, 2022

MD_DecodeCodepointFromUtf16 incorrectly calculates codepoints greater than 0xFFFF because it does not offset by 0x10000.

Adding 0x10000 to the end of the codepoint calculation should fix the issue:

if (1 < max && 0xD800 <= out[0] && out[0] < 0xDC00 && 0xDC00 <= out[1] && out[1] < 0xE000)
{
    result.codepoint = ((out[0] - 0xD800) << 10) | (out[1] - 0xDC00) + 0x10000;
    result.advance = 2;
}

Reference: Step 5 for Decoding UTF-16


MD_Utf8FromCodepoint sets the first byte incorrectly when the codepoint requires four bytes because it left-bitshifts MD_bitmask4 by 3 rather than 4.
MD_bitmask4 is the value 0x0F (in binary 1111), and the first byte in UTF-8 of codepoints greater than 0xFFFF should start with the binary 11110 (which would then get bitshifted by 3 so the remaining 3 bits can hold codepoint info).

Bitshifting by 4 instead of 3 should fix the issue:

else if (codepoint <= 0x10FFFF)
{
    out[0] = (MD_bitmask4 << 4) | ((codepoint >> 18) & MD_bitmask3);
    out[1] = MD_bit8 | ((codepoint >> 12) & MD_bitmask6);
    out[2] = MD_bit8 | ((codepoint >>  6) & MD_bitmask6);
    out[3] = MD_bit8 | ( codepoint        & MD_bitmask6);
    advance = 4;
}
@DelleVelleD DelleVelleD changed the title MD_DecodeCodepointFromUtf16 incorrectly calculates codepoints greater than 0xFFFF Unicode decoding and encoding bugs for codepoints greater than 0xFFFF Apr 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant