Improve Unicode support #2794

matt335672 · 2023-09-18T17:19:08Z

Updated 2023-10-20: Ready for review

I've been looking into the way we handle Unicode in general.

We've been using g_mbstowcs() and g_wcstombs() to handle conversions between UTF-8 and both UTF-16 and UTF-32.

These calls suffer from a number of problems when used for this purpose:-

The type wchar_t is not portable between platforms.
The Locale needs to be set for these functions to work correctly.
On the UN*X platforms I've looked at, UTF-32 is supported by these functions and we assume UTF-16 is just the bottom bits of a UTF-32 character. That works fine for characters in the Unicode BMP but fails for other characters such as emojis which require surrogate pairs (see #2603, possibly #942)
The functions are brittle - if a badly-formed or unrecognised character is found the functions just return -1 with no chance of recovery (possibly #942)

UTF-16 is used for Windows communication, and UTF-32 is used for other reasons, mostly related to font handling in the login window code.

This PR moves UTF-16 support into the Windows marshalling and unmarshalling code (common/parse.[hc]), and changes everything else to use UTF-8 mostly with UTF-32 conversions applied where needed.

New routines are provided to handle the conversions which are robust when presented with incorrect data. These routines are locale-independent. Extensive unit tests have been added for these.

The types char32_t and char16_t are used for UTF-32 and UTF-16 characters respectively. These are back-ported from C11.

The calls g_mbstowcs() and g_wcstombs() and the type twchar are no longer required and have been removed. There is one remaining use of a bare mbstowcs() call in the genkeymap tool which I haven't touched as it seems to be the best way to achieve what is required.

These are intended to replace non-UTF-16 uses of mbstowcs() / wcstombs()

These are intended to replace UTF-16 uses of mbstowcs() / wcstombs()

Because of the way UTF-8 encoding works, there is no need to use mbstowcs/wcstombs in the implementation of this function.

These calls are replaced with the newer UTF-16 parsing code withing the parse module

These calls are now replaced with explicit UTF conversion routines in the common/string_calls.[hc] and common/parse.[hc] modules. Also removed:- - The support code in common/os_calls.c to set the locale to use these routines. - The twchar type in arch.h

metalefty

Overall, LGTM. Let's fit it later if something is not working well.

matt335672 · 2023-11-01T10:44:55Z

Thanks @metalefty

@firewave is working on #2829 at the moment which is getting there but will be quite a disruptive merge. It involves small mods to lots of files.. Let's get that one in first so I don't mess up his workflow.

firewave · 2023-11-01T18:34:01Z

As this is ready to merge - put it in. I have no issues with rebasing/conflicts and the other PR still needs work. No need to postpone other (more substantial) changes because of it.

matt335672 · 2023-11-02T10:57:30Z

Thanks @firewave - I'll do that then. It might simplify the work I'm doing with smartcards.

matt335672 force-pushed the utf_changes_new branch 2 times, most recently from b3ac633 to 71fa683 Compare September 20, 2023 11:55

matt335672 force-pushed the utf_changes_new branch 2 times, most recently from 03401a2 to 7ac278c Compare September 28, 2023 12:04

matt335672 force-pushed the utf_changes_new branch 4 times, most recently from 015d7ae to b17ccae Compare October 17, 2023 08:24

matt335672 mentioned this pull request Oct 17, 2023

Support NetBsd and OpenIndiana #2811

Open

matt335672 added 2 commits October 18, 2023 10:07

Add UTF-8 / UTF-32 conversion routines

0463e55

These are intended to replace non-UTF-16 uses of mbstowcs() / wcstombs()

Add UTF-16 LE I/O routines

0758fe0

These are intended to replace UTF-16 uses of mbstowcs() / wcstombs()

matt335672 force-pushed the utf_changes_new branch 2 times, most recently from 8219c69 to 6671816 Compare October 20, 2023 09:42

matt335672 marked this pull request as ready for review October 20, 2023 15:40

matt335672 linked an issue Oct 20, 2023 that may be closed by this pull request

Clipboard question #942

Closed

matt335672 added 8 commits October 23, 2023 14:15

Remove mbstowcs/wcstombs from g_strtrim()

36ea4a3

Because of the way UTF-8 encoding works, there is no need to use mbstowcs/wcstombs in the implementation of this function.

libxrdp: Replace mbstowcs/wcstombs calls

3a5b893

These calls are replaced with the newer UTF-16 parsing code withing the parse module

Update xrdp font handling to use new UTF-8 calls

a50afc6

Update clipboard code to use new UTF-8 calls

f8e7fd4

Update drive redirection code to use new UTF-8 calls

8556f83

Update RAIL code to use new UTF-8 calls

1b286a0

Update smartcard code to use new UTF-8 calls

d722ffe

matt335672 force-pushed the utf_changes_new branch from 6671816 to f5f67e2 Compare October 23, 2023 13:20

metalefty approved these changes Nov 1, 2023

View reviewed changes

matt335672 merged commit 50cff2e into neutrinolabs:devel Nov 2, 2023
13 checks passed

matt335672 deleted the utf_changes_new branch November 2, 2023 10:57

matt335672 mentioned this pull request Nov 13, 2023

ERROR reading clientDir with macOS MS Remote Desktop.app with devel #2853

Closed

matt335672 mentioned this pull request Aug 6, 2024

utf-8 accented characters #1362

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Unicode support #2794

Improve Unicode support #2794

matt335672 commented Sep 18, 2023 •

edited

Loading

metalefty left a comment

matt335672 commented Nov 1, 2023

firewave commented Nov 1, 2023

matt335672 commented Nov 2, 2023

Improve Unicode support #2794

Improve Unicode support #2794

Conversation

matt335672 commented Sep 18, 2023 • edited Loading

metalefty left a comment

Choose a reason for hiding this comment

matt335672 commented Nov 1, 2023

firewave commented Nov 1, 2023

matt335672 commented Nov 2, 2023

matt335672 commented Sep 18, 2023 •

edited

Loading