Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Unicode support #2794

Merged
merged 10 commits into from
Nov 2, 2023
Merged

Conversation

matt335672
Copy link
Member

@matt335672 matt335672 commented Sep 18, 2023

Updated 2023-10-20: Ready for review

Fixes #2603 #942

I've been looking into the way we handle Unicode in general.

We've been using g_mbstowcs() and g_wcstombs() to handle conversions between UTF-8 and both UTF-16 and UTF-32.

These calls suffer from a number of problems when used for this purpose:-

  1. The type wchar_t is not portable between platforms.
  2. The Locale needs to be set for these functions to work correctly.
  3. On the UN*X platforms I've looked at, UTF-32 is supported by these functions and we assume UTF-16 is just the bottom bits of a UTF-32 character. That works fine for characters in the Unicode BMP but fails for other characters such as emojis which require surrogate pairs (see #2603, possibly #942)
  4. The functions are brittle - if a badly-formed or unrecognised character is found the functions just return -1 with no chance of recovery (possibly #942)

UTF-16 is used for Windows communication, and UTF-32 is used for other reasons, mostly related to font handling in the login window code.

This PR moves UTF-16 support into the Windows marshalling and unmarshalling code (common/parse.[hc]), and changes everything else to use UTF-8 mostly with UTF-32 conversions applied where needed.

New routines are provided to handle the conversions which are robust when presented with incorrect data. These routines are locale-independent. Extensive unit tests have been added for these.

The types char32_t and char16_t are used for UTF-32 and UTF-16 characters respectively. These are back-ported from C11.

The calls g_mbstowcs() and g_wcstombs() and the type twchar are no longer required and have been removed. There is one remaining use of a bare mbstowcs() call in the genkeymap tool which I haven't touched as it seems to be the best way to achieve what is required.

@matt335672 matt335672 force-pushed the utf_changes_new branch 2 times, most recently from b3ac633 to 71fa683 Compare September 20, 2023 11:55
@matt335672 matt335672 force-pushed the utf_changes_new branch 2 times, most recently from 03401a2 to 7ac278c Compare September 28, 2023 12:04
@matt335672 matt335672 force-pushed the utf_changes_new branch 4 times, most recently from 015d7ae to b17ccae Compare October 17, 2023 08:24
These are intended to replace non-UTF-16 uses of mbstowcs() / wcstombs()
These are intended to replace UTF-16 uses of mbstowcs() / wcstombs()
@matt335672 matt335672 force-pushed the utf_changes_new branch 2 times, most recently from 8219c69 to 6671816 Compare October 20, 2023 09:42
@matt335672 matt335672 marked this pull request as ready for review October 20, 2023 15:40
@matt335672 matt335672 linked an issue Oct 20, 2023 that may be closed by this pull request
Because of the way UTF-8 encoding works, there is no need to
use mbstowcs/wcstombs in the implementation of this function.
These calls are replaced with the newer UTF-16 parsing code
withing the parse module
These calls are now replaced with explicit UTF conversion routines in
the common/string_calls.[hc] and common/parse.[hc] modules.

Also removed:-
- The support code in common/os_calls.c to set the locale to use
  these routines.
- The twchar type in arch.h
Copy link
Member

@metalefty metalefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM. Let's fit it later if something is not working well.

@matt335672
Copy link
Member Author

Thanks @metalefty

@firewave is working on #2829 at the moment which is getting there but will be quite a disruptive merge. It involves small mods to lots of files.. Let's get that one in first so I don't mess up his workflow.

@firewave
Copy link
Contributor

firewave commented Nov 1, 2023

As this is ready to merge - put it in. I have no issues with rebasing/conflicts and the other PR still needs work. No need to postpone other (more substantial) changes because of it.

@matt335672
Copy link
Member Author

Thanks @firewave - I'll do that then. It might simplify the work I'm doing with smartcards.

@matt335672 matt335672 merged commit 50cff2e into neutrinolabs:devel Nov 2, 2023
13 checks passed
@matt335672 matt335672 deleted the utf_changes_new branch November 2, 2023 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clipboard can't handle emojis pasted from Windows Clipboard question
3 participants