Unicode support #108

vtereshkov · 2021-07-18T00:12:35Z

What Unicode to choose for chars and strings?

UTF-8
Pros: Backward-compatible with ASCII, no need to support both "narrow" and "wide" strings
Cons: Chars have variable width, ambiguous len, sizeof and indexing. Poor support on Windows

UTF-16
Pros: Native for Windows. Fixed char width
Cons: Unnatural for Linux. Incompatible with ASCII. Not all Unicode chars can be represented

UTF-32
Pros: Native for Linux. Fixed char width. Complete Unicode supported
Cons: Unnatural for Windows. Incompatible with ASCII

The text was updated successfully, but these errors were encountered:

vtereshkov · 2021-07-18T21:58:18Z

Now we have a rudimentary support for UTF-8, as the latest terminals and C runtime libraries on Windows 10 and Linux support the C.UTF-8 or similar locale strings. String length returned by len() is in bytes, not in characters. Go does the same, though it is inconvenient.

fn main() {
    s := "Привет" + ',' + " мир!"
    printf("Строка: " + s + ", длина: " + repr(len(s)) + '\n')
}

...:~/umka-lang/umka_linux$ ./umka -locale C.UTF-8 ../test.um
Строка: Привет, мир!, длина: 21

On Windows, this feaure is available under MSVC, but not under MinGW (older runtime?). It seems that the MSVC runtime is also buggy: scanf() fails to read non-ASCII UTF-8.

On Linux everything works as expected.

vtereshkov · 2021-07-19T00:26:21Z

@marekmaskarinec Please notice the API change: umkaInit() now requires locale, which can be NULL.

vtereshkov · 2021-07-19T09:20:21Z

Need to consider creating a module like utf8 in Go: https://pkg.go.dev/unicode/utf8

vtereshkov · 2021-07-19T12:09:57Z

@marekmaskarinec Do Umka's printf() and scanf() work correctly with non-ASCII UTF-8 strings on Void Linux? Everything is fine on Ubuntu 20, but not on Windows 10.

marekmaskarinec · 2021-07-23T07:25:15Z

This program:

fn main() {
    s := ""
    scanf("%s", &s)
    printf("%s\n", repr([]char(s)))
    printf("%s\n", s)
}

Produces this (input included):

🬀🬾
{ 0xFFFFFFF0 0xFFFFFF9F 0xFFFFFFAC 0xFFFFFF80 0xFFFFFFF0 0xFFFFFF9F 0xFFFFFFAC 0xFFFFFFBE 0x00 } 
🬀🬾

I did not touch the locale.

vtereshkov · 2021-07-23T09:11:37Z

@marekmaskarinec And what if you set -locale C.UTF-8?

marekmaskarinec · 2021-07-23T09:37:13Z

It doesn't seem to work.

[ tests ]$ umka -locale C.UTF-8 test.um
Error test.um (1, 1): Cannot set locale

I think the characters I used to test aren't UTF-8. Should I test with utf-8 characters?

marekmaskarinec · 2021-07-23T09:42:04Z

Here is a test with some czech characters, which are utf-8.

řášďéě
{ 0xFFFFFFC5 0xFFFFFF99 0xFFFFFFC3 0xFFFFFFA1 0xFFFFFFC5 0xFFFFFFA1 0xFFFFFFC4 0xFFFFFF8F 0xFFFFFFC3 0xFFFFFFA9 0xFFFFFFC4 0xFFFFFF9B 0x00 } 
řášďéě

vtereshkov · 2021-07-23T13:48:40Z

@marekmaskarinec Thank you. I doubt if there any characters in Unicode which are not UTF-8. And what does the Linux shell command locale -a print on your machine?

marekmaskarinec · 2021-07-24T06:22:01Z

[ ~ ]$ locale -a
C
POSIX
en_GB.utf8
en_US.utf8

vtereshkov · 2021-09-04T23:14:55Z

@marekmaskarinec When running utf8test.um on my Windows machine, I get

bytes: 9
characters: 4
▀: U+2580
€: U+20ac
$: U+24
¢: U+a2

whereas, according to expected.log, it should be

bytes: 6
characters: 2
▀: U+2580
€: U+20ac

I'm not sure that expected.log is correct.

Another problem is that when I print the output to the console rather than a file, the characters are interpreted as Windows-1251 instead of UTF-8:

bytes: 9
characters: 4
тЦА: U+2580
тВм: U+20ac
$: U+24
┬в: U+a2

But as I said in another place, this is probably a problem with the MinGW C runtime.

marekmaskarinec · 2021-09-05T09:00:34Z

I'm not sure that expected.log is correct.

Yes. I added some additional character so expected.log is incorrect.

vtereshkov · 2021-09-05T21:17:30Z

@marekmaskarinec I have tested utf8.um on a Cyrillic string. The behavior seems to be incorrect:

string: ▀€$¢
bytes: 9
characters: 4
▀: U+2580
€: U+20ac
$: U+24
¢: U+a2

string: Привет, мир!
bytes: 21
characters: 12
ҟ: U+49f
?: U+4c0
Ҹ: U+4b8
Ҳ: U+4b2
ҵ: U+4b5
?: U+4c2
,: U+2c
 : U+20
Ҽ: U+4bc
Ҹ: U+4b8
?: U+4c0
!: U+21

A third-party UTF-8 encoder gives the following representation for "Привет, мир!":

\xD0\x9F\xD1\x80\xD0\xB8\xD0\xB2\xD0\xB5\xD1\x82\x2C\x20\xD0\xBC\xD0\xB8\xD1\x80\x21

vtereshkov · 2021-09-05T23:25:42Z

@marekmaskarinec Two other things to consider:

r^ < 0x7f etc. Shouldn't it be r^ <= 0x7f?
1 << 8. Shouldn't it be 1 << 7?

marekmaskarinec · 2021-09-06T10:09:49Z

I fixed those things, bit with no effect. As far as I know, the problem is in getNextRune. Encoding works as intended.

Update: the problem might be with characters that have significant bits set to 1 in the first byte.

Update 2: turns out it was problem with the mask. I fixed it and now all except two characters decode corretly.

vtereshkov · 2021-09-09T23:10:54Z

@marekmaskarinec Are you going to commit the changes? Or you hope to first figure out what has happened with the two remaining characters?

marekmaskarinec · 2021-09-10T12:10:25Z

The changes are currently in my fork in branch utf8. I tried with one of the not working letters - CYRILLIC CAPITAL LETTER ER. It is generating 0x440, but the correct codepoint is 0x420. What I found out is that the byte I was getting was d1, but it's supposed to be d0.

skejeton · 2021-10-16T20:41:46Z

im for utf8 to be honestly, either that or UTF-32, but given the poor support of UTF-32, i'd choose utf-8, as utf-16 can't represent all characters in 2 bytes anyway, nature of utf-8 makes it opt in, you either have an ascii string, but if you want, you add a foreign character, in this case it makes use of 8th bit, which allows for it to not conflict with ascii

vtereshkov · 2021-10-16T21:19:27Z

@ishdx2 Yes, this is what I chose myself, but I hoped for a better support of UTF-8 by the C runtime and consoles over various platforms. On Linux the support is very good, on Windows it is not. MinGW does not have UTF-8 locales altogether, while MSVC supports them in printf(), but not in scanf(). This is weird.

skejeton · 2022-09-12T05:58:30Z

I'm afraid you have to use UTF-16 winapi functions

vtereshkov · 2024-02-24T13:20:53Z

UTF-8 is now supported by the utf8.um standard library module.

For Windows-specific console I/O problems, see #354.

vtereshkov added the enhancement New feature or request label Jul 18, 2021

vtereshkov added a commit that referenced this issue Jul 18, 2021

Support locale setting (#108)

916105f

vtereshkov added a commit that referenced this issue Jul 19, 2021

Move locale setting to libumka (#108)

2c94ef8

vtereshkov added a commit that referenced this issue Jul 19, 2021

Fix 3D cam example (#108)

64c29e8

marekmaskarinec mentioned this issue Sep 4, 2021

add utf8 library #112

Merged

vtereshkov added a commit that referenced this issue Sep 10, 2021

Fix UTF-8 rune encoding (#108)

12643cd

vtereshkov mentioned this issue Feb 24, 2024

scanf doesn't support UTF-8 on Windows #354

Open

vtereshkov closed this as completed Feb 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode support #108

Unicode support #108

vtereshkov commented Jul 18, 2021 •

edited

Loading

vtereshkov commented Jul 18, 2021 •

edited

Loading

vtereshkov commented Jul 19, 2021

vtereshkov commented Jul 19, 2021

vtereshkov commented Jul 19, 2021

marekmaskarinec commented Jul 23, 2021

vtereshkov commented Jul 23, 2021

marekmaskarinec commented Jul 23, 2021 •

edited

Loading

marekmaskarinec commented Jul 23, 2021

vtereshkov commented Jul 23, 2021

marekmaskarinec commented Jul 24, 2021

vtereshkov commented Sep 4, 2021

marekmaskarinec commented Sep 5, 2021

vtereshkov commented Sep 5, 2021

vtereshkov commented Sep 5, 2021

marekmaskarinec commented Sep 6, 2021 •

edited

Loading

vtereshkov commented Sep 9, 2021

marekmaskarinec commented Sep 10, 2021

skejeton commented Oct 16, 2021 •

edited

Loading

vtereshkov commented Oct 16, 2021

skejeton commented Sep 12, 2022

vtereshkov commented Feb 24, 2024

Unicode support #108

Unicode support #108

Comments

vtereshkov commented Jul 18, 2021 • edited Loading

vtereshkov commented Jul 18, 2021 • edited Loading

vtereshkov commented Jul 19, 2021

vtereshkov commented Jul 19, 2021

vtereshkov commented Jul 19, 2021

marekmaskarinec commented Jul 23, 2021

vtereshkov commented Jul 23, 2021

marekmaskarinec commented Jul 23, 2021 • edited Loading

marekmaskarinec commented Jul 23, 2021

vtereshkov commented Jul 23, 2021

marekmaskarinec commented Jul 24, 2021

vtereshkov commented Sep 4, 2021

marekmaskarinec commented Sep 5, 2021

vtereshkov commented Sep 5, 2021

vtereshkov commented Sep 5, 2021

marekmaskarinec commented Sep 6, 2021 • edited Loading

vtereshkov commented Sep 9, 2021

marekmaskarinec commented Sep 10, 2021

skejeton commented Oct 16, 2021 • edited Loading

vtereshkov commented Oct 16, 2021

skejeton commented Sep 12, 2022

vtereshkov commented Feb 24, 2024

vtereshkov commented Jul 18, 2021 •

edited

Loading

vtereshkov commented Jul 18, 2021 •

edited

Loading

marekmaskarinec commented Jul 23, 2021 •

edited

Loading

marekmaskarinec commented Sep 6, 2021 •

edited

Loading

skejeton commented Oct 16, 2021 •

edited

Loading