-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode support #108
Comments
Now we have a rudimentary support for UTF-8, as the latest terminals and C runtime libraries on Windows 10 and Linux support the
On Windows, this feaure is available under MSVC, but not under MinGW (older runtime?). It seems that the MSVC runtime is also buggy: On Linux everything works as expected. |
@marekmaskarinec Please notice the API change: |
Need to consider creating a module like |
@marekmaskarinec Do Umka's |
This program: fn main() {
s := ""
scanf("%s", &s)
printf("%s\n", repr([]char(s)))
printf("%s\n", s)
} Produces this (input included):
I did not touch the locale. |
@marekmaskarinec And what if you set |
It doesn't seem to work.
I think the characters I used to test aren't UTF-8. Should I test with utf-8 characters? |
Here is a test with some czech characters, which are utf-8.
|
@marekmaskarinec Thank you. I doubt if there any characters in Unicode which are not UTF-8. And what does the Linux shell command |
|
@marekmaskarinec When running
whereas, according to
I'm not sure that Another problem is that when I print the output to the console rather than a file, the characters are interpreted as Windows-1251 instead of UTF-8:
But as I said in another place, this is probably a problem with the MinGW C runtime. |
Yes. I added some additional character so |
@marekmaskarinec I have tested
A third-party UTF-8 encoder gives the following representation for "Привет, мир!":
|
@marekmaskarinec Two other things to consider:
|
I fixed those things, bit with no effect. As far as I know, the problem is in getNextRune. Encoding works as intended. Update: the problem might be with characters that have significant bits set to 1 in the first byte. Update 2: turns out it was problem with the mask. I fixed it and now all except two characters decode corretly. |
@marekmaskarinec Are you going to commit the changes? Or you hope to first figure out what has happened with the two remaining characters? |
The changes are currently in my fork in branch |
im for utf8 to be honestly, either that or UTF-32, but given the poor support of UTF-32, i'd choose utf-8, as utf-16 can't represent all characters in 2 bytes anyway, nature of utf-8 makes it opt in, you either have an ascii string, but if you want, you add a foreign character, in this case it makes use of 8th bit, which allows for it to not conflict with ascii |
@ishdx2 Yes, this is what I chose myself, but I hoped for a better support of UTF-8 by the C runtime and consoles over various platforms. On Linux the support is very good, on Windows it is not. MinGW does not have UTF-8 locales altogether, while MSVC supports them in |
I'm afraid you have to use UTF-16 winapi functions |
UTF-8 is now supported by the For Windows-specific console I/O problems, see #354. |
What Unicode to choose for chars and strings?
UTF-8
Pros: Backward-compatible with ASCII, no need to support both "narrow" and "wide" strings
Cons: Chars have variable width, ambiguous
len
,sizeof
and indexing. Poor support on WindowsUTF-16
Pros: Native for Windows. Fixed char width
Cons: Unnatural for Linux. Incompatible with ASCII. Not all Unicode chars can be represented
UTF-32
Pros: Native for Linux. Fixed char width. Complete Unicode supported
Cons: Unnatural for Windows. Incompatible with ASCII
The text was updated successfully, but these errors were encountered: