Why are runes implemented as u32 (utf32) and not [4]u8 (utf8)? #22461
Replies: 3 comments 2 replies
-
This was copied from Go. One value is easier to work with than an array? |
Beta Was this translation helpful? Give feedback.
-
It would be one 4 bytes of utf8 and a utf32 codepoint are both 32bit, i.e. union {
value int
bytes [4]u8
} The difference is in byte representation due to different endianness. An int that holds a utf32 codepoint has a defined value with a platform specific byte representation, whilst an int that holds 4 utf8 bytes has a defined byte representation but a platform specific value. In C this effect can be achieved by using a multi-character character constant (instead of converting to utf32): int utf8 = '🚀' // or '\F0\9F\9A\80'
int utf32 = string_utf32_code("🚀") // or 0x1f680
So I'm not arguing against using one value, but about which data this value holds! No need to convert would make interactions between strings and runes cheaper (padding is still needed, though). Probably I'll create a benchmark to check if there is a real world difference here. Since the source is go I will check if I can find something why they did it! |
Beta Was this translation helpful? Give feedback.
-
After further investigating this, I did not find a statement somewhere but a few hints: a u32 storing 1-4 UTF-8 codepoints
a u32 storing a UTF-32 codepoint
To make it easier to understand where V uses UTF-8 and where UTF-32 I propose to update the docs! |
Beta Was this translation helpful? Give feedback.
-
I was wondering why runes are implemented as utf32 and not utf8:
I don't see the benefit of converting forth and back between utf8 and utf32.
Why not just use [4]u8 and just store the data as is:
00000061
61 00 00 00
00000031
31 00 00 00
000000a9
c2 a9 00 00
00002605
e2 98 85 00
0001f680
f0 9f 9a 80
I'm currently working on #22117 and investigating how things are implemented in v and looking forward to learning more about its design decisions!
Beta Was this translation helpful? Give feedback.
All reactions