Why are runes implemented as u32 (utf32) and not [4]u8 (utf8)? #22461

peppergrayxyz · 2024-10-09T10:42:33Z

peppergrayxyz
Oct 9, 2024

I was wondering why runes are implemented as utf32 and not utf8:

s := '🚀'      // 0xF0 0x9F 0x9A 0x80 (utf8)
r := r.runes() // 0x00 0x01 0xF6 0x80 (utf32)
b := r.bytes() // 0xF0 0x9F 0x9A 0x80 (utf8)

I don't see the benefit of converting forth and back between utf8 and utf32.

Why not just use [4]u8 and just store the data as is:

u32 (utf32)	[4]u8 (utf8)	rune
`00000061`	`61 00 00 00`	a
`00000031`	`31 00 00 00`	1
`000000a9`	`c2 a9 00 00`	©
`00002605`	`e2 98 85 00`	★
`0001f680`	`f0 9f 9a 80`	🚀

I'm currently working on #22117 and investigating how things are implemented in v and looking forward to learning more about its design decisions!

medvednikov · 2024-10-09T17:35:26Z

medvednikov
Oct 9, 2024
Maintainer

This was copied from Go. One value is easier to work with than an array?

0 replies

peppergrayxyz · 2024-10-11T18:12:05Z

peppergrayxyz
Oct 11, 2024
Author

It would be one int value in both cases:

4 bytes of utf8 and a utf32 codepoint are both 32bit, i.e. [4]u8 can be casted and used as int. Think of it as union type:

union {
    value int
    bytes [4]u8
}

The difference is in byte representation due to different endianness. An int that holds a utf32 codepoint has a defined value with a platform specific byte representation, whilst an int that holds 4 utf8 bytes has a defined byte representation but a platform specific value.

In C this effect can be achieved by using a multi-character character constant (instead of converting to utf32):

int utf8 = '🚀'                        // or '\F0\9F\9A\80'
int utf32 = string_utf32_code("🚀")    // or 0x1f680

var	endianness	value	bytes
utf8	little	0x809A9FF0	F0 9F 9A 80
utf8	big	0xF09F9A80	F0 9F 9A 80
utf32	little	0x0001F680	80 F6 01 00
utf32	big	0x0001F680	00 01 F6 80

So I'm not arguing against using one value, but about which data this value holds!

No need to convert would make interactions between strings and runes cheaper (padding is still needed, though). Probably I'll create a benchmark to check if there is a real world difference here.

Since the source is go I will check if I can find something why they did it!

2 replies

JalonSolov Oct 11, 2024
Collaborator

Except that an int varies in size (in Go, C, et. al. already, and in V soon) based on the platform, and is also a signed value, not unsigned. This has many implications.

A u32, which is 4 bytes long, is simple, and guaranteed to remain 4 bytes in size.

If you really want to treat it as 4 separate bytes, with no need to convert, just create a union:

union MyRune {
    r rune
    b [4]u8
}

peppergrayxyz Oct 12, 2024
Author

Thanks for your suggestions!

u32

utf32 needs a 32bit type, so if int is not 32 bit anymore, runes need u32 and [4]u8 is still 32bit so both fit in u32.

union

A union with a rune doesn't resolve the need for conversion, because it will give the bytes that represent the Unicode codepoint (utf32) not the utf8 bytes needed, because runes use utf32 and not utf8 encoding.

Due to their different encoding, you can't compare runes with strings.
Hence, the need for conversion and my question why we use utf32 in the first place!

Try this: https://play.vlang.io/p/97ffa46126

fn main()
{
    union MyRune {
        r rune
        b [4]u8
        u u32
    }

    /* utf32 */
    r  := `🚀`
    r_ := unsafe { *(&MyRune(&r))}
    rb := unsafe { r_.b }
    ru := unsafe { r_.u }

    r_ref := u32(0x0001F680)
    assert(ru == r_ref)

    /* utf8 */
    s  := '🚀'
    s_ := unsafe { *(&MyRune(&(s.str[0])))}
    sb := unsafe { s_.b }
    su := unsafe { s_.u }

    b_ref := [u8(0xF0), 0x9F, 0x9A, 0x80]!
    assert(sb == b_ref)

    /* builtin conversions */
    assert(sb[0..4] == s.bytes())
    assert(sb[0..4] == r.bytes())
    assert([r]      == s.runes())

    /* MyRune from utf8 string */
    m := MyRune{ b: [ s[0], s[1], s[2], s[3] ]! }
    mb := unsafe { m.b }
    mu := unsafe { m.u }  
    mr := unsafe { m.r }  
    ms := unsafe { (&mb[0]).vstring_with_len(4) }

    /* MyRune from utf8 byte array */
    n := MyRune{ b: b_ref }
    assert(m == n)

    /* MyRune from rune (utf32) */
    x := MyRune{ r: r }
    xb := unsafe { x.b }
    xu := unsafe { x.u }
    xr := unsafe { x.r }
    xs := unsafe { (&xb[0]).vstring_with_len(4) }
    
    /* MyRune from codepoint (utf32) */
    y := MyRune{ u: r_ref }
    assert(x == y)

    println('[4]u8')
    println('sb (utf8): ${sb}')
    println('mb (utf8): ${mb}')
    println('rb(utf32): ${rb}')
    println('xb(utf32): ${xb}')
    println('sb == mb : ${sb == mb}')
    println('sb == xb : ${sb == xb}')
    println('sb == rb : ${sb == rb}')
    println('xb == rb : ${xb == rb}')

    println('u32')
    println('su (utf8): ${su}')
    println('mu (utf8): ${mu}')
    println('ru(utf32): ${ru}')
    println('xu(utf32): ${xu}')
    println('su == mu : ${su == mu}')
    println('su == xu : ${su == xu}')
    println('su == ru : ${su == ru}')
    println('xu == ru : ${xu == ru}')

    println('string')
    println('s  (utf8): ${s}')
    println('ms (utf8): ${ms}')

    println('rune (builtin conversion utf32-->utf8)')
    println('r        : ${r}')
    println('xr       : ${xr}')

    println('garbage')
    println('mr       : ${mr}') 
    println('xs       : ${xs}')
}

[4]u8
sb (utf8): [240, 159, 154, 128]
mb (utf8): [240, 159, 154, 128]
rb(utf32): [128, 246, 1, 0]
xb(utf32): [128, 246, 1, 0]
sb == mb : true
sb == xb : false
sb == rb : false
xb == rb : true
u32
su (utf8): 2157617136
mu (utf8): 2157617136
ru(utf32): 128640
xu(utf32): 128640
su == mu : true
su == xu : false
su == ru : false
xu == ru : true
string
s  (utf8): 🚀
ms (utf8): 🚀
rune (builtin conversion utf32-->utf8)
r        : 🚀
xr       : 🚀
garbage
mr       : �
xs       : ��

peppergrayxyz · 2024-10-17T20:22:22Z

peppergrayxyz
Oct 17, 2024
Author

After further investigating this, I did not find a statement somewhere but a few hints:

a u32 storing 1-4 UTF-8 codepoints

needs right padding for compassion but left padding for sorting, i.e. whatever you choose there is an overhead for the other
the value of the u32 is non-portable
there is no standard for this format

a u32 storing a UTF-32 codepoint

is defined in the standard
has some overhead for the initial conversion, but is a lot simpler to handle without any extra overhead once converted

To make it easier to understand where V uses UTF-8 and where UTF-32 I propose to update the docs!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are runes implemented as u32 (utf32) and not [4]u8 (utf8)? #22461

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why are runes implemented as u32 (utf32) and not [4]u8 (utf8)? #22461

peppergrayxyz Oct 9, 2024

Replies: 3 comments · 2 replies

medvednikov Oct 9, 2024 Maintainer

peppergrayxyz Oct 11, 2024 Author

JalonSolov Oct 11, 2024 Collaborator

peppergrayxyz Oct 12, 2024 Author

u32

union

peppergrayxyz Oct 17, 2024 Author

peppergrayxyz
Oct 9, 2024

Replies: 3 comments 2 replies

medvednikov
Oct 9, 2024
Maintainer

peppergrayxyz
Oct 11, 2024
Author

JalonSolov Oct 11, 2024
Collaborator

peppergrayxyz Oct 12, 2024
Author

peppergrayxyz
Oct 17, 2024
Author