decode() on bytes should support UTF-16 #1788

sethhall · 2024-07-10T16:06:23Z

It looks like the current implementation only supports ASCII and UTF-8 to decode into a string and the current library being used is strictly for UTF-8. In order to support anything with Windows roots, it would be nice to support UTF-16.

I poked around for a few minutes and found a potential small library that might work for the use case to decode UTF-16 into a string type.... https://github.com/nemtrif/utfcpp

sethhall · 2024-08-31T13:58:22Z

This still needs to be done (because the decode() method was clearly built with this in mind, but as a stop gap, I have a UTF-16 string reader (and it converts to utf-8 internally) implemented natively in spicy here: https://github.com/sethhall/spicy-parsers/blob/main/unicode/utf16.spicy

Ethanholtking · 2024-10-01T15:43:57Z

Hello I'd like to help solve this issue, however, I'm having difficulty trying to find where exactly the problem is. Could someone please help me locate where the decode() is?

bbannier · 2024-10-01T15:54:57Z

Hello I'd like to help solve this issue, however, I'm having difficulty trying to find where exactly the problem is. Could someone please help me locate where the decode() is?

Implementing the runtime part would go roughly like the following:

Adding a UTF16 Charset value here:

spicy/hilti/runtime/include/types/bytes.h

Line 42 in 943dea8

HILTI_RT_ENUM(Charset, Undef, UTF8, ASCII);

Implementing handling of Charset::UTF16 in Bytes::decode here:

spicy/hilti/runtime/src/types/bytes.cc

Lines 105 to 132 in 943dea8

    
           std::string Bytes::decode(bytes::Charset cs, bytes::DecodeErrorStrategy errors) const { 
        
               switch ( cs.value() ) { 
        
                   case bytes::Charset::UTF8: 
        
                       // Data is already in UTF-8, but let's validate it. 
        
                       return Bytes(str(), cs, errors).str(); 
        
                   case bytes::Charset::ASCII: { 
        
                       std::string s; 
        
                       for ( auto c : str() ) { 
        
                           if ( c >= 32 && c < 0x7f ) 
        
                               s += c; 
        
                           else { 
        
                               switch ( errors.value() ) { 
        
                                   case DecodeErrorStrategy::IGNORE: break; 
        
                                   case DecodeErrorStrategy::REPLACE: s += "?"; break; 
        
                                   case DecodeErrorStrategy::STRICT: throw RuntimeError("illegal ASCII character in string"); 
        
                               } 
        
                           } 
        
                       } 
        
                       return s; 
        
                   } 
        
                   case bytes::Charset::Undef: throw RuntimeError("unknown character set for decoding"); 
        
               } 
        
               cannot_be_reached(); 
        
           }

The C++ unit test for Bytes::decode should also be updated here:

spicy/hilti/runtime/src/tests/bytes.cc

Lines 64 to 83 in 943dea8

    
           TEST_CASE("decode") { 
        
               CHECK_EQ("123"_b.decode(bytes::Charset::ASCII), "123"); 
        
               CHECK_EQ("abc"_b.decode(bytes::Charset::ASCII), "abc"); 
        
               CHECK_EQ("abc"_b.decode(bytes::Charset::UTF8), "abc"); 
        
               CHECK_EQ("\xF0\x9F\x98\x85"_b.decode(bytes::Charset::UTF8), "\xF0\x9F\x98\x85"); 
        
               CHECK_EQ("\xF0\x9F\x98\x85"_b.decode(bytes::Charset::ASCII), "????"); 
        
               CHECK_EQ("€100"_b.decode(bytes::Charset::ASCII, bytes::DecodeErrorStrategy::REPLACE), "???100"); 
        
               CHECK_EQ("€100"_b.decode(bytes::Charset::ASCII, bytes::DecodeErrorStrategy::IGNORE), "100"); 
        
               CHECK_THROWS_WITH_AS("123ä4"_b.decode(bytes::Charset::ASCII, bytes::DecodeErrorStrategy::STRICT), 
        
                                    "illegal ASCII character in string", const RuntimeError&); 
        
               CHECK_EQ("\xc3\x28"_b.decode(bytes::Charset::UTF8, bytes::DecodeErrorStrategy::REPLACE), "\ufffd("); 
        
               CHECK_EQ("\xc3\x28"_b.decode(bytes::Charset::UTF8, bytes::DecodeErrorStrategy::IGNORE), "("); 
        
               CHECK_THROWS_WITH_AS("\xc3\x28"_b.decode(bytes::Charset::UTF8, bytes::DecodeErrorStrategy::STRICT), 
        
                                    "illegal UTF8 sequence in string", const RuntimeError&); 
        
               CHECK_THROWS_WITH_AS("123"_b.decode(bytes::Charset::Undef), "unknown character set for decoding", 
        
                                    const RuntimeError&); 
        
           }

To make this available in Spicy code it needs to be added to both HILTI as well as Spicy:

Add it to HILTI here:

spicy/hilti/lib/hilti.hlt

Line 14 in 943dea8

public type Charset = enum { ASCII, UTF8 } &cxxname="hilti::rt::bytes::Charset";

There should also be a new test case here: https://github.com/zeek/spicy/blob/943dea8d284c3b6fd65426e6e22abce1669ceeb1/tests/hilti/types/bytes/decode.hlt

Add it to Spicy here:

spicy/spicy/lib/spicy.spicy

Lines 33 to 37 in 943dea8

    
           ## Specifies the character set for bytes encoding/decoding. 
        
           public type Charset = enum { 
        
               ASCII, 
        
               UTF8 
        
           } &cxxname="hilti::rt::bytes::Charset";

Adding a test case is not strictly needed since this just wraps HILTI functionality.

rsmmr · 2024-10-28T15:23:11Z

@Ethanholtking Did that help? Are you working on this?

sethhall added the Enhancement Improvement of existing functionality label Jul 10, 2024

rsmmr added the Good first issue Good for newcomers label Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decode() on bytes should support UTF-16 #1788

decode() on bytes should support UTF-16 #1788

sethhall commented Jul 10, 2024

sethhall commented Aug 31, 2024

Ethanholtking commented Oct 1, 2024

bbannier commented Oct 1, 2024

rsmmr commented Oct 28, 2024

decode() on bytes should support UTF-16 #1788

decode() on bytes should support UTF-16 #1788

Comments

sethhall commented Jul 10, 2024

sethhall commented Aug 31, 2024

Ethanholtking commented Oct 1, 2024

bbannier commented Oct 1, 2024

rsmmr commented Oct 28, 2024