Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decode() on bytes should support UTF-16 #1788

Open
sethhall opened this issue Jul 10, 2024 · 4 comments
Open

decode() on bytes should support UTF-16 #1788

sethhall opened this issue Jul 10, 2024 · 4 comments
Labels
Enhancement Improvement of existing functionality Good first issue Good for newcomers

Comments

@sethhall
Copy link
Member

It looks like the current implementation only supports ASCII and UTF-8 to decode into a string and the current library being used is strictly for UTF-8. In order to support anything with Windows roots, it would be nice to support UTF-16.

I poked around for a few minutes and found a potential small library that might work for the use case to decode UTF-16 into a string type.... https://github.com/nemtrif/utfcpp

@sethhall sethhall added the Enhancement Improvement of existing functionality label Jul 10, 2024
@rsmmr rsmmr added the Good first issue Good for newcomers label Jul 22, 2024
@sethhall
Copy link
Member Author

This still needs to be done (because the decode() method was clearly built with this in mind, but as a stop gap, I have a UTF-16 string reader (and it converts to utf-8 internally) implemented natively in spicy here: https://github.com/sethhall/spicy-parsers/blob/main/unicode/utf16.spicy

@Ethanholtking
Copy link

Hello I'd like to help solve this issue, however, I'm having difficulty trying to find where exactly the problem is. Could someone please help me locate where the decode() is?

@bbannier
Copy link
Member

bbannier commented Oct 1, 2024

Hello I'd like to help solve this issue, however, I'm having difficulty trying to find where exactly the problem is. Could someone please help me locate where the decode() is?

Implementing the runtime part would go roughly like the following:

  1. Adding a UTF16 Charset value here:
    HILTI_RT_ENUM(Charset, Undef, UTF8, ASCII);
  2. Implementing handling of Charset::UTF16 in Bytes::decode here:
    std::string Bytes::decode(bytes::Charset cs, bytes::DecodeErrorStrategy errors) const {
    switch ( cs.value() ) {
    case bytes::Charset::UTF8:
    // Data is already in UTF-8, but let's validate it.
    return Bytes(str(), cs, errors).str();
    case bytes::Charset::ASCII: {
    std::string s;
    for ( auto c : str() ) {
    if ( c >= 32 && c < 0x7f )
    s += c;
    else {
    switch ( errors.value() ) {
    case DecodeErrorStrategy::IGNORE: break;
    case DecodeErrorStrategy::REPLACE: s += "?"; break;
    case DecodeErrorStrategy::STRICT: throw RuntimeError("illegal ASCII character in string");
    }
    }
    }
    return s;
    }
    case bytes::Charset::Undef: throw RuntimeError("unknown character set for decoding");
    }
    cannot_be_reached();
    }
    The C++ unit test for Bytes::decode should also be updated here:
    TEST_CASE("decode") {
    CHECK_EQ("123"_b.decode(bytes::Charset::ASCII), "123");
    CHECK_EQ("abc"_b.decode(bytes::Charset::ASCII), "abc");
    CHECK_EQ("abc"_b.decode(bytes::Charset::UTF8), "abc");
    CHECK_EQ("\xF0\x9F\x98\x85"_b.decode(bytes::Charset::UTF8), "\xF0\x9F\x98\x85");
    CHECK_EQ("\xF0\x9F\x98\x85"_b.decode(bytes::Charset::ASCII), "????");
    CHECK_EQ("€100"_b.decode(bytes::Charset::ASCII, bytes::DecodeErrorStrategy::REPLACE), "???100");
    CHECK_EQ("€100"_b.decode(bytes::Charset::ASCII, bytes::DecodeErrorStrategy::IGNORE), "100");
    CHECK_THROWS_WITH_AS("123ä4"_b.decode(bytes::Charset::ASCII, bytes::DecodeErrorStrategy::STRICT),
    "illegal ASCII character in string", const RuntimeError&);
    CHECK_EQ("\xc3\x28"_b.decode(bytes::Charset::UTF8, bytes::DecodeErrorStrategy::REPLACE), "\ufffd(");
    CHECK_EQ("\xc3\x28"_b.decode(bytes::Charset::UTF8, bytes::DecodeErrorStrategy::IGNORE), "(");
    CHECK_THROWS_WITH_AS("\xc3\x28"_b.decode(bytes::Charset::UTF8, bytes::DecodeErrorStrategy::STRICT),
    "illegal UTF8 sequence in string", const RuntimeError&);
    CHECK_THROWS_WITH_AS("123"_b.decode(bytes::Charset::Undef), "unknown character set for decoding",
    const RuntimeError&);
    }

To make this available in Spicy code it needs to be added to both HILTI as well as Spicy:

  1. Add it to HILTI here:
    public type Charset = enum { ASCII, UTF8 } &cxxname="hilti::rt::bytes::Charset";
    There should also be a new test case here: https://github.com/zeek/spicy/blob/943dea8d284c3b6fd65426e6e22abce1669ceeb1/tests/hilti/types/bytes/decode.hlt
  2. Add it to Spicy here:
    ## Specifies the character set for bytes encoding/decoding.
    public type Charset = enum {
    ASCII,
    UTF8
    } &cxxname="hilti::rt::bytes::Charset";
    Adding a test case is not strictly needed since this just wraps HILTI functionality.

@rsmmr
Copy link
Member

rsmmr commented Oct 28, 2024

@Ethanholtking Did that help? Are you working on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Improvement of existing functionality Good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants