Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new API function utf8_to_uv() #22541

Open
wants to merge 14 commits into
base: blead
Choose a base branch
from

Conversation

khwilliamson
Copy link
Contributor

This is designed to replace the problematic utf8_to_uvchr(), which is problematic. Its behavior varies depending on if warnings are enabled or not, and no code in core actually takes that into account

If warnings are enabled:

A zero return can mean both success or failure

    Hence a zero return must be disambiguated.  Success would come
    from the next character being a NUL.

If failure, <retlen> will be -1, so can't be used to find where to
start parsing again.

If disabled:

Both the return and <retlen> will be usable values, but the return
of the REPLACEMENT CHARACTER is ambiguous.  It could mean failure,
or it could mean that that was the next character in the input and
was successfully decoded.

utf8_to_uv() solves these. This commit includes a few changes to use it, to show it works. I have WIP that changes the rest of core to use it. I found that it makes coding simpler.

The new function returns true upon success; false on failure. And it is passed pointers to return the computed code point and byte length into. These values always contain the correct information, regardless of if the input is malformed or not.

It is easy to test for failure in a conditional and then to take appropriate action. However, most often it seems the appropriate action is to use, going forward, the REPLACEMENT CHARACTER returned in failure cases.

And if you don't care particularly if it succeeds or not, you just use it without testing the result. This happens when you are confident that the input is well-formed, or say in converting a string for display.

There is another function utf8_to_uv_flags() which merely extends this API for more flexible use, and doesn't offer the advantages over the existing API function that does the same thing. I included it because the main function is just a small wrapper around it, and the API is similar and some may prefer it.

inline.h Show resolved Hide resolved
inline.h Outdated Show resolved Hide resolved
inline.h Outdated Show resolved Hide resolved
@karenetheridge
Copy link
Member

karenetheridge commented Aug 29, 2024

After this is in, is it possible for Devel::PPPort to generate a warning when utf8_to_uvchr is used, with a reference to its replacement?

(I'm not suggesting that you do this Tony, just wondering if we can and should)

@khwilliamson
Copy link
Contributor Author

It is easy in ppport.h to either output a hint or a warning about using any particular function. It's just text. Here's an example:

*** WARNING: my_sprintf
*** It's safer to use my_snprintf instead


@khwilliamson
Copy link
Contributor Author

The name I used for this function was the first name used for this functionality, being removed in 5.7. A succession of names has followed, as the previous incarnation was found to be deficient for one reason or another. uvchr apparently came about as a result of supporting non-ASCII systems, and I presumed was chosen to somehow note the fluidity of the underlying character set. I thought that it had been long enough since utf8_to_uv' had been in use that it could safely be reused again. But I came up with an alternative name in the meantime utf8_to_cp (We would want to make an inversecp_to_utf8. I think cp` is in common enough usage these days that the name would be self-explanatory to modern programmers.

But I don't know. I'm opening this up for discussion

@tonycoz
Copy link
Contributor

tonycoz commented Sep 4, 2024

If it returns a native code point rather than a Unicode code point I'd be inclined to avoid "cp".

We've generally used "uvchr" for this but utf8_to_uvchr() is what we're trying to replace.

Maybe utf8_decode_uvchr(), but that has it's own potential for confusion.

Maybe utf8_to_uvchr4() to emphasize that this is an alternative to utf8_to_uvchr().

@khwilliamson
Copy link
Contributor Author

I'm then inclined to go with utf8_to_uv

utf8.c Outdated
Comment on lines 1029 to 1010
=for apidoc utf8_to_uv
=for apidoc_item strict_utf8_to_uv
=for apidoc_item c9strict_utf8_to_uv
=for apidoc_item valid_utf8_to_uvchr
=for apidoc_item utf8_to_uvchr_buf
=for apidoc_item utf8_to_uvchr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem with documenting these in the same place is I need to read (or at least skim) the documentation for the hard to use correctly APIs to use the easier to use correctly APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this; when I look at the resultant pod, except for the names, the newer ones are first in the group. In order to see the calling prototypes you have to skip over the old names, but the prototypes are visually distinctive enough that that's easy to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure it probably doesn't matter to the apidoc parser this is for, but I just want to note that since there are no newlines in between these lines, this constitutes a single =for directive which contains additional lines that happen to start with =for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the list of names that I talking about, but the descriptive text needs to cover both sets of functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand your objection. Here is the entire section. The favored functions are listed first; if you want to skip reading the disfavored ones to go look at the signatures, my claim is that the latter is visually distinctive enough that skimming is not required. I just added a sentence to make it stand out a bit more. That could be expanded.

But I do think it is important that these be in the same section. There are several paragraphs at the beginning that apply to all types, and those shouldn't be repeated. Also it's easier to contrast the types when in the same section

"utf8_to_uv"
"extended_utf8_to_uv"
"strict_utf8_to_uv"
"c9strict_utf8_to_uv"
"valid_utf8_to_uvchr"
"utf8_to_uvchr_buf"
"utf8_to_uvchr"
    "DEPRECATED!" It is planned to remove "utf8_to_uvchr" from a future
    release of Perl. Do not use it for new code; remove it from existing
    code.

    These functions each translate from UTF-8 to UTF-32 (or UTF-64 on 64
    bit platforms). In other words, to a code point ordinal value. (On
    EBCDIC platforms, the initial encoding is UTF-EBCDIC, and the output
    is a native code point).

    For example, the string "A" would be converted to the number 65 on
    an ASCII platform, and to 193 on an EBCDIC one. Converting the
    string "ABC" would yield the same results, as the functions stop
    after the first character converted. Converting the string "\N{LATIN
    CAPITAL LETTER A WITH MACRON} plus anything more in the string"
    would yield the number 0x100 on both types of platforms, since the
    first character is U+0100.

    The functions whose names contain "to_uvchr" are older than the
    functions whose names don't have "chr" in them. The API in the older
    functions is harder to use correctly, and so they are kept only for
    backwards compatibility, and may eventually become deprecated. If
    you are writing a module and use Devel::PPPort, your code can use
    the new functions back to at least Perl v5.7.1.
    ("valid_utf8_to_uvchr" is the exception to this name rule; its API
    is not problematic, and it is in no danger of becoming deprecated.
    But it is highly specialized so should rarely occur in actual code.)

    All the functions accept, without complaint, well-formed UTF-8 for
    any non-problematic Unicode code point 0 .. 0x10FFFF. There are two
    types of Unicode problematic code points: surrogate characters and
    non-character code points. (See perlunicode.) Some of the functions
    reject one or both of these. Private use characters and those code
    points yet to be assigned to a particular character are never
    considered problematic. Additionally, most of the functions accept
    non-Unicode code points, those starting at 0x110000.

    "utf8_to_uv" forms
        Almost all code should use only "utf8_to_uv",
        "extended_utf8_to_uv", "strict_utf8_to_uv", or
        "c9strict_utf8_to_uv". The other functions are either the
        problematic old form, or are for highly specialized uses.

        These four functions each return "true" if the sequence of bytes
        starting at "s" form a complete, legal UTF-8 (or UTF-EBCDIC)
        sequence for a code point. If so, *cp will be set to the native
        code point value it represents, and *advance will be set to its
        length, in bytes.

        Otherwise, each function returns "false" and sets *cp to the
        Unicode REPLACEMENT CHARACTER, and *advance to the next position
        along "s", where the next possible UTF-8 character could begin.

        The functions only examine as many bytes along "s" as are needed
        to form a complete UTF-8 representation of a single code point.
        Under no circumstances do they examine any byte beyond "e - 1",
        failing if the code point requires more than "e - s" bytes to
        represent.

        The functions differ only in what flavor of UTF-8 they accept.
        All reject syntactically invalid UTF-8. "strict_utf8_to_uv"
        additionally rejects any UTF-8 that translates into a code point
        that isn't specified by Unicode to be freely exchangeable,
        namely the surrogate characters and non-character code points.
        "c9strict_utf8_to_uv" instead uses the exchangeable definition
        given by Unicode's Corregendum #9, which rejects only
        surrogates. "extended_utf8_to_uv" accepts all syntactically
        valid UTF-8, as extended by Perl to allow 64-bit code points to
        be encoded.

        "utf8_to_uv" is merely a synonym of "extended_utf8_to_uv" whose
        name explicitly indicates that it accepts Perl-extended UTF-8.
        Perl programs traditionally handle this by default.

        Whenever input is rejected, an explanatory warning message is
        raised, unless "utf8" warnings (or the appropriate subcategory)
        are turned off. A given input sequence may contain multiple
        malformations, giving rise to multiple warnings, as the
        functions attempt to find and report on all malformations in a
        sequence. All the possible malformations are listed in
        "utf8_to_uv_msgs", with some examples of multiple ones for the
        same sequence.

        Often, "s" is an arbitrarily long string containing the UTF-8
        representations of many code points in a row, and these
        functions are called in the course of parsing "s" to find all
        those code points.

        If your code doesn't know how to deal with illegal input, as
        would be typical of a low level routine, the loop could look
        like:

         while (s < e) {
             UV cp;
             Size_t advance;
             (void) utf8_to_uv(s, e, &cp, &advance);
             <handle 'cp'>
             s += advance;
         }

        A REPLACEMENT CHARACTER will be inserted everywhere that
        malformed input occurs. Obviously, we aren't expecting such
        outcomes, but your code will be protected from going off the
        rails.

        If you do have a plan for handling malformed input, you could
        instead write:

         while (s < e) {
             UV cp;
             Size_t advance;

             if (UNLIKELY(! utf8_to_uv(s, e, &cp, &advance)) {
                 <bail out or convert to handleable>
             }

             <handle 'cp'>

             s += advance;
         }

        You may pass NULL to these functions instead of a pointer to
        your "advance" variable. But the only legitimate case to do this
        is if you are only examining the first character in "s", and
        have no plans to ever look further. You could also advance by
        using "UTF8SKIP", but this gives the correct result if and only
        if the input is well-formed; and is extra work always, as the
        functions have already done the equivalent work and return the
        correct value in "advance", regardless of whether the input is
        well-formed or not.

        You must always pass a non-NULL pointer into which to store the
        (first) code point "s" represents. If you don't care about this
        value, you should be using one of the "isUTF8_CHAR" functions
        instead.

    Function where the UTF-8 is known to be valid
        "valid_utf8_to_uvchr" is designed to be used where you generated
        the UTF-8 yourself, so you know it is valid. It skips any error
        checking, assuming the sequence of bytes starting at "s" is
        encoded as Perl extended UTF-8 (or Perl extended UTF-EBCDIC),
        reading as many bytes along "s" as necessary, and returning that
        count in *retlen (if "retlen" is not NULL).

    "utf8_to_uvchr" forms
        These are the old form equivalents of "utf8_to_uv" (and its
        synonym, "extended_utf8_to_uv"). They are "utf8_to_uvchr" and
        "utf8_to_uvchr_buf". There is no old form equivalent of either
        "strict_utf8_to_uv" nor "c9strict_utf8_to_uv".

        "utf8_to_uvchr" is DEPRECATED. Do NOT use it; it is a security
        hole ready to bring destruction onto you and yours.
        "utf8_to_uvchr_buf" is discouraged and may eventually become
        deprecated

        "utf8_to_uvchr_buf" checks if the sequence of bytes starting at
        "s" form a complete, legal UTF-8 (or UTF-EBCDIC) sequence for a
        code point. If so, it returns the code point value the sequence
        represents, and *retlen will be set to its length, in bytes.
        Thus, the next possible character in "s" begins at
        "s + *retlen".

        The function only examines as many bytes along "s" as are needed
        to form a complete UTF-8 representation of a single code point.
        Under no circumstances does it examine any byte beyond "e - 1".

        If the sequence examined starting at "s" is not legal Perl
        extended UTF-8, the translation fails, and the resultant
        behavior unfortunately depends on if the warnings category
        "utf8" is enabled or not.

        If 'utf8' warnings are disabled
            The Unicode REPLACEMENT CHARACTER is silently returned, and
            *retlen is set (if "retlen" isn't "NULL") so that
            ("s" + *retlen) is the next possible position in "s" that
            could begin a non-malformed character.

            But note that it is ambiguous whether a REPLACEMENT
            CHARACTER was actually in the input, or if this function
            synthetically generated one. In the unlikely event that you
            care, you'd have to examine the input to disambiguate.

        If 'utf8' warnings are enabled
            A warning will be displayed, and 0 is returned and *retlen
            is set (if "retlen" isn't "NULL") to -1.

            But note that 0 may also be returned if *s is a legal NUL
            character. This means that you have to disambiguate a 0
            return. You can do this by checking that the first byte of
            "s" is indeed a NUL; or by making sure to always pass a
            non-NULL "retlen" pointer, and by examining it.

            Also note that should you wish to proceed with parsing "s",
            you have no easy way of knowing where to start looking in it
            for the next possible character. It would be better to have
            instead called an equivalent function that provides this
            information; any of the "utf8_to_uv" series, or
            "utf8n_to_uvchr".

        Because of these quirks, "utf8_to_uvchr_buf" is very difficult
        to use correctly and handle all cases. Generally, you need to
        bail out at the first failure it finds.

        The deprecated "utf8_uvchr" behaves the same way as
        "utf8_to_uvchr_buf" for well-formed input, and for the
        malformations it is capable of finding, but doesn't find all of
        them, and it can read beyond the end of the input buffer, which
        is why it is deprecated.

    The bottom line is use the "utf8_to_uv()" family of functions.

        bool       utf8_to_uv         (      const U8 * const s,
                                             const U8 * const e,
                                             UV *cp_p, Size_t *advance_p)
        bool  Perl_utf8_to_uv         (      const U8 * const s,
                                             const U8 * const e,
                                             UV *cp_p, Size_t *advance_p)
        bool       extended_utf8_to_uv(      const U8 * const s,
                                             const U8 * const e,
                                             UV *cp_p, Size_t *advance_p)
        bool  Perl_extended_utf8_to_uv(      const U8 * const s,
                                             const U8 * const e,
                                             UV *cp_p, Size_t *advance_p)
        bool       strict_utf8_to_uv  (      const U8 * const s,
                                             const U8 * const e,
                                             UV *cp_p, Size_t *advance_p)
        bool  Perl_strict_utf8_to_uv  (      const U8 * const s,
                                             const U8 * const e,
                                             UV *cp_p, Size_t *advance_p)
        bool       c9strict_utf8_to_uv(      const U8 * const s,
                                             const U8 * const e,
                                             UV *cp_p, Size_t *advance_p)
        bool  Perl_c9strict_utf8_to_uv(      const U8 * const s,
                                             const U8 * const e,
                                             UV *cp_p, Size_t *advance_p)
        UV         valid_utf8_to_uvchr(      const U8 *s, STRLEN *retlen)
        UV    Perl_valid_utf8_to_uvchr(      const U8 *s, STRLEN *retlen)
        UV         utf8_to_uvchr_buf  (      const U8 *s, const U8 *send,
                                             STRLEN *retlen)
        UV    Perl_utf8_to_uvchr_buf  (pTHX_ const U8 *s, const U8 *send,
                                             STRLEN *retlen)
        UV         utf8_to_uvchr      (      const U8 *s, STRLEN *retlen)
        UV    Perl_utf8_to_uvchr      (pTHX_ const U8 *s, STRLEN *retlen)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a hard reject, but the 83 lines of description of the "avoid these" functions (9 from "The functions whose names contain "to_uvchr"...", 74 from ""utf8_to_uvchr" forms") seems like a distraction from the documentation of the functions we do want people to use.

Unrelated, the list of functions from "The functions differ only..." could be (syntactically) a list.

utf8.c Outdated Show resolved Hide resolved
@khwilliamson
Copy link
Contributor Author

I changed the name of perl_utf8_to_uv to extended_utf8_to_uv because of likely confusion

@tonycoz
Copy link
Contributor

tonycoz commented Oct 29, 2024

Except for the documentation nits which aren't hard rejections, I'm otherwise happy with this.

@khwilliamson
Copy link
Contributor Author

In testing this, I found a few bugs, which I have force-pushed corrections for. The compare button above hides most irrelevant code changes. Most of the changes were to the pod, either better wording or I found I didn't fully understand how things actually worked.

Concerning the pod concerns. I realized that valid_to_uvchr_buf is an internal function, so its pod needs to stay in perlintern and not be combined with the others. I still believe the pod for the functions these new ones are preferred over should remain combined with these, bug be afterwards. perlapi is not just a reference document, but a teaching one; and I want to persuade people that it is worthwhile to make the effort to convert from the old-style to the new. It's like Raku documentation assuming that the reader is converting from Perl, so goes out of its way to highlight the differences. And, if you aren't interested in seeing the pod for the old, it comes at the end of the group, but before the signatures for all of the items in it. Just stop reading at that point; or if you are interested in the signatures, their appearance is quite distinct from the remainder of the pod, so it is easy to just glance over it to see them.

I'm still working on the tests, which have shown other, pretty obscure bugs in the base code of this function. More pull requests to come on that.

khwilliamson added a commit to khwilliamson/perl5 that referenced this pull request Nov 26, 2024
These are the inverse of the utf8_to_uv family in GH Perl#22541.  They are
just synonyms to existing functions, and are being added to reduce
cognitive load, so if you know one name, you automatically can figure
out the inverse.
khwilliamson added a commit to khwilliamson/perl5 that referenced this pull request Nov 26, 2024
These are the inverse of the utf8_to_uv family in GH Perl#22541.  They are
just synonyms to existing functions, and are being added to reduce
cognitive load, so if you know one name, you automatically can figure
out the inverse.
There was a path through this function in which the caller's parameter
it asked to be set, &msgs, did not get set.  And doing it at the
beginning means not needing a second place.

Similarly for &errors.  There is no path where it didn't get set, but it
is cleaner to do it in at the same time as doing msgs.
These two input parameters are for very specialized uses.
This is a one line function that just calls another function.
It was a macro, but had a long-name function as well.  This converts to
using two macros.
It was a macro, but had a long-name function as well.  This converts to
using two macros.
This is the first of several functions with the naming style
utf8_to_uv(), and which are designed to be used instead of the
problematic current ones that are like utf8_to_uvchr().

The previous ones basically throw away crucial information in their
returns upon failure, creating hassles for the caller.  It is hard to
recover from malformed input with them to keep going to continue
parsing.  That is what modern UTF-8 handlers have settled on doing.

Originally I planned to replace just the most problematic one,
utf8_to_uvchr_buf(), but I realized that each level threw away
information, so it would be better to start at the base level one, which
utf8_to_uvchr_buf() eventually calls with a bunch of 0 parameters.  The
previous functions all had to disambiguate failure returns.  This stops
that at the root.

The new series all return a boolean as to their success, with a
consistent API throughout.  The old series had one outlier, again
utf8_to_uvchr_buf(), which had a different calling convention and
returns.

The basic logic in the base level function, which this commit handles,
was sound.  It just failed to return relevant information upon failure.

The new API has somewhat different formal parameter names and uses
Size_t instead of STRLEN for one of the parameters.  It also passes the
end of string position instead of a length.  The latter is problematic
when it could go negative, and instead becomes a huge positive number.

The old base function now merely calls the new one, and throws away the
relevant information, as it always has.
This is just utf8n_to_uvchr_error() with a more convenient API that is
harder to misuse.

New code should use this new function instead of the old.
This is just utf8n_to_uvchr() with a more convenient API that is harder
to misuse.

New code should use this new function instead of the old.
This performs the same function as utf8_to_uvchr_buf() with a more
convenient API that is much harder to misuse.

All code should convert to use this new function instead of the old.

The behavior of utf8_to_uvchr_buf()  varies depending on if <utf8>
warnings are enabled or not, and no code in core actually takes that
into account

If warnings are enabled:

 A zero return can mean both success or failure

     Hence a zero return must be disambiguated.  Success would come
     from the next character being a NUL.

 If failure, <retlen> will be -1, so can't be used to find where to
 start parsing again.

If disabled:

 Both the return and <retlen> will be usable values, but the return
 of the REPLACEMENT CHARACTER is ambiguous.  It could mean failure,
 or it could mean that that was the next character in the input and
 was successfully decoded.  It may very well not matter to you what
 the source of this particular value was.  It likely means a failure
 somewhere.  But there are occasions where you might care.

The new function returns true upon success; false on failure.  And it is
passed pointers to return the computed code point and byte length into.
These values always contain the correct information, regardless of if
the input is malformed or not.

It is easy to test for failure in a conditional and then to take
appropriate action.  However, most often it seems the appropriate action
is to use, going forward, the REPLACEMENT CHARACTER returned in failure
cases.

And if you don't care particularly if it succeeds or not, you just use
it without testing the result.  This happens when you are confident that
the input is well-formed, or say in converting a string for display.
This is simpler than the existing one.
One of these is a more explicit synonym for that function; the other two
restrict what's acceptable to Unicode's legal interchange or their C9
legal interchange.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants