Add new API function utf8_to_uv() #22541

khwilliamson · 2024-08-25T22:32:20Z

This is designed to replace the problematic utf8_to_uvchr(), which is problematic. Its behavior varies depending on if warnings are enabled or not, and no code in core actually takes that into account

If warnings are enabled:

A zero return can mean both success or failure

    Hence a zero return must be disambiguated.  Success would come
    from the next character being a NUL.

If failure, <retlen> will be -1, so can't be used to find where to
start parsing again.

If disabled:

Both the return and <retlen> will be usable values, but the return
of the REPLACEMENT CHARACTER is ambiguous.  It could mean failure,
or it could mean that that was the next character in the input and
was successfully decoded.

utf8_to_uv() solves these. This commit includes a few changes to use it, to show it works. I have WIP that changes the rest of core to use it. I found that it makes coding simpler.

The new function returns true upon success; false on failure. And it is passed pointers to return the computed code point and byte length into. These values always contain the correct information, regardless of if the input is malformed or not.

It is easy to test for failure in a conditional and then to take appropriate action. However, most often it seems the appropriate action is to use, going forward, the REPLACEMENT CHARACTER returned in failure cases.

And if you don't care particularly if it succeeds or not, you just use it without testing the result. This happens when you are confident that the input is well-formed, or say in converting a string for display.

There is another function utf8_to_uv_flags() which merely extends this API for more flexible use, and doesn't offer the advantages over the existing API function that does the same thing. I included it because the main function is just a small wrapper around it, and the API is similar and some may prefer it.

inline.h

karenetheridge · 2024-08-29T01:42:16Z

After this is in, is it possible for Devel::PPPort to generate a warning when utf8_to_uvchr is used, with a reference to its replacement?

(I'm not suggesting that you do this Tony, just wondering if we can and should)

khwilliamson · 2024-08-29T02:24:25Z

It is easy in ppport.h to either output a hint or a warning about using any particular function. It's just text. Here's an example:

*** WARNING: my_sprintf
*** It's safer to use my_snprintf instead

khwilliamson · 2024-09-02T04:03:27Z

The name I used for this function was the first name used for this functionality, being removed in 5.7. A succession of names has followed, as the previous incarnation was found to be deficient for one reason or another. uvchr apparently came about as a result of supporting non-ASCII systems, and I presumed was chosen to somehow note the fluidity of the underlying character set. I thought that it had been long enough since utf8_to_uv' had been in use that it could safely be reused again. But I came up with an alternative name in the meantime utf8_to_cp (We would want to make an inversecp_to_utf8. I think cp` is in common enough usage these days that the name would be self-explanatory to modern programmers.

But I don't know. I'm opening this up for discussion

tonycoz · 2024-09-04T02:10:45Z

If it returns a native code point rather than a Unicode code point I'd be inclined to avoid "cp".

We've generally used "uvchr" for this but utf8_to_uvchr() is what we're trying to replace.

Maybe utf8_decode_uvchr(), but that has it's own potential for confusion.

Maybe utf8_to_uvchr4() to emphasize that this is an alternative to utf8_to_uvchr().

khwilliamson · 2024-09-04T19:48:19Z

I'm then inclined to go with utf8_to_uv

tonycoz · 2024-10-20T23:29:46Z

utf8.c

+=for apidoc      utf8_to_uv
+=for apidoc_item strict_utf8_to_uv
+=for apidoc_item c9strict_utf8_to_uv
+=for apidoc_item valid_utf8_to_uvchr
+=for apidoc_item utf8_to_uvchr_buf
+=for apidoc_item utf8_to_uvchr


One problem with documenting these in the same place is I need to read (or at least skim) the documentation for the hard to use correctly APIs to use the easier to use correctly APIs.

I don't understand this; when I look at the resultant pod, except for the names, the newer ones are first in the group. In order to see the calling prototypes you have to skip over the old names, but the prototypes are visually distinctive enough that that's easy to do.

I'm sure it probably doesn't matter to the apidoc parser this is for, but I just want to note that since there are no newlines in between these lines, this constitutes a single =for directive which contains additional lines that happen to start with =for.

It's not the list of names that I talking about, but the descriptive text needs to cover both sets of functions.

I still don't understand your objection. Here is the entire section. The favored functions are listed first; if you want to skip reading the disfavored ones to go look at the signatures, my claim is that the latter is visually distinctive enough that skimming is not required. I just added a sentence to make it stand out a bit more. That could be expanded.

But I do think it is important that these be in the same section. There are several paragraphs at the beginning that apply to all types, and those shouldn't be repeated. Also it's easier to contrast the types when in the same section

"utf8_to_uv" "extended_utf8_to_uv" "strict_utf8_to_uv" "c9strict_utf8_to_uv" "valid_utf8_to_uvchr" "utf8_to_uvchr_buf" "utf8_to_uvchr" "DEPRECATED!" It is planned to remove "utf8_to_uvchr" from a future release of Perl. Do not use it for new code; remove it from existing code. These functions each translate from UTF-8 to UTF-32 (or UTF-64 on 64 bit platforms). In other words, to a code point ordinal value. (On EBCDIC platforms, the initial encoding is UTF-EBCDIC, and the output is a native code point). For example, the string "A" would be converted to the number 65 on an ASCII platform, and to 193 on an EBCDIC one. Converting the string "ABC" would yield the same results, as the functions stop after the first character converted. Converting the string "\N{LATIN CAPITAL LETTER A WITH MACRON} plus anything more in the string" would yield the number 0x100 on both types of platforms, since the first character is U+0100. The functions whose names contain "to_uvchr" are older than the functions whose names don't have "chr" in them. The API in the older functions is harder to use correctly, and so they are kept only for backwards compatibility, and may eventually become deprecated. If you are writing a module and use Devel::PPPort, your code can use the new functions back to at least Perl v5.7.1. ("valid_utf8_to_uvchr" is the exception to this name rule; its API is not problematic, and it is in no danger of becoming deprecated. But it is highly specialized so should rarely occur in actual code.) All the functions accept, without complaint, well-formed UTF-8 for any non-problematic Unicode code point 0 .. 0x10FFFF. There are two types of Unicode problematic code points: surrogate characters and non-character code points. (See perlunicode.) Some of the functions reject one or both of these. Private use characters and those code points yet to be assigned to a particular character are never considered problematic. Additionally, most of the functions accept non-Unicode code points, those starting at 0x110000. "utf8_to_uv" forms Almost all code should use only "utf8_to_uv", "extended_utf8_to_uv", "strict_utf8_to_uv", or "c9strict_utf8_to_uv". The other functions are either the problematic old form, or are for highly specialized uses. These four functions each return "true" if the sequence of bytes starting at "s" form a complete, legal UTF-8 (or UTF-EBCDIC) sequence for a code point. If so, *cp will be set to the native code point value it represents, and *advance will be set to its length, in bytes. Otherwise, each function returns "false" and sets *cp to the Unicode REPLACEMENT CHARACTER, and *advance to the next position along "s", where the next possible UTF-8 character could begin. The functions only examine as many bytes along "s" as are needed to form a complete UTF-8 representation of a single code point. Under no circumstances do they examine any byte beyond "e - 1", failing if the code point requires more than "e - s" bytes to represent. The functions differ only in what flavor of UTF-8 they accept. All reject syntactically invalid UTF-8. "strict_utf8_to_uv" additionally rejects any UTF-8 that translates into a code point that isn't specified by Unicode to be freely exchangeable, namely the surrogate characters and non-character code points. "c9strict_utf8_to_uv" instead uses the exchangeable definition given by Unicode's Corregendum #9, which rejects only surrogates. "extended_utf8_to_uv" accepts all syntactically valid UTF-8, as extended by Perl to allow 64-bit code points to be encoded. "utf8_to_uv" is merely a synonym of "extended_utf8_to_uv" whose name explicitly indicates that it accepts Perl-extended UTF-8. Perl programs traditionally handle this by default. Whenever input is rejected, an explanatory warning message is raised, unless "utf8" warnings (or the appropriate subcategory) are turned off. A given input sequence may contain multiple malformations, giving rise to multiple warnings, as the functions attempt to find and report on all malformations in a sequence. All the possible malformations are listed in "utf8_to_uv_msgs", with some examples of multiple ones for the same sequence. Often, "s" is an arbitrarily long string containing the UTF-8 representations of many code points in a row, and these functions are called in the course of parsing "s" to find all those code points. If your code doesn't know how to deal with illegal input, as would be typical of a low level routine, the loop could look like: while (s < e) { UV cp; Size_t advance; (void) utf8_to_uv(s, e, &cp, &advance); <handle 'cp'> s += advance; } A REPLACEMENT CHARACTER will be inserted everywhere that malformed input occurs. Obviously, we aren't expecting such outcomes, but your code will be protected from going off the rails. If you do have a plan for handling malformed input, you could instead write: while (s < e) { UV cp; Size_t advance; if (UNLIKELY(! utf8_to_uv(s, e, &cp, &advance)) { <bail out or convert to handleable> } <handle 'cp'> s += advance; } You may pass NULL to these functions instead of a pointer to your "advance" variable. But the only legitimate case to do this is if you are only examining the first character in "s", and have no plans to ever look further. You could also advance by using "UTF8SKIP", but this gives the correct result if and only if the input is well-formed; and is extra work always, as the functions have already done the equivalent work and return the correct value in "advance", regardless of whether the input is well-formed or not. You must always pass a non-NULL pointer into which to store the (first) code point "s" represents. If you don't care about this value, you should be using one of the "isUTF8_CHAR" functions instead. Function where the UTF-8 is known to be valid "valid_utf8_to_uvchr" is designed to be used where you generated the UTF-8 yourself, so you know it is valid. It skips any error checking, assuming the sequence of bytes starting at "s" is encoded as Perl extended UTF-8 (or Perl extended UTF-EBCDIC), reading as many bytes along "s" as necessary, and returning that count in *retlen (if "retlen" is not NULL). "utf8_to_uvchr" forms These are the old form equivalents of "utf8_to_uv" (and its synonym, "extended_utf8_to_uv"). They are "utf8_to_uvchr" and "utf8_to_uvchr_buf". There is no old form equivalent of either "strict_utf8_to_uv" nor "c9strict_utf8_to_uv". "utf8_to_uvchr" is DEPRECATED. Do NOT use it; it is a security hole ready to bring destruction onto you and yours. "utf8_to_uvchr_buf" is discouraged and may eventually become deprecated "utf8_to_uvchr_buf" checks if the sequence of bytes starting at "s" form a complete, legal UTF-8 (or UTF-EBCDIC) sequence for a code point. If so, it returns the code point value the sequence represents, and *retlen will be set to its length, in bytes. Thus, the next possible character in "s" begins at "s + *retlen". The function only examines as many bytes along "s" as are needed to form a complete UTF-8 representation of a single code point. Under no circumstances does it examine any byte beyond "e - 1". If the sequence examined starting at "s" is not legal Perl extended UTF-8, the translation fails, and the resultant behavior unfortunately depends on if the warnings category "utf8" is enabled or not. If 'utf8' warnings are disabled The Unicode REPLACEMENT CHARACTER is silently returned, and *retlen is set (if "retlen" isn't "NULL") so that ("s" + *retlen) is the next possible position in "s" that could begin a non-malformed character. But note that it is ambiguous whether a REPLACEMENT CHARACTER was actually in the input, or if this function synthetically generated one. In the unlikely event that you care, you'd have to examine the input to disambiguate. If 'utf8' warnings are enabled A warning will be displayed, and 0 is returned and *retlen is set (if "retlen" isn't "NULL") to -1. But note that 0 may also be returned if *s is a legal NUL character. This means that you have to disambiguate a 0 return. You can do this by checking that the first byte of "s" is indeed a NUL; or by making sure to always pass a non-NULL "retlen" pointer, and by examining it. Also note that should you wish to proceed with parsing "s", you have no easy way of knowing where to start looking in it for the next possible character. It would be better to have instead called an equivalent function that provides this information; any of the "utf8_to_uv" series, or "utf8n_to_uvchr". Because of these quirks, "utf8_to_uvchr_buf" is very difficult to use correctly and handle all cases. Generally, you need to bail out at the first failure it finds. The deprecated "utf8_uvchr" behaves the same way as "utf8_to_uvchr_buf" for well-formed input, and for the malformations it is capable of finding, but doesn't find all of them, and it can read beyond the end of the input buffer, which is why it is deprecated. The bottom line is use the "utf8_to_uv()" family of functions. bool utf8_to_uv ( const U8 * const s, const U8 * const e, UV *cp_p, Size_t *advance_p) bool Perl_utf8_to_uv ( const U8 * const s, const U8 * const e, UV *cp_p, Size_t *advance_p) bool extended_utf8_to_uv( const U8 * const s, const U8 * const e, UV *cp_p, Size_t *advance_p) bool Perl_extended_utf8_to_uv( const U8 * const s, const U8 * const e, UV *cp_p, Size_t *advance_p) bool strict_utf8_to_uv ( const U8 * const s, const U8 * const e, UV *cp_p, Size_t *advance_p) bool Perl_strict_utf8_to_uv ( const U8 * const s, const U8 * const e, UV *cp_p, Size_t *advance_p) bool c9strict_utf8_to_uv( const U8 * const s, const U8 * const e, UV *cp_p, Size_t *advance_p) bool Perl_c9strict_utf8_to_uv( const U8 * const s, const U8 * const e, UV *cp_p, Size_t *advance_p) UV valid_utf8_to_uvchr( const U8 *s, STRLEN *retlen) UV Perl_valid_utf8_to_uvchr( const U8 *s, STRLEN *retlen) UV utf8_to_uvchr_buf ( const U8 *s, const U8 *send, STRLEN *retlen) UV Perl_utf8_to_uvchr_buf (pTHX_ const U8 *s, const U8 *send, STRLEN *retlen) UV utf8_to_uvchr ( const U8 *s, STRLEN *retlen) UV Perl_utf8_to_uvchr (pTHX_ const U8 *s, STRLEN *retlen)

This isn't a hard reject, but the 83 lines of description of the "avoid these" functions (9 from "The functions whose names contain "to_uvchr"...", 74 from ""utf8_to_uvchr" forms") seems like a distraction from the documentation of the functions we do want people to use.

Unrelated, the list of functions from "The functions differ only..." could be (syntactically) a list.

utf8.c

khwilliamson · 2024-10-22T14:12:37Z

I changed the name of perl_utf8_to_uv to extended_utf8_to_uv because of likely confusion

tonycoz · 2024-10-29T22:40:57Z

Except for the documentation nits which aren't hard rejections, I'm otherwise happy with this.

khwilliamson · 2024-11-25T20:58:50Z

In testing this, I found a few bugs, which I have force-pushed corrections for. The compare button above hides most irrelevant code changes. Most of the changes were to the pod, either better wording or I found I didn't fully understand how things actually worked.

Concerning the pod concerns. I realized that valid_to_uvchr_buf is an internal function, so its pod needs to stay in perlintern and not be combined with the others. I still believe the pod for the functions these new ones are preferred over should remain combined with these, bug be afterwards. perlapi is not just a reference document, but a teaching one; and I want to persuade people that it is worthwhile to make the effort to convert from the old-style to the new. It's like Raku documentation assuming that the reader is converting from Perl, so goes out of its way to highlight the differences. And, if you aren't interested in seeing the pod for the old, it comes at the end of the group, but before the signatures for all of the items in it. Just stop reading at that point; or if you are interested in the signatures, their appearance is quite distinct from the remainder of the pod, so it is easy to just glance over it to see them.

I'm still working on the tests, which have shown other, pretty obscure bugs in the base code of this function. More pull requests to come on that.

These are the inverse of the utf8_to_uv family in GH Perl#22541. They are just synonyms to existing functions, and are being added to reduce cognitive load, so if you know one name, you automatically can figure out the inverse.

There was a path through this function in which the caller's parameter it asked to be set, &msgs, did not get set. And doing it at the beginning means not needing a second place. Similarly for &errors. There is no path where it didn't get set, but it is cleaner to do it in at the same time as doing msgs.

These two input parameters are for very specialized uses.

This is a one line function that just calls another function.

The helper adds no value

It was a macro, but had a long-name function as well. This converts to using two macros.

This is the first of several functions with the naming style utf8_to_uv(), and which are designed to be used instead of the problematic current ones that are like utf8_to_uvchr(). The previous ones basically throw away crucial information in their returns upon failure, creating hassles for the caller. It is hard to recover from malformed input with them to keep going to continue parsing. That is what modern UTF-8 handlers have settled on doing. Originally I planned to replace just the most problematic one, utf8_to_uvchr_buf(), but I realized that each level threw away information, so it would be better to start at the base level one, which utf8_to_uvchr_buf() eventually calls with a bunch of 0 parameters. The previous functions all had to disambiguate failure returns. This stops that at the root. The new series all return a boolean as to their success, with a consistent API throughout. The old series had one outlier, again utf8_to_uvchr_buf(), which had a different calling convention and returns. The basic logic in the base level function, which this commit handles, was sound. It just failed to return relevant information upon failure. The new API has somewhat different formal parameter names and uses Size_t instead of STRLEN for one of the parameters. It also passes the end of string position instead of a length. The latter is problematic when it could go negative, and instead becomes a huge positive number. The old base function now merely calls the new one, and throws away the relevant information, as it always has.

This is just utf8n_to_uvchr_error() with a more convenient API that is harder to misuse. New code should use this new function instead of the old.

This is just utf8n_to_uvchr() with a more convenient API that is harder to misuse. New code should use this new function instead of the old.

This performs the same function as utf8_to_uvchr_buf() with a more convenient API that is much harder to misuse. All code should convert to use this new function instead of the old. The behavior of utf8_to_uvchr_buf() varies depending on if <utf8> warnings are enabled or not, and no code in core actually takes that into account If warnings are enabled: A zero return can mean both success or failure Hence a zero return must be disambiguated. Success would come from the next character being a NUL. If failure, <retlen> will be -1, so can't be used to find where to start parsing again. If disabled: Both the return and <retlen> will be usable values, but the return of the REPLACEMENT CHARACTER is ambiguous. It could mean failure, or it could mean that that was the next character in the input and was successfully decoded. It may very well not matter to you what the source of this particular value was. It likely means a failure somewhere. But there are occasions where you might care. The new function returns true upon success; false on failure. And it is passed pointers to return the computed code point and byte length into. These values always contain the correct information, regardless of if the input is malformed or not. It is easy to test for failure in a conditional and then to take appropriate action. However, most often it seems the appropriate action is to use, going forward, the REPLACEMENT CHARACTER returned in failure cases. And if you don't care particularly if it succeeds or not, you just use it without testing the result. This happens when you are confident that the input is well-formed, or say in converting a string for display.

This is simpler than the existing one.

One of these is a more explicit synonym for that function; the other two restrict what's acceptable to Unicode's legal interchange or their C9 legal interchange.

tonycoz reviewed Aug 28, 2024

View reviewed changes

inline.h Show resolved Hide resolved

tonycoz reviewed Aug 29, 2024

View reviewed changes

inline.h Outdated Show resolved Hide resolved

tonycoz reviewed Aug 29, 2024

View reviewed changes

inline.h Outdated Show resolved Hide resolved

khwilliamson force-pushed the utf8_to_uv branch from b3321c0 to 1e35e2d Compare August 29, 2024 03:02

tonycoz approved these changes Sep 2, 2024

View reviewed changes

khwilliamson force-pushed the utf8_to_uv branch from 1e35e2d to 3466be3 Compare October 18, 2024 23:08

tonycoz reviewed Oct 20, 2024

View reviewed changes

utf8.c Outdated Show resolved Hide resolved

khwilliamson force-pushed the utf8_to_uv branch from 3466be3 to 72418de Compare October 22, 2024 10:34

khwilliamson force-pushed the utf8_to_uv branch from 72418de to c95a178 Compare October 23, 2024 18:27

github-actions bot added the hasConflicts label Oct 28, 2024

khwilliamson force-pushed the utf8_to_uv branch from c95a178 to bd1b0f7 Compare October 28, 2024 20:00

github-actions bot removed the hasConflicts label Oct 28, 2024

khwilliamson force-pushed the utf8_to_uv branch from bd1b0f7 to 2f3f716 Compare November 21, 2024 19:39

github-actions bot added hasConflicts and removed hasConflicts labels Nov 24, 2024

khwilliamson force-pushed the utf8_to_uv branch 2 times, most recently from a553a71 to 3b0b641 Compare November 25, 2024 20:05

khwilliamson mentioned this pull request Nov 26, 2024

Add uv_to_utf8 family of functions #22782

Open

github-actions bot added the hasConflicts label Nov 27, 2024

khwilliamson added 14 commits November 29, 2024 12:35

utf8_to_uv_msgs: Add branch predictions

01d43ff

These two input parameters are for very specialized uses.

Inline utf8_to_uvchr_buf

8d43a9f

This is a one line function that just calls another function.

Merge utf8_to_uvchr_buf() and its helper

ba06da2

The helper adds no value

Convert utf8n_to_uvchr_error to macro

2a894a3

It was a macro, but had a long-name function as well. This converts to using two macros.

Convert utf8n_to_uvchr() to macro

01881fa

It was a macro, but had a long-name function as well. This converts to using two macros.

Add utf8_to_uv_error(s)

f4ee548

This is just utf8n_to_uvchr_error() with a more convenient API that is harder to misuse. New code should use this new function instead of the old.

Add utf8_to_uv_flags()

c1ea65a

This is just utf8n_to_uvchr() with a more convenient API that is harder to misuse. New code should use this new function instead of the old.

Implement utf8_to_uvchr_buf in terms of utf8_to_uv_flags

d9ee4a4

This is simpler than the existing one.

Add utf8_to_uv() flavors

e3bfac5

One of these is a more explicit synonym for that function; the other two restrict what's acceptable to Unicode's legal interchange or their C9 legal interchange.

Document new utf8_to_uv function family

4f4b533

perldelta for utf8_to_uv() family

65a0a5c

khwilliamson force-pushed the utf8_to_uv branch from 3b0b641 to 65a0a5c Compare November 30, 2024 00:14

khwilliamson removed the hasConflicts label Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new API function utf8_to_uv() #22541

Add new API function utf8_to_uv() #22541

khwilliamson commented Aug 25, 2024

karenetheridge commented Aug 29, 2024 •

edited

Loading

khwilliamson commented Aug 29, 2024

khwilliamson commented Sep 2, 2024

tonycoz commented Sep 4, 2024

khwilliamson commented Sep 4, 2024

tonycoz Oct 20, 2024

khwilliamson Oct 22, 2024

Grinnz Oct 22, 2024

tonycoz Oct 23, 2024

khwilliamson Oct 23, 2024

tonycoz Oct 29, 2024

khwilliamson commented Oct 22, 2024

tonycoz commented Oct 29, 2024

khwilliamson commented Nov 25, 2024

Add new API function utf8_to_uv() #22541

Are you sure you want to change the base?

Add new API function utf8_to_uv() #22541

Conversation

khwilliamson commented Aug 25, 2024

karenetheridge commented Aug 29, 2024 • edited Loading

khwilliamson commented Aug 29, 2024

khwilliamson commented Sep 2, 2024

tonycoz commented Sep 4, 2024

khwilliamson commented Sep 4, 2024

tonycoz Oct 20, 2024

Choose a reason for hiding this comment

khwilliamson Oct 22, 2024

Choose a reason for hiding this comment

Grinnz Oct 22, 2024

Choose a reason for hiding this comment

tonycoz Oct 23, 2024

Choose a reason for hiding this comment

khwilliamson Oct 23, 2024

Choose a reason for hiding this comment

tonycoz Oct 29, 2024

Choose a reason for hiding this comment

khwilliamson commented Oct 22, 2024

tonycoz commented Oct 29, 2024

khwilliamson commented Nov 25, 2024

karenetheridge commented Aug 29, 2024 •

edited

Loading