-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new API function utf8_to_uv() #22541
base: blead
Are you sure you want to change the base?
Conversation
After this is in, is it possible for Devel::PPPort to generate a warning when utf8_to_uvchr is used, with a reference to its replacement? (I'm not suggesting that you do this Tony, just wondering if we can and should) |
It is easy in ppport.h to either output a hint or a warning about using any particular function. It's just text. Here's an example:
|
b3321c0
to
1e35e2d
Compare
The name I used for this function was the first name used for this functionality, being removed in 5.7. A succession of names has followed, as the previous incarnation was found to be deficient for one reason or another. But I don't know. I'm opening this up for discussion |
If it returns a native code point rather than a Unicode code point I'd be inclined to avoid "cp". We've generally used "uvchr" for this but utf8_to_uvchr() is what we're trying to replace. Maybe Maybe |
I'm then inclined to go with |
1e35e2d
to
3466be3
Compare
utf8.c
Outdated
=for apidoc utf8_to_uv | ||
=for apidoc_item strict_utf8_to_uv | ||
=for apidoc_item c9strict_utf8_to_uv | ||
=for apidoc_item valid_utf8_to_uvchr | ||
=for apidoc_item utf8_to_uvchr_buf | ||
=for apidoc_item utf8_to_uvchr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem with documenting these in the same place is I need to read (or at least skim) the documentation for the hard to use correctly APIs to use the easier to use correctly APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this; when I look at the resultant pod, except for the names, the newer ones are first in the group. In order to see the calling prototypes you have to skip over the old names, but the prototypes are visually distinctive enough that that's easy to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure it probably doesn't matter to the apidoc parser this is for, but I just want to note that since there are no newlines in between these lines, this constitutes a single =for directive which contains additional lines that happen to start with =for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the list of names that I talking about, but the descriptive text needs to cover both sets of functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand your objection. Here is the entire section. The favored functions are listed first; if you want to skip reading the disfavored ones to go look at the signatures, my claim is that the latter is visually distinctive enough that skimming is not required. I just added a sentence to make it stand out a bit more. That could be expanded.
But I do think it is important that these be in the same section. There are several paragraphs at the beginning that apply to all types, and those shouldn't be repeated. Also it's easier to contrast the types when in the same section
"utf8_to_uv"
"extended_utf8_to_uv"
"strict_utf8_to_uv"
"c9strict_utf8_to_uv"
"valid_utf8_to_uvchr"
"utf8_to_uvchr_buf"
"utf8_to_uvchr"
"DEPRECATED!" It is planned to remove "utf8_to_uvchr" from a future
release of Perl. Do not use it for new code; remove it from existing
code.
These functions each translate from UTF-8 to UTF-32 (or UTF-64 on 64
bit platforms). In other words, to a code point ordinal value. (On
EBCDIC platforms, the initial encoding is UTF-EBCDIC, and the output
is a native code point).
For example, the string "A" would be converted to the number 65 on
an ASCII platform, and to 193 on an EBCDIC one. Converting the
string "ABC" would yield the same results, as the functions stop
after the first character converted. Converting the string "\N{LATIN
CAPITAL LETTER A WITH MACRON} plus anything more in the string"
would yield the number 0x100 on both types of platforms, since the
first character is U+0100.
The functions whose names contain "to_uvchr" are older than the
functions whose names don't have "chr" in them. The API in the older
functions is harder to use correctly, and so they are kept only for
backwards compatibility, and may eventually become deprecated. If
you are writing a module and use Devel::PPPort, your code can use
the new functions back to at least Perl v5.7.1.
("valid_utf8_to_uvchr" is the exception to this name rule; its API
is not problematic, and it is in no danger of becoming deprecated.
But it is highly specialized so should rarely occur in actual code.)
All the functions accept, without complaint, well-formed UTF-8 for
any non-problematic Unicode code point 0 .. 0x10FFFF. There are two
types of Unicode problematic code points: surrogate characters and
non-character code points. (See perlunicode.) Some of the functions
reject one or both of these. Private use characters and those code
points yet to be assigned to a particular character are never
considered problematic. Additionally, most of the functions accept
non-Unicode code points, those starting at 0x110000.
"utf8_to_uv" forms
Almost all code should use only "utf8_to_uv",
"extended_utf8_to_uv", "strict_utf8_to_uv", or
"c9strict_utf8_to_uv". The other functions are either the
problematic old form, or are for highly specialized uses.
These four functions each return "true" if the sequence of bytes
starting at "s" form a complete, legal UTF-8 (or UTF-EBCDIC)
sequence for a code point. If so, *cp will be set to the native
code point value it represents, and *advance will be set to its
length, in bytes.
Otherwise, each function returns "false" and sets *cp to the
Unicode REPLACEMENT CHARACTER, and *advance to the next position
along "s", where the next possible UTF-8 character could begin.
The functions only examine as many bytes along "s" as are needed
to form a complete UTF-8 representation of a single code point.
Under no circumstances do they examine any byte beyond "e - 1",
failing if the code point requires more than "e - s" bytes to
represent.
The functions differ only in what flavor of UTF-8 they accept.
All reject syntactically invalid UTF-8. "strict_utf8_to_uv"
additionally rejects any UTF-8 that translates into a code point
that isn't specified by Unicode to be freely exchangeable,
namely the surrogate characters and non-character code points.
"c9strict_utf8_to_uv" instead uses the exchangeable definition
given by Unicode's Corregendum #9, which rejects only
surrogates. "extended_utf8_to_uv" accepts all syntactically
valid UTF-8, as extended by Perl to allow 64-bit code points to
be encoded.
"utf8_to_uv" is merely a synonym of "extended_utf8_to_uv" whose
name explicitly indicates that it accepts Perl-extended UTF-8.
Perl programs traditionally handle this by default.
Whenever input is rejected, an explanatory warning message is
raised, unless "utf8" warnings (or the appropriate subcategory)
are turned off. A given input sequence may contain multiple
malformations, giving rise to multiple warnings, as the
functions attempt to find and report on all malformations in a
sequence. All the possible malformations are listed in
"utf8_to_uv_msgs", with some examples of multiple ones for the
same sequence.
Often, "s" is an arbitrarily long string containing the UTF-8
representations of many code points in a row, and these
functions are called in the course of parsing "s" to find all
those code points.
If your code doesn't know how to deal with illegal input, as
would be typical of a low level routine, the loop could look
like:
while (s < e) {
UV cp;
Size_t advance;
(void) utf8_to_uv(s, e, &cp, &advance);
<handle 'cp'>
s += advance;
}
A REPLACEMENT CHARACTER will be inserted everywhere that
malformed input occurs. Obviously, we aren't expecting such
outcomes, but your code will be protected from going off the
rails.
If you do have a plan for handling malformed input, you could
instead write:
while (s < e) {
UV cp;
Size_t advance;
if (UNLIKELY(! utf8_to_uv(s, e, &cp, &advance)) {
<bail out or convert to handleable>
}
<handle 'cp'>
s += advance;
}
You may pass NULL to these functions instead of a pointer to
your "advance" variable. But the only legitimate case to do this
is if you are only examining the first character in "s", and
have no plans to ever look further. You could also advance by
using "UTF8SKIP", but this gives the correct result if and only
if the input is well-formed; and is extra work always, as the
functions have already done the equivalent work and return the
correct value in "advance", regardless of whether the input is
well-formed or not.
You must always pass a non-NULL pointer into which to store the
(first) code point "s" represents. If you don't care about this
value, you should be using one of the "isUTF8_CHAR" functions
instead.
Function where the UTF-8 is known to be valid
"valid_utf8_to_uvchr" is designed to be used where you generated
the UTF-8 yourself, so you know it is valid. It skips any error
checking, assuming the sequence of bytes starting at "s" is
encoded as Perl extended UTF-8 (or Perl extended UTF-EBCDIC),
reading as many bytes along "s" as necessary, and returning that
count in *retlen (if "retlen" is not NULL).
"utf8_to_uvchr" forms
These are the old form equivalents of "utf8_to_uv" (and its
synonym, "extended_utf8_to_uv"). They are "utf8_to_uvchr" and
"utf8_to_uvchr_buf". There is no old form equivalent of either
"strict_utf8_to_uv" nor "c9strict_utf8_to_uv".
"utf8_to_uvchr" is DEPRECATED. Do NOT use it; it is a security
hole ready to bring destruction onto you and yours.
"utf8_to_uvchr_buf" is discouraged and may eventually become
deprecated
"utf8_to_uvchr_buf" checks if the sequence of bytes starting at
"s" form a complete, legal UTF-8 (or UTF-EBCDIC) sequence for a
code point. If so, it returns the code point value the sequence
represents, and *retlen will be set to its length, in bytes.
Thus, the next possible character in "s" begins at
"s + *retlen".
The function only examines as many bytes along "s" as are needed
to form a complete UTF-8 representation of a single code point.
Under no circumstances does it examine any byte beyond "e - 1".
If the sequence examined starting at "s" is not legal Perl
extended UTF-8, the translation fails, and the resultant
behavior unfortunately depends on if the warnings category
"utf8" is enabled or not.
If 'utf8' warnings are disabled
The Unicode REPLACEMENT CHARACTER is silently returned, and
*retlen is set (if "retlen" isn't "NULL") so that
("s" + *retlen) is the next possible position in "s" that
could begin a non-malformed character.
But note that it is ambiguous whether a REPLACEMENT
CHARACTER was actually in the input, or if this function
synthetically generated one. In the unlikely event that you
care, you'd have to examine the input to disambiguate.
If 'utf8' warnings are enabled
A warning will be displayed, and 0 is returned and *retlen
is set (if "retlen" isn't "NULL") to -1.
But note that 0 may also be returned if *s is a legal NUL
character. This means that you have to disambiguate a 0
return. You can do this by checking that the first byte of
"s" is indeed a NUL; or by making sure to always pass a
non-NULL "retlen" pointer, and by examining it.
Also note that should you wish to proceed with parsing "s",
you have no easy way of knowing where to start looking in it
for the next possible character. It would be better to have
instead called an equivalent function that provides this
information; any of the "utf8_to_uv" series, or
"utf8n_to_uvchr".
Because of these quirks, "utf8_to_uvchr_buf" is very difficult
to use correctly and handle all cases. Generally, you need to
bail out at the first failure it finds.
The deprecated "utf8_uvchr" behaves the same way as
"utf8_to_uvchr_buf" for well-formed input, and for the
malformations it is capable of finding, but doesn't find all of
them, and it can read beyond the end of the input buffer, which
is why it is deprecated.
The bottom line is use the "utf8_to_uv()" family of functions.
bool utf8_to_uv ( const U8 * const s,
const U8 * const e,
UV *cp_p, Size_t *advance_p)
bool Perl_utf8_to_uv ( const U8 * const s,
const U8 * const e,
UV *cp_p, Size_t *advance_p)
bool extended_utf8_to_uv( const U8 * const s,
const U8 * const e,
UV *cp_p, Size_t *advance_p)
bool Perl_extended_utf8_to_uv( const U8 * const s,
const U8 * const e,
UV *cp_p, Size_t *advance_p)
bool strict_utf8_to_uv ( const U8 * const s,
const U8 * const e,
UV *cp_p, Size_t *advance_p)
bool Perl_strict_utf8_to_uv ( const U8 * const s,
const U8 * const e,
UV *cp_p, Size_t *advance_p)
bool c9strict_utf8_to_uv( const U8 * const s,
const U8 * const e,
UV *cp_p, Size_t *advance_p)
bool Perl_c9strict_utf8_to_uv( const U8 * const s,
const U8 * const e,
UV *cp_p, Size_t *advance_p)
UV valid_utf8_to_uvchr( const U8 *s, STRLEN *retlen)
UV Perl_valid_utf8_to_uvchr( const U8 *s, STRLEN *retlen)
UV utf8_to_uvchr_buf ( const U8 *s, const U8 *send,
STRLEN *retlen)
UV Perl_utf8_to_uvchr_buf (pTHX_ const U8 *s, const U8 *send,
STRLEN *retlen)
UV utf8_to_uvchr ( const U8 *s, STRLEN *retlen)
UV Perl_utf8_to_uvchr (pTHX_ const U8 *s, STRLEN *retlen)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a hard reject, but the 83 lines of description of the "avoid these" functions (9 from "The functions whose names contain "to_uvchr"...", 74 from ""utf8_to_uvchr" forms") seems like a distraction from the documentation of the functions we do want people to use.
Unrelated, the list of functions from "The functions differ only..." could be (syntactically) a list.
3466be3
to
72418de
Compare
I changed the name of |
72418de
to
c95a178
Compare
c95a178
to
bd1b0f7
Compare
Except for the documentation nits which aren't hard rejections, I'm otherwise happy with this. |
bd1b0f7
to
2f3f716
Compare
a553a71
to
3b0b641
Compare
In testing this, I found a few bugs, which I have force-pushed corrections for. The compare button above hides most irrelevant code changes. Most of the changes were to the pod, either better wording or I found I didn't fully understand how things actually worked. Concerning the pod concerns. I realized that I'm still working on the tests, which have shown other, pretty obscure bugs in the base code of this function. More pull requests to come on that. |
These are the inverse of the utf8_to_uv family in GH Perl#22541. They are just synonyms to existing functions, and are being added to reduce cognitive load, so if you know one name, you automatically can figure out the inverse.
These are the inverse of the utf8_to_uv family in GH Perl#22541. They are just synonyms to existing functions, and are being added to reduce cognitive load, so if you know one name, you automatically can figure out the inverse.
There was a path through this function in which the caller's parameter it asked to be set, &msgs, did not get set. And doing it at the beginning means not needing a second place. Similarly for &errors. There is no path where it didn't get set, but it is cleaner to do it in at the same time as doing msgs.
These two input parameters are for very specialized uses.
This is a one line function that just calls another function.
The helper adds no value
It was a macro, but had a long-name function as well. This converts to using two macros.
It was a macro, but had a long-name function as well. This converts to using two macros.
This is the first of several functions with the naming style utf8_to_uv(), and which are designed to be used instead of the problematic current ones that are like utf8_to_uvchr(). The previous ones basically throw away crucial information in their returns upon failure, creating hassles for the caller. It is hard to recover from malformed input with them to keep going to continue parsing. That is what modern UTF-8 handlers have settled on doing. Originally I planned to replace just the most problematic one, utf8_to_uvchr_buf(), but I realized that each level threw away information, so it would be better to start at the base level one, which utf8_to_uvchr_buf() eventually calls with a bunch of 0 parameters. The previous functions all had to disambiguate failure returns. This stops that at the root. The new series all return a boolean as to their success, with a consistent API throughout. The old series had one outlier, again utf8_to_uvchr_buf(), which had a different calling convention and returns. The basic logic in the base level function, which this commit handles, was sound. It just failed to return relevant information upon failure. The new API has somewhat different formal parameter names and uses Size_t instead of STRLEN for one of the parameters. It also passes the end of string position instead of a length. The latter is problematic when it could go negative, and instead becomes a huge positive number. The old base function now merely calls the new one, and throws away the relevant information, as it always has.
This is just utf8n_to_uvchr_error() with a more convenient API that is harder to misuse. New code should use this new function instead of the old.
This is just utf8n_to_uvchr() with a more convenient API that is harder to misuse. New code should use this new function instead of the old.
This performs the same function as utf8_to_uvchr_buf() with a more convenient API that is much harder to misuse. All code should convert to use this new function instead of the old. The behavior of utf8_to_uvchr_buf() varies depending on if <utf8> warnings are enabled or not, and no code in core actually takes that into account If warnings are enabled: A zero return can mean both success or failure Hence a zero return must be disambiguated. Success would come from the next character being a NUL. If failure, <retlen> will be -1, so can't be used to find where to start parsing again. If disabled: Both the return and <retlen> will be usable values, but the return of the REPLACEMENT CHARACTER is ambiguous. It could mean failure, or it could mean that that was the next character in the input and was successfully decoded. It may very well not matter to you what the source of this particular value was. It likely means a failure somewhere. But there are occasions where you might care. The new function returns true upon success; false on failure. And it is passed pointers to return the computed code point and byte length into. These values always contain the correct information, regardless of if the input is malformed or not. It is easy to test for failure in a conditional and then to take appropriate action. However, most often it seems the appropriate action is to use, going forward, the REPLACEMENT CHARACTER returned in failure cases. And if you don't care particularly if it succeeds or not, you just use it without testing the result. This happens when you are confident that the input is well-formed, or say in converting a string for display.
This is simpler than the existing one.
One of these is a more explicit synonym for that function; the other two restrict what's acceptable to Unicode's legal interchange or their C9 legal interchange.
3b0b641
to
65a0a5c
Compare
This is designed to replace the problematic utf8_to_uvchr(), which is problematic. Its behavior varies depending on if warnings are enabled or not, and no code in core actually takes that into account
If warnings are enabled:
If disabled:
utf8_to_uv() solves these. This commit includes a few changes to use it, to show it works. I have WIP that changes the rest of core to use it. I found that it makes coding simpler.
The new function returns true upon success; false on failure. And it is passed pointers to return the computed code point and byte length into. These values always contain the correct information, regardless of if the input is malformed or not.
It is easy to test for failure in a conditional and then to take appropriate action. However, most often it seems the appropriate action is to use, going forward, the REPLACEMENT CHARACTER returned in failure cases.
And if you don't care particularly if it succeeds or not, you just use it without testing the result. This happens when you are confident that the input is well-formed, or say in converting a string for display.
There is another function utf8_to_uv_flags() which merely extends this API for more flexible use, and doesn't offer the advantages over the existing API function that does the same thing. I included it because the main function is just a small wrapper around it, and the API is similar and some may prefer it.