-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utils: Add utf8_is_ valid function #80779
base: main
Are you sure you want to change the base?
Conversation
285b4b0
to
ed28e46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this addition! It sounds useful to sometimes validate a string but not decode it, i.e. input validation.
Some suggestions were proposed above as feedback. Let me know if these suggestions make sense.
Maybe the solution is to completely decode the UTF-8 string, as then it would check for extra things:
- https://en.wikipedia.org/wiki/UTF-8#Surrogates
- https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings
And all other bizzare situations contained here:
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
if (i + nbyte > len) { | ||
return false; | ||
} | ||
i += nbyte; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we skip nbyte
at once, this does not check if the UTF-8 string bytes skipped are valid (i.e. starts with 0b10xxxxxx
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that it doesn't continue through the entire string. Instead, it returns immediately when the conditions are not met. When the conditions are met, it only performs a single match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this comment is still valid.
Please ignore CI, which is related to function signature. I'll change the signature and function. |
30419f6
to
d68c1b4
Compare
lib/utils/utf8.c
Outdated
{ | ||
int i = 0, nbyte = 0; | ||
int len = strnlen(str, maxlen); | ||
unsigned char *buf = (unsigned char *)str; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be const as we are not modifying the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will trigger a ci warning, that is STATIC_CONST.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually you should never cast away const pointer as that would indicate that you could modify the thing the pointer points to. Usually compiler should warn about this, and I recall some static checkers will.
So it is hard to understand why
const unsigned char *buf = (const unsigned char *)str;
would give a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolving this. #80779 (comment) claims it's answered here, but I disagree. Accessing str
directly should be possible without the additional local buf
and should be possible to use const
.
Please provide additional information about the warning you see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't add const, jukkar has already tested it. This will make my PR fail CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rruuaanng I am once again going to ask you not to resolve comments.
In the implementation you don't need the buf
at all, you can just use str
directly. Are you seeing issues with that approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
printf("%d %d %d", 0xe4, (const char)'\xe4', (const unsigned char)'\xe4');
output:
228 -28 228
I need to reiterate that you should read the above review carefully :)
Edit
I have said that when I did not use a method and optimization it was for a reason, not because I did not discover it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share the log of the build warning/error you get? #80779 (comment) is just a screenshot of your editor, which doesn't really say much :)
The other utf8 functions works fine with const char *
, so I'm curious to see what the issue here is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, other implementations have problems. They do not check the extended bytes or the abnormal bytes. The above case has explained why I want to convert it to unsigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't add const, jukkar has already tested it. This will make my PR fail CI.
I have not said anything about testing or accepting this, please don't claim that.
What I have said is that you should not cast away constness of a variable. I don't get why you talk about static variable in this case.
fbec8b3
to
bf92c6a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to see answers to unsolved issues. Also please do not mark questions resolved.
That is to say, the only question left about this PR is why it should be converted to unsigned. Edit |
I have pushed the original changes. If CI has any related issues, we can check it. |
lib/utils/utf8.c
Outdated
} else { | ||
if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE && | ||
str[i] >= SEQUENCE_MIN_LEN_2_BYTE) { | ||
nbyte = 2; | ||
} else if (str[i] <= SEQUENCE_MAX_LEN_3_BYTE && | ||
str[i] >= SEQUENCE_MIN_LEN_3_BYTE) { | ||
nbyte = 3; | ||
} else if (str[i] <= SEQUENCE_MAX_LEN_4_BYTE && | ||
str[i] >= SEQUENCE_MIN_LEN_4_BYTE) { | ||
nbyte = 4; | ||
} else { | ||
return false; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This else is redundant, please move code to the left:
} else { | |
if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE && | |
str[i] >= SEQUENCE_MIN_LEN_2_BYTE) { | |
nbyte = 2; | |
} else if (str[i] <= SEQUENCE_MAX_LEN_3_BYTE && | |
str[i] >= SEQUENCE_MIN_LEN_3_BYTE) { | |
nbyte = 3; | |
} else if (str[i] <= SEQUENCE_MAX_LEN_4_BYTE && | |
str[i] >= SEQUENCE_MIN_LEN_4_BYTE) { | |
nbyte = 4; | |
} else { | |
return false; | |
} | |
} | |
} | |
if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE && str[i] >= SEQUENCE_MIN_LEN_2_BYTE) { | |
nbyte = 2; | |
} else if (str[i] <= SEQUENCE_MAX_LEN_3_BYTE && str[i] >= SEQUENCE_MIN_LEN_3_BYTE) { | |
nbyte = 3; | |
} else if (str[i] <= SEQUENCE_MAX_LEN_4_BYTE && str[i] >= SEQUENCE_MIN_LEN_4_BYTE) { | |
nbyte = 4; | |
} else { | |
return false; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was over 100 columns and I had to wrap. So, I need to resolve this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time: Do not resolve other peoples comments!
I do not follow your logic. Moving code to the left makes the 100 column "issue" less, not worse. Please explain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to
str[i] <= SEQUENCE_MAX_LEN_4_BYTE && str[i] >= SEQUENCE_MIN_LEN_4_BYTE)
and it made CI throw out that the code exceeded 100 lines.
Edit
If you want me to give you an explanation, I will change it and you can look at the warnings CI throws. And it's just a look, it's not important. We should focus on the functionality of the decoder, not other.
CI thrown
|
Interesting. It would appear that the libc implementation used for that test does not implement |
Yes, you can certainly use it. But as it is an extension to the standard C library you need to "ask" for its prototype to be included.
at the beginning of the file that uses strnlen (before any header inclusion). |
lib/utils/utf8.c
Outdated
bool utf8_is_valid(const char *str, size_t maxlen) | ||
{ | ||
size_t i = 0, nbyte = 0; | ||
size_t len = strnlen(str, maxlen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the challenges with strnlen
, could we just do the following? Or would that be considered insecure?
size_t len = strnlen(str, maxlen); | |
size_t len = MIN(strlen(str), maxlen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider that insecure. Why not do the check when looping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do the check when looping?
Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while (i < maxlen) {
if (str[i] == '\0') {
break;
}
if (str[i] <= ASCII_CHAR && str[i] > '\0') {
i++;
continue;
}
// ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry to tell you, that empty characters(0x00) also belong to utf8 characters. If you do this, the check will be cut off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check would already be cut off, as strlen
is simply looking for a \0
byte, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTF-8 is an encoding that is used to represent multibyte character sets in a way that is backward-compatible with single-byte character sets. Another advantage of UTF-8 is that it ensures there are no NULL bytes in the data, with the exception of an actual NULL byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
[0x30, 0x31, 0x00, 0x30, 0xff] should be considered a valid UTF8 string as only the first 2 characters (0x30, 0x31) and the NULL terminator should be considered.
For example
char str[100];
str[0] = 0x30; /* or '0' */
str[1] = 0x31; /* or '1' */
str[2] = 0x00; /* or '\0' */
is a valid UTF8 string, even if octets [3..99] may be invalid, since the string stops at the NULL terminator.
I like the suggestion from @pdgendt as it also optimizes the check by avoiding the call to strlen
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. So we don't need to check the empty character into looping.
tests/unit/util/main.c
Outdated
{ | ||
/* Test whether the verification function meets the requirements */ | ||
zassert_true(utf8_is_valid("\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5", 11)); | ||
zassert_true(utf8_is_valid("\x00", 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would disagree. The tests should be readable and understandable by other developers. It is currently not IMO since we have reviewers commenting on this.
I am not a computer, so I have no idea if e.g. \xed\x9f\xbf
is in fact a valid UTF8 string, and thus cannot determine if the check here is correct or not. I could try and parse each individual line here manually, but that is a waste of time, when it's easier to add a description, a comment or use UTF8 characters like "Øåæ".
zassert_false(utf8_is_valid("\xc0\xaf", 2)); | ||
zassert_false(utf8_is_valid("\xc1\xbf", 2)); | ||
zassert_false(utf8_is_valid("\xc0\x80", 2)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have tests where maxlen
is larger and smaller than the provided string as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we only need to add one. That is to test its behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should have all 3 cases:
- maxlen smaller than strlen(str)
- maxlen equal to strlen(str) (covered already)
- maxlen larger than strlen(str)
Otherwise you aren't covering all checks in the code, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use strnlen
, I don't think we need to test when maxlen is greater than len (and when maxlen less than len).
Edit
By the way, I would like to reply here about the readability of the test strings. Most of the UTF8 string in the test do not have a readable representation. They are only used to test the decoder's recognition of boundary values (but those with readable characters, I've changed them)
I added
before |
37253ac
to
807aefd
Compare
If it passes CI, I hope you guys can approve it (it’s just a small feature, but it received so many reviews, which surprised me.) |
807aefd
to
0e8c910
Compare
Removing myself as reviewer. I trust the other reviewers to do remaining reviews, but I can't spend more time on this when the author keeps resolving my comments without proper resolutions, but looks like we are nearly there |
0e8c910
to
5e338ae
Compare
lib/utils/utf8.c
Outdated
if (str == NULL) { | ||
return false; | ||
} | ||
len = strnlen(str, maxlen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, but I fail to see the need to have strnlen
or strlen
here. Worst case scenario is that we loop the entire string twice, without any benefit. If the first two characters aren't valid utf8, we could already stop after 2 steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good proposal. Maybe I can use len as a parameter. I won't deal with the exception caused by the wrong len. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use the maxlen
argument directly in the loop? See my comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will modify it later and change i < len
to i < maxlen
.
Edit
And remove strnlen
;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the first two characters aren't valid utf8, we could already stop after 2 steps.
if (str[i] <= ASCII_CHAR && str[i] >= '\0') {
i++;
continue;
} else {
if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE
&& str[i] >= SEQUENCE_MIN_LEN_2_BYTE) {
nbyte = 2;
} else if (str[i] <= SEQUENCE_MAX_LEN_3_BYTE
&& str[i] >= SEQUENCE_MIN_LEN_3_BYTE) {
nbyte = 3;
} else if (str[i] <= SEQUENCE_MAX_LEN_4_BYTE
&& str[i] >= SEQUENCE_MIN_LEN_4_BYTE) {
nbyte = 4;
} else {
return false;
}
}
I almost forgot, my implementation includes this function, it is in the recognition. It will exit the function when it matches non-UTF8. The same is true for the first character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, I have another question. You mentioned that you can put maxlen directly into the loop instead of using strnlen, which seems to be no different from directly using len as a function parameter. In fact, it can be len directly instead of maxlen. Right?
@pdgendt
This PR has been reviewed by multiple people. However, no perfect and specific solution has been discussed, and they are usually scattered. This makes it inconvenient for me to implement them, and I hope we can summarize and list them (do not accept requests for changes in code style, it is useless). I think if there are no exceptions in terms of functionality, it has met the requirements for merging, which can make it available to users as soon as possible. And find bugs in practice. For minimal performance optimization, I don't think we need it, and this can be distributed by others after release. |
I list the following points that I learned from the discussion:
Edit And, thank you for your review and time! My changes may not be suitable for everyone, because everyone has a different perspective on this matter. In fact, we can't force uniformity, but we can summarize a solution that everyone can agree on. |
5e338ae
to
2ff2475
Compare
Add 'utf8_is_valid' function to check if a given string is utf8 encoded. Signed-off-by: James Roy <rruuaanng@outlook.com>
2ff2475
to
0eabadd
Compare
@pdgendt I've changes them, Please review again! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nitpickery, but one design issue I think needs to be addressed: this code doesn't inspect the actual code point itself, so what does "valid" mean here?
Specifically, this will return "true" for the sequence { 0xf0, 0x80, 0x80, 0xc1 }
, which is a "correct" but non-canonical and very surprising encoding of the ASCII character "A".
Most serious applications view this as bad, as it very often escapes string validation and escaping code and causes security bugs. See the relevant section in the Wikipedia page: https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings
Now, I'm not personally a security nut and won't take a strong position on the way this is "supposed" to work. But absolutely if we don't do the "Standard Security Best Practices" thing we should call that fact out very loudly in the docs for the function.
-1 just to make sure someone adds the docs. Feel free to override and merge if you can't find me to remove it.
while (i < len) { | ||
if (str[i] == '\0') { | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: "while (str[i] != '\0') {
is IMHO clearer and four lines shorter.
Edit: while (i < len && str[i] != '\0') {
that is. Which brings up another nitpick: this will return true
if the null-terminated prefix of the passed string is valid utf-8, it doesn't necessarily validate the whole string.
Is a buffer with trailing garbage "valid" or not? Documentation should specify.
if (str[i] == '\0') { | ||
break; | ||
} | ||
if (str[i] <= ASCII_CHAR && str[i] > '\0') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: the > '\0'
condition is disallowed by your test above and is dead code, eliminate.
continue; | ||
} else { | ||
if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE | ||
&& str[i] >= SEQUENCE_MIN_LEN_2_BYTE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for using comparisons here? UTF-8 sequences are canonically described as being selected by bitmasks (e.g. "0b1110xxxx" indicates a three byte sequence), and optimized code dealing with it usually exploits things like specialized instructions to count the leading one bits. Doing it by mapping those to range comparisons isn't wrong, it just looks weird to my eyes and seems likely to be needlessly large.
FWIW, my immediate thought here would be to try something like (completely untested!):
nbytes = MIN(4, __builtin_clz(~str[i]));
Add 'utf8_check' function to check if a given string is utf8 encoded.