Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utils: Add utf8_is_ valid function #80779

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rruuaanng
Copy link
Collaborator

Add 'utf8_check' function to check if a given string is utf8 encoded.

Copy link
Collaborator

@josuah josuah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this addition! It sounds useful to sometimes validate a string but not decode it, i.e. input validation.

Some suggestions were proposed above as feedback. Let me know if these suggestions make sense.

Maybe the solution is to completely decode the UTF-8 string, as then it would check for extra things:

And all other bizzare situations contained here:
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

lib/utils/utf8.c Outdated Show resolved Hide resolved
lib/utils/utf8.c Outdated Show resolved Hide resolved
lib/utils/utf8.c Outdated Show resolved Hide resolved
if (i + nbyte > len) {
return false;
}
i += nbyte;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we skip nbyte at once, this does not check if the UTF-8 string bytes skipped are valid (i.e. starts with 0b10xxxxxx).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that it doesn't continue through the entire string. Instead, it returns immediately when the conditions are not met. When the conditions are met, it only performs a single match.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this comment is still valid.

include/zephyr/sys/util.h Outdated Show resolved Hide resolved
include/zephyr/sys/util.h Outdated Show resolved Hide resolved
include/zephyr/sys/util.h Outdated Show resolved Hide resolved
@rruuaanng
Copy link
Collaborator Author

Please ignore CI, which is related to function signature. I'll change the signature and function.

@rruuaanng rruuaanng force-pushed the utf8-check branch 4 times, most recently from 30419f6 to d68c1b4 Compare November 2, 2024 15:38
lib/utils/utf8.c Outdated Show resolved Hide resolved
lib/utils/utf8.c Outdated
{
int i = 0, nbyte = 0;
int len = strnlen(str, maxlen);
unsigned char *buf = (unsigned char *)str;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be const as we are not modifying the string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will trigger a ci warning, that is STATIC_CONST.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually you should never cast away const pointer as that would indicate that you could modify the thing the pointer points to. Usually compiler should warn about this, and I recall some static checkers will.
So it is hard to understand why

const unsigned char *buf = (const unsigned char *)str;

would give a warning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving this. #80779 (comment) claims it's answered here, but I disagree. Accessing str directly should be possible without the additional local buf and should be possible to use const.

Please provide additional information about the warning you see

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't add const, jukkar has already tested it. This will make my PR fail CI.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rruuaanng I am once again going to ask you not to resolve comments.

In the implementation you don't need the buf at all, you can just use str directly. Are you seeing issues with that approach?

Copy link
Collaborator Author

@rruuaanng rruuaanng Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printf("%d %d %d", 0xe4, (const char)'\xe4', (const unsigned char)'\xe4');

output:
228 -28 228

I need to reiterate that you should read the above review carefully :)

Edit
I have said that when I did not use a method and optimization it was for a reason, not because I did not discover it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share the log of the build warning/error you get? #80779 (comment) is just a screenshot of your editor, which doesn't really say much :)

The other utf8 functions works fine with const char *, so I'm curious to see what the issue here is

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, other implementations have problems. They do not check the extended bytes or the abnormal bytes. The above case has explained why I want to convert it to unsigned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't add const, jukkar has already tested it. This will make my PR fail CI.

I have not said anything about testing or accepting this, please don't claim that.

What I have said is that you should not cast away constness of a variable. I don't get why you talk about static variable in this case.

include/zephyr/sys/util.h Outdated Show resolved Hide resolved
@rruuaanng rruuaanng force-pushed the utf8-check branch 2 times, most recently from fbec8b3 to bf92c6a Compare November 3, 2024 02:22
lib/utils/utf8.c Outdated Show resolved Hide resolved
Copy link
Member

@jukkar jukkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to see answers to unsolved issues. Also please do not mark questions resolved.

@rruuaanng
Copy link
Collaborator Author

rruuaanng commented Nov 6, 2024

That is to say, the only question left about this PR is why it should be converted to unsigned.

Edit
And, why not use strnlen? I will initiate modifications to the above two later, and we can analyze the situation of ci together.

@rruuaanng rruuaanng changed the title utils: Add utf8_check function utils: Add utf8_is_ valid function Nov 6, 2024
@rruuaanng
Copy link
Collaborator Author

I have pushed the original changes. If CI has any related issues, we can check it.

lib/utils/utf8.c Outdated
Comment on lines 97 to 117
} else {
if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE &&
str[i] >= SEQUENCE_MIN_LEN_2_BYTE) {
nbyte = 2;
} else if (str[i] <= SEQUENCE_MAX_LEN_3_BYTE &&
str[i] >= SEQUENCE_MIN_LEN_3_BYTE) {
nbyte = 3;
} else if (str[i] <= SEQUENCE_MAX_LEN_4_BYTE &&
str[i] >= SEQUENCE_MIN_LEN_4_BYTE) {
nbyte = 4;
} else {
return false;
}
}
Copy link
Contributor

@rettichschnidi rettichschnidi Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else is redundant, please move code to the left:

Suggested change
} else {
if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE &&
str[i] >= SEQUENCE_MIN_LEN_2_BYTE) {
nbyte = 2;
} else if (str[i] <= SEQUENCE_MAX_LEN_3_BYTE &&
str[i] >= SEQUENCE_MIN_LEN_3_BYTE) {
nbyte = 3;
} else if (str[i] <= SEQUENCE_MAX_LEN_4_BYTE &&
str[i] >= SEQUENCE_MIN_LEN_4_BYTE) {
nbyte = 4;
} else {
return false;
}
}
}
if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE && str[i] >= SEQUENCE_MIN_LEN_2_BYTE) {
nbyte = 2;
} else if (str[i] <= SEQUENCE_MAX_LEN_3_BYTE && str[i] >= SEQUENCE_MIN_LEN_3_BYTE) {
nbyte = 3;
} else if (str[i] <= SEQUENCE_MAX_LEN_4_BYTE && str[i] >= SEQUENCE_MIN_LEN_4_BYTE) {
nbyte = 4;
} else {
return false;
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was over 100 columns and I had to wrap. So, I need to resolve this comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time: Do not resolve other peoples comments!

I do not follow your logic. Moving code to the left makes the 100 column "issue" less, not worse. Please explain.

Copy link
Collaborator Author

@rruuaanng rruuaanng Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to

str[i] <= SEQUENCE_MAX_LEN_4_BYTE && str[i] >= SEQUENCE_MIN_LEN_4_BYTE) 

and it made CI throw out that the code exceeded 100 lines.

Edit

If you want me to give you an explanation, I will change it and you can look at the warnings CI throws. And it's just a look, it's not important. We should focus on the functionality of the decoder, not other.

@rruuaanng
Copy link
Collaborator Author

CI thrown

zephyr/CMakeFiles/zephyr.dir/lib/utils/utf8.c.obj -c /__w/zephyr/zephyr/lib/utils/utf8.c
/__w/zephyr/zephyr/lib/utils/utf8.c:91:15: error: call to undeclared function 'strnlen'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
        size_t len = strnlen(str, maxlen);
                     ^
/__w/zephyr/zephyr/lib/utils/utf8.c:91:15: note: did you mean 'strlen'?
/usr/include/string.h:407:15: note: 'strlen' declared here
extern size_t strlen (const char *__s)
              ^
1 error generated.
ninja: build stopped: subcommand failed.

@Thalley
Copy link
Collaborator

Thalley commented Nov 8, 2024

CI thrown

zephyr/CMakeFiles/zephyr.dir/lib/utils/utf8.c.obj -c /__w/zephyr/zephyr/lib/utils/utf8.c
/__w/zephyr/zephyr/lib/utils/utf8.c:91:15: error: call to undeclared function 'strnlen'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
        size_t len = strnlen(str, maxlen);
                     ^
/__w/zephyr/zephyr/lib/utils/utf8.c:91:15: note: did you mean 'strlen'?
/usr/include/string.h:407:15: note: 'strlen' declared here
extern size_t strlen (const char *__s)
              ^
1 error generated.
ninja: build stopped: subcommand failed.

Interesting. It would appear that the libc implementation used for that test does not implement strnlen. The platform seems to be the native_sim. @aescolar can we not use strnlen on native_sim?

@aescolar
Copy link
Member

aescolar commented Nov 8, 2024

Interesting. It would appear that the libc implementation used for that test does not implement strnlen. The platform seems to be the native_sim. @aescolar can we not use strnlen on native_sim?

Yes, you can certainly use it. But as it is an extension to the standard C library you need to "ask" for its prototype to be included.
You can just add

#undef _POSIX_C_SOURCE
#define _POSIX_C_SOURCE 200809L

at the beginning of the file that uses strnlen (before any header inclusion).

lib/utils/utf8.c Outdated
bool utf8_is_valid(const char *str, size_t maxlen)
{
size_t i = 0, nbyte = 0;
size_t len = strnlen(str, maxlen);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the challenges with strnlen, could we just do the following? Or would that be considered insecure?

Suggested change
size_t len = strnlen(str, maxlen);
size_t len = MIN(strlen(str), maxlen);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider that insecure. Why not do the check when looping?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do the check when looping?

Can you elaborate?

Copy link
Collaborator

@pdgendt pdgendt Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	while (i < maxlen) {
		if (str[i] == '\0') {
			break;
		}

		if (str[i] <= ASCII_CHAR && str[i] > '\0') {
			i++;
			continue;
		}

		// ...
	}

Copy link
Collaborator Author

@rruuaanng rruuaanng Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry to tell you, that empty characters(0x00) also belong to utf8 characters. If you do this, the check will be cut off.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check would already be cut off, as strlen is simply looking for a \0 byte, no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-8 is an encoding that is used to represent multibyte character sets in a way that is backward-compatible with single-byte character sets. Another advantage of UTF-8 is that it ensures there are no NULL bytes in the data, with the exception of an actual NULL byte.

From https://www.oreilly.com/library/view/secure-programming-cookbook/0596003943/ch03s12.html#:~:text=Discussion,of%20an%20actual%20NULL%20byte.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

[0x30, 0x31, 0x00, 0x30, 0xff] should be considered a valid UTF8 string as only the first 2 characters (0x30, 0x31) and the NULL terminator should be considered.

For example

char str[100];
str[0] = 0x30; /* or '0' */
str[1] = 0x31; /* or '1' */ 
str[2] = 0x00; /* or '\0' */

is a valid UTF8 string, even if octets [3..99] may be invalid, since the string stops at the NULL terminator.

I like the suggestion from @pdgendt as it also optimizes the check by avoiding the call to strlen.

Copy link
Collaborator Author

@rruuaanng rruuaanng Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. So we don't need to check the empty character into looping.

lib/utils/utf8.c Outdated Show resolved Hide resolved
{
/* Test whether the verification function meets the requirements */
zassert_true(utf8_is_valid("\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5", 11));
zassert_true(utf8_is_valid("\x00", 1));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would disagree. The tests should be readable and understandable by other developers. It is currently not IMO since we have reviewers commenting on this.

I am not a computer, so I have no idea if e.g. \xed\x9f\xbf is in fact a valid UTF8 string, and thus cannot determine if the check here is correct or not. I could try and parse each individual line here manually, but that is a waste of time, when it's easier to add a description, a comment or use UTF8 characters like "Øåæ".

include/zephyr/sys/util.h Show resolved Hide resolved
zassert_false(utf8_is_valid("\xc0\xaf", 2));
zassert_false(utf8_is_valid("\xc1\xbf", 2));
zassert_false(utf8_is_valid("\xc0\x80", 2));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have tests where maxlen is larger and smaller than the provided string as well

Copy link
Collaborator Author

@rruuaanng rruuaanng Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we only need to add one. That is to test its behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should have all 3 cases:

  1. maxlen smaller than strlen(str)
  2. maxlen equal to strlen(str) (covered already)
  3. maxlen larger than strlen(str)

Otherwise you aren't covering all checks in the code, right?

Copy link
Collaborator Author

@rruuaanng rruuaanng Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use strnlen, I don't think we need to test when maxlen is greater than len (and when maxlen less than len).

Edit
By the way, I would like to reply here about the readability of the test strings. Most of the UTF8 string in the test do not have a readable representation. They are only used to test the decoder's recognition of boundary values ​​(but those with readable characters, I've changed them)

@rruuaanng
Copy link
Collaborator Author

Interesting. It would appear that the libc implementation used for that test does not implement strnlen. The platform seems to be the native_sim. @aescolar can we not use strnlen on native_sim?

Yes, you can certainly use it. But as it is an extension to the standard C library you need to "ask" for its prototype to be included. You can just add

#undef _POSIX_C_SOURCE
#define _POSIX_C_SOURCE 200809L

at the beginning of the file that uses strnlen (before any header inclusion).

I added

#undef _POSIX_C_SOURCE
#define _POSIX_C_SOURCE 200809L

before #include <string.h>. Hope it works.

lib/utils/utf8.c Outdated Show resolved Hide resolved
@rruuaanng
Copy link
Collaborator Author

If it passes CI, I hope you guys can approve it (it’s just a small feature, but it received so many reviews, which surprised me.)

lib/utils/utf8.c Outdated Show resolved Hide resolved
@Thalley Thalley self-requested a review November 11, 2024 14:49
@Thalley
Copy link
Collaborator

Thalley commented Nov 11, 2024

Removing myself as reviewer. I trust the other reviewers to do remaining reviews, but I can't spend more time on this when the author keeps resolving my comments without proper resolutions, but looks like we are nearly there

lib/utils/utf8.c Outdated
if (str == NULL) {
return false;
}
len = strnlen(str, maxlen);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but I fail to see the need to have strnlen or strlen here. Worst case scenario is that we loop the entire string twice, without any benefit. If the first two characters aren't valid utf8, we could already stop after 2 steps.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good proposal. Maybe I can use len as a parameter. I won't deal with the exception caused by the wrong len. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use the maxlen argument directly in the loop? See my comment.

Copy link
Collaborator Author

@rruuaanng rruuaanng Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will modify it later and change i < len to i < maxlen.

Edit

And remove strnlen ;)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the first two characters aren't valid utf8, we could already stop after 2 steps.

if (str[i] <= ASCII_CHAR && str[i] >= '\0') {
			i++;
			continue;
		} else {
			if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE
			 && str[i] >= SEQUENCE_MIN_LEN_2_BYTE) {
				nbyte = 2;
			} else if (str[i] <= SEQUENCE_MAX_LEN_3_BYTE
					&& str[i] >= SEQUENCE_MIN_LEN_3_BYTE) {
				nbyte = 3;
			} else if (str[i] <= SEQUENCE_MAX_LEN_4_BYTE
					&& str[i] >= SEQUENCE_MIN_LEN_4_BYTE) {
				nbyte = 4;
			} else {
				return false;
			}
		}

I almost forgot, my implementation includes this function, it is in the recognition. It will exit the function when it matches non-UTF8. The same is true for the first character.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, I have another question. You mentioned that you can put maxlen directly into the loop instead of using strnlen, which seems to be no different from directly using len as a function parameter. In fact, it can be len directly instead of maxlen. Right?
@pdgendt

@rruuaanng
Copy link
Collaborator Author

This PR has been reviewed by multiple people. However, no perfect and specific solution has been discussed, and they are usually scattered. This makes it inconvenient for me to implement them, and I hope we can summarize and list them (do not accept requests for changes in code style, it is useless). I think if there are no exceptions in terms of functionality, it has met the requirements for merging, which can make it available to users as soon as possible. And find bugs in practice. For minimal performance optimization, I don't think we need it, and this can be distributed by others after release.

@rruuaanng
Copy link
Collaborator Author

rruuaanng commented Nov 15, 2024

I list the following points that I learned from the discussion:

  1. Test items should be annotated to explain their purpose
  2. maxlen is not necessary, you can just provide the len parameter and let the user pass the string length. If the length is abnormal, the user will be responsible for any abnormal behavior.

Edit

And, thank you for your review and time! My changes may not be suitable for everyone, because everyone has a different perspective on this matter. In fact, we can't force uniformity, but we can summarize a solution that everyone can agree on.

lib/utils/utf8.c Outdated Show resolved Hide resolved
lib/utils/utf8.c Outdated Show resolved Hide resolved
Add 'utf8_is_valid' function to check if a given string is utf8 encoded.

Signed-off-by: James Roy <rruuaanng@outlook.com>
@rruuaanng
Copy link
Collaborator Author

@pdgendt I've changes them, Please review again! :)

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpickery, but one design issue I think needs to be addressed: this code doesn't inspect the actual code point itself, so what does "valid" mean here?

Specifically, this will return "true" for the sequence { 0xf0, 0x80, 0x80, 0xc1 }, which is a "correct" but non-canonical and very surprising encoding of the ASCII character "A".

Most serious applications view this as bad, as it very often escapes string validation and escaping code and causes security bugs. See the relevant section in the Wikipedia page: https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings

Now, I'm not personally a security nut and won't take a strong position on the way this is "supposed" to work. But absolutely if we don't do the "Standard Security Best Practices" thing we should call that fact out very loudly in the docs for the function.

-1 just to make sure someone adds the docs. Feel free to override and merge if you can't find me to remove it.

while (i < len) {
if (str[i] == '\0') {
break;
}
Copy link
Contributor

@andyross andyross Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: "while (str[i] != '\0') { is IMHO clearer and four lines shorter.

Edit: while (i < len && str[i] != '\0') { that is. Which brings up another nitpick: this will return true if the null-terminated prefix of the passed string is valid utf-8, it doesn't necessarily validate the whole string.

Is a buffer with trailing garbage "valid" or not? Documentation should specify.

if (str[i] == '\0') {
break;
}
if (str[i] <= ASCII_CHAR && str[i] > '\0') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: the > '\0' condition is disallowed by your test above and is dead code, eliminate.

continue;
} else {
if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE
&& str[i] >= SEQUENCE_MIN_LEN_2_BYTE) {
Copy link
Contributor

@andyross andyross Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for using comparisons here? UTF-8 sequences are canonically described as being selected by bitmasks (e.g. "0b1110xxxx" indicates a three byte sequence), and optimized code dealing with it usually exploits things like specialized instructions to count the leading one bits. Doing it by mapping those to range comparisons isn't wrong, it just looks weird to my eyes and seems likely to be needlessly large.

FWIW, my immediate thought here would be to try something like (completely untested!):

nbytes = MIN(4, __builtin_clz(~str[i]));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Base OS Base OS Library (lib/os) area: Testsuite Testsuite area: Utilities
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants