bridge: Fix text channels with split UTF8 chars across frames #20628

martinpitt · 2024-06-20T09:31:18Z

Fixes #19235

src/cockpit/channel.py

allisonkarlitskaya

I would have structured this a bit differently:

.send_data(data: bytes): leave it alone, but add a comment that the caller is responsible for sending properly-formed utf8 data on non-binary channels.
.send_text(data: str): essentially self.send_bytes(data.encode())
.send_magic_stuff(data: bytes): if this is a binary channel, just .send_data(), if it's text then incrementally decode data, passing it to .send_text().

I feel like there's a few places where we would like to use .send_text() — grepping around in the channels shows a few cases of .send_data(x.encode()) and many cases of .send_data(b'') which are probably meant as text.

It might also make sense to rename the existing .send_data() to .send_bytes() and then the new "magic" function would be .send_data(). I'm not too worried about the names.

The main thrust of this is that someone calling self.send_data(x.encode()) on a text channel should not then have their data re-decoded, only to be encoded yet again... There should be a .send_text() to avoid that, and this is what the "magic" function should also call internally on the "string" that it gets from the incremental decoder.

pkg/base1/test-http.js

martinpitt · 2024-06-20T11:28:38Z

@allisonkarlitskaya ack, thanks -- I'll look into rewriting that. (I had a gut feeling to do this, but first wanted to check for regressions -- apparently there are none)

martinpitt · 2024-06-20T14:37:20Z

Reworked as above to split into send_{bytes,text,data}(). I still have that pesky exception message problem from above, but that's too hairy for the rest of the day. Let's get an opinion from the bots in the meantime.

allisonkarlitskaya

I did a survey of channels that are actually sending data other than .send_json():

fsread provides possibly-split-character data via its do_yield_data() generator. It would be easy to change the open() call in there to change its mode based on if the channel is binary or not
packages is similar, but takes a different path: AsyncChannel.sendfile(). It would probably be more difficult to adapt.
HTTP is difficult: I think there's no way to get text data out of the HTTP client API.
the echo channel sends whatever we send to it. It's a good candidate for send_bytes().
stream goes via ProtocolChannel. Your first patch moved that to .send_bytes() and didn't move it back to .send_data() later — that means a stream channel opened in text mode could still send invalid utf8.

So in summary: although we could modify fsread and packages to give us text data, because of HTTP and because of the stream channel, I think we're more or less forced to do the incremental coder for ourselves, at least once, and Channel seems to be the sensible spot for it.

allisonkarlitskaya · 2024-06-20T15:08:44Z

src/cockpit/channel.py

@@ -219,6 +221,9 @@ def ready(self, **kwargs: JsonValue) -> None:
        self.send_control(command='ready', **kwargs)

    def done(self) -> None:
+        # any residue from partial send_data() frames?
+        if self.decoder and self.decoder.getstate()[0]:


You can call decoder.decode(b'', final=True) if memory serves, and that should throw a UnicodeDecodeError which you can handle in the same way as you do below.

I'm aware of that, but that does more work than checking the state. It is a bit more consistent though. No strong opinion.

See below, reworked into a helper, so both places now use the same code.

martinpitt · 2024-06-21T05:14:45Z

the echo channel sends whatever we send to it. It's a good candidate for send_bytes().

I thought the same -- it's a case of GIGO, and it's unlikely that it's accidentally fed with invalid UTF-8. And if so, that'd ideally already fail on the sender side, not on the receiver. So I left it alone.

that means a stream channel opened in text mode could still send invalid utf8.

Oops, will fix. Thanks!

src/cockpit/channels/metrics.py

allisonkarlitskaya · 2024-06-21T07:54:58Z

src/cockpit/channel.py

@@ -125,6 +126,7 @@ def do_control(self, command: str, message: JsonObject) -> None:
            self._ack_bytes = get_enum(message, 'send-acks', ['bytes'], None) is not None
            self.group = get_str(message, 'group', 'default')
            self.is_binary = get_enum(message, 'binary', ['raw'], None) is not None
+            self.decoder: 'codecs.IncrementalDecoder | None' = None


This is weird. Please hang it directly on the class.

How do you mean? Putting it into __init__? In case you meant a class attribute: that would be very much wrong of course.

Hanging default values directly on the class is very common practice and we do it all over the place. Python even invented ClassVar that you need to explicitly use to annotate the cases of that which you actually intend to be treated as actual class attributes.

Sorry, I hate this behavior. I really want to keep it here, where it is crystal clear and explicit that it is an instance variable.

I moved the type declaration into the class, which does make sense.

FTR, I consider the initialization of channel and group to '' at the class level a potential bug and unsafe practice. Accessing these fields in a non-open channel would better raise an AttributeError than silently succeeding and misbehaving. But that's the "agree to disagree" which we elaborated at length in the chat.

allisonkarlitskaya · 2024-06-21T07:55:36Z

src/cockpit/channel.py

+        if self.decoder:
+            try:
+                self.decoder.decode(b'', final=True)
+            except UnicodeDecodeError as exc:


This looks nicer, thanks.

I wonder if you want to DRY this a bit:

add a final: boolean = False kwarg to send_data(), wired through to the decoder

if final and not block: return to avoid sending an empty frame if the call was just for final=True purposes

make .done() call send_data(b'', final=True)

I think that, for utf-8 at least, self.decoder.decode(b'', final=True) should always return an empty string (or raise) but the idea that maybe it couldn't is bothering me... 😅

This is "matter of taste" territory.

I'm not a fan of exposing the final flag to the external API, as that's just an internal helper. I moved it into a separate private method.

allisonkarlitskaya · 2024-06-21T07:59:32Z

src/cockpit/channel.py

+            block = self.decoder.decode(data)
+        except UnicodeDecodeError as exc:
+            raise ChannelError('protocol-error', message=str(exc)) from exc
+        return self.send_bytes(block.encode())


...or .send_text()?

allisonkarlitskaya · 2024-06-21T08:00:14Z

src/cockpit/channel.py

+
+        Similar to `send_bytes`, but for text data.  The data is sent as UTF-8 encoded bytes.
+        """
+        return self.send_bytes(data.encode("UTF-8"))


Sometimes you use .decode() and .encode() with an argument, sometimes without. utf-8 is the default value, so I think it always makes sense without.

Done as a separate commit throughout the code.

Commit 930d957 renamed/moved the file, but forgot to adjust .gitignore.

If the channel closes with a problem that includes a `message` field, copy that into the thrown exception, instead of duplicating the `problem` field (what `BasicError()` does by default).

It has always been the default in Python 3, and we were previously using a wild mix between "no argument", "utf-8", "UTF-8", and "utf-8".

It's a generic channel option, also documented as such in doc/protocol.md, so it should be handled centrally. This avoids some DRY and is a prerequisite for the next commits.

This factorizes a few `.encode()` calls, and clarifies which callers are guaranteed to send UTF-8 text, and which can send arbitrary bytes. This leaves AsyncChannnel and GeneratorChannel as the places which potentially send non-UTF8-data to text channels. We'll deal with them in the next commit to not change too many things at once. Move `send_json()` below `send_text()` to keep all `send_*()` methods together.

@DaTa

Reintroduce `Channel.send_data()` as an API to do incremental UTF-8 decoding for text channel messages, so that it avoids breaking multi-byte UTF-8 characters. These frames were previously ignored completely, as a text web socket just refuses these messages. cockpit-ws warns about them as well: ```c g_critical ("invalid non-UTF8 @DaTa passed as text to web_socket_connection_send()"); ``` That led to pages receiving text channel messages with large holes, as it happened in cockpit-project/cockpit-podman#1733. Avoid this by incrementally decoding messages, and sending them in valid prefixes. This can happen in AsyncChannel (http-stream2), GeneratorChannel (fsread1), and ProtocolChannel (stream), so use the API there. This is rather expensive, but hopefully text channels are only being used for small messages anyway. Reproduce the situation in test-server's `/split-utf8` and `/truncated-utf8` endpoints. Fixes cockpit-project#19235

That bug is fixed for good, so we should not run into this any more. This also broke tests randomly due to pages receiving invalid messages. This reverts commit ab17855.

cockpituous · 2024-06-21T12:10:48Z

pkg/lib/cockpit.js

@@ -3896,7 +3896,7 @@ function factory() {

                if (options.problem) {
                    http_debug("http problem: ", options.problem);
-                    dfd.reject(new BasicError(options.problem));
+                    dfd.reject(new BasicError(options.problem, options.message));


This added line is not executed by any test.

martinpitt · 2024-06-24T07:19:41Z

@allisonkarlitskaya gentle review ping?

allisonkarlitskaya

Thanks for all the updates!

a wild mix between "no argument", "utf-8", "UTF-8", and "utf-8".
utf-8 and utf-8? 😅

martinpitt · 2024-06-24T08:06:05Z

Ah, I suppose I wanted to write "utf8", and then saw that we don't actually use this form (it works, too, though). 😁 Anyway, harmless enough..

martinpitt commented Jun 20, 2024

View reviewed changes

src/cockpit/channel.py Outdated Show resolved Hide resolved

martinpitt marked this pull request as draft June 20, 2024 09:34

martinpitt requested a review from allisonkarlitskaya June 20, 2024 10:08

allisonkarlitskaya requested changes Jun 20, 2024

View reviewed changes

martinpitt force-pushed the utf8-split branch from e45a086 to 295d54f Compare June 20, 2024 10:42

martinpitt commented Jun 20, 2024

View reviewed changes

pkg/base1/test-http.js Outdated Show resolved Hide resolved

martinpitt force-pushed the utf8-split branch from 295d54f to 2eabad6 Compare June 20, 2024 14:36

martinpitt force-pushed the utf8-split branch from 2eabad6 to 9e75936 Compare June 20, 2024 15:02

allisonkarlitskaya requested changes Jun 20, 2024

View reviewed changes

martinpitt force-pushed the utf8-split branch from 9e75936 to 2dfe526 Compare June 21, 2024 05:22

martinpitt marked this pull request as ready for review June 21, 2024 05:33

martinpitt requested a review from allisonkarlitskaya June 21, 2024 06:17

allisonkarlitskaya requested changes Jun 21, 2024

View reviewed changes

martinpitt added 7 commits June 21, 2024 11:49

gitignore: Re-ignore built tmpfiles.d file

077809a

Commit 930d957 renamed/moved the file, but forgot to adjust .gitignore.

cockpit.js: Fix exception reporting for cockpit.http() errors

460ddd0

If the channel closes with a problem that includes a `message` field, copy that into the thrown exception, instead of duplicating the `problem` field (what `BasicError()` does by default).

all: Drop UTF-8 argument from .decode()/.encode()

56caea9

It has always been the default in Python 3, and we were previously using a wild mix between "no argument", "utf-8", "UTF-8", and "utf-8".

bridge: Factorize "binary" option evalution into Channel

0931a47

It's a generic channel option, also documented as such in doc/protocol.md, so it should be handled centrally. This avoids some DRY and is a prerequisite for the next commits.

Revert "test: Ignore "invalid non-UTF8 @DaTa passed" message"

8505b1e

That bug is fixed for good, so we should not run into this any more. This also broke tests randomly due to pages receiving invalid messages. This reverts commit ab17855.

martinpitt force-pushed the utf8-split branch from 2dfe526 to 8505b1e Compare June 21, 2024 10:06

martinpitt requested a review from allisonkarlitskaya June 21, 2024 10:07

cockpituous reviewed Jun 21, 2024

View reviewed changes

allisonkarlitskaya approved these changes Jun 24, 2024

View reviewed changes

martinpitt merged commit d880b3c into cockpit-project:main Jun 24, 2024
82 checks passed

martinpitt deleted the utf8-split branch June 24, 2024 08:06

allisonkarlitskaya mentioned this pull request Jul 26, 2024

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbd in position 6: invalid start byte #20791

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bridge: Fix text channels with split UTF8 chars across frames #20628

bridge: Fix text channels with split UTF8 chars across frames #20628

martinpitt commented Jun 20, 2024 •

edited

Loading

allisonkarlitskaya left a comment •

edited

Loading

martinpitt commented Jun 20, 2024

martinpitt commented Jun 20, 2024

allisonkarlitskaya left a comment

allisonkarlitskaya Jun 20, 2024

martinpitt Jun 21, 2024

martinpitt Jun 21, 2024

martinpitt commented Jun 21, 2024

allisonkarlitskaya Jun 21, 2024

martinpitt Jun 21, 2024

allisonkarlitskaya Jun 21, 2024

martinpitt Jun 21, 2024

martinpitt Jun 21, 2024

allisonkarlitskaya Jun 21, 2024

allisonkarlitskaya Jun 21, 2024 •

edited by martinpitt

Loading

martinpitt Jun 21, 2024

allisonkarlitskaya Jun 21, 2024

martinpitt Jun 21, 2024

allisonkarlitskaya Jun 21, 2024

martinpitt Jun 21, 2024

cockpituous Jun 21, 2024

martinpitt commented Jun 24, 2024

allisonkarlitskaya left a comment

martinpitt commented Jun 24, 2024

bridge: Fix text channels with split UTF8 chars across frames #20628

bridge: Fix text channels with split UTF8 chars across frames #20628

Conversation

martinpitt commented Jun 20, 2024 • edited Loading

allisonkarlitskaya left a comment • edited Loading

Choose a reason for hiding this comment

martinpitt commented Jun 20, 2024

martinpitt commented Jun 20, 2024

allisonkarlitskaya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinpitt commented Jun 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonkarlitskaya Jun 21, 2024 • edited by martinpitt Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinpitt commented Jun 24, 2024

allisonkarlitskaya left a comment

Choose a reason for hiding this comment

martinpitt commented Jun 24, 2024

martinpitt commented Jun 20, 2024 •

edited

Loading

allisonkarlitskaya left a comment •

edited

Loading

allisonkarlitskaya Jun 21, 2024 •

edited by martinpitt

Loading