Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bridge: Fix text channels with split UTF8 chars across frames #20628

Merged
merged 7 commits into from
Jun 24, 2024

Conversation

martinpitt
Copy link
Member

@martinpitt martinpitt commented Jun 20, 2024

Fixes #19235

src/cockpit/channel.py Outdated Show resolved Hide resolved
@martinpitt martinpitt marked this pull request as draft June 20, 2024 09:34
Copy link
Member

@allisonkarlitskaya allisonkarlitskaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have structured this a bit differently:

  • .send_data(data: bytes): leave it alone, but add a comment that the caller is responsible for sending properly-formed utf8 data on non-binary channels.
  • .send_text(data: str): essentially self.send_bytes(data.encode())
  • .send_magic_stuff(data: bytes): if this is a binary channel, just .send_data(), if it's text then incrementally decode data, passing it to .send_text().

I feel like there's a few places where we would like to use .send_text() — grepping around in the channels shows a few cases of .send_data(x.encode()) and many cases of .send_data(b'') which are probably meant as text.

It might also make sense to rename the existing .send_data() to .send_bytes() and then the new "magic" function would be .send_data(). I'm not too worried about the names.

The main thrust of this is that someone calling self.send_data(x.encode()) on a text channel should not then have their data re-decoded, only to be encoded yet again... There should be a .send_text() to avoid that, and this is what the "magic" function should also call internally on the "string" that it gets from the incremental decoder.

pkg/base1/test-http.js Outdated Show resolved Hide resolved
@martinpitt
Copy link
Member Author

@allisonkarlitskaya ack, thanks -- I'll look into rewriting that. (I had a gut feeling to do this, but first wanted to check for regressions -- apparently there are none)

@martinpitt
Copy link
Member Author

Reworked as above to split into send_{bytes,text,data}(). I still have that pesky exception message problem from above, but that's too hairy for the rest of the day. Let's get an opinion from the bots in the meantime.

Copy link
Member

@allisonkarlitskaya allisonkarlitskaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a survey of channels that are actually sending data other than .send_json():

  • fsread provides possibly-split-character data via its do_yield_data() generator. It would be easy to change the open() call in there to change its mode based on if the channel is binary or not
  • packages is similar, but takes a different path: AsyncChannel.sendfile(). It would probably be more difficult to adapt.
  • HTTP is difficult: I think there's no way to get text data out of the HTTP client API.
  • the echo channel sends whatever we send to it. It's a good candidate for send_bytes().
  • stream goes via ProtocolChannel. Your first patch moved that to .send_bytes() and didn't move it back to .send_data() later — that means a stream channel opened in text mode could still send invalid utf8.

So in summary: although we could modify fsread and packages to give us text data, because of HTTP and because of the stream channel, I think we're more or less forced to do the incremental coder for ourselves, at least once, and Channel seems to be the sensible spot for it.

@@ -219,6 +221,9 @@ def ready(self, **kwargs: JsonValue) -> None:
self.send_control(command='ready', **kwargs)

def done(self) -> None:
# any residue from partial send_data() frames?
if self.decoder and self.decoder.getstate()[0]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can call decoder.decode(b'', final=True) if memory serves, and that should throw a UnicodeDecodeError which you can handle in the same way as you do below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware of that, but that does more work than checking the state. It is a bit more consistent though. No strong opinion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below, reworked into a helper, so both places now use the same code.

@martinpitt
Copy link
Member Author

the echo channel sends whatever we send to it. It's a good candidate for send_bytes().

I thought the same -- it's a case of GIGO, and it's unlikely that it's accidentally fed with invalid UTF-8. And if so, that'd ideally already fail on the sender side, not on the receiver. So I left it alone.

that means a stream channel opened in text mode could still send invalid utf8.

Oops, will fix. Thanks!

src/cockpit/channels/metrics.py Show resolved Hide resolved
@@ -125,6 +126,7 @@ def do_control(self, command: str, message: JsonObject) -> None:
self._ack_bytes = get_enum(message, 'send-acks', ['bytes'], None) is not None
self.group = get_str(message, 'group', 'default')
self.is_binary = get_enum(message, 'binary', ['raw'], None) is not None
self.decoder: 'codecs.IncrementalDecoder | None' = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. Please hang it directly on the class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you mean? Putting it into __init__? In case you meant a class attribute: that would be very much wrong of course.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging default values directly on the class is very common practice and we do it all over the place. Python even invented ClassVar that you need to explicitly use to annotate the cases of that which you actually intend to be treated as actual class attributes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I hate this behavior. I really want to keep it here, where it is crystal clear and explicit that it is an instance variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the type declaration into the class, which does make sense.

FTR, I consider the initialization of channel and group to '' at the class level a potential bug and unsafe practice. Accessing these fields in a non-open channel would better raise an AttributeError than silently succeeding and misbehaving. But that's the "agree to disagree" which we elaborated at length in the chat.

if self.decoder:
try:
self.decoder.decode(b'', final=True)
except UnicodeDecodeError as exc:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nicer, thanks.

Copy link
Member

@allisonkarlitskaya allisonkarlitskaya Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you want to DRY this a bit:

  • add a final: boolean = False kwarg to send_data(), wired through to the decoder
  • if final and not block: return to avoid sending an empty frame if the call was just for final=True purposes
  • make .done() call send_data(b'', final=True)

I think that, for utf-8 at least, self.decoder.decode(b'', final=True) should always return an empty string (or raise) but the idea that maybe it couldn't is bothering me... 😅

This is "matter of taste" territory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of exposing the final flag to the external API, as that's just an internal helper. I moved it into a separate private method.

block = self.decoder.decode(data)
except UnicodeDecodeError as exc:
raise ChannelError('protocol-error', message=str(exc)) from exc
return self.send_bytes(block.encode())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...or .send_text()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Similar to `send_bytes`, but for text data. The data is sent as UTF-8 encoded bytes.
"""
return self.send_bytes(data.encode("UTF-8"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes you use .decode() and .encode() with an argument, sometimes without. utf-8 is the default value, so I think it always makes sense without.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done as a separate commit throughout the code.

Commit 930d957 renamed/moved the file, but forgot to adjust
.gitignore.
If the channel closes with a problem that includes a `message` field,
copy that into the thrown exception, instead of duplicating the
`problem` field (what `BasicError()` does by default).
It has always been the default in Python 3, and we were previously using
a wild mix between "no argument", "utf-8", "UTF-8", and "utf-8".
It's a generic channel option, also documented as such in
doc/protocol.md, so it should be handled centrally. This avoids some DRY
and is a prerequisite for the next commits.
This factorizes a few `.encode()` calls, and clarifies which callers are
guaranteed to send UTF-8 text, and which can send arbitrary bytes.

This leaves AsyncChannnel and GeneratorChannel as the places which
potentially send non-UTF8-data to text channels. We'll deal with them in
the next commit to not change too many things at once.

Move `send_json()` below `send_text()` to keep all `send_*()` methods
together.
Reintroduce `Channel.send_data()` as an API to do incremental UTF-8
decoding for text channel messages, so that it avoids breaking
multi-byte UTF-8 characters. These frames were previously ignored
completely, as a text web socket just refuses these messages. cockpit-ws
warns about them as well:

```c
 g_critical ("invalid non-UTF8 @DaTa passed as text to web_socket_connection_send()");
```

That led to pages receiving text channel messages with large holes, as it
happened in cockpit-project/cockpit-podman#1733.

Avoid this by incrementally decoding messages, and sending them in valid
prefixes. This can happen in AsyncChannel (http-stream2),
GeneratorChannel (fsread1), and ProtocolChannel (stream), so use the API
there. This is rather expensive, but hopefully text channels are only
being used for small messages anyway.

Reproduce the situation in test-server's `/split-utf8` and
`/truncated-utf8` endpoints.

Fixes cockpit-project#19235
That bug is fixed for good, so we should not run into this any more.
This also broke tests randomly due to pages receiving invalid messages.

This reverts commit ab17855.
@@ -3896,7 +3896,7 @@ function factory() {

if (options.problem) {
http_debug("http problem: ", options.problem);
dfd.reject(new BasicError(options.problem));
dfd.reject(new BasicError(options.problem, options.message));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This added line is not executed by any test.

@martinpitt
Copy link
Member Author

@allisonkarlitskaya gentle review ping?

Copy link
Member

@allisonkarlitskaya allisonkarlitskaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the updates!

a wild mix between "no argument", "utf-8", "UTF-8", and "utf-8".
utf-8 and utf-8? 😅

@martinpitt
Copy link
Member Author

Ah, I suppose I wanted to write "utf8", and then saw that we don't actually use this form (it works, too, though). 😁 Anyway, harmless enough..

@martinpitt martinpitt merged commit d880b3c into cockpit-project:main Jun 24, 2024
82 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

text channels drop frames with split UTF-8 multi-byte characters
3 participants