-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bridge: Fix text channels with split UTF8 chars across frames #20628
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have structured this a bit differently:
.send_data(data: bytes)
: leave it alone, but add a comment that the caller is responsible for sending properly-formed utf8 data on non-binary channels..send_text(data: str)
: essentiallyself.send_bytes(data.encode())
.send_magic_stuff(data: bytes)
: if this is a binary channel, just.send_data()
, if it's text then incrementally decodedata
, passing it to.send_text()
.
I feel like there's a few places where we would like to use .send_text()
— grepping around in the channels shows a few cases of .send_data(x.encode())
and many cases of .send_data(b'')
which are probably meant as text.
It might also make sense to rename the existing .send_data()
to .send_bytes()
and then the new "magic" function would be .send_data()
. I'm not too worried about the names.
The main thrust of this is that someone calling self.send_data(x.encode())
on a text channel should not then have their data re-decoded, only to be encoded yet again... There should be a .send_text()
to avoid that, and this is what the "magic" function should also call internally on the "string" that it gets from the incremental decoder.
@allisonkarlitskaya ack, thanks -- I'll look into rewriting that. (I had a gut feeling to do this, but first wanted to check for regressions -- apparently there are none) |
Reworked as above to split into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a survey of channels that are actually sending data other than .send_json():
- fsread provides possibly-split-character data via its
do_yield_data()
generator. It would be easy to change theopen()
call in there to change its mode based on if the channel is binary or not - packages is similar, but takes a different path:
AsyncChannel.sendfile()
. It would probably be more difficult to adapt. - HTTP is difficult: I think there's no way to get text data out of the HTTP client API.
- the echo channel sends whatever we send to it. It's a good candidate for send_bytes().
- stream goes via
ProtocolChannel
. Your first patch moved that to.send_bytes()
and didn't move it back to.send_data()
later — that means a stream channel opened in text mode could still send invalid utf8.
So in summary: although we could modify fsread and packages to give us text data, because of HTTP and because of the stream channel, I think we're more or less forced to do the incremental coder for ourselves, at least once, and Channel seems to be the sensible spot for it.
src/cockpit/channel.py
Outdated
@@ -219,6 +221,9 @@ def ready(self, **kwargs: JsonValue) -> None: | |||
self.send_control(command='ready', **kwargs) | |||
|
|||
def done(self) -> None: | |||
# any residue from partial send_data() frames? | |||
if self.decoder and self.decoder.getstate()[0]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can call decoder.decode(b'', final=True)
if memory serves, and that should throw a UnicodeDecodeError
which you can handle in the same way as you do below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware of that, but that does more work than checking the state. It is a bit more consistent though. No strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below, reworked into a helper, so both places now use the same code.
I thought the same -- it's a case of GIGO, and it's unlikely that it's accidentally fed with invalid UTF-8. And if so, that'd ideally already fail on the sender side, not on the receiver. So I left it alone.
Oops, will fix. Thanks! |
src/cockpit/channel.py
Outdated
@@ -125,6 +126,7 @@ def do_control(self, command: str, message: JsonObject) -> None: | |||
self._ack_bytes = get_enum(message, 'send-acks', ['bytes'], None) is not None | |||
self.group = get_str(message, 'group', 'default') | |||
self.is_binary = get_enum(message, 'binary', ['raw'], None) is not None | |||
self.decoder: 'codecs.IncrementalDecoder | None' = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird. Please hang it directly on the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you mean? Putting it into __init__
? In case you meant a class attribute: that would be very much wrong of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging default values directly on the class is very common practice and we do it all over the place. Python even invented ClassVar
that you need to explicitly use to annotate the cases of that which you actually intend to be treated as actual class attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I hate this behavior. I really want to keep it here, where it is crystal clear and explicit that it is an instance variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the type declaration into the class, which does make sense.
FTR, I consider the initialization of channel
and group
to ''
at the class level a potential bug and unsafe practice. Accessing these fields in a non-open channel would better raise an AttributeError
than silently succeeding and misbehaving. But that's the "agree to disagree" which we elaborated at length in the chat.
src/cockpit/channel.py
Outdated
if self.decoder: | ||
try: | ||
self.decoder.decode(b'', final=True) | ||
except UnicodeDecodeError as exc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nicer, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you want to DRY this a bit:
- add a
final: boolean = False
kwarg tosend_data()
, wired through to the decoder - if
final and not block: return
to avoid sending an empty frame if the call was just forfinal=True
purposes - make
.done()
callsend_data(b'', final=True)
I think that, for utf-8 at least, self.decoder.decode(b'', final=True)
should always return an empty string (or raise) but the idea that maybe it couldn't is bothering me... 😅
This is "matter of taste" territory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of exposing the final
flag to the external API, as that's just an internal helper. I moved it into a separate private method.
src/cockpit/channel.py
Outdated
block = self.decoder.decode(data) | ||
except UnicodeDecodeError as exc: | ||
raise ChannelError('protocol-error', message=str(exc)) from exc | ||
return self.send_bytes(block.encode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...or .send_text()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/cockpit/channel.py
Outdated
|
||
Similar to `send_bytes`, but for text data. The data is sent as UTF-8 encoded bytes. | ||
""" | ||
return self.send_bytes(data.encode("UTF-8")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes you use .decode()
and .encode()
with an argument, sometimes without. utf-8
is the default value, so I think it always makes sense without.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done as a separate commit throughout the code.
Commit 930d957 renamed/moved the file, but forgot to adjust .gitignore.
If the channel closes with a problem that includes a `message` field, copy that into the thrown exception, instead of duplicating the `problem` field (what `BasicError()` does by default).
It has always been the default in Python 3, and we were previously using a wild mix between "no argument", "utf-8", "UTF-8", and "utf-8".
It's a generic channel option, also documented as such in doc/protocol.md, so it should be handled centrally. This avoids some DRY and is a prerequisite for the next commits.
This factorizes a few `.encode()` calls, and clarifies which callers are guaranteed to send UTF-8 text, and which can send arbitrary bytes. This leaves AsyncChannnel and GeneratorChannel as the places which potentially send non-UTF8-data to text channels. We'll deal with them in the next commit to not change too many things at once. Move `send_json()` below `send_text()` to keep all `send_*()` methods together.
Reintroduce `Channel.send_data()` as an API to do incremental UTF-8 decoding for text channel messages, so that it avoids breaking multi-byte UTF-8 characters. These frames were previously ignored completely, as a text web socket just refuses these messages. cockpit-ws warns about them as well: ```c g_critical ("invalid non-UTF8 @DaTa passed as text to web_socket_connection_send()"); ``` That led to pages receiving text channel messages with large holes, as it happened in cockpit-project/cockpit-podman#1733. Avoid this by incrementally decoding messages, and sending them in valid prefixes. This can happen in AsyncChannel (http-stream2), GeneratorChannel (fsread1), and ProtocolChannel (stream), so use the API there. This is rather expensive, but hopefully text channels are only being used for small messages anyway. Reproduce the situation in test-server's `/split-utf8` and `/truncated-utf8` endpoints. Fixes cockpit-project#19235
That bug is fixed for good, so we should not run into this any more. This also broke tests randomly due to pages receiving invalid messages. This reverts commit ab17855.
@@ -3896,7 +3896,7 @@ function factory() { | |||
|
|||
if (options.problem) { | |||
http_debug("http problem: ", options.problem); | |||
dfd.reject(new BasicError(options.problem)); | |||
dfd.reject(new BasicError(options.problem, options.message)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This added line is not executed by any test.
@allisonkarlitskaya gentle review ping? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the updates!
a wild mix between "no argument", "utf-8", "UTF-8", and "utf-8".
utf-8
andutf-8
? 😅
Ah, I suppose I wanted to write "utf8", and then saw that we don't actually use this form (it works, too, though). 😁 Anyway, harmless enough.. |
Fixes #19235