-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bridge crashes on early channel opening failure #20880
Conversation
When the channel fails early, i.e. when the websocket goes away and writing the "ready" control message fails, then `Channel.do_close()` was called before `.connection_made()`. This caused a bridge crash: > AttributeError: 'SubprocessStreamChannel' object has no attribute '_transport'
When the channel fails very early during open() (if writing the "ready" control message fails), `self.path_watch` did not yet initialized. Then close() failed with an `AssertionError`.
Hmm, our bridge surely is rather crash happy 🤔 (unrelated, retrying) |
@@ -386,6 +386,7 @@ async def create_transport(self, loop: asyncio.AbstractEventLoop, options: JsonO | |||
raise NotImplementedError | |||
|
|||
def do_open(self, options: JsonObject) -> None: | |||
self._transport = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the wrong spot for that. Please move this to an initialization at the class level where all the other variables are initialized. I'm not sure why I missed this one...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know I despise instance variables that pretend to be class variables. That be the wrong spot IMO. If you insist I'll do it to get it out of the way, but this really belongs into the constructor (and open() is kinda sorta the constructor for Channels).
@@ -530,8 +530,8 @@ def do_identity_changed(self, fd: 'Handle | None', err: 'int | None') -> None: | |||
|
|||
def do_close(self) -> None: | |||
# non-watch channels close immediately — if we get this, we're watching | |||
assert self.path_watch is not None | |||
self.path_watch.close() | |||
if self.path_watch is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conditional sort of invalidates the point of the comment above.
There's a thorny philosophical issue behind all of this: what's supposed to happen when writing to the transport fails with an OSError
. That's supposed to result in the transport being closed with that exception, of course, but should it also bubble up into the code?
If so, I feel like we should do more to guard individual callers against having to deal with this. I don't want to have very channel having to think about what might happen if their .send()
calls throw...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is to say: let me do a bit of research on this and see what the standard transports do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import asyncio
import sys
from typing import override
class Prot(asyncio.Protocol):
@override
def connection_made(self, transport: asyncio.BaseTransport) -> None:
assert isinstance(transport, asyncio.WriteTransport)
transport.write(b'hihi')
print('I survived')
@override
def connection_lost(self, exc: Exception | None) -> None:
print('lost', exc)
async def run() -> None:
loop = asyncio.get_running_loop()
await loop.connect_write_pipe(Prot, sys.stdin)
await asyncio.sleep(1000)
asyncio.run(run())
yields
$ python3 x.py < /dev/null
I survived
lost [Errno 9] Bad file descriptor
which is to say: the transport protects the caller of .write()
from the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our transports should guard callers in the same way:
try:
n_bytes = os.write(self._out_fd, data)
except BlockingIOError:
n_bytes = 0
except OSError as exc:
self.abort(exc)
return
There's a critical difference, though: we call .abort()
directly, which immediately calls .connection_lost()
on the Protocol.
def abort(self, exc: 'Exception | None' = None) -> None:
self._closing = True
self._close_reader()
self._remove_write_queue()
self._protocol.connection_lost(exc)
self._close()
The standard transports punt it to the next mainloop iteration.
def _close(self, exc=None):
self._closing = True
if self._buffer:
self._loop._remove_writer(self._fileno)
self._buffer.clear()
self._loop._remove_reader(self._fileno)
self._loop.call_soon(self._call_connection_lost, exc)
So with a standard transport, the caller's code runs first, but with ours, the abort code runs first.
The correct solution here is to change our transport to be more like the stdlib ones. This might have also been the root cause of #20634 and other similar bugs: it's not that we close multiple times — it's that we never finished opening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as an aside: I remember when I was writing this transport code way at the beginning I was in a hurry and I thought "meh, I don't understand why the standard transports do it this way — it seems pointless and unnecessary. I'll do it my way and we can change it if we find out that it's wrong."
😅
At least part of this is handled in #20881 now. |
and now the second part of it is handled there too. Let's close this one. |
These have plagued c-files for a while, see e.g. this crash (top of the weather report).
The second patch is fixing the crash that I got when I was smoke-testing the first one. See cockpit-project/bots#6730 (comment) . It's also on the weather report, example
I've tried for 90 mins to write an unit test for the first one, but I just don't grasp that. I monkeyed around with
But nothing here really fits --
test_transport
all usesMockTransport
which is so different in behaviour from the real_Transport
class that I find it (too) hard to reproduce the behaviour. The test_bridge level felt more promising, but I couldn't make that work either.