Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bridge crashes on early channel opening failure #20880

Closed
wants to merge 2 commits into from

Conversation

martinpitt
Copy link
Member

@martinpitt martinpitt commented Aug 13, 2024

These have plagued c-files for a while, see e.g. this crash (top of the weather report).

The second patch is fixing the crash that I got when I was smoke-testing the first one. See cockpit-project/bots#6730 (comment) . It's also on the weather report, example

I've tried for 90 mins to write an unit test for the first one, but I just don't grasp that. I monkeyed around with

diff --git src/cockpit/transports.py src/cockpit/transports.py
index 8aabc7d83..491638587 100644
--- src/cockpit/transports.py
+++ src/cockpit/transports.py
@@ -241,9 +241,11 @@ class _Transport(asyncio.Transport):
 
         try:
             n_bytes = os.write(self._out_fd, data)
+            logger.warning('Transport.write %r succeeded %i', data, n_bytes)
         except BlockingIOError:
             n_bytes = 0
         except OSError as exc:
+            logger.warning('Transport.write %r failed %r', data, exc)
             self.abort(exc)
             return
 
diff --git test/pytest/mocktransport.py test/pytest/mocktransport.py
index ed2c6f721..b06e6b4b9 100644
--- test/pytest/mocktransport.py
+++ test/pytest/mocktransport.py
@@ -25,7 +25,9 @@ class MockTransport(asyncio.Transport):
     def send_data(self, channel: str, data: bytes) -> None:
         msg = channel.encode('ascii') + b'\n' + data
         msg = str(len(msg)).encode('ascii') + b'\n' + msg
+        print("XXXXX MockTransport send_data %r" % data)
         self.protocol.data_received(msg)
+        print("XXXXX MockTransport send_data done")
 
     def send_init(self, version=1, host=MOCK_HOSTNAME, **kwargs):
         self.send_json('', command='init', version=version, host=host, **kwargs)
@@ -124,7 +126,9 @@ class MockTransport(asyncio.Transport):
 
     async def assert_msg(self, expected_channel: str, absent_keys: 'Iterable[str]' = (),
                          **kwargs: JsonValue) -> JsonObject:
+        print("XXX assert_msg waiting next_msg on", expected_channel)
         msg = await self.next_msg(expected_channel)
+        print("XXX assert_msg got next_msg", msg)
         assert msg == dict(msg, **{k.replace('_', '-'): v for k, v in kwargs.items()}), msg
         for absent_key in absent_keys:
             assert absent_key not in msg
diff --git test/pytest/test_bridge.py test/pytest/test_bridge.py
index b4b02d78c..a4ae51bd2 100644
--- test/pytest/test_bridge.py
+++ test/pytest/test_bridge.py
@@ -1379,3 +1379,16 @@ async def test_fsinfo_targets(transport: MockTransport, tmp_path: Path) -> None:
     # double-check with the non-watch variant
     client = await FsInfoClient.open(transport, tmp_path, ['type', 'target', 'targets'], fnmatch='l*')
     assert await client.wait() == state
+
+
+@pytest.mark.asyncio
+async def test_early_write_error(transport: MockTransport, monkeypatch: pytest.MonkeyPatch) -> None:
+    def fail_write(_data):
+        # transport.close_future.set_result(OSError("moo"))
+        transport.protocol.connection_lost(OSError("meh"))
+        transport.send_close("1.1", problem="kaputt")
+
+    # MockTransport.write = fail_write
+    # monkeypatch.setattr(transport, 'write', unittest.mock.Mock(side_effect=OSError))
+    monkeypatch.setattr(transport, 'write', fail_write)
+    await transport.check_open('echo', channel="1.1", problem="mieps")
diff --git test/pytest/test_transport.py test/pytest/test_transport.py
index 7c819c009..cedaf62a0 100644
--- test/pytest/test_transport.py
+++ test/pytest/test_transport.py
@@ -469,3 +469,13 @@ class TestSubprocessTransport:
         while protocol.transport is not None:
             await asyncio.sleep(0.1)
         assert protocol.transport is None
+
+    @pytest.mark.asyncio
+    async def test_early_write_error(self, monkeypatch: pytest.MonkeyPatch) -> None:
+        loop = asyncio.get_running_loop()
+        protocol = Protocol()
+        monkeypatch.setattr(os, 'write', unittest.mock.Mock(side_effect=OSError))
+        transport = cockpit.transports.SubprocessTransport(loop, protocol, ['true'])
+        protocol.close_on_eof = False
+        while not protocol.exited:
+            await asyncio.sleep(0.1)

But nothing here really fits -- test_transport all uses MockTransport which is so different in behaviour from the real _Transport class that I find it (too) hard to reproduce the behaviour. The test_bridge level felt more promising, but I couldn't make that work either.

When the channel fails early, i.e. when the websocket goes away and
writing the "ready" control message fails, then `Channel.do_close()` was
called before `.connection_made()`. This caused a bridge crash:

> AttributeError: 'SubprocessStreamChannel' object has no attribute '_transport'
When the channel fails very early during open() (if writing the "ready"
control message fails), `self.path_watch` did not yet initialized. Then
close() failed with an `AssertionError`.
@martinpitt
Copy link
Member Author

Hmm, our bridge surely is rather crash happy 🤔 (unrelated, retrying)

@@ -386,6 +386,7 @@ async def create_transport(self, loop: asyncio.AbstractEventLoop, options: JsonO
raise NotImplementedError

def do_open(self, options: JsonObject) -> None:
self._transport = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong spot for that. Please move this to an initialization at the class level where all the other variables are initialized. I'm not sure why I missed this one...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know I despise instance variables that pretend to be class variables. That be the wrong spot IMO. If you insist I'll do it to get it out of the way, but this really belongs into the constructor (and open() is kinda sorta the constructor for Channels).

@@ -530,8 +530,8 @@ def do_identity_changed(self, fd: 'Handle | None', err: 'int | None') -> None:

def do_close(self) -> None:
# non-watch channels close immediately — if we get this, we're watching
assert self.path_watch is not None
self.path_watch.close()
if self.path_watch is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conditional sort of invalidates the point of the comment above.

There's a thorny philosophical issue behind all of this: what's supposed to happen when writing to the transport fails with an OSError. That's supposed to result in the transport being closed with that exception, of course, but should it also bubble up into the code?

If so, I feel like we should do more to guard individual callers against having to deal with this. I don't want to have very channel having to think about what might happen if their .send() calls throw...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is to say: let me do a bit of research on this and see what the standard transports do.

Copy link
Member

@allisonkarlitskaya allisonkarlitskaya Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import asyncio
import sys
from typing import override


class Prot(asyncio.Protocol):
    @override
    def connection_made(self, transport: asyncio.BaseTransport) -> None:
        assert isinstance(transport, asyncio.WriteTransport)
        transport.write(b'hihi')
        print('I survived')

    @override
    def connection_lost(self, exc: Exception | None) -> None:
        print('lost', exc)

async def run() -> None:
    loop = asyncio.get_running_loop()
    await loop.connect_write_pipe(Prot, sys.stdin)
    await asyncio.sleep(1000)


asyncio.run(run())

yields

$ python3 x.py < /dev/null
I survived
lost [Errno 9] Bad file descriptor

which is to say: the transport protects the caller of .write() from the exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our transports should guard callers in the same way:

        try:
            n_bytes = os.write(self._out_fd, data)
        except BlockingIOError:
            n_bytes = 0
        except OSError as exc:
            self.abort(exc)
            return

There's a critical difference, though: we call .abort() directly, which immediately calls .connection_lost() on the Protocol.

    def abort(self, exc: 'Exception | None' = None) -> None:
        self._closing = True
        self._close_reader()
        self._remove_write_queue()
        self._protocol.connection_lost(exc)
        self._close()

The standard transports punt it to the next mainloop iteration.

    def _close(self, exc=None):
        self._closing = True
        if self._buffer:
            self._loop._remove_writer(self._fileno)
        self._buffer.clear()
        self._loop._remove_reader(self._fileno)
        self._loop.call_soon(self._call_connection_lost, exc)

So with a standard transport, the caller's code runs first, but with ours, the abort code runs first.

The correct solution here is to change our transport to be more like the stdlib ones. This might have also been the root cause of #20634 and other similar bugs: it's not that we close multiple times — it's that we never finished opening.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as an aside: I remember when I was writing this transport code way at the beginning I was in a hurry and I thought "meh, I don't understand why the standard transports do it this way — it seems pointless and unnecessary. I'll do it my way and we can change it if we find out that it's wrong."

😅

@allisonkarlitskaya
Copy link
Member

At least part of this is handled in #20881 now.

@allisonkarlitskaya
Copy link
Member

At least part of this is handled in #20881 now.

and now the second part of it is handled there too. Let's close this one.

@martinpitt martinpitt deleted the init-transport branch August 13, 2024 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants