pybridge: Avoid BrokenPipeError crash of SubprocessTransport.write() #19414

martinpitt · 2023-09-28T15:58:03Z

A SubprocessTransport may die soon after starting without draining its
stdin. This can e.g. happen with "connection refused" and similar
networking errors when running ssh.

Trying to write to the process is then likely to trigger a
BrokenPipeError. This messes up proper error handling, as the error
usually gets read from the process'es err and exit code.

At this point "exited" signal didn't arrive yet, so ignore the
BrokenPipeError abort in the Peer, and defer error reporting to the
process exit handler.

Beiboot is broken in an interesting way on ubuntu-2204. It also happens on fedora-39 and rhel-9-4, but much less often -- the automatic retries usually "take care" of it. On ubuntu-2204 this reproduces almost perfectly locally (for some weird reason of timing/py version/etc.), it actually works only very rarely. That makes it ideal to investigate it on, and test fixes on.

What happens: beiboot.py's SshPeer.do_connect_transport starts the ssh transport and immediately starts sending data to it (the stage 1 boot loader). That data does not currently get queued. This always has to wait for ssh doing its (possibly interactive) authentication, but the thing is, we can't really predict -- it may be noninteractive SSH key authentication with none of AuthorizeResponder's handlers ever getting invoked. I.e. it's not clear what to wait on actually.

So in some cases it overflows the kernel fd buffer and thus triggers the BrokenPipeError.

allisonkarlitskaya · 2023-09-28T19:16:29Z

I'm not a fan. This is too complicated.

What if we threw Peer.PeerExited here instead? That's specifically designed for the case where the peer exits and we notice its return code, but if we are unlucky enough to see write() fail with EPIPE first, we can more or less assume that it's the same thing. In that case, the code in peer.py will prefer to take the exception generated by the connection function, if one exists, which will be the error returned by ferny.

NB https://github.com/cockpit-project/cockpit/blob/main/src/cockpit/peer.py#L102

martinpitt · 2023-09-29T03:49:03Z

I'm not a fan. This is too complicated.

Absolutely, see the TODO list. This was an experimental hack, aka. "troll Lis" 😁 , I don't want to land this.

What if we threw Peer.PeerExited here instead?

I don't see how _Transport.write() could do this:

It's too abstract to even know about processes, it's just a refinement of asyncio.Transport. It doesn't even know what a Peer is.
The process didn't exit yet at this point, or at least write() doesn't know about it at that point.
For a generic unix socket/file/etc. transport we probably do want to throw a BrokenPipeError, as there is no better error handling.

One possible way out may be to add a more specific SubprocessTransport.write(), and move most of the write() logic into an internal helper (to avoid repeating the queueing logic). It would still have to abstract from Peer.PeerExited, as accessing Peer in transport.py is still a layering violation.

I played around with this idea, and added a check to SubprocessTransport.write() at the end, if it even knows about the exit yet, and it doesn't:

cockpit.transports-DEBUG: Created transport <cockpit.transports.StdioTransport object at 0x7f836cad8e80> for protocol <__main__.SshBridge object at 0x7f836caa9960>, fds 0 1
cockpit.transports-DEBUG: Created transport <cockpit.transports.SubprocessTransport object at 0x7f836cadab80> for protocol <__main__.SshPeer object at 0x7f836caa9e10>, fds 10 9
cockpit.transports-DEBUG: XXX SubprocessTransport.write b'import traceback\nimport contextlib\nimport os\nimport lzma\nimport sys\ntry:\n\tdef try_exec(argv):\n\t\twith'
cockpit.peer-DEBUG: Peer SshPeer connection lost <class 'BrokenPipeError'> [Errno 32] Broken pipe
cockpit.transports-DEBUG: sent Signals.SIGTERM to process 4542
cockpit.transports-DEBUG: XXX SubprocessTransport.write end; _returncode None
cockpit.transports-DEBUG: Process exited with status 255
[... BrokenPipeError crash here...]

As we can't rely on picking up the "exited" signal before the EPIPE, I don't see any other way than just ignoring EPIPE at that point..

martinpitt · 2023-09-29T04:28:28Z

I reworked this to be less gross and more generic. I still don't claim elegance, but I'm running out of ideas here. Let's get a complete test run to get some feedback.

allisonkarlitskaya · 2023-09-29T08:05:02Z

It's too abstract to even know about processes, it's just a refinement of asyncio.Transport. It doesn't even know what a Peer is.

I would do this from Peer.do_close. You could check for BrokenPipeError at this point and rewrite it to PeerExited before setting it on the future.

allisonkarlitskaya · 2023-09-29T08:06:43Z

It's too abstract to even know about processes, it's just a refinement of asyncio.Transport. It doesn't even know what a Peer is.

I would do this from Peer.do_close. You could check for BrokenPipeError at this point and rewrite it to PeerExited before setting it on the future.

We could also just add BrokenPipeError directly to the except PeerExited linked to above so it gets treated the same way.

martinpitt · 2023-09-29T09:45:08Z

So your second approach works fine -- pushing to give it a full exercise.

However, I'm struggling with writing a unit test. This is what I have so far:

--- test/pytest/test_peer.py
+++ test/pytest/test_peer.py
@@ -5,9 +5,10 @@ import sys
 import pytest
 
 from cockpit.packages import BridgeConfig
-from cockpit.peer import ConfiguredPeer, PeerRoutingRule
+from cockpit.peer import ConfiguredPeer, PeerExited, PeerRoutingRule
 from cockpit.protocol import CockpitProtocolError
 from cockpit.router import Router
+from cockpit.transports import SubprocessTransport
 
 from . import mockpeer
 from .mocktransport import MockTransport
@@ -174,3 +175,20 @@ async def test_await_cancellable_connect_close(monkeypatch, event_loop, bridge):
     while len(asyncio.all_tasks()) > 1:
         await asyncio.sleep(0.1)
     assert peer.was_cancelled
+
+
+@pytest.mark.asyncio
+async def test_spawn_broken_pipe(bridge):
+    class BrokenPipePeer(ConfiguredPeer):
+        async def do_connect_transport(self) -> None:
+            transport = await self.spawn(['sh', '-c', 'echo kaputt >&2; exit 9'], ())
+            assert isinstance(transport, SubprocessTransport)
+            transport.write(b'abcdefg\n' * 10000)
+            while transport.get_returncode() is None:
+                await asyncio.sleep(0.1)
+
+    peer = BrokenPipePeer(bridge, PEER_CONFIG)
+    with pytest.raises(PeerExited) as raises:
+        await peer.start()
+    assert raises.value.exit_code == 9
+    peer.close()

But it still fails on src/cockpit/transports.py:180: BrokenPipeError.

allisonkarlitskaya · 2023-09-29T11:00:03Z

However, I'm struggling with writing a unit test.

The relevant block of code, for reference:

        except (PeerExited, BrokenPipeError):
            # PeerExited is a fairly generic error.  If the connection process is
            # still running, perhaps we'd get a better error message from it.
            # We get BrokenPipeError with .write() if the process has died, but we didn't get
            # the exit code yet. We want to handle errors by reading the process'es stdout/err and
            # exit code, so ignore the BrokenPipeError.
            await connect_task
            # Otherwise, re-raise
            raise

But your added comment is wrong. We don't ignore the BrokenPipeError. We only attempt to see if await connect_task will raise an exception. If it doesn't, then we fall through to the # Otherwise, re-raise, which is why your BrokenPipeError is bubbling up.

This is done in order to give ferny a chance to return an exception from do_connect_transport(), but your do_connect_transport() is succeeding.

In short: you need to add some raise MoreSpecificError() at the end of do_connect_transport().

In general, I'd prefer if you integrated your comment a bit more into the text above it. Perhaps something like:

These are fairly generic errors. PeerExited means that we observed the process exiting. BrokenPipeError means that we got EPIPE when attempting to write() to it. In both cases, the process is gone, but it's not clear why. If the connection process is still running, perhaps we'd get a better error message from it.

allisonkarlitskaya · 2023-09-29T11:04:21Z

One more thing we could do: if we get BrokenPipeError on (specifically) SubprocessTransport then we could just silently ignore it, under the assumption that a process_exited() call is not far behind...

The only issue is that in case this assumption is false, we're going to end up in a bad place.

allisonkarlitskaya · 2023-09-29T11:06:38Z

Two more suggestions about your testcase:

add a short time.sleep(0.1) before the transport.write() to improve the chances of it hitting EPIPE. Note: not await asyncio.sleep(0.1) because that would give a chance for the process_exited() call to happen first.
you don't need to write so much. even a single byte would be enough to trigger the EPIPE, and it's much closer to what we do in reality (the stage1 beiboot bootloader is well under the pipe buffer size).

martinpitt · 2023-09-29T14:29:43Z

@allisonkarlitskaya How about this?

martinpitt · 2023-09-29T14:38:14Z

This py3.7 failure is curious, but I don't have time any more to examine this now. Next week, or if Lis beats me to it 😁

allisonkarlitskaya · 2023-09-29T15:16:33Z

That failure looks like a pretty normal race condition. Probably we get process_exited delivered somehow between the spawn and the write call. We do await the spawn(), after all... I assumed that because this function is "not really async" internally there was no chance for the intercession, but I could also be wrong about this.

I checked what the standard library does in the case of write-after-close and it seems like we're too harsh here. The stock asyncio transports simply ignore the write. So I could see us changing the behaviour there, in any case.

I guess we sort of anticipate that the PeerExited situation might sometimes occur first — indeed the point of the sync sleep() is to increase the chance of the race going the other way. And indeed, in both cases (PeerExited and pipe error), we ought to see the error raised from the connect function...

So let's fix the transport. I'm going to rerun the test here out of curiosity... I wonder if Python 3.7 does an extra return to the mainloop for some reason...

allisonkarlitskaya · 2023-09-29T15:21:38Z

Unsurprisingly, it passed on the retry.

martinpitt · 2023-10-04T02:26:20Z

I can't reproduce that particular wirte() assertion failure with running the test 100 times, and not even with wrapping that part of the test into for _ in range(100): (which is a magnitude faster). However, it's crystal clear what happens -- so let's do what https://github.com/python/cpython/blob/3.11/Lib/asyncio/unix_events.py#L685 does.

With dropping the time.sleep(0.1) the peer.start() usually raises PeerExited, which would be another race condition. But I don't just want to accept either, as that would defeat the point of the test. If this turns out to be too flaky, we could do the loop and assert that it raises a BrokenPipeError in at least 80% of the cases or so?

martinpitt · 2023-10-04T02:33:58Z

Added an extra commit to fix the write() assertion.

martinpitt · 2023-10-04T05:37:36Z

The rawhide failure in TestFirewall.testFirewallPage happens in every PR now, let's investigate/naughty that today. Unrelated. Update: Landed in PR #19428

allisonkarlitskaya

Okay. I've done enough research to satisfy myself about exactly what's going on here.

The issue isn't anything to do with our await on the (not-really-)async spawn() somehow returning to the mainloop nor does it have anything to do with threads. The actual issue is that the SafeChildWatcher (which we use on Python < 3.9, because PidfdChildWatcher isn't available to us) will immediately fire its callback if the child has already exited by the time the watch is registered. Our example program here exits fast enough to cause a race with that registration, triggering the strange issue we saw about writing to an already-closed transport.

This race is not possible after 3.9, which also gives me faith in our unit test testing the thing we want it to — after 3.9 it should be hitting our desired case 100% of the time.

I only have one minor reservation, but it's not super important....

allisonkarlitskaya · 2023-10-04T09:25:18Z

src/cockpit/transports.py

+        # event before seeing BrokenPipeError, we'll try to write to a closed pipe.
+        # Do what the standard library does and ignore, instead of assert
+        if self._closing:
+            logger.warning('ignoring write() to closing transport fd %i', self._out_fd)


This is possibly too loud for something that can occur due to no error at all in a race condition on old Python versions....

Right -- now that we understand it, I lowered it to debug().

A SubprocessTransport may die soon after starting without draining its stdin. This can e.g. happen with "connection refused" and similar networking errors when running `ssh`. Trying to write to the process is then likely to trigger a `BrokenPipeError`. This messes up proper error handling, as the error usually gets read from the process'es err and exit code. At this point "exited" signal didn't arrive yet, so ignore the BrokenPipeError abort in the Peer, and defer error reporting to the process exit handler.

With a subprocess transport it can happen that we get and process the the "exited" event before seeing `BrokenPipeError` when trying to write() to it. This would previously crash with an AssertionError. This can happen with `SafeChildWatcher` (which we use on Python < 3.9, because `PidfdChildWatcher` isn't available to us), which will immediately fire its callback if the child has already exited by the time the watch is registered. `test_spawn_broken_pipe()` from the previous commit exits fast enough to cause a race with that registration. Do what the standard library does [1] and ignore the write. [1] https://github.com/python/cpython/blob/3.11/Lib/asyncio/unix_events.py#L685

The pybridge was enabled on all Debian/Ubuntu in commit 6283d5c.

martinpitt · 2023-10-04T09:35:48Z

@allisonkarlitskaya Nice, thanks for figuring that out! I did the log level change and also incorporated your explanation into the commit message.

martinpitt added the no-test For doc/workflow changes, or experiments which don't need a full CI run, label Sep 28, 2023

martinpitt mentioned this pull request Sep 28, 2023

beiboot: feature parity with cockpit-ssh #19401

Merged

4 tasks

martinpitt requested a review from allisonkarlitskaya September 28, 2023 15:59

martinpitt removed the no-test For doc/workflow changes, or experiments which don't need a full CI run, label Sep 29, 2023

martinpitt force-pushed the beiboot-epipe branch from 96e4af8 to bdce2fc Compare September 29, 2023 04:26

martinpitt changed the title ~~pybridge: Avoid BrokenPipeError crash of ssh transport in beiboot~~ pybridge: Avoid BrokenPipeError crash of SubprocessTransport.write() Sep 29, 2023

martinpitt marked this pull request as ready for review September 29, 2023 05:09

martinpitt force-pushed the beiboot-epipe branch from bdce2fc to 9980351 Compare September 29, 2023 09:44

martinpitt mentioned this pull request Sep 29, 2023

beiboot preparation fixes #19413

Merged

martinpitt force-pushed the beiboot-epipe branch 2 times, most recently from 024c7c6 to f3ce043 Compare September 29, 2023 14:29

martinpitt removed the request for review from allisonkarlitskaya October 4, 2023 02:26

martinpitt force-pushed the beiboot-epipe branch from f3ce043 to d8d6952 Compare October 4, 2023 02:33

martinpitt force-pushed the beiboot-epipe branch from d8d6952 to 8b028b4 Compare October 4, 2023 02:35

martinpitt requested a review from allisonkarlitskaya October 4, 2023 04:59

allisonkarlitskaya previously approved these changes Oct 4, 2023

View reviewed changes

martinpitt added 3 commits October 4, 2023 11:31

test: Enable TestClient on debian-stable and ubuntu-2204

716dbef

The pybridge was enabled on all Debian/Ubuntu in commit 6283d5c.

martinpitt dismissed allisonkarlitskaya’s stale review via 716dbef October 4, 2023 09:35

martinpitt force-pushed the beiboot-epipe branch from 8b028b4 to 716dbef Compare October 4, 2023 09:35

martinpitt requested a review from allisonkarlitskaya October 4, 2023 09:35

allisonkarlitskaya approved these changes Oct 4, 2023

View reviewed changes

martinpitt added the release-blocker Targetted for next release label Oct 4, 2023

martinpitt merged commit a4d7582 into cockpit-project:main Oct 4, 2023
100 checks passed

martinpitt deleted the beiboot-epipe branch October 4, 2023 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pybridge: Avoid BrokenPipeError crash of SubprocessTransport.write() #19414

pybridge: Avoid BrokenPipeError crash of SubprocessTransport.write() #19414

martinpitt commented Sep 28, 2023 •

edited

Loading

allisonkarlitskaya commented Sep 28, 2023 •

edited

Loading

martinpitt commented Sep 29, 2023 •

edited

Loading

martinpitt commented Sep 29, 2023

allisonkarlitskaya commented Sep 29, 2023

allisonkarlitskaya commented Sep 29, 2023

martinpitt commented Sep 29, 2023

allisonkarlitskaya commented Sep 29, 2023

allisonkarlitskaya commented Sep 29, 2023

allisonkarlitskaya commented Sep 29, 2023 •

edited

Loading

martinpitt commented Sep 29, 2023

martinpitt commented Sep 29, 2023

allisonkarlitskaya commented Sep 29, 2023

allisonkarlitskaya commented Sep 29, 2023

martinpitt commented Oct 4, 2023

martinpitt commented Oct 4, 2023

martinpitt commented Oct 4, 2023 •

edited

Loading

allisonkarlitskaya left a comment

allisonkarlitskaya Oct 4, 2023

martinpitt Oct 4, 2023

martinpitt commented Oct 4, 2023

pybridge: Avoid BrokenPipeError crash of SubprocessTransport.write() #19414

pybridge: Avoid BrokenPipeError crash of SubprocessTransport.write() #19414

Conversation

martinpitt commented Sep 28, 2023 • edited Loading

allisonkarlitskaya commented Sep 28, 2023 • edited Loading

martinpitt commented Sep 29, 2023 • edited Loading

martinpitt commented Sep 29, 2023

allisonkarlitskaya commented Sep 29, 2023

allisonkarlitskaya commented Sep 29, 2023

martinpitt commented Sep 29, 2023

allisonkarlitskaya commented Sep 29, 2023

allisonkarlitskaya commented Sep 29, 2023

allisonkarlitskaya commented Sep 29, 2023 • edited Loading

martinpitt commented Sep 29, 2023

martinpitt commented Sep 29, 2023

allisonkarlitskaya commented Sep 29, 2023

allisonkarlitskaya commented Sep 29, 2023

martinpitt commented Oct 4, 2023

martinpitt commented Oct 4, 2023

martinpitt commented Oct 4, 2023 • edited Loading

allisonkarlitskaya left a comment

Choose a reason for hiding this comment

allisonkarlitskaya Oct 4, 2023

Choose a reason for hiding this comment

martinpitt Oct 4, 2023

Choose a reason for hiding this comment

martinpitt commented Oct 4, 2023

martinpitt commented Sep 28, 2023 •

edited

Loading

allisonkarlitskaya commented Sep 28, 2023 •

edited

Loading

martinpitt commented Sep 29, 2023 •

edited

Loading

allisonkarlitskaya commented Sep 29, 2023 •

edited

Loading

martinpitt commented Oct 4, 2023 •

edited

Loading