-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pybridge: Avoid BrokenPipeError crash of SubprocessTransport.write() #19414
Conversation
I'm not a fan. This is too complicated. What if we threw NB https://github.com/cockpit-project/cockpit/blob/main/src/cockpit/peer.py#L102 |
Absolutely, see the TODO list. This was an experimental hack, aka. "troll Lis" 😁 , I don't want to land this.
I don't see how
One possible way out may be to add a more specific I played around with this idea, and added a check to SubprocessTransport.write() at the end, if it even knows about the exit yet, and it doesn't:
As we can't rely on picking up the "exited" signal before the EPIPE, I don't see any other way than just ignoring EPIPE at that point.. |
96e4af8
to
bdce2fc
Compare
I reworked this to be less gross and more generic. I still don't claim elegance, but I'm running out of ideas here. Let's get a complete test run to get some feedback. |
I would do this from |
We could also just add |
bdce2fc
to
9980351
Compare
So your second approach works fine -- pushing to give it a full exercise. However, I'm struggling with writing a unit test. This is what I have so far: --- test/pytest/test_peer.py
+++ test/pytest/test_peer.py
@@ -5,9 +5,10 @@ import sys
import pytest
from cockpit.packages import BridgeConfig
-from cockpit.peer import ConfiguredPeer, PeerRoutingRule
+from cockpit.peer import ConfiguredPeer, PeerExited, PeerRoutingRule
from cockpit.protocol import CockpitProtocolError
from cockpit.router import Router
+from cockpit.transports import SubprocessTransport
from . import mockpeer
from .mocktransport import MockTransport
@@ -174,3 +175,20 @@ async def test_await_cancellable_connect_close(monkeypatch, event_loop, bridge):
while len(asyncio.all_tasks()) > 1:
await asyncio.sleep(0.1)
assert peer.was_cancelled
+
+
+@pytest.mark.asyncio
+async def test_spawn_broken_pipe(bridge):
+ class BrokenPipePeer(ConfiguredPeer):
+ async def do_connect_transport(self) -> None:
+ transport = await self.spawn(['sh', '-c', 'echo kaputt >&2; exit 9'], ())
+ assert isinstance(transport, SubprocessTransport)
+ transport.write(b'abcdefg\n' * 10000)
+ while transport.get_returncode() is None:
+ await asyncio.sleep(0.1)
+
+ peer = BrokenPipePeer(bridge, PEER_CONFIG)
+ with pytest.raises(PeerExited) as raises:
+ await peer.start()
+ assert raises.value.exit_code == 9
+ peer.close() But it still fails on |
The relevant block of code, for reference: except (PeerExited, BrokenPipeError):
# PeerExited is a fairly generic error. If the connection process is
# still running, perhaps we'd get a better error message from it.
# We get BrokenPipeError with .write() if the process has died, but we didn't get
# the exit code yet. We want to handle errors by reading the process'es stdout/err and
# exit code, so ignore the BrokenPipeError.
await connect_task
# Otherwise, re-raise
raise But your added comment is wrong. We don't ignore the This is done in order to give ferny a chance to return an exception from In short: you need to add some In general, I'd prefer if you integrated your comment a bit more into the text above it. Perhaps something like:
|
One more thing we could do: if we get The only issue is that in case this assumption is false, we're going to end up in a bad place. |
Two more suggestions about your testcase:
|
024c7c6
to
f3ce043
Compare
@allisonkarlitskaya How about this? |
This py3.7 failure is curious, but I don't have time any more to examine this now. Next week, or if Lis beats me to it 😁 |
That failure looks like a pretty normal race condition. Probably we get I checked what the standard library does in the case of write-after-close and it seems like we're too harsh here. The stock asyncio transports simply ignore the write. So I could see us changing the behaviour there, in any case. I guess we sort of anticipate that the So let's fix the transport. I'm going to rerun the test here out of curiosity... I wonder if Python 3.7 does an extra return to the mainloop for some reason... |
Unsurprisingly, it passed on the retry. |
I can't reproduce that particular wirte() assertion failure with running the test 100 times, and not even with wrapping that part of the test into With dropping the |
f3ce043
to
d8d6952
Compare
Added an extra commit to fix the write() assertion. |
d8d6952
to
8b028b4
Compare
The rawhide failure in TestFirewall.testFirewallPage happens in every PR now, let's investigate/naughty that today. Unrelated. Update: Landed in PR #19428 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I've done enough research to satisfy myself about exactly what's going on here.
The issue isn't anything to do with our await
on the (not-really-)async
spawn()
somehow returning to the mainloop nor does it have anything to do with threads. The actual issue is that the SafeChildWatcher
(which we use on Python < 3.9, because PidfdChildWatcher
isn't available to us) will immediately fire its callback if the child has already exited by the time the watch is registered. Our example program here exits fast enough to cause a race with that registration, triggering the strange issue we saw about writing to an already-closed transport.
This race is not possible after 3.9, which also gives me faith in our unit test testing the thing we want it to — after 3.9 it should be hitting our desired case 100% of the time.
I only have one minor reservation, but it's not super important....
src/cockpit/transports.py
Outdated
# event before seeing BrokenPipeError, we'll try to write to a closed pipe. | ||
# Do what the standard library does and ignore, instead of assert | ||
if self._closing: | ||
logger.warning('ignoring write() to closing transport fd %i', self._out_fd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is possibly too loud for something that can occur due to no error at all in a race condition on old Python versions....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right -- now that we understand it, I lowered it to debug().
A SubprocessTransport may die soon after starting without draining its stdin. This can e.g. happen with "connection refused" and similar networking errors when running `ssh`. Trying to write to the process is then likely to trigger a `BrokenPipeError`. This messes up proper error handling, as the error usually gets read from the process'es err and exit code. At this point "exited" signal didn't arrive yet, so ignore the BrokenPipeError abort in the Peer, and defer error reporting to the process exit handler.
With a subprocess transport it can happen that we get and process the the "exited" event before seeing `BrokenPipeError` when trying to write() to it. This would previously crash with an AssertionError. This can happen with `SafeChildWatcher` (which we use on Python < 3.9, because `PidfdChildWatcher` isn't available to us), which will immediately fire its callback if the child has already exited by the time the watch is registered. `test_spawn_broken_pipe()` from the previous commit exits fast enough to cause a race with that registration. Do what the standard library does [1] and ignore the write. [1] https://github.com/python/cpython/blob/3.11/Lib/asyncio/unix_events.py#L685
The pybridge was enabled on all Debian/Ubuntu in commit 6283d5c.
8b028b4
to
716dbef
Compare
@allisonkarlitskaya Nice, thanks for figuring that out! I did the log level change and also incorporated your explanation into the commit message. |
A SubprocessTransport may die soon after starting without draining its
stdin. This can e.g. happen with "connection refused" and similar
networking errors when running
ssh
.Trying to write to the process is then likely to trigger a
BrokenPipeError
. This messes up proper error handling, as the errorusually gets read from the process'es err and exit code.
At this point "exited" signal didn't arrive yet, so ignore the
BrokenPipeError abort in the Peer, and defer error reporting to the
process exit handler.
Beiboot is broken in an interesting way on ubuntu-2204. It also happens on fedora-39 and rhel-9-4, but much less often -- the automatic retries usually "take care" of it. On ubuntu-2204 this reproduces almost perfectly locally (for some weird reason of timing/py version/etc.), it actually works only very rarely. That makes it ideal to investigate it on, and test fixes on.
What happens: beiboot.py's
SshPeer.do_connect_transport
starts thessh
transport and immediately starts sending data to it (the stage 1 boot loader). That data does not currently get queued. This always has to wait for ssh doing its (possibly interactive) authentication, but the thing is, we can't really predict -- it may be noninteractive SSH key authentication with none ofAuthorizeResponder
's handlers ever getting invoked. I.e. it's not clear what to wait on actually.So in some cases it overflows the kernel fd buffer and thus triggers the
BrokenPipeError
.