QUIC: MsQuicListener can randomly fail to accept new connection #55979

wfurt · 2021-07-20T03:05:37Z

That can be obvious with

System.Net.Quic.QuicException : Connection has been shutdown by transport. Error Code: CONNECTION_REFUSED

however there are more hidden implications:
#55642
we have test fragments like

            ValueTask clientTask = clientConnection.ConnectAsync();
            using QuicConnection serverConnection = await listener.AcceptConnectionAsync();
            await clientTask;

if the connect fails, the AcceptConnectionAsync() would hang forever since the clientTask is not awaited.
There are other reasons why we hang and @stephentoub is helping with the other patterns.

as @ManickaP pointed in #53224
https://github.com/microsoft/msquic/blob/3898b2b88085a478eeb844885ea78daa9b060d06/src/core/stream.c#L207-L211

    if (QuicConnIsClosed(Stream->Connection) ||
        Stream->Flags.Started) {
        Status = QUIC_STATUS_INVALID_STATE;
        goto Exit;
    }

That check is QUIC protocol state. If we get close from peer, there is race condition there.
I correlate this with packet capture and I can that the failed stream/connection got REFUSED message as well.
test-refused.pcapng.zip

This is unpleasant as it is difficult to hook any retry logic to INVALID_STATE.

There may be more to it but this is partially caused by (MS?)QUIC design.
I originally thought there may be some race condition starting the listener but clearly the message is well-formate QUIC.
When I debug the failures, there would be NO callback on the MsQuicListener and the logic happens inside msquic.
@nibanks pointed me to QuicWorkerIsOverloaded and when this is true, listener would refuse to take up more work e.g. accept new connections. There may be other reasons why msquic would refuse new connection.

This is also unpleasant as our tests do stress CPU and we are at point where any test can impact any test.

The text was updated successfully, but these errors were encountered:

ghost · 2021-07-20T03:05:41Z

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

That can be obvious with

System.Net.Quic.QuicException : Connection has been shutdown by transport. Error Code: CONNECTION_REFUSED

however there are more hidden implications:
#55642
we have test fragments like

            ValueTask clientTask = clientConnection.ConnectAsync();
            using QuicConnection serverConnection = await listener.AcceptConnectionAsync();
            await clientTask;

if the connect fails, the AcceptConnectionAsync() would hang forever since the clientTask is not awaited.
There are other reasons why we hang and @stephentoub is helping with the other patterns.

as @ManickaP pointed in #53224
https://github.com/microsoft/msquic/blob/3898b2b88085a478eeb844885ea78daa9b060d06/src/core/stream.c#L207-L211

    if (QuicConnIsClosed(Stream->Connection) ||
        Stream->Flags.Started) {
        Status = QUIC_STATUS_INVALID_STATE;
        goto Exit;
    }

That check is QUIC protocol state. If we get close from peer, there is race condition there.
I correlate this with packet capture and I can that the failed stream/connection got REFUSED message as well.
test-refused.pcapng.zip

This is unpleasant as it is difficult to hook any retry logic to INVALID_STATE.

There may be more to it but this is partially caused by (MS?)QUIC design.
I originally thought there may be some race condition starting the listener but clearly the message is well-formate QUIC.
When I debug the failures, there would be NO callback on the MsQuicListener and the logic happens inside msquic.
@nibanks pointed me to QuicWorkerIsOverloaded and when this is true, listener would refuse to take up more work e.g. accept new connections. There may be other reasons why msquic would refuse new connection.

This is also unpleasant as our tests do stress CPU and we are at point where any test can impact any test.

Author:	wfurt
Assignees:	-
Labels:	`area-System.Net.Quic`
Milestone:	-

karelz · 2021-07-20T16:48:59Z

Triage: Crank up the threshold limit higher for testing.

wfurt · 2021-07-25T23:07:40Z

Aside from the CPU check in QuicWorkerIsOverloaded, I see many cases when we hit

if (Connection->ReceiveQueueCount >= QUIC_MAX_RECEIVE_QUEUE_COUNT) {
..
QuicPacketLogDrop(Connection, CxPlatDataPathRecvDataToRecvPacket(Datagram), "Max queue limit reached");

this may be related to existing tests - they are weird in a sense that we send lot of 1-2 bytes chunks so it seems easy or overrun the 180 limit. I wish this is something we could tune up for the test runs @nibanks but it seems to be compile time constant.

nibanks · 2021-07-25T23:35:36Z

IMHO, this is not particularly interesting. When you run things for long enough at max CPU queues build up. You cannot queue unlimited. You will continue to hit issues like this. Dropped packets are not fatal. They will get retransmitted.

wfurt · 2021-07-26T00:03:40Z

Point taken. We have quite a few randomly failing tests in CI - something not trivial to investigate.
I assume we can get to the counters, right? Is there something new can look at to know for sure why given connection failed? (assuming we have access to the Listener and client connection)

wfurt · 2021-07-26T17:19:06Z

I put assert to QuicWorkerIsOverloaded and I can see it hit - took ~ 200 iterations in this case.

(lldb) target create "../../testhost/net6.0-Linux-Debug-x64/dotnet" --core "core"
Core file '/home/furt/github/wfurt-runtime/artifacts/bin/System.Net.Quic.Functional.Tests/net6.0-Unix-Debug/core' (x86_64) was loaded.
(lldb) bt
* thread #1, name = 'dotnet', stop reason = signal SIGABRT
  * frame #0: 0x00007fb949b5518b libc.so.6`__GI_raise(sig=2) at raise.c:51
    frame #1: 0x00007fb949b3492e libc.so.6`__GI_abort at abort.c:100
    frame #2: 0x00007fb8cdf6715b libmsquic.so`quic_bugcheck at platform_posix.c:61
    frame #3: 0x00007fb8cdf0a78d libmsquic.so`QuicRegistrationAcceptConnection at worker.h:118
    frame #4: 0x00007fb8cdf0a74c libmsquic.so`QuicRegistrationAcceptConnection(Registration=<unavailable>, Connection=<unavailable>) at registration.c:356
    frame #5: 0x00007fb8cdf07154 libmsquic.so`QuicListenerAcceptConnection(Listener=0x00007fb854006410, Connection=0x00007fb8840070a0, Info=0x00007fb8377fd9c0) at listener.c:557
    frame #6: 0x00007fb8cdf20566 libmsquic.so`QuicBindingAcceptConnection(Binding=<unavailable>, Connection=0x00007fb8840070a0, Info=0x00007fb8377fd9c0) at binding.c:555
    frame #7: 0x00007fb8cdf37f30 libmsquic.so`QuicCryptoProcessData(Crypto=0x00007fb8840078c0, IsClientInitial=<unavailable>) at crypto.c:1735
    frame #8: 0x00007fb8cdf383e6 libmsquic.so`QuicCryptoProcessFrame(Crypto=0x00007fb8840078c0, KeyType=<unavailable>, Frame=<unavailable>) at crypto.c:1250
    frame #9: 0x00007fb8cdf2e841 libmsquic.so`QuicConnRecvFrames(Connection=0x00007fb8840070a0, Path=<unavailable>, Packet=0x00007fb854053410, ECN=<unavailable>) at connection.c:4199
    frame #10: 0x00007fb8cdf313e7 libmsquic.so`QuicConnRecvDatagramBatch(Connection=0x00007fb8840070a0, Path=<unavailable>, BatchCount=<unavailable>, Datagrams=0x00007fb8377fdcc0, Cipher=<unavailable>, RecvState=<unavailable>) at connection.c:5102
    frame #11: 0x00007fb8cdf31dca libmsquic.so`QuicConnRecvDatagrams(Connection=0x00007fb8840070a0, DatagramChain=<unavailable>, DatagramChainCount=<unavailable>, IsDeferred=<unavailable>) at connection.c:5294
    frame #12: 0x00007fb8cdf322dd libmsquic.so`QuicConnFlushRecv(Connection=0x00007fb8840070a0) at connection.c:5438
    frame #13: 0x00007fb8cdf34c08 libmsquic.so`QuicConnDrainOperations(Connection=0x00007fb8840070a0) at connection.c:6819
    frame #14: 0x00007fb8cdf191a5 libmsquic.so`QuicWorkerProcessConnection(Worker=0x00007fb85c0380a0, Connection=0x00007fb8840070a0) at worker.c:488
    frame #15: 0x00007fb8cdf197d6 libmsquic.so`QuicWorkerThread(Context=0x00007fb85c0380a0) at worker.c:576
    frame #16: 0x00007fb94a05c609 libpthread.so.0`start_thread + 217
    frame #17: 0x00007fb949c31293 libc.so.6`__GI___clone at clone.S:95

karelz · 2021-07-27T08:58:55Z

The impact on our tests is larger and larger, affecting entire runtime repo. We need to either find a way to disable the behavior in msquic during our tests, or we have to find a workaround to not hit it in our tests (e.g. run msquic tests serially). We need some solution ASAP (in a day or two).

As an example: Even our BasicTest in #56269 is failing too often and we should disable it in CI due to the noise it is causing.

wfurt · 2021-08-05T18:43:05Z

We talk about it more today @nibanks and just to confirm and clarify. When MsQuic feels it is overloaded it would send back packages with notification instead of just ignoring the request and waiting for re-transmit, right?

And is there any other reason why client may get CONNECTION_REFUSED error?
I'm primarily asking if this is something we can depend on for test connection retry or if there is any better way how detect this situation.

nibanks · 2021-08-05T18:52:05Z

In this case, MsQuic should inform the peer with the CONNECTION_REFUSED error. There shouldn't be any other reasons that you'd see this error from an MsQuic server. You can retry if you want. But you're just going to be adding more load to the system and there's no guarantee you'll succeed any time soon. If you have enough parallel tests all "retrying until they succeed" it's possible you never get there. Perhaps if you have "sleep and retry" model it's work better.

wfurt · 2021-08-05T22:31:48Z

I see no evidence that the system would be generally overloaded. When I run it locally on my vm I see 20-80% cpu utilization. There may be spikes and with docker and virtualization there are no real-time guarantees.

At this moment this is biggest reason. for CI instability impacting all .NET contributors. There are other factors like occasional DNS failures or outages in external Azure services. But all the others are quite rare so we live with them.
Also we are yet to run tests on ARM and some of those machines are SLOW.

If we can move this to the occasional but rare category it would be still big improvement. That may be wait and retry, increasing the threshold or some other strategy. For example could MsQuic give us event so we know that particular connection is going to fail?

ManickaP · 2021-08-10T17:24:58Z

Triage: partial fix in PR. We also want to disable parallelism of QUIC tests to minimize the probability of this issue, after all that's what msquic does in their test suite.
Merge the PR in flight, get in the PR with disabled parallelism and close this.
If it comes back, we'd need to disable tests more aggressively since there's no easy remedy to this problem.

ManickaP · 2023-01-26T16:36:57Z

Triage: there's nothing we can do about the root cause on our side. We could expose counters with the values of msquic statistics (either private or public). Either way, this is diagnostics and not critical.

rzikm · 2024-02-28T17:15:44Z

Now that #98361 is merged, this issue should not be an issue anymore (unless the machine is truly overloaded). Let's keep this issue open until we can validate this when MsQuic 2.4 gets released and the CI images get updated.

wfurt · 2024-02-28T17:19:31Z

Are we brave enough to run tests in parallel again?

rzikm · 2024-02-28T17:20:37Z

Are we brave enough to run tests in parallel again?

I would, try it, once all runners have MsQuic 2.4 (the cert validation does not run async on lower versions because of microsoft/msquic#4132)

nibanks · 2024-02-28T17:25:36Z

Are we brave enough to run tests in parallel again?

I would, try it, once all runners have MsQuic 2.4 (the cert validation does not run async on lower versions because of microsoft/msquic#4132)

Should we take that fix to v2.3? Or do you not care, because you're not actually using v2.3?

rzikm · 2024-02-28T17:47:00Z

We are running 2.3 on CI, so porting it would be nice. Thanks in advance.

nibanks · 2024-02-28T18:04:42Z

We are running 2.3 on CI, so porting it would be nice. Thanks in advance.

Ok. I will try to get around to it this week. If you want, feel free to make a PR to release/2.3 to cherry-pick the PR.

rzikm · 2024-02-29T08:45:40Z

microsoft/msquic#4164

wfurt added the area-System.Net.Quic label Jul 20, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jul 20, 2021

karelz added test-bug Problem in test source code (most likely) and removed untriaged New issue has not been triaged by the area owner labels Jul 20, 2021

karelz added this to the 6.0.0 milestone Jul 20, 2021

karelz assigned wfurt Jul 20, 2021

wfurt mentioned this issue Jul 24, 2021

Test failure: System.Net.Quic.Tests.QuicStreamTests_MsQuicProvider.BasicTest #56269

Closed

wfurt mentioned this issue Jul 26, 2021

Disable SocketsHttpHandlerTest_Http3_MsQuic.ClientSettingsReceived_Su… #56338

Merged

wfurt mentioned this issue Jul 27, 2021

disable test parallelization for http3 #56413

Merged

CarnaViire mentioned this issue Jul 29, 2021

[HTTP/3] Stress test regression #56310

Closed

wfurt mentioned this issue Aug 4, 2021

Test failure: Dispose_WithPendingAcceptAndConnect_PendingAndSubsequentThrowOperationAbortedException #56875

Closed

ManickaP mentioned this issue Aug 6, 2021

Test failure: ConnectionClosedByPeer_WithPendingAcceptAndConnect_PendingAndSubsequentThrowConnectionAbortedException #56926

Closed

wfurt mentioned this issue Aug 7, 2021

improve resiliency of quic test #57020

Merged

ManickaP mentioned this issue Aug 10, 2021

Test failure: ArgumentValidation_ThrowsExpectedException #56925

Closed

karelz added test-enhancement Improvements of test source code and removed enhancement Product code improvement that does NOT require public API changes/additions labels Aug 2, 2022

karelz modified the milestones: 7.0.0, 8.0.0 Aug 2, 2022

karelz unassigned wfurt Aug 2, 2022

CarnaViire mentioned this issue Oct 18, 2022

[HttpStress] massive timeouts around HTTP/3 stress test startup #77126

Closed

karelz modified the milestones: 8.0.0, 9.0.0 Jul 18, 2023

pepone mentioned this issue Oct 20, 2023

Run QUIC tests serially icerpc/icerpc-csharp#3730

Merged

wfurt mentioned this issue Jan 2, 2024

[QUIC] Disable QUIC tests on CLR stress tests #96392

Merged

rzikm self-assigned this Feb 7, 2024

This was referenced Feb 14, 2024

Don't call user callbacks on MsQuic worker thread. #98361

Merged

Privately expose MsQuic performance counters #98422

Merged

rzikm mentioned this issue Apr 15, 2024

Run MsQuic tests in parallel. #100947

Merged

dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Apr 15, 2024

rzikm closed this as completed in #100947 Apr 16, 2024

github-actions bot locked and limited conversation to collaborators May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QUIC: MsQuicListener can randomly fail to accept new connection #55979

QUIC: MsQuicListener can randomly fail to accept new connection #55979

wfurt commented Jul 20, 2021

ghost commented Jul 20, 2021

karelz commented Jul 20, 2021

wfurt commented Jul 25, 2021

nibanks commented Jul 25, 2021

wfurt commented Jul 26, 2021

wfurt commented Jul 26, 2021

karelz commented Jul 27, 2021

wfurt commented Aug 5, 2021

nibanks commented Aug 5, 2021

wfurt commented Aug 5, 2021

ManickaP commented Aug 10, 2021

ManickaP commented Jan 26, 2023

rzikm commented Feb 28, 2024

wfurt commented Feb 28, 2024

rzikm commented Feb 28, 2024 •

edited

Loading

nibanks commented Feb 28, 2024

rzikm commented Feb 28, 2024

nibanks commented Feb 28, 2024

rzikm commented Feb 29, 2024

QUIC: MsQuicListener can randomly fail to accept new connection #55979

QUIC: MsQuicListener can randomly fail to accept new connection #55979

Comments

wfurt commented Jul 20, 2021

ghost commented Jul 20, 2021

karelz commented Jul 20, 2021

wfurt commented Jul 25, 2021

nibanks commented Jul 25, 2021

wfurt commented Jul 26, 2021

wfurt commented Jul 26, 2021

karelz commented Jul 27, 2021

wfurt commented Aug 5, 2021

nibanks commented Aug 5, 2021

wfurt commented Aug 5, 2021

ManickaP commented Aug 10, 2021

ManickaP commented Jan 26, 2023

rzikm commented Feb 28, 2024

wfurt commented Feb 28, 2024

rzikm commented Feb 28, 2024 • edited Loading

nibanks commented Feb 28, 2024

rzikm commented Feb 28, 2024

nibanks commented Feb 28, 2024

rzikm commented Feb 29, 2024

rzikm commented Feb 28, 2024 •

edited

Loading