Missing flow control appears to block traffic #180

marwinski · 2021-03-03T08:51:53Z

We have integrated the apiserver-network-proxy into our project and stumbled across some possibly more fundamental issues with it. Don’t get me wrong: we really like this project and really appreciate the work that has been done, however those issues would ultimately not allow us to use it, but our understanding of the whole domain might be too limited. This is what we know to be issues or suspect based on our observations:

My colleague @ScheererJ has filed PR Add a timeout in when the proxy agent dials to the remote #179 fixing an issue that caused our tests to fail. Problem was that a blocked Dial call caused the receive loop to hang for up to 10 minutes blocking all tunnel traffic. Making this asynchronous fixed the tests. We consider this to be a valid fix for the issue.
Even with (1) we see that the runtime of the tests is twice that compared to using our traditional connectivity solution (that has its own issues which is the reason why we want to replace it). We have no proof but strongly suspect the root cause to be in https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/pkg/agent/client.go#L417
(plus a duplicate of this in the server). The problem is that if the receiver (the process to which the agent forwards the traffic) is slow to process data for whatever reason it will block all connectivity inside the tunnel just like with (1) above.
The only way to fix this from our point of view is to introduce some kind of flow control blocking the sender (=the kube-apiserver) until the other side has received the data (or vice versa). This is of course not trivial and we have no fix or proposal for this one so far.
The konnectivity agent does not handle the case when a connection to a connectivity server is broken; possibly due to a restart of the konnectivity server. In this case it will consume resources (threads, possibly cpu, write loads of logs) without ever stopping. We consider this minor but also have no fix for this so far.
Especially with “kubectl exec” we have seen very poor throughput (like on a 56k modem). We have not done a deep dive but suspect that more or less lots and lots of tiny grpc messages are being sent causing the poor throughput.

We do see that the apiserver-network-proxy is used productively but especially with issues (1) and (2) above we somewhat lack the imagination how this could work (keep my disclaimer about our limited know-how in mind). Anyway, we would really like to use the network proxy as we have already done all the integration work and are willing to help here. Could you share some insights especially with regards to (2) above and how we could help.

Thanks

marwinski · 2021-03-05T09:56:30Z

We have confirmed that issue 2 above (missing flow control) is indeed an issue although the operating system buffers appear to be slightly bigger than anticipated. We have worked on a resolution in our fork for review mandelsoft@bb4a77d

We can show a blocked tunnel without the patch that does not block with it.

ydp · 2021-04-29T03:18:21Z

I think we are encounter similar issue as (4), in out scale test, some cases always timeout from kubectl side, but the actual workloads running fine.
HI @marwinski have you ever encounter kubectl logs issue like this #167 , I am searching on the possible cause, but no luck.

marwinski · 2021-04-29T08:01:30Z

Hi @ydp, this issue could well cause frequent timeouts from kubectl logs as it goes via the tunnel. #167 looks like the error handling issue described in (3) above.

You should be able to get rid of the blocking issues with our two pull requests mandelsoft@bb4a77d and #179. Keep in mind that we have not done detailed testing of those - except to validate that they indeed fix some or even all blocking issues. Especially the first fix is quite advanced; the second one is much easier and will probably fix your issue.

This however does not fix #167 although this should be quite forward to fix (add a bit of error handling).

rtheis · 2021-06-01T18:46:22Z

Hi folks, we've been doing a lot of Kubernetes conformance testing with Konnectivity version 0.0.19 enabled for Kubernetes version 1.21 clusters.

Here is the basic structure of our test invocation.

kubetest --provider=skeleton --ginkgo-parallel="${k8s_conformance_e2e_num_parallel_test_nodes}" --test --test_args="--ginkgo.noColor --ginkgo.flakeAttempts=2 --ginkgo.focus=\[Conformance\] --ginkgo.skip=\[Serial\]|\[Disruptive\] ${EXTRA_TEST_ARGS}" --check-version-skew=false --dump="${k8s_conformance_e2e_results_directory}"

These tests have consistently failed with Konnectivity. Scaling up k8s_conformance_e2e_num_parallel_test_nodes only makes the problem worse. Replacing Konnectivity with our previous proxy solution, the tests consistently pass with k8s_conformance_e2e_num_parallel_test_nodes=25.

We then built Konnectivity with #179 and now the tests pass with k8s_conformance_e2e_num_parallel_test_nodes=25. Hopefully this data will help expedite the review and merge of #179 and a new release of Konnectivity with this critical fix. Thanks.

mihivagyok · 2021-06-03T13:03:58Z

@marwinski
Hi! Could you please give some testing detail how to re-produce the issue 2 (blocking the tunnel)?

Thank you!
Adam

k8s-triage-robot · 2021-09-01T13:17:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rtheis · 2021-09-01T13:19:19Z

/remove-lifecycle stale

k8s-triage-robot · 2021-11-30T13:32:56Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rtheis · 2021-11-30T13:47:16Z

/remove-lifecycle stale

relyt0925 · 2022-01-30T21:43:26Z

/remove-lifecycle stale

k8s-triage-robot · 2022-04-30T21:53:11Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rtheis · 2022-05-02T10:57:49Z

/remove-lifecycle stale

k8s-triage-robot · 2022-07-31T11:37:14Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rtheis · 2022-08-01T11:28:24Z

/remove-lifecycle stale

k8s-triage-robot · 2022-10-30T11:37:03Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rtheis · 2022-10-30T15:42:30Z

/remove-lifecycle stale

tallclair · 2022-11-11T19:03:20Z

/lifecycle frozen

rtheis mentioned this issue Jun 1, 2021

Add a timeout in when the proxy agent dials to the remote #179

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 1, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 1, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2022

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing flow control appears to block traffic #180

Missing flow control appears to block traffic #180

marwinski commented Mar 3, 2021

marwinski commented Mar 5, 2021

ydp commented Apr 29, 2021 •

edited

Loading

marwinski commented Apr 29, 2021

rtheis commented Jun 1, 2021

mihivagyok commented Jun 3, 2021

k8s-triage-robot commented Sep 1, 2021

rtheis commented Sep 1, 2021

k8s-triage-robot commented Nov 30, 2021

rtheis commented Nov 30, 2021

relyt0925 commented Jan 30, 2022

k8s-triage-robot commented Apr 30, 2022

rtheis commented May 2, 2022

k8s-triage-robot commented Jul 31, 2022

rtheis commented Aug 1, 2022

k8s-triage-robot commented Oct 30, 2022

rtheis commented Oct 30, 2022

tallclair commented Nov 11, 2022

Missing flow control appears to block traffic #180

Missing flow control appears to block traffic #180

Comments

marwinski commented Mar 3, 2021

marwinski commented Mar 5, 2021

ydp commented Apr 29, 2021 • edited Loading

marwinski commented Apr 29, 2021

rtheis commented Jun 1, 2021

mihivagyok commented Jun 3, 2021

k8s-triage-robot commented Sep 1, 2021

rtheis commented Sep 1, 2021

k8s-triage-robot commented Nov 30, 2021

rtheis commented Nov 30, 2021

relyt0925 commented Jan 30, 2022

k8s-triage-robot commented Apr 30, 2022

rtheis commented May 2, 2022

k8s-triage-robot commented Jul 31, 2022

rtheis commented Aug 1, 2022

k8s-triage-robot commented Oct 30, 2022

rtheis commented Oct 30, 2022

tallclair commented Nov 11, 2022

ydp commented Apr 29, 2021 •

edited

Loading