-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing flow control appears to block traffic #180
Comments
We have confirmed that issue 2 above (missing flow control) is indeed an issue although the operating system buffers appear to be slightly bigger than anticipated. We have worked on a resolution in our fork for review mandelsoft@bb4a77d We can show a blocked tunnel without the patch that does not block with it. |
I think we are encounter similar issue as (4), in out scale test, some cases always timeout from |
Hi @ydp, this issue could well cause frequent timeouts from kubectl logs as it goes via the tunnel. #167 looks like the error handling issue described in (3) above. You should be able to get rid of the blocking issues with our two pull requests mandelsoft@bb4a77d and #179. Keep in mind that we have not done detailed testing of those - except to validate that they indeed fix some or even all blocking issues. Especially the first fix is quite advanced; the second one is much easier and will probably fix your issue. This however does not fix #167 although this should be quite forward to fix (add a bit of error handling). |
Hi folks, we've been doing a lot of Kubernetes conformance testing with Konnectivity version 0.0.19 enabled for Kubernetes version 1.21 clusters. Here is the basic structure of our test invocation.
These tests have consistently failed with Konnectivity. Scaling up We then built Konnectivity with #179 and now the tests pass with |
@marwinski Thank you! |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
/lifecycle frozen |
We have integrated the apiserver-network-proxy into our project and stumbled across some possibly more fundamental issues with it. Don’t get me wrong: we really like this project and really appreciate the work that has been done, however those issues would ultimately not allow us to use it, but our understanding of the whole domain might be too limited. This is what we know to be issues or suspect based on our observations:
My colleague @ScheererJ has filed PR Add a timeout in when the proxy agent dials to the remote #179 fixing an issue that caused our tests to fail. Problem was that a blocked Dial call caused the receive loop to hang for up to 10 minutes blocking all tunnel traffic. Making this asynchronous fixed the tests. We consider this to be a valid fix for the issue.
Even with (1) we see that the runtime of the tests is twice that compared to using our traditional connectivity solution (that has its own issues which is the reason why we want to replace it). We have no proof but strongly suspect the root cause to be in https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/pkg/agent/client.go#L417
(plus a duplicate of this in the server). The problem is that if the receiver (the process to which the agent forwards the traffic) is slow to process data for whatever reason it will block all connectivity inside the tunnel just like with (1) above.
The only way to fix this from our point of view is to introduce some kind of flow control blocking the sender (=the kube-apiserver) until the other side has received the data (or vice versa). This is of course not trivial and we have no fix or proposal for this one so far.
The konnectivity agent does not handle the case when a connection to a connectivity server is broken; possibly due to a restart of the konnectivity server. In this case it will consume resources (threads, possibly cpu, write loads of logs) without ever stopping. We consider this minor but also have no fix for this so far.
Especially with “kubectl exec” we have seen very poor throughput (like on a 56k modem). We have not done a deep dive but suspect that more or less lots and lots of tiny grpc messages are being sent causing the poor throughput.
We do see that the apiserver-network-proxy is used productively but especially with issues (1) and (2) above we somewhat lack the imagination how this could work (keep my disclaimer about our limited know-how in mind). Anyway, we would really like to use the network proxy as we have already done all the integration work and are willing to help here. Could you share some insights especially with regards to (2) above and how we could help.
Thanks
The text was updated successfully, but these errors were encountered: