-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protocol: simplify identifers #462
Comments
I agree with this direction. A lot of memory leaks we've seen have to do with the extra state transition between pending dial & connected. Using a consistent ID across the connection lifetime would resolve the race conditions @jkh52 mentioned, and also let us (eventually) clean up a lot of code. I think the migration is actually pretty straightforward. We just need to make sure all 3 components (client, server, agent) can understand both the old ("random" + connectionID) and the new (UID) IDs, and translate between them. The requests would continue to carry both sets of IDs, until we were confident that no version skew exists (at which point we can clean up handling the old IDs). I do think we should have a version-skew test framework in place before implementing such a change. Risks: Security: 128-bit UUIDs should be sufficient to all but guarantee that no accidental collisions happen. From a security perspective, the only concern is if a malicious agent was able to reuse a connection UUID intercept connection traffic. For this reason, the server should validate that requests with a given UUID are coming from the expected frontend / agent. Entropy exhaustion: with the above security considerations, I don't think we need to use a secure random generator, which mitigates this concern. As a minor optimization, we could use the lower 64 bits as the dial ID. |
We (AKS) use http-connect mode. Would be delighted to help. |
Glad to hear it! Does the idea make sense and seem worth pursuing? |
A rough migration plan idea:
|
I'm probably getting ahead of myself, but we might want to add the new
It will be easier to tell with a prototype, but I'm not convinced it's necessary to maintain two separate flows. I think the proxy-server should have look-up tables to translate between the ID types, so it can lookup the dial/connection ID from tunnelUID, and the tunnelUID from dial/connectionID. Then, the server just fills in whatever IDs are missing, and then executes common logic. In practice, I think this might mean indexing the tunnel by tunnelUID, and just looking up the tunnelUID if it's missing. |
A server should be able to tell an agent to cancel a pending dial (this feature is missing today). The new key would support this pretty easily. But if server needs to map new key to old key, then it cannot tell the agent to cancel (since it doesn't know connID yet). |
Capturing an idea from @tallclair via chat: this proposal could reduce server responsibility down to message pass-through (minimally: keeping track of frontend and backend associations). As a thought experiment, if there is single frontend and single backend it is even stateless. The idea might be applicable in unit test. |
/cc @cheftako |
Alternative proposal for migration planSummary: Each component handles incoming packets with only old identifiers (random/connectionID), only new identifier (tunnelUID), or both. Each component outputs packets that always include both sets of identifiers (random/connectionID AND tunnelUID). This way, in the intermediate compatibility phase, each component is both fully backwards compatible and forwards compatible. Dial flow:
Steady state:
They handle incoming DATA and CLOSE packets accordingly:
Footnotes: Data structures:Client:
Server:
Agent:
Rollout:Phase 0: (current state) - no components understand |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Proposal: have the client generate a globally unique identifier, and use this for the lifetime of the proxy connection.
Summary of the connection dial and cleanup flows we have today:
PendingDials
(by dialID)frontends
(by agentID+connectionID) after dial successProblems: Since proxy connections may be closed/cancelled by either end, it is difficult to reliably clean up in all cases. In particular:
End goal is much simpler:
Migration
The hardest part seems to be a backward compatible migration (and eventual code deletion + cleanup). Care must also be taken to make sure HTTPConnect (tunnel.go) is given parity treatment.
I think the migration is tractable, and would like to hear what others think.
The text was updated successfully, but these errors were encountered: