-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relay clustering v5 #94
base: main
Are you sure you want to change the base?
Conversation
I think the reason that we still overshoot performance is that we are not configuring our new "super relay" clusters with enough resources. Suppose, for example, that we have a 1 Gbit/s machine that runs 4 relays. With perfect load balancing, each relay gets 250 Mbit/s. But our load balancing is far from perfect, and client usage is dynamic, so suppose that each individual relay occasionally reaches 500 Mbit/s of actual usage (observed bandwidth) while the other remaining 500 Mbit/s is shared among the other 3 relays. In this situation, the maximum observed bandwidth across the 4 relays is only 500 Mbit/s. Now, in the baseline tornettools code, we would have added 4 relays each with 500 Mbit/s. This gives us a total of 2 Gbit/s, which is double of what the shared machine actually had available. In the v5 clustering code in this PR, we add one relay with 500 Mbit/s, which is half of the bandwidth that the machine had available. Thus we overshoot the long tails in some of the performance graphs. I think we can tune in the super relay bandwidth calculation by considering that the sum of the median observed bandwidths of the shared relays might be higher than the max of any one of the relays. |
tornettools currently configures Shadow such that every relay is placed on its own host. This placement strategy fails to consider that, in the real world, relay operators often run multiple relays on the same physical server machine in order to make better use of available resources.
This PR adds code that attempts to reconcile the situation. Although there is no ground truth information available about which relays share a machine in the real world, we can do a lot better with the following heuristic: if multiple relays have the same assigned relay family and the same IP address, then it is likely that they share the same machine.
Additionally, we need heuristics for estimating available bandwidth. The current approach takes the maximum bandwidth ever observed by the relay and considers that the machine has that much bandwidth capacity. Without additional ground truth hints, this is the best heuristic we have when a relay does not share a machine. However, when multiple relays share a machine, then we should consider that the sum of the steady state bandwidth of each relay may be higher than the maximum observed by any one relay.