Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relay clustering v5 #94

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft

Relay clustering v5 #94

wants to merge 9 commits into from

Conversation

robgjansen
Copy link
Member

@robgjansen robgjansen commented Mar 8, 2023

tornettools currently configures Shadow such that every relay is placed on its own host. This placement strategy fails to consider that, in the real world, relay operators often run multiple relays on the same physical server machine in order to make better use of available resources.

This PR adds code that attempts to reconcile the situation. Although there is no ground truth information available about which relays share a machine in the real world, we can do a lot better with the following heuristic: if multiple relays have the same assigned relay family and the same IP address, then it is likely that they share the same machine.

Additionally, we need heuristics for estimating available bandwidth. The current approach takes the maximum bandwidth ever observed by the relay and considers that the machine has that much bandwidth capacity. Without additional ground truth hints, this is the best heuristic we have when a relay does not share a machine. However, when multiple relays share a machine, then we should consider that the sum of the steady state bandwidth of each relay may be higher than the maximum observed by any one relay.

This was referenced Mar 8, 2023
@robgjansen
Copy link
Member Author

First of all, it looks like --load_scale=1.5 is a good default these days to bring the overall amount of data transferred by relays in line with current day Tor.


This is with --load_scale=1.0

relay_goodput1


This is with --load_scale=1.5

relay_goodput1 5

@robgjansen
Copy link
Member Author

Second, it does look like clustering has the desired effect of creating longer tails in the client performance metrics.


circuit_build_time exit

round_trip_time exit

transfer_error_rates_ALL exit

@robgjansen
Copy link
Member Author

Third, it looks like even with using an EXACT ip address match as the clustering criteria, we still overshoot performance a bit.


transfer_time_51200 exit

transfer_time_1048576 exit

transfer_time_5242880 exit

@robgjansen
Copy link
Member Author

robgjansen commented Mar 8, 2023

I think the reason that we still overshoot performance is that we are not configuring our new "super relay" clusters with enough resources. Suppose, for example, that we have a 1 Gbit/s machine that runs 4 relays. With perfect load balancing, each relay gets 250 Mbit/s. But our load balancing is far from perfect, and client usage is dynamic, so suppose that each individual relay occasionally reaches 500 Mbit/s of actual usage (observed bandwidth) while the other remaining 500 Mbit/s is shared among the other 3 relays. In this situation, the maximum observed bandwidth across the 4 relays is only 500 Mbit/s.

Now, in the baseline tornettools code, we would have added 4 relays each with 500 Mbit/s. This gives us a total of 2 Gbit/s, which is double of what the shared machine actually had available. In the v5 clustering code in this PR, we add one relay with 500 Mbit/s, which is half of the bandwidth that the machine had available. Thus we overshoot the long tails in some of the performance graphs.

I think we can tune in the super relay bandwidth calculation by considering that the sum of the median observed bandwidths of the shared relays might be higher than the max of any one of the relays.

@robgjansen robgjansen changed the title Link clustering v5 Relay clustering v5 Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants