Relay clustering v5 #94

robgjansen · 2023-03-08T17:01:52Z

tornettools currently configures Shadow such that every relay is placed on its own host. This placement strategy fails to consider that, in the real world, relay operators often run multiple relays on the same physical server machine in order to make better use of available resources.

This PR adds code that attempts to reconcile the situation. Although there is no ground truth information available about which relays share a machine in the real world, we can do a lot better with the following heuristic: if multiple relays have the same assigned relay family and the same IP address, then it is likely that they share the same machine.

Additionally, we need heuristics for estimating available bandwidth. The current approach takes the maximum bandwidth ever observed by the relay and considers that the machine has that much bandwidth capacity. Without additional ground truth hints, this is the best heuristic we have when a relay does not share a machine. However, when multiple relays share a machine, then we should consider that the sum of the steady state bandwidth of each relay may be higher than the maximum observed by any one relay.

robgjansen · 2023-03-08T17:13:39Z

First of all, it looks like --load_scale=1.5 is a good default these days to bring the overall amount of data transferred by relays in line with current day Tor.

This is with `--load_scale=1.0`

This is with `--load_scale=1.5`

robgjansen · 2023-03-08T17:18:34Z

Second, it does look like clustering has the desired effect of creating longer tails in the client performance metrics.

robgjansen · 2023-03-08T17:19:44Z

Third, it looks like even with using an EXACT ip address match as the clustering criteria, we still overshoot performance a bit.

robgjansen · 2023-03-08T19:08:56Z

I think the reason that we still overshoot performance is that we are not configuring our new "super relay" clusters with enough resources. Suppose, for example, that we have a 1 Gbit/s machine that runs 4 relays. With perfect load balancing, each relay gets 250 Mbit/s. But our load balancing is far from perfect, and client usage is dynamic, so suppose that each individual relay occasionally reaches 500 Mbit/s of actual usage (observed bandwidth) while the other remaining 500 Mbit/s is shared among the other 3 relays. In this situation, the maximum observed bandwidth across the 4 relays is only 500 Mbit/s.

Now, in the baseline tornettools code, we would have added 4 relays each with 500 Mbit/s. This gives us a total of 2 Gbit/s, which is double of what the shared machine actually had available. In the v5 clustering code in this PR, we add one relay with 500 Mbit/s, which is half of the bandwidth that the machine had available. Thus we overshoot the long tails in some of the performance graphs.

I think we can tune in the super relay bandwidth calculation by considering that the sum of the median observed bandwidths of the shared relays might be higher than the max of any one of the relays.

sporksmith and others added 9 commits February 21, 2023 10:35

combine_parsed_consensus_results: split into parts

efd979c

combine_parsed_serverdesc_results -> bandwidths_from_serverdescs

c83e70a

do clustering

7761064

Make family calculations a bit more robust

c232e38

Incorporate country in cluster keys

b2a5b41

record nicknames for debugging

8778984

cluster /8's instead of /16's

03a0d5f

Change clustering criteria from /24 to /32 (ie exact IP match)

bf51557

Fix bug where relays were not being clustered as intended

27b85a5

This was referenced Mar 8, 2023

Link clustering v4 #90

Closed

Link clustering v3 #84

Closed

robgjansen changed the title ~~Link clustering v5~~ Relay clustering v5 Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relay clustering v5 #94

Relay clustering v5 #94

robgjansen commented Mar 8, 2023 •

edited

Loading

robgjansen commented Mar 8, 2023

robgjansen commented Mar 8, 2023

robgjansen commented Mar 8, 2023

robgjansen commented Mar 8, 2023 •

edited

Loading

Relay clustering v5 #94

Are you sure you want to change the base?

Relay clustering v5 #94

Conversation

robgjansen commented Mar 8, 2023 • edited Loading

robgjansen commented Mar 8, 2023

This is with --load_scale=1.0

This is with --load_scale=1.5

robgjansen commented Mar 8, 2023

robgjansen commented Mar 8, 2023

robgjansen commented Mar 8, 2023 • edited Loading

robgjansen commented Mar 8, 2023 •

edited

Loading

This is with `--load_scale=1.0`

This is with `--load_scale=1.5`

robgjansen commented Mar 8, 2023 •

edited

Loading