Introduce entropy for fastestmirror option #324

PhirePhly · 2024-08-12T04:14:49Z

Howdy,

I do want to caveat this PR with the fact that this is the first time I've touched the librepo codebase, and I haven't gotten the test suite running on my local dev box (didn't attempt to after moderate issues just getting the library to build) so I have not run this through all the unit tests. If any of the unit tests expect fastestmirror to always return a stable list of mirrors, that test will break. I didn't see any test definitions that looked to be testing this in a cursory review... I did build and test this change locally on my Alma9 box.

Traditionally the fastestmirror option has relied on taking the provided mirrorlist from the repo CDN and measured the latency to each mirror by timing the libCurl socket open time. This has resulted in several undesirable properties for the fastestmirror option:

Socket open latency is not always a good proxy for download bandwidth performance, so if the nearest mirror to a user happens to be particularly slow, they will always experience a very slow download despite having fastestmirror enabled.
Mirror operators don't appreciate fastestmirror being used by high density deployments (i.e. data centers) where thousands to millions of hosts with fastestmirror enabled all dogpile the single lowest latency mirror despite there being several mirrors within almost as close proximity to the site

This change introduces entropy into the mirror selection process by taking all mirrors with latency measurements less than twice that of the nearest mirror and shuffling those mirrors evenly, then simply appending all other mirrors to the end of the list in strict latency order.

I selected 2x as the upper cutoff for shuffling in mirrors based on nothing analytical. I expect that 2x would do a good job of including all mirrors in the same metro if there's other nearby mirrors since the latency is likely dominated by last mile connectivity. For other users who are already seeing significant >100ms latency to the best mirror, picking <200ms mirrors is going to possibly be noticeably worse, but users 100ms away from the closest mirror are going to be having a poor user experience regardless of which mirror they select.

Traditionally the fastestmirror option has relied on taking the provided mirrorlist from the repo CDN and measured the latency to each mirror by timing the libCurl socket open time. This has resulted in several undesirable properties for the fastestmirror option: 1. Socket open latency is not always a good proxy for download bandwidth performance, so if the nearest mirror to a user happens to be particularly slow, they will always experience a very slow download despite having fastestmirror enabled. 2. Mirror operators don't appreciate fastestmirror being used by high density deployments (i.e. data centers) where thousands to millions of hosts with fastestmirror enabled all dogpile the single lowest latency mirror despite there being several mirrors within almost as close proximity to the site This change introduces entropy into the mirror selection process by taking all mirrors with latency measurements less than twice that of the nearest mirror and shuffling those mirrors evenly, then simply appending all other mirrors to the end of the list in strict latency order.

PhirePhly · 2024-08-29T20:27:16Z

Any thoughts on this? The CI failures seem like they have nothing to do with the code I touched?

evan-goode · 2024-10-01T02:28:56Z

Thanks and sorry for the delay. I think this is a great idea!

I expect that 2x would do a good job of including all mirrors in the same metro if there's other nearby mirrors since the latency is likely dominated by last mile connectivity.

I agree, 2x seems reasonable. As you said, latency to begin with is not a perfect proxy for overall mirror speed; it's quite likely to have 10ms mirror that is slower (less available bandwidth) than a 20ms mirror.

One unit test is failing on my end (test_download_repo_01_via_metalink_badfirsthost_fastestmirror_with_cache). I fixed it here: evan-goode@8d01763. Feel free to cherry-pick it. Once that is fixed and pushed to this PR, the "Copr Build" action should succeed and we'll be able to see the results of our integration test suite ("DNF CI / DNF5 Integration Tests"). I don't think there is much related to fastestmirror in those integration tests, but we'll see.

I'd like to hear a second opinion from someone else on our team before merging this, there might be some use case I'm not thinking of, and so we might want to provide a config option for users to opt-out of this behavior.

Signed-off-by: Kenneth Finnegan <kennethfinnegan2007@gmail.com>

PhirePhly · 2024-10-01T16:50:19Z

Thanks Evan. I was just worried that no one wanted to touch this knob and the discussion was going to get memory-holed.

Good changes. My C rust is showing. Cherry-picked and signed-off.

The only non-consumer/home user use case for fastestmirror that I could come up with was using it to prefer the on-campus mirror as an alternative for the traffic engineering knobs available in Mirror Manager. Picking typical numbers out of the air, the local mirror being 0.8ms away vs public mirrors being 4ms away means there would be zero behavior change here. But I would also love to hear other devs thoughts on how much flexibility we have on the fastestmirror behavior in general.

evan-goode · 2024-10-02T16:19:06Z

Thanks. What if this behavior could be configurable by another option, called fastestmirror_random_ratio or similar? The default setting for this option could be 1 (set maximum latency to 1x lowest latency, i.e. the current behavior of always using the lowest latency mirror) and the "recommended setting" could be 2 to get the behavior described above.

We could make 2 the default if there is a good enough argument for it, but generally I want to avoid having existing users of the fastestmirror option be confused why their DNF is no longer reliably selecting the "fastest mirror". Since fastestmirror is already false by default, maybe it is not too much to ask for fleet operators to set the new option. What do you think?

PhirePhly · 2024-10-02T17:17:51Z

My biggest concern is the moral hazard of making the option available. A layman reading the documentation would say "well of course I always want the fastest mirror; why wouldn't I?" and every "top 10 things to do right after installing Fedora" article would just start including fastestmirror_random_ratio = 1 as well.

I think there's upsides to users for this change if the lowest latency mirror happens to be low performance, but I think there is also value for the mirror operators here that this protects mirrors from any one large install base (or network stud, i.e. countries without mirrors) being able to turn this on and dogpile a random mirror they've never heard of. (Granted, they can always explicitly configure a single URL for a repo, but that takes quite a bit more intent) As a mirror operator I've been working with VPS providers who built their golden OS images with bad settings, so I'd rather continue to have fastestmirror be a heuristic blackbox and if users really need more fine grained traffic engineering controls I think there is/should be other avenues for those problem.

I'm trying to imagine who would be using fastestmirror for fine grained traffic engineering instead of "eh, just pick something good", who would would be impacted by this change, who shouldn't just be talking to the distro about routing their subnet/ASN to the ideal mirror in the first place.

evan-goode · 2024-10-15T00:02:49Z

Good points. We've decided to merge this as-is and put off deciding about the fastestmirror_random_ratio setting until someone actually requests it.

I'll create PRs to update the documentation of the fastestmirror setting in DNF 4 and DNF 5.

The CI failures are unrelated to this patch.

PhirePhly · 2024-10-15T17:40:45Z

Thanks. For completeness on this thread, I did try and drum up some use cases on Reddit and Mastodon.

The only use case anyone mentioned was being in country codes with no mirror coverage and the need to not use geographically close mirrors in other countries that route via another continent and back. So the difference is already between bad and worse when there isn't local mirrors, and this confirms my suspicion that this change will make DNF pick other mirrors than the one furthest south in Florida to serve all of south america.

I'm looking forward to hearing field experience with this change once it trickles its way into general deployment.

evan-goode self-assigned this Oct 1, 2024

fastestmirror: Guard against empty mirrorlist and unreachable mirrors

4657146

Signed-off-by: Kenneth Finnegan <kennethfinnegan2007@gmail.com>

evan-goode merged commit ea8ce15 into rpm-software-management:master Oct 15, 2024
4 of 6 checks passed

This was referenced Oct 17, 2024

Document entropy in fastestmirror option rpm-software-management/dnf#2147

Closed

Document entropy in fastestmirror option rpm-software-management/dnf5#1782

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce entropy for fastestmirror option #324

Introduce entropy for fastestmirror option #324

PhirePhly commented Aug 12, 2024

PhirePhly commented Aug 29, 2024

evan-goode commented Oct 1, 2024

PhirePhly commented Oct 1, 2024 •

edited

Loading

evan-goode commented Oct 2, 2024

PhirePhly commented Oct 2, 2024

evan-goode commented Oct 15, 2024

PhirePhly commented Oct 15, 2024

Introduce entropy for fastestmirror option #324

Introduce entropy for fastestmirror option #324

Conversation

PhirePhly commented Aug 12, 2024

PhirePhly commented Aug 29, 2024

evan-goode commented Oct 1, 2024

PhirePhly commented Oct 1, 2024 • edited Loading

evan-goode commented Oct 2, 2024

PhirePhly commented Oct 2, 2024

evan-goode commented Oct 15, 2024

PhirePhly commented Oct 15, 2024

PhirePhly commented Oct 1, 2024 •

edited

Loading