Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce entropy for fastestmirror option #324

Merged
merged 2 commits into from
Oct 15, 2024

Conversation

PhirePhly
Copy link
Contributor

Howdy,

I do want to caveat this PR with the fact that this is the first time I've touched the librepo codebase, and I haven't gotten the test suite running on my local dev box (didn't attempt to after moderate issues just getting the library to build) so I have not run this through all the unit tests. If any of the unit tests expect fastestmirror to always return a stable list of mirrors, that test will break. I didn't see any test definitions that looked to be testing this in a cursory review... I did build and test this change locally on my Alma9 box.

Traditionally the fastestmirror option has relied on taking the provided mirrorlist from the repo CDN and measured the latency to each mirror by timing the libCurl socket open time. This has resulted in several undesirable properties for the fastestmirror option:

  1. Socket open latency is not always a good proxy for download bandwidth performance, so if the nearest mirror to a user happens to be particularly slow, they will always experience a very slow download despite having fastestmirror enabled.
  2. Mirror operators don't appreciate fastestmirror being used by high density deployments (i.e. data centers) where thousands to millions of hosts with fastestmirror enabled all dogpile the single lowest latency mirror despite there being several mirrors within almost as close proximity to the site

This change introduces entropy into the mirror selection process by taking all mirrors with latency measurements less than twice that of the nearest mirror and shuffling those mirrors evenly, then simply appending all other mirrors to the end of the list in strict latency order.

I selected 2x as the upper cutoff for shuffling in mirrors based on nothing analytical. I expect that 2x would do a good job of including all mirrors in the same metro if there's other nearby mirrors since the latency is likely dominated by last mile connectivity. For other users who are already seeing significant >100ms latency to the best mirror, picking <200ms mirrors is going to possibly be noticeably worse, but users 100ms away from the closest mirror are going to be having a poor user experience regardless of which mirror they select.

Traditionally the fastestmirror option has relied on taking the provided
mirrorlist from the repo CDN and measured the latency to each mirror by
timing the libCurl socket open time. This has resulted in several
undesirable properties for the fastestmirror option:

1. Socket open latency is not always a good proxy for download bandwidth
   performance, so if the nearest mirror to a user happens to be
   particularly slow, they will always experience a very slow download
   despite having fastestmirror enabled.
2. Mirror operators don't appreciate fastestmirror being used by high
   density deployments (i.e. data centers) where thousands to millions
   of hosts with fastestmirror enabled all dogpile the single lowest
   latency mirror despite there being several mirrors within almost as
   close proximity to the site

This change introduces entropy into the mirror selection process by
taking all mirrors with latency measurements less than twice that of the
nearest mirror and shuffling those mirrors evenly, then simply appending
all other mirrors to the end of the list in strict latency order.
@PhirePhly
Copy link
Contributor Author

Any thoughts on this? The CI failures seem like they have nothing to do with the code I touched?

@evan-goode evan-goode self-assigned this Oct 1, 2024
@evan-goode
Copy link
Member

Thanks and sorry for the delay. I think this is a great idea!

I expect that 2x would do a good job of including all mirrors in the same metro if there's other nearby mirrors since the latency is likely dominated by last mile connectivity.

I agree, 2x seems reasonable. As you said, latency to begin with is not a perfect proxy for overall mirror speed; it's quite likely to have 10ms mirror that is slower (less available bandwidth) than a 20ms mirror.

One unit test is failing on my end (test_download_repo_01_via_metalink_badfirsthost_fastestmirror_with_cache). I fixed it here: evan-goode@8d01763. Feel free to cherry-pick it. Once that is fixed and pushed to this PR, the "Copr Build" action should succeed and we'll be able to see the results of our integration test suite ("DNF CI / DNF5 Integration Tests"). I don't think there is much related to fastestmirror in those integration tests, but we'll see.

I'd like to hear a second opinion from someone else on our team before merging this, there might be some use case I'm not thinking of, and so we might want to provide a config option for users to opt-out of this behavior.

Signed-off-by: Kenneth Finnegan <kennethfinnegan2007@gmail.com>
@PhirePhly
Copy link
Contributor Author

PhirePhly commented Oct 1, 2024

Thanks Evan. I was just worried that no one wanted to touch this knob and the discussion was going to get memory-holed.

Good changes. My C rust is showing. Cherry-picked and signed-off.

The only non-consumer/home user use case for fastestmirror that I could come up with was using it to prefer the on-campus mirror as an alternative for the traffic engineering knobs available in Mirror Manager. Picking typical numbers out of the air, the local mirror being 0.8ms away vs public mirrors being 4ms away means there would be zero behavior change here. But I would also love to hear other devs thoughts on how much flexibility we have on the fastestmirror behavior in general.

@evan-goode
Copy link
Member

Thanks. What if this behavior could be configurable by another option, called fastestmirror_random_ratio or similar? The default setting for this option could be 1 (set maximum latency to 1x lowest latency, i.e. the current behavior of always using the lowest latency mirror) and the "recommended setting" could be 2 to get the behavior described above.

We could make 2 the default if there is a good enough argument for it, but generally I want to avoid having existing users of the fastestmirror option be confused why their DNF is no longer reliably selecting the "fastest mirror". Since fastestmirror is already false by default, maybe it is not too much to ask for fleet operators to set the new option. What do you think?

@PhirePhly
Copy link
Contributor Author

My biggest concern is the moral hazard of making the option available. A layman reading the documentation would say "well of course I always want the fastest mirror; why wouldn't I?" and every "top 10 things to do right after installing Fedora" article would just start including fastestmirror_random_ratio = 1 as well.

I think there's upsides to users for this change if the lowest latency mirror happens to be low performance, but I think there is also value for the mirror operators here that this protects mirrors from any one large install base (or network stud, i.e. countries without mirrors) being able to turn this on and dogpile a random mirror they've never heard of. (Granted, they can always explicitly configure a single URL for a repo, but that takes quite a bit more intent) As a mirror operator I've been working with VPS providers who built their golden OS images with bad settings, so I'd rather continue to have fastestmirror be a heuristic blackbox and if users really need more fine grained traffic engineering controls I think there is/should be other avenues for those problem.

I'm trying to imagine who would be using fastestmirror for fine grained traffic engineering instead of "eh, just pick something good", who would would be impacted by this change, who shouldn't just be talking to the distro about routing their subnet/ASN to the ideal mirror in the first place.

@evan-goode
Copy link
Member

Good points. We've decided to merge this as-is and put off deciding about the fastestmirror_random_ratio setting until someone actually requests it.

I'll create PRs to update the documentation of the fastestmirror setting in DNF 4 and DNF 5.

The CI failures are unrelated to this patch.

@evan-goode evan-goode merged commit ea8ce15 into rpm-software-management:master Oct 15, 2024
4 of 6 checks passed
@PhirePhly
Copy link
Contributor Author

Thanks. For completeness on this thread, I did try and drum up some use cases on Reddit and Mastodon.

The only use case anyone mentioned was being in country codes with no mirror coverage and the need to not use geographically close mirrors in other countries that route via another continent and back. So the difference is already between bad and worse when there isn't local mirrors, and this confirms my suspicion that this change will make DNF pick other mirrors than the one furthest south in Florida to serve all of south america.

I'm looking forward to hearing field experience with this change once it trickles its way into general deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants