-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add active peer probing and a cached addr book #90
base: main
Are you sure you want to change the base?
Conversation
ff3ec97
to
a20a4c3
Compare
this adds metric for evaluating all addr lookups someguy_cached_router_peer_addr_lookups{cache="unused|hit|miss",origin="providers|peers"} I've also wired up FindPeers for completeness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a first pass and dropped some suggestions inline. I also pushed with new metric (details inline).
As for Open questions, my thinking is:
- The only peers we probe and cache are peers that we've successfully run identify with. Peers returned without multiaddrs from
FindProviders
for which we have no cached multiaddrs remain unresolved.
- Should we try to call FindPeer inside the iterator so they can be resolved? This can blocking the streaming of the providers in the iterator.
Indeed, looking at someguy_cached_router_peer_addr_lookups
shows we have cache miss
quite often (0 addrs + cache also does not have them).
Was bit difficult to reason without some real-world input, so I've piped root CIDs hitting our staging environment, to populate the metric:
- with CID duplicates:
ssh ubuntu@kubo-staging-us-east-02.ovh.dwebops.net tail -F /var/log/nginx/access-json.log | stdbuf -o0 jq -r .request_uri | awk -F'[/&?]' '{print $3; fflush()}' | xargs -P 100 -I {cid} curl -s -o /dev/null -w "%{http_code} %{url.path}\n" "http://127.0.0.1:8190/routing/v1/providers/{cid}"
- only unique CIDs:
ssh ubuntu@kubo-staging-us-east-02.ovh.dwebops.net tail -F /var/log/nginx/access-json.log | stdbuf -o0 jq -r .request_uri | awk -F'[/&?]' '!seen[$3]++ {print $3; fflush()}' | xargs -P 100 -I {cid} curl -s -o /dev/null -w "%{http_code} %{url.path}\n" "http://127.0.0.1:8190/routing/v1/providers/{cid}"
A few minutes later http://127.0.0.1:8190/debug/metrics/prometheus shows:
# HELP someguy_cached_router_peer_addr_lookups Number of peer addr info lookups per origin and cache state
# TYPE someguy_cached_router_peer_addr_lookups counter
someguy_cached_router_peer_addr_lookups{cache="hit",origin="providers"} 1323
someguy_cached_router_peer_addr_lookups{cache="miss",origin="providers"} 6574
someguy_cached_router_peer_addr_lookups{cache="unused",origin="providers"} 7686
So yes, finding a way of decreasing miss
feels useful, given how high it is.
Two ideas:
- Lazy/easy: avoid blocking iterator by adding peers with cache misses to some queue, and then processing them asynchronously at some safe rate, populating cache in best-effort fashion. May not help first query, but all subsequent ones, over time, will get increased cache
hit
- Implement custom iterator: if peer hits cache miss, we dont return the peer, but silently moves to the next item, and puts current one at the side queue which is processed async calling findPeer. once the iterator hits the last item, we go back to items on the side queue. This way we don't slow down results with addrs, and we can wait and stream ones at the end without impacting perf of fast ones.
- Should we probe the last connected addr or all addresses we have for a Peer?
See comment inline, iiuc host.Connect
effectively probes all of known addrs, until success.
Probably good enough for now. If we need per-addr resolution, we may need ask go-libp2p for new API.
Note that vole libp2p identify <multiaddr>
connects to specific multiaddr because it does not run routing and spawns a new libp2p host every time.
- When should we augment results with cached addresses? Currently, it's done only when there are no results in the FindProviders from kad-dht. The presumption there is that if you get the results from
FindProviders
have multiaddrs for a peer, it's up to date.
I think current approach of hitting cache if regular routing returns no addrs is sensible.
It also makes it easier to reason about metrics like someguy_cached_router_peer_addr_lookups{origin,cache}
- How do we prevent excessive memory consumption by the cached address book? The memory address book already has built in limits and clean up. However, the
peers
map doesn't. temp solution: I've added some instrumentation for this
Cap at TTL of 48h?
// How long to wait since last connection before probing a peer again | ||
PeerProbeThreshold = time.Hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on using this const, so we always engage probing AFTER the go-libp2p TTL expires? (right now also 1h, but if change in future, could impact efficiency of our probe)
// How long to wait since last connection before probing a peer again | |
PeerProbeThreshold = time.Hour | |
// How long to wait since last connection before probing a peer again | |
PeerProbeThreshold = peerstore.AddressTTL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you search for references to peerstore.AddressTTL
you'll see it isn't used by anything (both in go-libp2p and go-libp2p-kad-dht) so maybe we should remove it?
There are two relevant address TTLs:
- RecentlyConnectedTTL which is 15 minutes
- ProviderAddrTTL which is 24 hours and is only for addresses associated with a provider record you are storing (so much less prevalent)
If we want to probe after the go-libp2p TTL expires, that would have to be 15 minutes for most addresses, but probing that frequently would mean we probe almost every peer every probePeers
run, since that's how often the probe runs. Once we histogram data from production on how long probePeers
takes, we can adjust.
If you want, we can maybe make this a multiple of RecentlyConnectedTTL
, e.g. RecentlyConnectedTTL * 4
?
Co-authored-by: Marcin Rataj <lidel@lidel.org>
Co-authored-by: Marcin Rataj <lidel@lidel.org>
2q-lru tracks both frequently and recently used entries separately
we don't need the return count with the 2q-lru cache and the peerAddrLookups metric
mock the libp2p host and use a real event bus
Thanks @lidel. I've addressed all your points. |
I've been running this for a little while with the accelerated DHT client (similar to how production runs) And got these metrics:
Which reveal an average of ~5 minutes for a probe cycle. I therefore made the following changes to the instrumentation:
Cache counters after running for a couple of hours:
|
What
This is an attempt to fix #16 by implementing #53.
Also fixes #25
How
memoryAddrBook
whichNew magic numbers
We have to start with some default. This PR introduces some magic numbers which will likely change as we get some operational data:
someguy/server_addr_book.go
Lines 21 to 37 in 19b15aa
Open questions
FindProviders
for which we have no cached multiaddrs remain unresolved.FindProviders
have multiaddrs for a peer, it's up to date.peers
map doesn't. temp solution: I've added some instrumentation for this