Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add active peer probing and a cached addr book #90

Open
wants to merge 62 commits into
base: main
Choose a base branch
from

Conversation

2color
Copy link
Member

@2color 2color commented Nov 27, 2024

What

This is an attempt to fix #16 by implementing #53.

Also fixes #25

How

  • New Cached Address Book
  • New Cached Router and enriches results with cached addresses when records have no addresses.
    • Implement a custom iterator for FindProviders that looks up cache, returns result with addrs if there's a cache HIT, or dispatches a FindPeer if there's a cache miss, which it then returns to the user once a result is rerturned
  • New background goroutine
    • Subscribes to identify and connectedness events and updates cached address book.
    • Runs a probe against all peers that meet probe criteria:
      • Not currently connected
      • Haven't been probed in the last threshold (1 hour)

New magic numbers

We have to start with some default. This PR introduces some magic numbers which will likely change as we get some operational data:

// The TTL to keep recently connected peers for. This should be enough time to probe
const RecentlyConnectedAddrTTL = time.Hour * 24
// Connected peers don't expire until they disconnect
const ConnectedAddrTTL = math.MaxInt64
// How long to wait since last connection before probing a peer again
const PeerProbeThreshold = time.Hour
// How often to run the probe peers function
const ProbeInterval = time.Minute * 5
// How many concurrent probes to run at once
const MaxConcurrentProbes = 20
// How many connect failures to tolerate before clearing a peer's addresses
const MaxConnectFailures = 3

Open questions

  • The only peers we probe and cache are peers that we've successfully run identify with. Peers returned without multiaddrs from FindProviders for which we have no cached multiaddrs remain unresolved.
    • Should we try to call FindPeer inside the iterator so they can be resolved? This can blocking the streaming of othe providers in the iterator.
    • Another way might be to subscribe to kad-dht query events (not 100% sure if this is possible) and add to probe loop
  • Should we probe the last connected addr or all addresses we have for a Peer?
  • When should we augment results with cached addresses? Currently, it's done only when there are no results in the FindProviders from kad-dht. The presumption there is that if you get the results from FindProviders have multiaddrs for a peer, it's up to date.
  • How do we prevent excessive memory consumption by the cached address book? The memory address book already has built in limits and clean up. However, the peers map doesn't. temp solution: I've added some instrumentation for this

@2color 2color marked this pull request as ready for review November 28, 2024 15:43
@2color 2color requested review from lidel and aschmahmann November 28, 2024 15:57
this adds metric for evaluating all addr lookups
someguy_cached_router_peer_addr_lookups{cache="unused|hit|miss",origin="providers|peers"}

I've also wired up FindPeers for completeness.
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a first pass and dropped some suggestions inline. I also pushed with new metric (details inline).

As for Open questions, my thinking is:

  • The only peers we probe and cache are peers that we've successfully run identify with. Peers returned without multiaddrs from FindProviders for which we have no cached multiaddrs remain unresolved.
    • Should we try to call FindPeer inside the iterator so they can be resolved? This can blocking the streaming of the providers in the iterator.

Indeed, looking at someguy_cached_router_peer_addr_lookups shows we have cache miss quite often (0 addrs + cache also does not have them).

Was bit difficult to reason without some real-world input, so I've piped root CIDs hitting our staging environment, to populate the metric:

  • with CID duplicates: ssh ubuntu@kubo-staging-us-east-02.ovh.dwebops.net tail -F /var/log/nginx/access-json.log | stdbuf -o0 jq -r .request_uri | awk -F'[/&?]' '{print $3; fflush()}' | xargs -P 100 -I {cid} curl -s -o /dev/null -w "%{http_code} %{url.path}\n" "http://127.0.0.1:8190/routing/v1/providers/{cid}"
  • only unique CIDs: ssh ubuntu@kubo-staging-us-east-02.ovh.dwebops.net tail -F /var/log/nginx/access-json.log | stdbuf -o0 jq -r .request_uri | awk -F'[/&?]' '!seen[$3]++ {print $3; fflush()}' | xargs -P 100 -I {cid} curl -s -o /dev/null -w "%{http_code} %{url.path}\n" "http://127.0.0.1:8190/routing/v1/providers/{cid}"

A few minutes later http://127.0.0.1:8190/debug/metrics/prometheus shows:

# HELP someguy_cached_router_peer_addr_lookups Number of peer addr info lookups per origin and cache state
# TYPE someguy_cached_router_peer_addr_lookups counter
someguy_cached_router_peer_addr_lookups{cache="hit",origin="providers"} 1323
someguy_cached_router_peer_addr_lookups{cache="miss",origin="providers"} 6574
someguy_cached_router_peer_addr_lookups{cache="unused",origin="providers"} 7686

So yes, finding a way of decreasing miss feels useful, given how high it is.

Two ideas:

  • Lazy/easy: avoid blocking iterator by adding peers with cache misses to some queue, and then processing them asynchronously at some safe rate, populating cache in best-effort fashion. May not help first query, but all subsequent ones, over time, will get increased cache hit
  • Implement custom iterator: if peer hits cache miss, we dont return the peer, but silently moves to the next item, and puts current one at the side queue which is processed async calling findPeer. once the iterator hits the last item, we go back to items on the side queue. This way we don't slow down results with addrs, and we can wait and stream ones at the end without impacting perf of fast ones.
  • Should we probe the last connected addr or all addresses we have for a Peer?

See comment inline, iiuc host.Connect effectively probes all of known addrs, until success.
Probably good enough for now. If we need per-addr resolution, we may need ask go-libp2p for new API.

Note that vole libp2p identify <multiaddr> connects to specific multiaddr because it does not run routing and spawns a new libp2p host every time.

  • When should we augment results with cached addresses? Currently, it's done only when there are no results in the FindProviders from kad-dht. The presumption there is that if you get the results from FindProviders have multiaddrs for a peer, it's up to date.

I think current approach of hitting cache if regular routing returns no addrs is sensible.
It also makes it easier to reason about metrics like someguy_cached_router_peer_addr_lookups{origin,cache}

  • How do we prevent excessive memory consumption by the cached address book? The memory address book already has built in limits and clean up. However, the peers map doesn't. temp solution: I've added some instrumentation for this

Cap at TTL of 48h?

Comment on lines +49 to +50
// How long to wait since last connection before probing a peer again
PeerProbeThreshold = time.Hour
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on using this const, so we always engage probing AFTER the go-libp2p TTL expires? (right now also 1h, but if change in future, could impact efficiency of our probe)

https://github.com/libp2p/go-libp2p/blob/8423de3a64f17f6bec18bf57b472e5a3615883db/core/peerstore/peerstore.go#L24

Suggested change
// How long to wait since last connection before probing a peer again
PeerProbeThreshold = time.Hour
// How long to wait since last connection before probing a peer again
PeerProbeThreshold = peerstore.AddressTTL

Copy link
Member Author

@2color 2color Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you search for references to peerstore.AddressTTL you'll see it isn't used by anything (both in go-libp2p and go-libp2p-kad-dht) so maybe we should remove it?

There are two relevant address TTLs:

  • RecentlyConnectedTTL which is 15 minutes
  • ProviderAddrTTL which is 24 hours and is only for addresses associated with a provider record you are storing (so much less prevalent)

If we want to probe after the go-libp2p TTL expires, that would have to be 15 minutes for most addresses, but probing that frequently would mean we probe almost every peer every probePeers run, since that's how often the probe runs. Once we histogram data from production on how long probePeers takes, we can adjust.

If you want, we can maybe make this a multiple of RecentlyConnectedTTL, e.g. RecentlyConnectedTTL * 4?

cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
CHANGELOG.md Show resolved Hide resolved
server_cached_router.go Outdated Show resolved Hide resolved
2color and others added 3 commits November 29, 2024 16:33
Co-authored-by: Marcin Rataj <lidel@lidel.org>
Co-authored-by: Marcin Rataj <lidel@lidel.org>
@2color
Copy link
Member Author

2color commented Dec 5, 2024

Thanks @lidel. I've addressed all your points.

@2color 2color requested a review from lidel December 6, 2024 11:57
@2color
Copy link
Member Author

2color commented Dec 6, 2024

I've been running this for a little while with the accelerated DHT client (similar to how production runs)

And got these metrics:

# HELP someguy_cached_addr_book_peer_state_size Number of peers object currently in the peer state
# TYPE someguy_cached_addr_book_peer_state_size gauge
someguy_cached_addr_book_peer_state_size 9729
# HELP someguy_cached_addr_book_probe_duration_seconds Duration of peer probing operations in seconds
# TYPE someguy_cached_addr_book_probe_duration_seconds histogram
someguy_cached_addr_book_probe_duration_seconds_bucket{le="1"} 0
someguy_cached_addr_book_probe_duration_seconds_bucket{le="2"} 0
someguy_cached_addr_book_probe_duration_seconds_bucket{le="5"} 0
someguy_cached_addr_book_probe_duration_seconds_bucket{le="10"} 0
someguy_cached_addr_book_probe_duration_seconds_bucket{le="30"} 0
someguy_cached_addr_book_probe_duration_seconds_bucket{le="60"} 0
someguy_cached_addr_book_probe_duration_seconds_bucket{le="120"} 0
someguy_cached_addr_book_probe_duration_seconds_bucket{le="300"} 9
someguy_cached_addr_book_probe_duration_seconds_bucket{le="+Inf"} 12
someguy_cached_addr_book_probe_duration_seconds_sum 3197.8019163719996
someguy_cached_addr_book_probe_duration_seconds_count 12

Which reveal an average of ~5 minutes for a probe cycle.

I therefore made the following changes to the instrumentation:

  1. Add a probedPeersCounter to track how many individual peers are probed
  2. Increased the probe_duration bucket sizes so we have an eye if these get too long as the size of the address book gets large (as it happens with the accelerated DHT client)

Cache counters after running for a couple of hours:

# TYPE someguy_cached_addr_book_peer_state_size gauge
someguy_cached_addr_book_peer_state_size 10629

# HELP someguy_cached_router_peer_addr_lookups Number of peer addr info lookups per origin and cache state
# TYPE someguy_cached_router_peer_addr_lookups counter
someguy_cached_router_peer_addr_lookups{cache="hit",origin="providers"} 2218
someguy_cached_router_peer_addr_lookups{cache="miss",origin="peers"} 187
someguy_cached_router_peer_addr_lookups{cache="miss",origin="providers"} 504
someguy_cached_router_peer_addr_lookups{cache="unused",origin="peers"} 106
someguy_cached_router_peer_addr_lookups{cache="unused",origin="providers"} 5517

@2color 2color requested a review from sukunrt December 6, 2024 13:04
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
server_cached_router.go Show resolved Hide resolved
cached_addr_book.go Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
@2color 2color requested a review from lidel December 11, 2024 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Local storage for local caching purposes No multiaddrs returned from provider record lookups
3 participants