feat: add active peer probing and a cached addr book #90

2color · 2024-11-27T14:27:15Z

What

This is an attempt to fix #16 by implementing #53.

Also fixes #25

How

New Cached Address Book
- Wraps memoryAddrBook which
New Cached Router and enriches results with cached addresses when records have no addresses.
- Implement a custom iterator for FindProviders that looks up cache, returns result with addrs if there's a cache HIT, or dispatches a FindPeer if there's a cache miss, which it then returns to the user once a result is rerturned
New background goroutine
- Subscribes to identify and connectedness events and updates cached address book.
- Runs a probe against all peers that meet probe criteria:
  - Not currently connected
  - Haven't been probed in the last threshold (1 hour)

New magic numbers

We have to start with some default. This PR introduces some magic numbers which will likely change as we get some operational data:

someguy/server_addr_book.go

Lines 21 to 37 in 19b15aa

    
           // The TTL to keep recently connected peers for. This should be enough time to probe 
        
           const RecentlyConnectedAddrTTL = time.Hour * 24 
        
           // Connected peers don't expire until they disconnect 
        
           const ConnectedAddrTTL = math.MaxInt64 
        
           // How long to wait since last connection before probing a peer again 
        
           const PeerProbeThreshold = time.Hour 
        
           // How often to run the probe peers function 
        
           const ProbeInterval = time.Minute * 5 
        
           // How many concurrent probes to run at once 
        
           const MaxConcurrentProbes = 20 
        
           // How many connect failures to tolerate before clearing a peer's addresses 
        
           const MaxConnectFailures = 3

Open questions

The only peers we probe and cache are peers that we've successfully run identify with. Peers returned without multiaddrs from FindProviders for which we have no cached multiaddrs remain unresolved.
- Should we try to call FindPeer inside the iterator so they can be resolved? This can blocking the streaming of othe providers in the iterator.
- Another way might be to subscribe to kad-dht query events (not 100% sure if this is possible) and add to probe loop
Should we probe the last connected addr or all addresses we have for a Peer?
When should we augment results with cached addresses? Currently, it's done only when there are no results in the FindProviders from kad-dht. The presumption there is that if you get the results from FindProviders have multiaddrs for a peer, it's up to date.
How do we prevent excessive memory consumption by the cached address book? The memory address book already has built in limits and clean up. However, the peers map doesn't. temp solution: I've added some instrumentation for this

fix test by allowing private ips

this adds metric for evaluating all addr lookups someguy_cached_router_peer_addr_lookups{cache="unused|hit|miss",origin="providers|peers"} I've also wired up FindPeers for completeness.

lidel

Made a first pass and dropped some suggestions inline. I also pushed with new metric (details inline).

As for Open questions, my thinking is:

The only peers we probe and cache are peers that we've successfully run identify with. Peers returned without multiaddrs from FindProviders for which we have no cached multiaddrs remain unresolved.

Should we try to call FindPeer inside the iterator so they can be resolved? This can blocking the streaming of the providers in the iterator.

Indeed, looking at someguy_cached_router_peer_addr_lookups shows we have cache miss quite often (0 addrs + cache also does not have them).

Was bit difficult to reason without some real-world input, so I've piped root CIDs hitting our staging environment, to populate the metric:

with CID duplicates: ssh ubuntu@kubo-staging-us-east-02.ovh.dwebops.net tail -F /var/log/nginx/access-json.log | stdbuf -o0 jq -r .request_uri | awk -F'[/&?]' '{print $3; fflush()}' | xargs -P 100 -I {cid} curl -s -o /dev/null -w "%{http_code} %{url.path}\n" "http://127.0.0.1:8190/routing/v1/providers/{cid}"
only unique CIDs: ssh ubuntu@kubo-staging-us-east-02.ovh.dwebops.net tail -F /var/log/nginx/access-json.log | stdbuf -o0 jq -r .request_uri | awk -F'[/&?]' '!seen[$3]++ {print $3; fflush()}' | xargs -P 100 -I {cid} curl -s -o /dev/null -w "%{http_code} %{url.path}\n" "http://127.0.0.1:8190/routing/v1/providers/{cid}"

A few minutes later http://127.0.0.1:8190/debug/metrics/prometheus shows:

# HELP someguy_cached_router_peer_addr_lookups Number of peer addr info lookups per origin and cache state
# TYPE someguy_cached_router_peer_addr_lookups counter
someguy_cached_router_peer_addr_lookups{cache="hit",origin="providers"} 1323
someguy_cached_router_peer_addr_lookups{cache="miss",origin="providers"} 6574
someguy_cached_router_peer_addr_lookups{cache="unused",origin="providers"} 7686

So yes, finding a way of decreasing miss feels useful, given how high it is.

Two ideas:

Lazy/easy: avoid blocking iterator by adding peers with cache misses to some queue, and then processing them asynchronously at some safe rate, populating cache in best-effort fashion. May not help first query, but all subsequent ones, over time, will get increased cache hit
Implement custom iterator: if peer hits cache miss, we dont return the peer, but silently moves to the next item, and puts current one at the side queue which is processed async calling findPeer. once the iterator hits the last item, we go back to items on the side queue. This way we don't slow down results with addrs, and we can wait and stream ones at the end without impacting perf of fast ones.

Should we probe the last connected addr or all addresses we have for a Peer?

See comment inline, iiuc host.Connect effectively probes all of known addrs, until success.
Probably good enough for now. If we need per-addr resolution, we may need ask go-libp2p for new API.

Note that vole libp2p identify <multiaddr> connects to specific multiaddr because it does not run routing and spawns a new libp2p host every time.

When should we augment results with cached addresses? Currently, it's done only when there are no results in the FindProviders from kad-dht. The presumption there is that if you get the results from FindProviders have multiaddrs for a peer, it's up to date.

I think current approach of hitting cache if regular routing returns no addrs is sensible.
It also makes it easier to reason about metrics like someguy_cached_router_peer_addr_lookups{origin,cache}

How do we prevent excessive memory consumption by the cached address book? The memory address book already has built in limits and clean up. However, the peers map doesn't. temp solution: I've added some instrumentation for this

Cap at TTL of 48h?

lidel · 2024-11-28T16:49:17Z

cached_addr_book.go

+	// How long to wait since last connection before probing a peer again
+	PeerProbeThreshold = time.Hour


Thoughts on using this const, so we always engage probing AFTER the go-libp2p TTL expires? (right now also 1h, but if change in future, could impact efficiency of our probe)

https://github.com/libp2p/go-libp2p/blob/8423de3a64f17f6bec18bf57b472e5a3615883db/core/peerstore/peerstore.go#L24

Suggested change

// How long to wait since last connection before probing a peer again

PeerProbeThreshold = time.Hour

// How long to wait since last connection before probing a peer again

PeerProbeThreshold = peerstore.AddressTTL

If you search for references to peerstore.AddressTTL you'll see it isn't used by anything (both in go-libp2p and go-libp2p-kad-dht) so maybe we should remove it?

There are two relevant address TTLs:

RecentlyConnectedTTL which is 15 minutes

ProviderAddrTTL which is 24 hours and is only for addresses associated with a provider record you are storing (so much less prevalent)

If we want to probe after the go-libp2p TTL expires, that would have to be 15 minutes for most addresses, but probing that frequently would mean we probe almost every peer every probePeers run, since that's how often the probe runs. Once we histogram data from production on how long probePeers takes, we can adjust.

If you want, we can maybe make this a multiple of RecentlyConnectedTTL, e.g. RecentlyConnectedTTL * 4?

cached_addr_book.go

CHANGELOG.md

server_cached_router.go

Co-authored-by: Marcin Rataj <lidel@lidel.org>

2q-lru tracks both frequently and recently used entries separately

we don't need the return count with the 2q-lru cache and the peerAddrLookups metric

mock the libp2p host and use a real event bus

2color · 2024-12-05T17:55:22Z

Thanks @lidel. I've addressed all your points.

2color · 2024-12-06T12:59:01Z

I've been running this for a little while with the accelerated DHT client (similar to how production runs)

And got these metrics:

# HELP someguy_cached_addr_book_peer_state_size Number of peers object currently in the peer state
# TYPE someguy_cached_addr_book_peer_state_size gauge
someguy_cached_addr_book_peer_state_size 9729
# HELP someguy_cached_addr_book_probe_duration_seconds Duration of peer probing operations in seconds
# TYPE someguy_cached_addr_book_probe_duration_seconds histogram
someguy_cached_addr_book_probe_duration_seconds_bucket{le="1"} 0
someguy_cached_addr_book_probe_duration_seconds_bucket{le="2"} 0
someguy_cached_addr_book_probe_duration_seconds_bucket{le="5"} 0
someguy_cached_addr_book_probe_duration_seconds_bucket{le="10"} 0
someguy_cached_addr_book_probe_duration_seconds_bucket{le="30"} 0
someguy_cached_addr_book_probe_duration_seconds_bucket{le="60"} 0
someguy_cached_addr_book_probe_duration_seconds_bucket{le="120"} 0
someguy_cached_addr_book_probe_duration_seconds_bucket{le="300"} 9
someguy_cached_addr_book_probe_duration_seconds_bucket{le="+Inf"} 12
someguy_cached_addr_book_probe_duration_seconds_sum 3197.8019163719996
someguy_cached_addr_book_probe_duration_seconds_count 12

Which reveal an average of ~5 minutes for a probe cycle.

I therefore made the following changes to the instrumentation:

Add a probedPeersCounter to track how many individual peers are probed
Increased the probe_duration bucket sizes so we have an eye if these get too long as the size of the address book gets large (as it happens with the accelerated DHT client)

Cache counters after running for a couple of hours:

# TYPE someguy_cached_addr_book_peer_state_size gauge
someguy_cached_addr_book_peer_state_size 10629

# HELP someguy_cached_router_peer_addr_lookups Number of peer addr info lookups per origin and cache state
# TYPE someguy_cached_router_peer_addr_lookups counter
someguy_cached_router_peer_addr_lookups{cache="hit",origin="providers"} 2218
someguy_cached_router_peer_addr_lookups{cache="miss",origin="peers"} 187
someguy_cached_router_peer_addr_lookups{cache="miss",origin="providers"} 504
someguy_cached_router_peer_addr_lookups{cache="unused",origin="peers"} 106
someguy_cached_router_peer_addr_lookups{cache="unused",origin="providers"} 5517

cached_addr_book.go

server_cached_router.go

cached_addr_book.go

server_cached_router.go

2color added 13 commits November 28, 2024 12:00

feat: add cached peer book with higher ttls

3896470

feat: initial implementation of active peer probing

7dd33ca

feat: use the cached router

0e86ea4

chore: go mod tidy

ec2a67a

feat: log probe duration

fe68140

chore: log in probe loop

06c2d0c

fix: update peer state if doesn't exist

fc76783

fix: add addresses to cached address book

e904c3e

fix: wrap with cached router only if available

814ae58

feat: make everything a little bit better

a4d6456

chore: small refinements

81feca7

test: add test for cached addr book

e75992f

chore: rename files

a20a4c3

2color force-pushed the add-peer-caching branch from ff3ec97 to a20a4c3 Compare November 28, 2024 11:01

2color added 6 commits November 28, 2024 12:23

feat: add options to cached addr book

c5f1d62

fix test by allowing private ips

feat: add instrumentation

e678be8

fix: thread safety

a0965bc

docs: update changelog

d82ad0f

fix: small fixes

a84d5f6

fix: simplify cached router

9ab02e1

2color marked this pull request as ready for review November 28, 2024 15:43

2color requested review from lidel and aschmahmann November 28, 2024 15:57

feat(metric): cached_router_peer_addr_lookups

9658af8

this adds metric for evaluating all addr lookups someguy_cached_router_peer_addr_lookups{cache="unused|hit|miss",origin="providers|peers"} I've also wired up FindPeers for completeness.

lidel reviewed Nov 29, 2024

View reviewed changes

2color commented Nov 29, 2024

View reviewed changes

server_cached_router.go Show resolved Hide resolved

2color commented Nov 29, 2024

View reviewed changes

server_cached_router.go Show resolved Hide resolved

2color and others added 3 commits November 29, 2024 16:33

Apply suggestions from code review

7cdb5be

Co-authored-by: Marcin Rataj <lidel@lidel.org>

Update CHANGELOG.md

762136e

Co-authored-by: Marcin Rataj <lidel@lidel.org>

chore: use service name for namespace

2cf46d4

2color added 10 commits December 5, 2024 10:54

test: remove bitswap record test

a0443d0

refactor: extract connectedness checks to a func

22aacd7

fix: set ttl for both signed and unsigned addrs

fe372ac

fix: prevent race condition

03a4078

feat: use 2q-lru cache for peer state

84393fd

2q-lru tracks both frequently and recently used entries separately

chore: remove return count

d466dc7

we don't need the return count with the 2q-lru cache and the peerAddrLookups metric

test: improve reliability of tests

8078cb5

mock the libp2p host and use a real event bus

fix: record failed connections

7decf6c

feat: add exponential backoff for probes/peer lookups

b536e82

fix: return peers with no addrs that wont probe

7182699

2color added 2 commits December 6, 2024 10:41

fix: brittle test

b0b24e0

feat: add probed peers counter

697457d

2color requested a review from lidel December 6, 2024 11:57

fix: adjust probe duration metric buckets

7fcf45f

2color requested a review from sukunrt December 6, 2024 13:04

sukunrt reviewed Dec 6, 2024

View reviewed changes

cached_addr_book.go Outdated Show resolved Hide resolved

cached_addr_book.go Outdated Show resolved Hide resolved

server_cached_router.go Show resolved Hide resolved

cached_addr_book.go Show resolved Hide resolved

cached_addr_book.go Outdated Show resolved Hide resolved

fix: prevent race conditions

1718215

sukunrt reviewed Dec 6, 2024

View reviewed changes

cached_addr_book.go Outdated Show resolved Hide resolved

feat: increase cache size and add max backoff

dc57e9f

lidel reviewed Dec 6, 2024

View reviewed changes

server_cached_router.go Show resolved Hide resolved

2color added 6 commits December 9, 2024 12:33

fix: omit providers whose peer cannot be found

c5abeec

chore: remove unused function

0cc76f9

deps: upgrade go-libp2p

f0e0bd4

fix: avoid using the cache in FindPeers

2211aae

fix: do not return cached results for FindPeers

be5958a

refactor: small optimisation

af7c3a8

2color requested a review from lidel December 11, 2024 11:46

chore: re-add comment

62c0d9f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add active peer probing and a cached addr book #90

feat: add active peer probing and a cached addr book #90

2color commented Nov 27, 2024 •

edited

Loading

lidel left a comment •

edited

Loading

lidel Nov 28, 2024

2color Nov 29, 2024 •

edited

Loading

2color commented Dec 5, 2024

2color commented Dec 6, 2024 •

edited

Loading

	// The TTL to keep recently connected peers for. This should be enough time to probe
	const RecentlyConnectedAddrTTL = time.Hour * 24

	// Connected peers don't expire until they disconnect
	const ConnectedAddrTTL = math.MaxInt64

	// How long to wait since last connection before probing a peer again
	const PeerProbeThreshold = time.Hour

	// How often to run the probe peers function
	const ProbeInterval = time.Minute * 5

	// How many concurrent probes to run at once
	const MaxConcurrentProbes = 20

	// How many connect failures to tolerate before clearing a peer's addresses
	const MaxConnectFailures = 3

		// How long to wait since last connection before probing a peer again
		PeerProbeThreshold = time.Hour

feat: add active peer probing and a cached addr book #90

Are you sure you want to change the base?

feat: add active peer probing and a cached addr book #90

Conversation

2color commented Nov 27, 2024 • edited Loading

What

How

New magic numbers

Open questions

lidel left a comment • edited Loading

Choose a reason for hiding this comment

lidel Nov 28, 2024

Choose a reason for hiding this comment

2color Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

2color commented Dec 5, 2024

2color commented Dec 6, 2024 • edited Loading

2color commented Nov 27, 2024 •

edited

Loading

lidel left a comment •

edited

Loading

2color Nov 29, 2024 •

edited

Loading

2color commented Dec 6, 2024 •

edited

Loading