Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace gossipsub scores #10

Merged
merged 13 commits into from
Apr 17, 2024
Merged

Trace gossipsub scores #10

merged 13 commits into from
Apr 17, 2024

Conversation

cortze
Copy link
Contributor

@cortze cortze commented Apr 3, 2024

Decription

The current tracing module of the Libp2p host and the GossipSub events includes most of the interactions between Hermes and remote peers in the Ethereum network. There is still a relevant point missing to debug, with higher resolution, how GossipSub events (mostly PRUNES/DISCONNECTIONS) relate to the PeerScore of each peer per topic.

Thus, this PR aims to aggregate the periodic export of the PeerScores snapshots. This feature was already present in the Lotus + TraceCatcher combo but hasn't landed on Hermes yet.

Tasks

  • Add the periodic inspection of the PeerScores (per peer and per topic), and flush them as traces
  • Allow the Snapshot Frequency to be adjustable (Let me know if you want a smaller name for the flag, I went for a descriptive one ;) )
  • Incorporate the new feature to the Ethereum GossipSub service as an Option
  • Test whether the peerscore traces are getting correctly flushed and tracked at the AWS kinesis instance

@cortze cortze self-assigned this Apr 3, 2024
@cortze
Copy link
Contributor Author

cortze commented Apr 3, 2024

Interestingly, the tests happily pass successfully when I try them out locally:

(trace-gossipsub-scores)$ go test ./tele
PASS
ok      github.com/probe-lab/hermes/tele        0.004s

I haven't even changed the existing tests or functions tested. I'll further look into it.

@guillaumemichel
Copy link
Contributor

It seems that the tests are failing due to the 32-bit architecture, which is weird, I restarted the jobs, let's see if they pass.

If the problem persists, we can get rid of 32-bit tests.

cmd/hermes/cmd_eth.go Outdated Show resolved Hide resolved
cmd/hermes/cmd_eth.go Outdated Show resolved Hide resolved
@@ -15,7 +15,7 @@ import (
gcrypto "github.com/ethereum/go-ethereum/crypto"
"github.com/libp2p/go-libp2p"
mplex "github.com/libp2p/go-libp2p-mplex"
"github.com/libp2p/go-libp2p-pubsub"
pubsub "github.com/libp2p/go-libp2p-pubsub"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this alias? I guess this was added automatically. No big deal, just curious 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be an automated behaviour from VS Code, let me check if i can adjust it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, I also use vscode with the Go extension, but I don't get aliases for imports. Do you use a custom format extension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be a clean install of code on manjaro + the go extension. I'll try it out with the one from the ms-vscode guys

host/host.go Outdated Show resolved Hide resolved
host/peerscore.go Outdated Show resolved Hide resolved
host/peerscore.go Outdated Show resolved Hide resolved

// ScoreKeeper is a thread-safe local copy of the score per peer and per copy
// TODO: figure out if this is some sort of info that we want to expose through OpenTelemetry (Still good to have it)
type ScoreKeeper struct {
Copy link
Contributor

@dennis-tra dennis-tra Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently has no mechanism to remove peers. It would grow indefinitely over time. It would be great to have a method on ScoreKeeper that removes peers (and also remove peers when, e.g., we disconnect?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Update() method should directly swap the entire map for the one given by the GossipSub Inspect call.

from the spec

Furthermore, the score is retained for some period of time when a peer disconnects, so that malicious peers cannot easily reset their score when it drops to negative and well behaving peers don't lose their status because of a disconnection.

As far as I know, these scores are eventually dropped, so we would just be tracking all the scores available at the GossipSub level. (There seems to be a RetainScore duration after the Disconnection)

host/peerscore.go Outdated Show resolved Hide resolved
host/host.go Outdated Show resolved Hide resolved
host/host.go Outdated Show resolved Hide resolved
@cortze
Copy link
Contributor Author

cortze commented Apr 9, 2024

Update

There have been some minor changes to the structure of the tool to make the PeerScoring work:

  • The tool now will compute the ForkDigest and the ForkVersion for the given network and for the current epoch -> 8f7e039 .

Previously, the tool would hide some ForkVersion and ForkDigest errors when it was mismatching with the trusted node one as it would update its BeaconStatus to the one of the node. This was making the tool to subscribe to the gossip topics of the wrong ForkDigest .

  • Having the ForkVersion at the PubSub level allows to correctly parse the messages according to the current Fork -> a1bf61b

The tool reported some Decoding Errors when reading AggregateAndProofs from the beacon_block topic , this is now fixed.

The combination of all these three points was a bit tedious to troubleshoot, as a non-correct fork was leading to not having peers in the mesh when testing it locally with a Holesky node, keeping 0 connections open on the advertised network. The wrong decoding of the gossipsub topic was making the service to restart, thus not keeping any score, which was neither configured at the topic level.

There are significant changes in the code (apologies for that, I wasn't expecting such extra changes), but this last commit already connects correctly to the local Prysm node in holesky, and reports some initial peer scores at the beacon_block topic:

peer_id: 16Uiu2HAmGkRcFqRnh4pN66UTeoRBp7uMsfp385Cx4oxAUwN2qbNC map[AppSpecificScore:0 BehaviourPenalty:0 IPColocationFactor:0 PeerID:16Uiu2HAmGkRcFqRnh4pN66UTeoRBp7uMsfp385Cx4oxAUwN2qbNC Score:2.503214489295642 Topics:[<nil> map[FirstMessag
eDeliveries:2.929018111619552 InvalidMessageDeliveries:0 MeshMessageDeliveries:4.671362011321691 TimeInMesh:1m18.999537812s Topic:/eth2/69ae0e99/beacon_block/ssz_snappy]]]
peer_id: 16Uiu2HAm3HNpFPrpq4FshxjS44YFXp8948hyqZBxQNN6FfecEQ3J map[AppSpecificScore:0 BehaviourPenalty:0 IPColocationFactor:0 PeerID:16Uiu2HAm3HNpFPrpq4FshxjS44YFXp8948hyqZBxQNN6FfecEQ3J Score:0.13333333333333333 Topics:[<nil> map[FirstMess
ageDeliveries:0 InvalidMessageDeliveries:0 MeshMessageDeliveries:0 TimeInMesh:1m5.699377854s Topic:/eth2/69ae0e99/beacon_block/ssz_snappy]]]

Let me know if there is any change we would like to apply to the code itself, formatting or style (I haven't been able to fix the auto-fmt of vs-code), although I can manually leave those lines away :)

@cortze cortze marked this pull request as ready for review April 15, 2024 08:50
Copy link
Contributor

@guillaumemichel guillaumemichel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good 👍🏻

We can merge even though the 32-bit tests are failing. Anyway we won't run hermes on a 32-bit arch.

@cortze cortze merged commit 5cd02c6 into main Apr 17, 2024
2 of 4 checks passed
@cortze cortze deleted the trace-gossipsub-scores branch April 17, 2024 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants