Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.21.0-rc.0: unstable peers #6989

Closed
twoeths opened this issue Jul 31, 2024 · 7 comments
Closed

v1.21.0-rc.0: unstable peers #6989

twoeths opened this issue Jul 31, 2024 · 7 comments
Labels
meta-bug Issues that identify a bug and require a fix.

Comments

@twoeths
Copy link
Contributor

twoeths commented Jul 31, 2024

Describe the bug

Peer count is unstable on the beta mainnet node, I used to see it dropped to 35 while inbound peer is 0 or 1

Screenshot 2024-07-31 at 13 35 08

it's likely due to Event Loop Lag on Network worker (which is only ~15ms on the stable mainnet node)

Screenshot 2024-07-31 at 13 36 06

Expected behavior

  • Peer count should be ~100 stably
  • Inbound peers should be ~90 as in the stable mainnet node (or around that value)
  • Event loop lag Network Worker should be ~15ms

Steps to reproduce

No response

Additional context

No response

Operating system

Linux

Lodestar version or commit hash

v1.21.0-rc.0

@twoeths twoeths added the meta-bug Issues that identify a bug and require a fix. label Jul 31, 2024
@twoeths
Copy link
Contributor Author

twoeths commented Jul 31, 2024

This is on our unstable mainnet node

Screenshot 2024-07-31 at 13 39 17

the issue started happening from around Jul 27 14:00:00 UTC so #5886 could be the root cause

@nflaig
Copy link
Member

nflaig commented Jul 31, 2024

It does not look that bad on my mainnet nodes

mainnet-1

image

mainnet-2

image

mainnet-3

image

On all of the nodes I can also see an improved event loop lag metric

image

Let me know if there are other metrics to share that might be useful

@twoeths
Copy link
Contributor Author

twoeths commented Jul 31, 2024

@nflaig it only happens on a test mainnet node because it subscribes to all subnets, so it has a lot of I/O load
I don't see the issue on our cip nodes too

@nflaig
Copy link
Member

nflaig commented Jul 31, 2024

@nflaig it only happens on a test mainnet node because it subscribes to all subnets, so it has a lot of I/O load I don't see the issue on our cip nodes too

Right, that's likely it. Will enable it on one of my mainnet nodes as well now

@philknows
Copy link
Member

philknows commented Aug 2, 2024

After putting in rc.1 to beta, you can see the stabilization in some peer metrics here:

Screenshot 2024-08-02 at 11 54 08 AM
Screenshot 2024-08-02 at 11 58 20 AM
![Screenshot 2024-08-02
Screenshot 2024-08-02 at 12 00 47 PM
Screenshot 2024-08-02 at 12 00 52 PM
Screenshot 2024-08-02 at 11 58 07 AM

@twoeths
Copy link
Contributor Author

twoeths commented Aug 6, 2024

attaching a network profile of cayman/yamux branch cc @wemeetagain

mainnet_yamux_network_thread_2024-08-06.cpuprofile.zip

there are so many dial errors in that branch

Screenshot 2024-08-06 at 15 04 41

from the log:

grep -e "unexpected end of input" -rn beacon-2024-08-06.log | wc -l
19724
grep -e "REQUEST_ERROR_DIAL_ERROR" -rn beacon-2024-08-06.log | wc -l
14302

sample error stack trace does not seem relate to yamux

Aug-06 00:02:30.296[network]       ^[[36mverbose^[[39m: Req  error method=ping, version=1, encoding=ssz_snappy, client=NA, peer=16...kfMnnz, requestId=272488, code=REQUEST_ERROR_DIAL_ERROR, error=unexpected end of input
Error: unexpected end of input
    at Object.read (file:///usr/src/lodestar/node_modules/it-byte-stream/dist/src/index.js:86:31)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.read (file:///usr/src/lodestar/node_modules/it-length-prefixed-stream/dist/src/index.js:51:37)
    at async read (file:///usr/src/lodestar/node_modules/@libp2p/multistream-select/dist/src/multistream.js:22:17)
    at async Module.readString (file:///usr/src/lodestar/node_modules/@libp2p/multistream-select/dist/src/multistream.js:33:17)
    at async Module.select (file:///usr/src/lodestar/node_modules/@libp2p/multistream-select/dist/src/select.js:72:20)
    at async ConnectionImpl.newStream [as _newStream] (file:///usr/src/lodestar/node_modules/libp2p/dist/src/upgrader.js:330:50)
    at async ConnectionImpl.newStream (file:///usr/src/lodestar/node_modules/libp2p/dist/src/connection/index.js:94:24)
    at async file:///usr/src/lodestar/packages/reqresp/lib/request/index.js:57:26
    at async withTimeout (file:///usr/src/lodestar/packages/utils/lib/timeout.js:12:16)
Error: REQUEST_ERROR_DIAL_ERROR
    at file:///usr/src/lodestar/packages/reqresp/lib/request/index.js:67:23
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async sendRequest (file:///usr/src/lodestar/packages/reqresp/lib/request/index.js:55:24)
    at async ReqRespBeaconNode.sendRequest (file:///usr/src/lodestar/packages/reqresp/lib/ReqResp.js:109:13)
    at async collectExactOneTyped (file:///usr/src/lodestar/packages/beacon-node/lib/network/reqresp/utils/collect.js:10:22)
    at async PeerManager.requestPing (file:///usr/src/lodestar/packages/beacon-node/lib/network/peers/peerManager.js:340:31)

@twoeths
Copy link
Contributor Author

twoeths commented Oct 1, 2024

issue was resolved because we reverted yamux enhancement

@twoeths twoeths closed this as completed Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-bug Issues that identify a bug and require a fix.
Projects
None yet
Development

No branches or pull requests

3 participants