Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beacon chain lags behind frequently when running the latest tree-states branch #4521

Closed
jimmygchen opened this issue Jul 20, 2023 · 11 comments
Closed
Assignees
Labels
bug Something isn't working tree-states Upcoming state and database overhaul

Comments

@jimmygchen
Copy link
Member

Description

After merging in latest unstable to tree-states in PR #4514 , the BN running tree-states version seems to lag behind quite frequently for some reasons.

image
@jimmygchen jimmygchen added bug Something isn't working tree-states Upcoming state and database overhaul labels Jul 20, 2023
@jimmygchen jimmygchen self-assigned this Jul 20, 2023
@jimmygchen
Copy link
Member Author

looks like gossip_block processing is possibly broken, each processing is taking up to 10s. I think this might be causing the queues to fill up (and hence chain lagging behind) and the high memory usage.

@jimmygchen
Copy link
Member Author

I'm not able to reproduce this when running locally, I've tried the following 3 versions

  • 8423e9f: Latest tree-states branch
  • 5d2063d: One commit prior to the unstable merge
  • 6954de6: Alpha release version v4.2.990-exp

I haven't seen any significant difference in terms of performance and lagging behind, but maybe i didn't keep them up long enough (I ran each version for about 15 mins)

I'll try doing a fresh deployment to the same node to see if the problem persists.

@jimmygchen
Copy link
Member Author

Might be worth trying this version as well to narrow down the possible changes that broke it:
#4458

@jimmygchen
Copy link
Member Author

jimmygchen commented Jul 26, 2023

Our tree states node seems to be back to normal now, 8 hours after rolling back to the alpha release version 🤔

The slowness seems to be correlated to the frequency of State diff applied log occurrences, which wasn't showing up at all until I upgraded to the latest version, and now stopped after I rolled back to the alpha release version.

@jimmygchen
Copy link
Member Author

Re-deployed latest tree-state branch again and looks fine - I think there's still something fishy with the new version though, because the old version seems to be able to recover from the slowness, but the performance with new version just keep degrading over time.

Will leave it for a bit and keep monitoring. (Will try to reproduce with the attestation rewards API call later if it all looks good)

@jimmygchen
Copy link
Member Author

jimmygchen commented Jul 28, 2023

The node running the latest tree-states branch version degraded over time :/ I think we need to take a deeper look.
Latest alpha release version is fine and not affected.

@michaelsproul
Copy link
Member

I guess the state diffs being applied are for historic states. It's possible this is a manifestation of the HTTP death spiral issue in combination with some slight slowdown in the new version of tree-states vs the previous alpha. I think the fact we recently pointed the checkpointz server at this node might be part of the reason we've seen a change.

We could try bumping up the hdiff buffer cache size to see if that helps take some of the time off the calls. Grepping the logs for "diff" should show long each diff is taking to be applied and whether the buffer cache is being hit. We probably also want to tweak the cache algorithm so that it prefers to keep "deeper" (more general) diffs, rather than ones for e.g. single epochs.

It will also be interesting to test the HTTP API fix once that's applied in unstable.

@michaelsproul
Copy link
Member

Something else it could be is the database migration frequency. The alpha has it set to 4, but the latest version has it set to 1. This was an attempt to reduce cache misses (by making it more likely the new finalized state is in the cache) but it may have backfired

@jimmygchen
Copy link
Member Author

Thanks, would be interesting to try changing the database migration frequency and see how it peforms. It's worth noting that the state-transition test benchmark (block processing times) hasn't changed much since the last time we looked at it together, so I'm inclined to agree that it could be related to the HTTP death spiral issue.

@michaelsproul
Copy link
Member

michaelsproul commented Aug 7, 2023

I think I found part of the problem in #4573. I'll deploy that now and see if it shows much improvement. I haven't seen the crazy 20GB+ memory usage running the latest tree-states yet, but it does look a lot bumpier than v4.2.990-exp.

@jimmygchen
Copy link
Member Author

This has been fixed in #4576, closing. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working tree-states Upcoming state and database overhaul
Projects
None yet
Development

No branches or pull requests

2 participants