-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Introduce GC logging on failure, don't break the loop. #3876
Conversation
@@ -163,7 +163,10 @@ impl<DB: Blockstore + GarbageCollectable + Sync + Send + 'static> MarkAndSweep<D | |||
/// using CAR-backed storage with a snapshot, for implementation simplicity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this note still valid, if not please update it.
/// NOTE: This currently does not take into account the fact that we might be starting the node
/// using CAR-backed storage with a snapshot, for implementation simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This note will always be valid, it states that GC does not care about car-backed storage blocks. Perhaps I should rephrase it to make it clearer.
Basically, the way that it works is as follows: the database GC has access to is either ParityDB or MemoryDB, depending on the implementation. Since the GC is designed to wait until it starts doing the mark
step - we don't really need to do anything special. And even if the database does not have many records by that time - it will just do a smaller cleanup in the first cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I don't quite understand how fn filter
works when it's takes the parity-db only. Shouldn't fn unordered_stream_graph
take the entire DB (parity-db + car-db)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work just fine. We only care about cleaning up unreachable nodes. Since all the data that's stored in ParityDB is in the future for any data that's in car-db - we don't really care about it. Unless I'm missing some weird edge-case. I'm assuming that snapshot we export are finalized, in which case there should not be any issues.
The one enhancement I think we should do based on your comments and my revisiting the code is specifying the first epoch that's stored in the database as opposed to the car file. Otherwise
if depth > current_epoch {
time::sleep(interval).await;
return anyhow::Ok(());
}
is basically useless. Or we can get rid of it altogether and just rely on initial sleep that we do later for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that snapshot we export are finalized
I'm not positive about this, as I understand this can only be true when we export snapshot from latest - finality
, @lemmih could you weight in on this?
just rely on initial sleep
I suggest storing the last GC epoch in parity-db metadata comlumn, calculating wait time instead of waiting for 10h on every start. In an edge case, if we have a node that restarts (maybe for getting updates) every 9h then GC would never get a chance to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lemmih what do you think about the above? Seems reasonable to me.
We could also initialize the last GC epoch as the heaviest tipset we see in car-backed storage when the database is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too concerned about this.
@@ -185,6 +188,9 @@ impl<DB: Blockstore + GarbageCollectable + Sync + Send + 'static> MarkAndSweep<D | |||
// Make sure we don't run the GC too often. | |||
time::sleep(interval).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to not sleep for 10h here and make it testable with CI calibnet checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make it configurable. This sleep
was introduced at David's request to avoid running the GC too often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't think it's necessary. The reason being - we still need to wait finality
to do filter
and sweep
, which makes this scenario pretty much untestable using calibnet checks, unless we want to wait 7.5+ hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to wait finality
epochs as apposed to keeping 2*finality
state roots like snapshot export?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which makes this scenario pretty much untestable using calibnet checks
Could you elaborate on why it becomes impossible to proactively trigger GC by a CLI event like before? That was useful for e2e tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make it artificially testable by introducing more abstractions in order to be able to mock certain parts of the algorithm.
If I'm not wrong, this still does not allow effective manual e2e testing or benchmarking on calibnet or mainnet, sounds like a deal breaker to me.
Note that the old GC has solved the exact problem (running in the background without downtime) without waiting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not ideal, whether or not it's a deal breaker, because it's not manually testable - I'm not really sure. We can always benchmark every part of the three steps, there's no need to benchmark all three steps at the same time.
@lemmih could you please chip in with your thoughts on the matter? I kind of agree that this is hard to test and impossible to test manually, but it does offer a lot simpler codebase and algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for chain finality is required for correctness. Due to possible forks, we may not delete data unless (1) it is unreachable from the current HEAD, and (2) it was written to the database more than chain_finality
epochs ago. This does make end-to-end testing more difficult: How do we quickly test the GC when it is not allowed to garbage collect anything in the first 7.5 hours?
Lowering (or removing) the wait time is not an option. The wait is required for correctness and cannot be any lower than 7.5 hours on mainnet or calibnet.
I posit that end-to-end testing is not important. We can verify with quickcheck that the GC behaves as expected (ie. deletes what it should delete, doesn't delete what it shouldn't, etc). And we can statically guarantee that the GC loop will never exit (except for critical failures, which should terminate Forest). Furthermore, the nodes we run in production will show whether there are any GC leaks over time.
In summary: Keep the 7.5 hour wait (this is not optional). Keep the quickcheck tests. Don't do end-to-end tests in the CI (as the GC doesn't do anything interesting in this timespan). Prove to the reader that the GC loop will run as long as Forest itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in the issue comment, one way out of this is providing GC alarms in our long-running node. This is not ideal, given our current release process, but it's better than pure faith. 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently figuring out why the GC gets stuck. It could be due to the fact that we bootstrap the node with a snapshot (car-backed storage) and the parallel graph walk somehow get stuck due to not having access to data all the way to genesis. Once that is debugged and resolved - we'll be able to monitor GC via checks and metrics indeed.
@ruseinov Can this be closed? Or should it be merged? |
yep, about time. |
Summary of changes
Changes introduced in this pull request:
Reference issue to close (if applicable)
Work on #3863
Other information and links
Change checklist