-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CosmosFullNode upgrade should be schedule-able at some future block height #15
Comments
This is more like an Epic that needs to be broken into smaller issues. |
Good advice from Andrew: Upgrade needs to happen at halt height. block needs to be committed for halt height then node should halt. First block after upgrade is halt height + 1 |
I also propose we move creating snapshots to be something this Operator handles. (May be a separate controller). Why? Our own controller can inspect block height, health, and other state. Randomly choose a healthy instance from which to make a snapshot. Additionally, we may need to cleanly exit the pod, take pod down, create snapshot, bring pod back. There is a chance of data corruption if we snapshot as the pod is writing to databases. |
Thoughts on this building on top of @DavidNix idea for requiring block ranges/versions to be declared in config: podTemplate:
image: ghcr.io/strangelove-ventures/heighliner/chain:v2.0.0 After: podTemplate:
image: ghcr.io/strangelove-ventures/heighliner/chain
versions:
- blockHeightStart: 1
blockHeightEnd: 12345
imageTag: "v1.0.0"
- blockHeightStart: 12346
imageTag: "v2.0.0" This will allow us to schedule an upgrade in advance, sync a node from genesis, and have different versions for different pods if they are at different heights. |
Thanks for the thoughts. I'm still unsure of the final API design at this point, but that is an interesting suggestion. It will take some discovery to get to the final design. Fwiw, I do not think we query the database directly, as we discussed in person. That's like using a private API which can change on you at any moment, thus resulting in brittle architecture. I bet we can leverage public APIs like Tendermint RPC or SDK GRPC, for example. |
That brings up the chicken/egg problem though. We need to determine which version to run based on the latest block height on the pv before starting the container. Tendermint RPC/GRPC are not available until we've chosen the right version and the node is started up. Maybe it's possible though to use the base tendermint image with the pvc mounted, without any peering config, to expose RPC and determine height to choose the right image version |
I understand the chicken/egg problem. This feature will need some time for discovery to work out different scenarios. My first hunch is figuring out the height before starting the container may not be necessary. It may be ok to start the container and restart it if needed, for example. |
But we are getting too in the weeds of implementation details. In this context, I encourage focusing on defining problems over solutions. You bring up a good problem: "Do we need to know the height before starting the chain?" |
I like this discussion. One requirement I'm using is this is specifically for chain upgrades which includes a chain halt. Starting the chain with the right version is one problem, but knowing when to restart a running chain is another. The latter I'd argue is the crux of this feature. I want to leverage the fact that chains will halt for major upgrades. Having the operator precisely use a new version at the exact height will be very difficult. If we rely on the chain halt, that makes this feature easier. Any other use case beyond chain upgrades (where it will halt) is open for discussion but I'd argue that is another feature. |
For minor/bugfix upgrades without a chain halt, perhaps height >= target_height is ok. The operator will not upgrade at precisely that height. |
Syncing from genesis, the chain will still automatically halt at prior passed software upgrade proposal heights. So the other use cases I mentioned will be covered automatically. If we want to cover non-software-upgrade-proposal use cases, we can set |
The |
Notes to self (for inspiration and research, not to copy):
|
On an unrelated task, I wanted to figure out what cosmovisor was doing. It uses the keeper to (so direct database access) to find the |
Good point, halt height forces a panic. maybe we could do something like:
Using the SDK to grab it directly from the store sounds like a good idea. If only this could be a CLI method so we don't have to worry about different SDK versions. |
Scenario: A human doesn't need to be present for the upgrade
When something goes wrong
then a human is alerted
"This is gonna be the biggest Quality-of-Life enhancement for the infra team" — @agouin
Tasks
This probably needs to be a new CRD with a separate controller such as: CosmosFullNodeUpgrade or the longer CosmosFullNodeScheduledUpgrade.
Therefore, it's a large effort. https://github.com/backube/snapscheduler may be good inspiration for how to properly schedule actions in the future.
The problem is it's toil to schedule at an exact time especially for chain halt upgrades.
The CRD likely needs to distinguish between chain halt and "regular" upgrades with the ability to get height from chains.
The text was updated successfully, but these errors were encountered: