Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CosmosFullNode upgrade should be schedule-able at some future block height #15

Closed
2 tasks
DavidNix opened this issue Aug 3, 2022 · 15 comments · Fixed by #373
Closed
2 tasks

CosmosFullNode upgrade should be schedule-able at some future block height #15

DavidNix opened this issue Aug 3, 2022 · 15 comments · Fixed by #373
Assignees

Comments

@DavidNix
Copy link
Contributor

DavidNix commented Aug 3, 2022

Scenario: A human doesn't need to be present for the upgrade
When something goes wrong
then a human is alerted


"This is gonna be the biggest Quality-of-Life enhancement for the infra team" — @agouin


Tasks

Preview Give feedback

This probably needs to be a new CRD with a separate controller such as: CosmosFullNodeUpgrade or the longer CosmosFullNodeScheduledUpgrade.

Therefore, it's a large effort. https://github.com/backube/snapscheduler may be good inspiration for how to properly schedule actions in the future.

The problem is it's toil to schedule at an exact time especially for chain halt upgrades.

The CRD likely needs to distinguish between chain halt and "regular" upgrades with the ability to get height from chains.

@DavidNix
Copy link
Contributor Author

DavidNix commented Aug 3, 2022

This is more like an Epic that needs to be broken into smaller issues.

@DavidNix
Copy link
Contributor Author

Good advice from Andrew:

Upgrade needs to happen at halt height.

block needs to be committed for halt height then node should halt. First block after upgrade is halt height + 1

@DavidNix
Copy link
Contributor Author

I also propose we move creating snapshots to be something this Operator handles. (May be a separate controller). Why? Our own controller can inspect block height, health, and other state. Randomly choose a healthy instance from which to make a snapshot. Additionally, we may need to cleanly exit the pod, take pod down, create snapshot, bring pod back. There is a chance of data corruption if we snapshot as the pod is writing to databases.

@agouin
Copy link
Member

agouin commented Feb 6, 2023

Thoughts on this building on top of @DavidNix idea for requiring block ranges/versions to be declared in config:
Before:

      podTemplate:
        image: ghcr.io/strangelove-ventures/heighliner/chain:v2.0.0

After:

      podTemplate:
        image: ghcr.io/strangelove-ventures/heighliner/chain
        versions:
          - blockHeightStart: 1
            blockHeightEnd: 12345
            imageTag: "v1.0.0"
          - blockHeightStart: 12346
            imageTag: "v2.0.0"

This will allow us to schedule an upgrade in advance, sync a node from genesis, and have different versions for different pods if they are at different heights.

@DavidNix
Copy link
Contributor Author

DavidNix commented Feb 6, 2023

Thanks for the thoughts. I'm still unsure of the final API design at this point, but that is an interesting suggestion. It will take some discovery to get to the final design.

Fwiw, I do not think we query the database directly, as we discussed in person. That's like using a private API which can change on you at any moment, thus resulting in brittle architecture. I bet we can leverage public APIs like Tendermint RPC or SDK GRPC, for example.

@agouin
Copy link
Member

agouin commented Feb 6, 2023

Fwiw, I do not think we query the database directly, as we discussed in person. That's like using a private API which can change on you at any moment, thus resulting in brittle architecture. I bet we can leverage public APIs like Tendermint RPC or SDK GRPC, for example.

That brings up the chicken/egg problem though. We need to determine which version to run based on the latest block height on the pv before starting the container. Tendermint RPC/GRPC are not available until we've chosen the right version and the node is started up.

Maybe it's possible though to use the base tendermint image with the pvc mounted, without any peering config, to expose RPC and determine height to choose the right image version

@DavidNix
Copy link
Contributor Author

DavidNix commented Feb 6, 2023

I understand the chicken/egg problem. This feature will need some time for discovery to work out different scenarios. My first hunch is figuring out the height before starting the container may not be necessary. It may be ok to start the container and restart it if needed, for example.

@DavidNix
Copy link
Contributor Author

DavidNix commented Feb 6, 2023

But we are getting too in the weeds of implementation details. In this context, I encourage focusing on defining problems over solutions. You bring up a good problem: "Do we need to know the height before starting the chain?"

@DavidNix DavidNix changed the title Schedule an upgrade for some time in the future. CosmosFullNode: Schedule an upgrade for some time in the future. Feb 6, 2023
@DavidNix
Copy link
Contributor Author

DavidNix commented Feb 6, 2023

I like this discussion. One requirement I'm using is this is specifically for chain upgrades which includes a chain halt. Starting the chain with the right version is one problem, but knowing when to restart a running chain is another. The latter I'd argue is the crux of this feature.

I want to leverage the fact that chains will halt for major upgrades.

Having the operator precisely use a new version at the exact height will be very difficult. If we rely on the chain halt, that makes this feature easier.

Any other use case beyond chain upgrades (where it will halt) is open for discussion but I'd argue that is another feature.

@DavidNix
Copy link
Contributor Author

DavidNix commented Feb 6, 2023

For minor/bugfix upgrades without a chain halt, perhaps height >= target_height is ok. The operator will not upgrade at precisely that height.

@agouin
Copy link
Member

agouin commented Feb 6, 2023

Any other use case beyond chain upgrades (where it will halt) is open for discussion but I'd argue that is another feature.

Syncing from genesis, the chain will still automatically halt at prior passed software upgrade proposal heights. So the other use cases I mentioned will be covered automatically.

If we want to cover non-software-upgrade-proposal use cases, we can set --halt-height flag if the version is not the latest version in the array. --halt-height, if provided for software upgrade proposal chain halts, would not be harmful since it was going to halt at that height anyways. So it could be used as a one-size fits all. I agree it is unnecessary for most use cases though and can be a separate task.

@DavidNix
Copy link
Contributor Author

DavidNix commented Feb 6, 2023

The --halt-height is good to remember. My only worry there is it will cause downtime if that exits the process of all the pods. What I've seen is the chain halts but does not crash on major upgrades. Thus, in theory, the pod can still accept RPC calls. But we need validation there. What I've seen is it gets stuck peering, thus you're able to make RPC calls. Probably warrants testing with --halt-height as well as that could be universal to major and minor/patch upgrades.

@DavidNix
Copy link
Contributor Author

DavidNix commented Feb 8, 2023

Notes to self (for inspiration and research, not to copy):

  • What does cosmovisor do?
  • What is the tendermint RPC /status implementation?

@DavidNix
Copy link
Contributor Author

On an unrelated task, I wanted to figure out what cosmovisor was doing. It uses the keeper to (so direct database access) to find the upgradetypes.Plan.

@agouin
Copy link
Member

agouin commented Feb 10, 2023

The --halt-height is good to remember. My only worry there is it will cause downtime if that exits the process of all the pods.

Good point, halt height forces a panic. maybe we could do something like:

  • set the --halt-height to a few blocks before the upgrade on ~half of the pods (will refer to them first set of pods)
  • set the --halt-height to the upgrade height on the second ~half of the pods (will refer to them as second set of pods)
  • wait for first set of pods to panic at the pre-upgrade halt height, then restart them without --halt-height and peering disabled so that they stay up without syncing (this will make sure they do not panic even across restarts since they are not yet at the upgrade height).
  • wait for the second set of pods to panic at the upgrade halt height, then apply the upgrade version.
  • set --halt-height to the upgrade height on the first set of pods, and now repeat the process that was done on the second set of pods.

On an unrelated task, I wanted to figure out what cosmovisor was doing. It uses the keeper to (so direct database access) to find the upgradetypes.Plan.

Using the SDK to grab it directly from the store sounds like a good idea. If only this could be a CLI method so we don't have to worry about different SDK versions.

@jonathanpberger jonathanpberger changed the title CosmosFullNode: Schedule an upgrade for some time in the future. CosmosFullNode upgrade should be schedule-able at some future block height Jun 20, 2023
@agouin agouin assigned agouin and unassigned DavidNix Oct 22, 2023
qj0r9j0vc2 referenced this issue in bharvest-devops/bharvest-operator Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants