CosmosFullNode upgrade should be schedule-able at some future block height #15

DavidNix · 2022-08-03T20:49:13Z

Scenario: A human doesn't need to be present for the upgrade
When something goes wrong
then a human is alerted

"This is gonna be the biggest Quality-of-Life enhancement for the infra team" — @agouin

Tasks

Give feedback

break into smaller stories
what does cosmovisor do? what else do we need to write?
Options

This probably needs to be a new CRD with a separate controller such as: CosmosFullNodeUpgrade or the longer CosmosFullNodeScheduledUpgrade.

Therefore, it's a large effort. https://github.com/backube/snapscheduler may be good inspiration for how to properly schedule actions in the future.

The problem is it's toil to schedule at an exact time especially for chain halt upgrades.

The CRD likely needs to distinguish between chain halt and "regular" upgrades with the ability to get height from chains.

DavidNix · 2022-08-03T20:52:35Z

This is more like an Epic that needs to be broken into smaller issues.

DavidNix · 2022-09-28T16:04:14Z

Good advice from Andrew:

Upgrade needs to happen at halt height.

block needs to be committed for halt height then node should halt. First block after upgrade is halt height + 1

DavidNix · 2022-09-28T16:06:51Z

I also propose we move creating snapshots to be something this Operator handles. (May be a separate controller). Why? Our own controller can inspect block height, health, and other state. Randomly choose a healthy instance from which to make a snapshot. Additionally, we may need to cleanly exit the pod, take pod down, create snapshot, bring pod back. There is a chance of data corruption if we snapshot as the pod is writing to databases.

agouin · 2023-02-06T18:07:49Z

Thoughts on this building on top of @DavidNix idea for requiring block ranges/versions to be declared in config:
Before:

      podTemplate:
        image: ghcr.io/strangelove-ventures/heighliner/chain:v2.0.0

After:

      podTemplate:
        image: ghcr.io/strangelove-ventures/heighliner/chain
        versions:
          - blockHeightStart: 1
            blockHeightEnd: 12345
            imageTag: "v1.0.0"
          - blockHeightStart: 12346
            imageTag: "v2.0.0"

This will allow us to schedule an upgrade in advance, sync a node from genesis, and have different versions for different pods if they are at different heights.

DavidNix · 2023-02-06T18:38:34Z

Thanks for the thoughts. I'm still unsure of the final API design at this point, but that is an interesting suggestion. It will take some discovery to get to the final design.

Fwiw, I do not think we query the database directly, as we discussed in person. That's like using a private API which can change on you at any moment, thus resulting in brittle architecture. I bet we can leverage public APIs like Tendermint RPC or SDK GRPC, for example.

agouin · 2023-02-06T18:44:30Z

Fwiw, I do not think we query the database directly, as we discussed in person. That's like using a private API which can change on you at any moment, thus resulting in brittle architecture. I bet we can leverage public APIs like Tendermint RPC or SDK GRPC, for example.

That brings up the chicken/egg problem though. We need to determine which version to run based on the latest block height on the pv before starting the container. Tendermint RPC/GRPC are not available until we've chosen the right version and the node is started up.

Maybe it's possible though to use the base tendermint image with the pvc mounted, without any peering config, to expose RPC and determine height to choose the right image version

DavidNix · 2023-02-06T18:55:44Z

I understand the chicken/egg problem. This feature will need some time for discovery to work out different scenarios. My first hunch is figuring out the height before starting the container may not be necessary. It may be ok to start the container and restart it if needed, for example.

DavidNix · 2023-02-06T19:01:14Z

But we are getting too in the weeds of implementation details. In this context, I encourage focusing on defining problems over solutions. You bring up a good problem: "Do we need to know the height before starting the chain?"

DavidNix · 2023-02-06T19:41:04Z

I like this discussion. One requirement I'm using is this is specifically for chain upgrades which includes a chain halt. Starting the chain with the right version is one problem, but knowing when to restart a running chain is another. The latter I'd argue is the crux of this feature.

I want to leverage the fact that chains will halt for major upgrades.

Having the operator precisely use a new version at the exact height will be very difficult. If we rely on the chain halt, that makes this feature easier.

Any other use case beyond chain upgrades (where it will halt) is open for discussion but I'd argue that is another feature.

DavidNix · 2023-02-06T19:42:04Z

For minor/bugfix upgrades without a chain halt, perhaps height >= target_height is ok. The operator will not upgrade at precisely that height.

agouin · 2023-02-06T19:50:06Z

Any other use case beyond chain upgrades (where it will halt) is open for discussion but I'd argue that is another feature.

Syncing from genesis, the chain will still automatically halt at prior passed software upgrade proposal heights. So the other use cases I mentioned will be covered automatically.

If we want to cover non-software-upgrade-proposal use cases, we can set --halt-height flag if the version is not the latest version in the array. --halt-height, if provided for software upgrade proposal chain halts, would not be harmful since it was going to halt at that height anyways. So it could be used as a one-size fits all. I agree it is unnecessary for most use cases though and can be a separate task.

DavidNix · 2023-02-06T20:52:44Z

The --halt-height is good to remember. My only worry there is it will cause downtime if that exits the process of all the pods. What I've seen is the chain halts but does not crash on major upgrades. Thus, in theory, the pod can still accept RPC calls. But we need validation there. What I've seen is it gets stuck peering, thus you're able to make RPC calls. Probably warrants testing with --halt-height as well as that could be universal to major and minor/patch upgrades.

DavidNix · 2023-02-08T15:29:14Z

Notes to self (for inspiration and research, not to copy):

What does cosmovisor do?
What is the tendermint RPC /status implementation?

DavidNix · 2023-02-10T16:16:02Z

On an unrelated task, I wanted to figure out what cosmovisor was doing. It uses the keeper to (so direct database access) to find the upgradetypes.Plan.

agouin · 2023-02-10T17:41:03Z

The --halt-height is good to remember. My only worry there is it will cause downtime if that exits the process of all the pods.

Good point, halt height forces a panic. maybe we could do something like:

set the --halt-height to a few blocks before the upgrade on ~half of the pods (will refer to them first set of pods)
set the --halt-height to the upgrade height on the second ~half of the pods (will refer to them as second set of pods)
wait for first set of pods to panic at the pre-upgrade halt height, then restart them without --halt-height and peering disabled so that they stay up without syncing (this will make sure they do not panic even across restarts since they are not yet at the upgrade height).
wait for the second set of pods to panic at the upgrade halt height, then apply the upgrade version.
set --halt-height to the upgrade height on the first set of pods, and now repeat the process that was done on the second set of pods.

On an unrelated task, I wanted to figure out what cosmovisor was doing. It uses the keeper to (so direct database access) to find the upgradetypes.Plan.

Using the SDK to grab it directly from the store sounds like a good idea. If only this could be a CLI method so we don't have to worry about different SDK versions.

test makefile

DavidNix changed the title ~~Schedule an upgrade for some time in the future.~~ CosmosFullNode: Schedule an upgrade for some time in the future. Feb 6, 2023

jonathanpberger changed the title ~~CosmosFullNode: Schedule an upgrade for some time in the future.~~ CosmosFullNode upgrade should be schedule-able at some future block height Jun 20, 2023

jonathanpberger assigned DavidNix Jun 20, 2023

agouin mentioned this issue Oct 19, 2023

scheduled upgrades #373

Merged

agouin assigned agouin and unassigned DavidNix Oct 22, 2023

agouin closed this as completed in #373 Oct 24, 2023

qj0r9j0vc2 referenced this issue in bharvest-devops/bharvest-operator Jan 15, 2024

Merge pull request #15 from qj0r9j0vc2/bharvest

5c6071a

test makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CosmosFullNode upgrade should be schedule-able at some future block height #15

CosmosFullNode upgrade should be schedule-able at some future block height #15

DavidNix commented Aug 3, 2022 •

edited by jonathanpberger

Loading

Tasks

DavidNix commented Aug 3, 2022

DavidNix commented Sep 28, 2022

DavidNix commented Sep 28, 2022

agouin commented Feb 6, 2023 •

edited

Loading

DavidNix commented Feb 6, 2023

agouin commented Feb 6, 2023

DavidNix commented Feb 6, 2023 •

edited

Loading

DavidNix commented Feb 6, 2023 •

edited

Loading

DavidNix commented Feb 6, 2023

DavidNix commented Feb 6, 2023

agouin commented Feb 6, 2023 •

edited

Loading

DavidNix commented Feb 6, 2023 •

edited

Loading

DavidNix commented Feb 8, 2023 •

edited

Loading

DavidNix commented Feb 10, 2023

agouin commented Feb 10, 2023 •

edited

Loading

CosmosFullNode upgrade should be schedule-able at some future block height #15

CosmosFullNode upgrade should be schedule-able at some future block height #15

Comments

DavidNix commented Aug 3, 2022 • edited by jonathanpberger Loading

Tasks

DavidNix commented Aug 3, 2022

DavidNix commented Sep 28, 2022

DavidNix commented Sep 28, 2022

agouin commented Feb 6, 2023 • edited Loading

DavidNix commented Feb 6, 2023

agouin commented Feb 6, 2023

DavidNix commented Feb 6, 2023 • edited Loading

DavidNix commented Feb 6, 2023 • edited Loading

DavidNix commented Feb 6, 2023

DavidNix commented Feb 6, 2023

agouin commented Feb 6, 2023 • edited Loading

DavidNix commented Feb 6, 2023 • edited Loading

DavidNix commented Feb 8, 2023 • edited Loading

DavidNix commented Feb 10, 2023

agouin commented Feb 10, 2023 • edited Loading

DavidNix commented Aug 3, 2022 •

edited by jonathanpberger

Loading

agouin commented Feb 6, 2023 •

edited

Loading

DavidNix commented Feb 6, 2023 •

edited

Loading

DavidNix commented Feb 6, 2023 •

edited

Loading

agouin commented Feb 6, 2023 •

edited

Loading

DavidNix commented Feb 6, 2023 •

edited

Loading

DavidNix commented Feb 8, 2023 •

edited

Loading

agouin commented Feb 10, 2023 •

edited

Loading