Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of Fix AP upgrade version issue into release/1.16.x #27364

Conversation

hc-github-team-secure-vault-core
Copy link
Collaborator

Backport

This PR is auto-generated from #27277 to be assessed for backporting due to the inclusion of the label backport/1.16.x.

🚨

Warning automatic cherry-pick of commits failed. If the first commit failed,
you will see a blank no-op commit below. If at least one commit succeeded, you
will see the cherry-picked commits up to, not including, the commit where
the merge conflict occurred.

The person who merged in the original PR is:
@banks
This person should manually cherry-pick the original PR into a new backport PR,
and close this one when the manual backport PR is merged in.

merge conflict error: POST https://api.github.com/repos/hashicorp/vault/merges: 409 Merge conflict []

The below text is copied from the body of the original PR.


This is a subtle issue introduced by #26449 while fixing a different issue. It only impacts Vault Enterprise but is generally simpler to reason about.

That PR was a correct fix, however the effectiveSDKVersion was only ever being set on active nodes not on perf standbys in Enterprise.

The big change here is:

  • Get rid of the custom Set method and instead always set the effectiveSDKVersion from Core during SetupCluster which is necessary on all types of servers
  • As a belt-and-braces measure, default RaftBackend.effectiveSDKVersion back to the binary version so even if we fail to plumb this from core in some edge case in the future it's still the right value in almost all cases (other than testing environments).

I've manually tested this locally. To reproduce the issue it fixes you need to:

  1. Attempt an AP upgrade in Vault Enterprise from any version before 1.15.8 to any version after 1.15.8
  2. Ensure you don't manually set autopilot_upgrade_version in config on the new servers (doing so bypasses the bug).

In this case you'll see the new servers join but remain "target version" because their Echo's are sending an empty version string. You also see some logs about empty upgrade_versions which are legit in this case but are also confusing because you can see them even after the fix due to timing (if AP runs before an Echo has come in but after node joined raft on the leader you'll see that log even on a non-broken cluster and then you don't see a corresponding "OK" log after wards).

TODO

  • also fix the logging confusion where we complain about no upgrade version due to timing and then don't log that it's OK again
  • review this scenario with others and test more cases to be sure this is correct.
  • See if there is a meaningful way we could test this in CI without a flaky complicated end-to-end test just for plumbing...

Overview of commits

@github-actions github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Jun 5, 2024
Copy link

hashicorp-cla-app bot commented Jun 5, 2024

CLA assistant check

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes


1 out of 2 committers have signed the CLA.

  • banks
  • temp

temp seems not to be a GitHub user.
You need a GitHub account to be able to sign the CLA.
If you have already a GitHub account, please add the email address used for this commit to your account.

Have you signed the CLA already but the status is still pending? Recheck it.

Copy link

github-actions bot commented Jun 5, 2024

CI Results:
All Go tests succeeded! ✅

* Fix AP upgrade version issue

* add heartbeat logging at trace level

* add log to show when heartbeats resume

* Test the plumbing

* Revert "Test the plumbing"

This reverts commit e25fcd8.

* Add CHANGELOG

* Add plumbing test

* Update misleading comment

---------

Co-authored-by: Josh Black <raskchanky@gmail.com>
@raskchanky raskchanky marked this pull request as ready for review June 5, 2024 20:53
Copy link

github-actions bot commented Jun 5, 2024

Build Results:
All builds succeeded! ✅

@banks banks added this to the 1.16.4 milestone Jun 6, 2024
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Josh

@banks
Copy link
Member

banks commented Jun 6, 2024

I used the wrong process for these backports.

@banks banks closed this Jun 6, 2024
@raskchanky
Copy link
Contributor

I'm assuming this should be closed as well, since #27363 was closed and we have separate backport PRs for these now.

@raskchanky raskchanky deleted the backport/fix/autopilot-upgrade-ii/adversely-awake-lemur branch June 6, 2024 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants