Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve Errors Around Failed PDB Updates Against Deleted PDBs #57

Merged
merged 1 commit into from
Aug 15, 2024
Merged

Resolve Errors Around Failed PDB Updates Against Deleted PDBs #57

merged 1 commit into from
Aug 15, 2024

Conversation

michohl
Copy link
Contributor

@michohl michohl commented Aug 7, 2024

This PR resolves the following error we were seeing in production logs which is related to #54.

The issue is occurs because the reconcilePDBs loop deletes the PDB then attempts to update it after it is deleted.

time="2024-07-17T14:10:42Z" level=error msg="Failed to update PDB: Operation cannot be fulfilled on poddisruptionbudgets.policy \"my-cool-deployment-pdb-controller\": StorageError: invalid object, Code: 4, Key: /registry/poddisruptionbudgets/my-namespace/my-cool-deployment-pdb-controller, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: <uid redacted>, UID in object meta: "

This is resolved by ending the loop early if we delete the PDB.

I've also added a RetryOnConflict wrapper around the update and create methods since they're easy to add from the client-go library and prevent possible issues in the future around updating existing PDBs.

…rageErrors

Signed-off-by: Michael Riesberg-Timmer <MichaelRT@pallid.dev>
@michohl
Copy link
Contributor Author

michohl commented Aug 9, 2024

After some more testing it seems this PR does resolve the second error message but it does not yet resolve the first one:

time="2024-07-17T14:20:46Z" level=error msg="Failed to update PDB: Operation cannot be fulfilled on poddisruptionbudgets.policy \"my-cool-deployment-pdb-controller\": the object has been modified; please apply your changes to the latest version and try again"

Even with a longer backoff than the default it doesn't seem to care much for the update we're trying to make. I'm going to dig some more and try and find where the conflict is coming from.

@michohl michohl changed the title Resolve Errors Around Failed PDB Updates Draft: Resolve Errors Around Failed PDB Updates Aug 9, 2024
@michohl michohl marked this pull request as draft August 9, 2024 15:59
@michohl
Copy link
Contributor Author

michohl commented Aug 9, 2024

After some more testing it seems this PR does resolve the second error message but it does not yet resolve the first one:

time="2024-07-17T14:20:46Z" level=error msg="Failed to update PDB: Operation cannot be fulfilled on poddisruptionbudgets.policy \"my-cool-deployment-pdb-controller\": the object has been modified; please apply your changes to the latest version and try again"

Even with a longer backoff than the default it doesn't seem to care much for the update we're trying to make. I'm going to dig some more and try and find where the conflict is coming from.

After digging some more into this I still can't find which fields are exactly causing issues but because we're doing a .Update() instead of a .Patch() I think we're getting caught up in some dynamic fields being changed which causes sporadic failures.

I was able to work around this in our own clusters by adding a step after the .Update() method to do another .Get() of the same object and then applying the desired state over that fresh copy of the managedPDB.

I think the PR can still go as it is now and then I can come back with a second PR with the described logic for a separate review. I don't want to slow down the fix for the error around the controller attempting to update deleted PDBs. So I will modify the description of this PR to be more focused on that specific issue and will leave the failed updates for another PR after.

@michohl michohl changed the title Draft: Resolve Errors Around Failed PDB Updates Resolve Errors Around Failed PDB Updates Aug 9, 2024
@michohl michohl marked this pull request as ready for review August 9, 2024 16:57
@michohl michohl changed the title Resolve Errors Around Failed PDB Updates Resolve Errors Around Failed PDB Updates Against Deleted PDBs Aug 9, 2024
@mikkeloscar
Copy link
Owner

mikkeloscar commented Aug 13, 2024

How the retryOnConflict is used doesn't seem right/have any effect according to how it's documented here: https://pkg.go.dev/k8s.io/client-go/util/retry#RetryOnConflict

I would propose to remove those changes from this PR, then we can merge the continue fix to avoid updating deleted resources.

Also it's not really a problem that an update fails because the controller will automatically retry in next reconciliation loop.

@michohl
Copy link
Contributor Author

michohl commented Aug 13, 2024

How the retryOnConflict is used doesn't seem right/have any effect according to how it's documented here: https://pkg.go.dev/k8s.io/client-go/util/retry#RetryOnConflict

I would propose to remove those changes from this PR, then we can merge the continue fix to avoid updating deleted resources.

Also it's not really a problem that an update fails because the controller will automatically retry in next reconciliation loop.

Agreed on removing the RetryOnConflict since it's not relevant to this PR now in it's more limited scope. I have updated my PR to reflect this change.

I have been running a local fork of it in our clusters with a RetryOnConflict though and it does resolve the failed update errors I mentioned in my previous comment (when combined with some logic to refresh the local cached copy of the PDB we want to update). We can have that conversation in a more relevant PR though after this :)

@mikkeloscar
Copy link
Owner

👍

@szuecs
Copy link
Collaborator

szuecs commented Aug 15, 2024

👍

@szuecs szuecs merged commit 61c5773 into mikkeloscar:master Aug 15, 2024
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants