Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.16] OCPBUGS-45041: operator/status clear azure path fix job conditions on operator removal #1158

Open
wants to merge 2 commits into
base: release-4.16
Choose a base branch
from

Conversation

flavianmissi
Copy link
Member

DO NOT MERGE!

This PR is a reinterpretation of the changes originally made on #1142. The AzurePathFixController source code looks very different between the main branch and 4.14-4.16 branches.
I'm opening this PR to ensure the approach taken on #1142 works similarly here as well.


The original bug (see #1142 for bug link) was quite often caught by TestLeaderElection. This test would often fail in a situation where the operator condition would get stuck on AzurePathFixProgressing: Azure path fix job is progressing: 1 pods active; 0 pods failed, while other controllers would have successfully progressed into Removed state.
Running TestLeaderElection a few times in a row reliably reproduces this issue on 4.14-4.16 branches.
With the changes in this PR, I can no longer see this failure in my local environment. I have run TestLeaderElection 20 times and it passed every time.

/hold

remove early return when .status.storage.azure is unset. this property
is cleared up when the operator managementState is set to Removed, and
in such cases the early return would stop the controller from clearing
up the conditions and deleting the job. without this check the
controller can still do its job, even when managementState is Removed.
@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 19, 2024
@openshift-ci openshift-ci bot requested a review from adambkaplan November 19, 2024 08:22
Copy link
Contributor

openshift-ci bot commented Nov 19, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: flavianmissi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 19, 2024
@flavianmissi
Copy link
Member Author

e2e failures seem unrelated.
/retest

// the move-blobs cmd does not work on Azure Stack Hub. Users on ASH
// will have to copy the blobs on their own using something like az copy.
if strings.EqualFold(azureStorage.CloudName, "AZURESTACKCLOUD") {
azureStorage := imageRegistryConfig.Status.Storage.Azure
if azureStorage != nil && strings.EqualFold(azureStorage.CloudName, "AZURESTACKCLOUD") {
return nil
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we still check for the presence of the AccountName/Container for the generator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the generator has a nil check but it doesn't hurt to have a more specific check here too, will do.

)
return utilerrors.NewAggregate([]error{err, updateError})
}
case operatorv1.Unmanaged:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also add Force or/and default if we want to check all the cases

Copy link
Member Author

@flavianmissi flavianmissi Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is the first I hear about Force - I'm not sure I understand what we would do to handle it here.
The docs state:

// Force means that the operator is actively managing its resources but will not block an upgrade
// if unmet prereqs exist. This state puts the operator at risk for unsuccessful upgrades

What blocks an upgrade? Degraded/Progressing conditions status set to true?

And what should we do on default?

It was also hard for me to decide how to handle Unmanaged - other components just ignore it, but current released versions will still create the job.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if people actually use it :D, but IMO we could just do the same as we do for Managed in this case.

@flavianmissi
Copy link
Member Author

e2e-azure-operator failed misteriously...

./hack/test-go.sh -count 1 -timeout 110m -v${WHAT:+ -run="$WHAT"} ./test/e2e/
make: *** [Makefile:39: test-e2e] Error 1

operator logs weren't very helpful...
/test e2e-azure-operator

@flavianmissi
Copy link
Member Author

again a really unhelpful message from e2e-azure-operator:

./hack/test-go.sh -count 1 -timeout 110m -v${WHAT:+ -run="$WHAT"} ./test/e2e/
make: *** [Makefile:39: test-e2e] Error 1

I'll get a cluster today and run these manually there - will report back when I learn more.

@flavianmissi
Copy link
Member Author

Looks like the e2e-azure-operator error was legit: it happened when azure storage configuration was nil in the CR status without the operator being Removed, which caused the operator to get stuck progressing (AzurePathFixProgressing: The job does not exist). Handing the case when the azure storage configuration is empty fixes the tests.

@flavianmissi flavianmissi force-pushed the azurepathfix-state-removed-4.16 branch from 015ad4a to ac3c3af Compare November 22, 2024 12:30
@flavianmissi
Copy link
Member Author

test failures do not seem related to changes in this PR.

/retest

Copy link
Contributor

openshift-ci bot commented Nov 25, 2024

@flavianmissi: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@flavianmissi
Copy link
Member Author

/retitle release-4.16 OCPBUGS-45041: operator/status clear azure path fix job conditions on operator removal

@openshift-ci openshift-ci bot changed the title WIP: operator/status clear azure path fix job conditions on operator removal release-4.16 OCPBUGS-45041: operator/status clear azure path fix job conditions on operator removal Nov 26, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 26, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 26, 2024
@openshift-ci-robot
Copy link
Contributor

@flavianmissi: This pull request references Jira Issue OCPBUGS-45041, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected dependent Jira Issue OCPBUGS-45040 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is New instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

DO NOT MERGE!

This PR is a reinterpretation of the changes originally made on #1142. The AzurePathFixController source code looks very different between the main branch and 4.14-4.16 branches.
I'm opening this PR to ensure the approach taken on #1142 works similarly here as well.


The original bug (see #1142 for bug link) was quite often caught by TestLeaderElection. This test would often fail in a situation where the operator condition would get stuck on AzurePathFixProgressing: Azure path fix job is progressing: 1 pods active; 0 pods failed, while other controllers would have successfully progressed into Removed state.
Running TestLeaderElection a few times in a row reliably reproduces this issue on 4.14-4.16 branches.
With the changes in this PR, I can no longer see this failure in my local environment. I have run TestLeaderElection 20 times and it passed every time.

/hold

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@flavianmissi
Copy link
Member Author

/retitle [release-4.16] OCPBUGS-45041: operator/status clear azure path fix job conditions on operator removal

@openshift-ci openshift-ci bot changed the title release-4.16 OCPBUGS-45041: operator/status clear azure path fix job conditions on operator removal [release-4.16] OCPBUGS-45041: operator/status clear azure path fix job conditions on operator removal Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants