-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-4.16] OCPBUGS-45041: operator/status clear azure path fix job conditions on operator removal #1158
base: release-4.16
Are you sure you want to change the base?
Conversation
remove early return when .status.storage.azure is unset. this property is cleared up when the operator managementState is set to Removed, and in such cases the early return would stop the controller from clearing up the conditions and deleting the job. without this check the controller can still do its job, even when managementState is Removed.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: flavianmissi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
e2e failures seem unrelated. |
// the move-blobs cmd does not work on Azure Stack Hub. Users on ASH | ||
// will have to copy the blobs on their own using something like az copy. | ||
if strings.EqualFold(azureStorage.CloudName, "AZURESTACKCLOUD") { | ||
azureStorage := imageRegistryConfig.Status.Storage.Azure | ||
if azureStorage != nil && strings.EqualFold(azureStorage.CloudName, "AZURESTACKCLOUD") { | ||
return nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we still check for the presence of the AccountName
/Container
for the generator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the generator has a nil check but it doesn't hurt to have a more specific check here too, will do.
) | ||
return utilerrors.NewAggregate([]error{err, updateError}) | ||
} | ||
case operatorv1.Unmanaged: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also add Force or/and default if we want to check all the cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this is the first I hear about Force - I'm not sure I understand what we would do to handle it here.
The docs state:
// Force means that the operator is actively managing its resources but will not block an upgrade
// if unmet prereqs exist. This state puts the operator at risk for unsuccessful upgrades
What blocks an upgrade? Degraded/Progressing conditions status set to true?
And what should we do on default?
It was also hard for me to decide how to handle Unmanaged - other components just ignore it, but current released versions will still create the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if people actually use it :D, but IMO we could just do the same as we do for Managed in this case.
operator logs weren't very helpful... |
again a really unhelpful message from
I'll get a cluster today and run these manually there - will report back when I learn more. |
Looks like the |
015ad4a
to
ac3c3af
Compare
test failures do not seem related to changes in this PR. /retest |
@flavianmissi: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/retitle release-4.16 OCPBUGS-45041: operator/status clear azure path fix job conditions on operator removal |
@flavianmissi: This pull request references Jira Issue OCPBUGS-45041, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/retitle [release-4.16] OCPBUGS-45041: operator/status clear azure path fix job conditions on operator removal |
DO NOT MERGE!
This PR is a reinterpretation of the changes originally made on #1142. The
AzurePathFixController
source code looks very different between the main branch and 4.14-4.16 branches.I'm opening this PR to ensure the approach taken on #1142 works similarly here as well.
The original bug (see #1142 for bug link) was quite often caught by
TestLeaderElection
. This test would often fail in a situation where the operator condition would get stuck onAzurePathFixProgressing: Azure path fix job is progressing: 1 pods active; 0 pods failed
, while other controllers would have successfully progressed intoRemoved
state.Running
TestLeaderElection
a few times in a row reliably reproduces this issue on 4.14-4.16 branches.With the changes in this PR, I can no longer see this failure in my local environment. I have run
TestLeaderElection
20 times and it passed every time./hold